Tuesday, July 18, 2017

Techniques: Sequencers/NGS pipelines/general steps...

Evolution of genomic landscape........
1953: DNA discovery (Watson and Crick)
1977: Sanger Sequencing (anger)
1983: PCR (Mullis)
1993: Pyrosequencing
1998: Single molecule emulsion PCR
2005: 454 Life Science sequencer
2006: Solexa genome analyser (Cambridge University, Lynx therapeutics)
2006: Illiumina (acquired Solexa)
2007: ABI SOLiD
2007: Roche takes over 454 Life Science/Roche
2008: GS flx sequencing 454 Life Science/Roche
2008: NGS human genome sequencing
2010: HiSeq22000

#NGS had pervaded o public health, epidemiology, disease studies, infertility issues, agriculture. Variations of NGS include exomes, gene panels, RNA expression, fusions, hotspot panels, CNV. 
Cancer research, complex disease genomics, genomics in drug development, microbial genomics, forensic genomics, agrigenomics, reproductive and genetic health have been enriched by NGS.

---------------------------------------------------------------------------------------------------------
Sanger sequencing
It's been over 40 years (1970s) since Fred Sanger  started the terminator nucleotides for sequencing (dideoxy sequencing). The process is based on the detection of labelled chain-terminating nucleotides that are incorporated by a DNA polymerase during the replication of a template. Ten years ago, most genomes were still sequenced by the Sanger method.
Sanger method: Selecting and growing clones of whole genome shotgun libraries, isolating sequencing templates, and performing the sequencing reactions, followed by electrophoresis on a bank of 96 or 384 well capillary machines. The output from such production lines was then automatically assembled and usually generated a high-quality draft of the genome.
Computational challenges arise from simple sample processing, to assembly, binning, and identification of species; further challenges are annotation of genes and of course assignment of function.


(https://www.gatc-biotech.com/en/expertise/sanger-sequencing.html)
454/Roche started first NGS technique. Terminators are also used in Illumina, Helicos and QIAGEN sequencers
Genotyping by sequencing is possible by NGS 
All NGS platforms follow the same broad 3 steps: prep-------sequence------analyze
Prep step includes: DNA extraction, and amplification 

Illumina: sample prep-----cluster generation on solid phase---sequencing by synthesis with reversible terminator--Data analysis

Roche/454 Life Science: sample prep-----emulsion PCR---pyrosequencing...Data analysis
ABi SOLiD: sample prep-----emulsion PCR---sequencing by ligation...Data analysis
Helicos: sample prep-----single molecule---sequencing by synthesis with reversible terminator---Data analysisIon Torrent/ Thermo Fisher: Semiconductor technology
Pacific Biosciences: sample prep-----SMRT technology
Sample/library preparation  differs for each sequencers
Next Generation Sequence (NGS) data is high throughput
Information can be extracted from it through intensive manipulation and robust pipelines
To main approaches are Reference-based mapping or de novo assembly......
Read positions are manipulated to find SNPs, genotypes and indels
It suffers from low coverage depth at certain places (high GC areas) which undermines confidence in SNP calling.
Less than 10x depth is ambiguous
BAM file is map of data to reference. BAM is generated from SAM, for the better legibility of the former file.
(https://gatkforums.broadinstitute.org/gatk/discussion/6491/howto-visualize-an-alignment-with-igv)
-------------------------------------
Illumina........
#Now, 90% genome sequencing is one by Illumina
sample prep (workbench)---cluster amplification (Cluster station)..sequencing by synthesis (genome analyser)...analysis pipeline (Linux server)
The library prep step: gDNA--fragmented DNA---adaptor-ligated DNA---gel purification---library validation
Cluster amplification step: flow cell has 8 lanes--each has flow cell ports...(dNTPs and polymerase has been added).
36-100x cycles are used..
Analysis pipeline: image files--intensity files--base call files---alignment files.....consensus files...sequence analysis

(Firecrest for intensity file; Bustard for base call files; Gerald/ELAND for alignment files, CASAVA for consensus assemble; GenomeStudio for data visualization)



MiSeq: Small genome, amplicon, targeted gene sequencing (gene expression analysis)
e.g. targeted gene, small genome, and amplicon sequencing, 16S metagenomics,
 5 Mb genome, 50-100X coverage, 2 x 300 bp read length, Nextera XT Library Prep Kit, MiSeq Reagent v3 600-cycle kit

HiSeq: Big genome, exome, transcriptome

Sequenced using paired end mode with 36 bp reads. The short read required high coverage
A larger number of contigs needed before genome closure.


#Illumina SNP analysis

#Arrange Paired-end Reads
ls data/
for i in $(awk -F"L001" '{gsub("_$","",$1);print $1}' <(ls data/*.fastq.gz) | sort | uniq); do mkdir $(basename $i); mkdir $(basename $i)/Raw; cp $i*.fastq.gz $(basename $i)/Raw/. ; done
ls -d sample*
#Quality Trimming
for i in $(ls -d sample*/); do echo $i; cd $i; ls Raw; sickle pe -f Raw/*_R1_001.fastq.gz -r Raw/*_R2_001.fastq.gz -t sanger -o trimmed.$(basename ${i})_R1.fastq -p trimmed.$(basename ${i})_R2.fastq -s singles.$(basename ${i}).file.fastq -q 20 -l 10 ; cd ..; done
ls -d sample*/*.fastq
#Sequence Alignment and Quality Check
bwa index Cdiff078.fa 
for i in $(ls -d sample*/); do echo $i; cd $i; bwa mem -M ../Cdiff078.fa trimmed.$(basename ${i})_R1.fastq trimmed.$(basename ${i})_R2.fastq | samtools view -hbS - | samtools sort -m 1000000000  -  $(basename ${i});cd ..;done
#assess the quality of mapped reads by generating charts for GC Bias, Mean Quality by Cycle, and Quality Score Distribution
for i in $(ls -d sample*/); do echo $i; cd $i; java -Xmx2g -jar $(which CollectGcBiasMetrics.jar) R=../Cdiff078.fa I=$(basename ${i}).bam O=$(basename ${i})_GCBias.txt CHART=$(basename ${i})_GCBias.pdf ASSUME_SORTED=true; cd ..;done
for i in $(ls -d sample*/); do echo $i; cd $i; java -Xmx2g -jar $(which MeanQualityByCycle.jar) R=../Cdiff078.fa I=$(basename ${i}).bam O=$(basename ${i})_Qcycle.txt CHART=$(basename ${i})_Qcycle.pdf; cd ..;done
for i in $(ls -d sample*/); do echo $i; cd $i; java -Xmx2g -jar $(which QualityScoreDistribution.jar) R=../Cdiff078.fa I=$(basename ${i}).bam O=$(basename ${i})_Qdist.txt CHART=$(basename ${i})_Qdist.pdf; cd ..;done
#Mark and Remove Duplicates (by Picard)
for i in $(ls -d sample*/); do echo $i; cd $i; ls $(basename ${i}).bam ;java -jar $(which MarkDuplicates.jar) INPUT= $(basename ${i}).bam OUTPUT= $(basename ${i}).rmdup.bam METRICS_FILE= duplicateMatrix REMOVE_DUPLICATES=true;java -jar $(which AddOrReplaceReadGroups.jar) I=$(basename ${i}).rmdup.bam O=$(basename $i).rmdup.addgp.bam LB=whatever PL=illumina PU=whatever SM=whatever;samtools index $(basename ${i}).rmdup.addgp.bam; cd ..;done 
#Indel Realignment
samtools faidx Cdiff078.fa
if [ ! -f  "Cdiff078.dict" ]; then java -jar $(which CreateSequenceDictionary.jar) R=Cdiff078.fa O=Cdiff078.dict; fi
for i in $(ls -d sample*/); do echo $i;cd $i; java -Xmx2g -jar $(which GenomeAnalysisTK.jar) -T RealignerTargetCreator -R ../Cdiff078.fa -I $(basename ${i}).rmdup.addgp.bam -o $(basename ${i}).rmdup.addgp.intervals;java -Xmx4g -jar $(which GenomeAnalysisTK.jar) -I $(basename ${i}).rmdup.addgp.bam -R ../Cdiff078.fa -T IndelRealigner -targetIntervals $(basename ${i}).rmdup.addgp.intervals  -o $(basename ${i}).realigned.bam;cd ..;done
#SNPs and Indels Detection
samtools mpileup -B -f Cdiff078.fa  sample1/sample1.realigned.bam sample2/sample2.realigned.bam sample3/sample3.realigned.bam | java -jar VarScan.v2.3.7.jar mpileup2cns --min-coverage 2 --min-var-freq 0.8 --p-value 0.005 --variants --output-vcf > variants.vcf
head -30 variants.vcf 
#Filter Recombination
for i in $(grep -v "#" variants.vcf| cut -f1|sort|uniq); do grep -i -e "$i" -e "#CHROM" variants.vcf> $i.vcf;done
#count SNP density
for i in $(ls Cdiff078_*.vcf); do echo $i; python contig_count_snp_density2.py -vcfs ${i} --win 200 ; done
#Collate all the snp density files together
for i in $(ls *snp_densitywin200.csv); do awk '{n=split(FILENAME,a,".");print a[1]"," $0}' $i | sed '1d' ;done> contigs_snps_density.csv
#plot the snp desnity using the following R script:
library(ggplot2)
library(reshape2)
library(colorspace) 
density<- read.csv("contigs_snps_density.csv", header=FALSE, sep=",")
colnames(density)<- c("contigs","pos","count")

p1<- qplot(x=density$pos,y=density$count, color=density$contigs) +guides(col = guide_legend(title="Contigs",nrow = 18))

p <- p1+ labs(list(title="SNP density Plot", x= "Base Position", y="Density"))
ggsave("snp_density.pdf", plot=p, width =9)
#Filter the variants at a density level 5 to avoid false positives due to recombination.
python filter_variants.py --use-density --density 5 --window 200 variants.vcf > variants.filtered.vcf
#Number of SNPs after filtering are
vcftools --vcf variants.filtered.vcf --remove-indels
#Convert the filtered snps into a psudo-sequence using the python script 'vcf2phyloviz.py'. It gives the python vcf2phyloviz.py --metadata metadata.txt --output_prefix snps  variants.filtered.vcf
cat snps_all_alleles.fasta 
#Phylogenetic Tree Build using FastTree
FastTree -nt  snps_all_alleles.fasta > snps.fastree.nwk
#Filter Recombination and Tree Build using Gubbins
vcftools --vcf variants.vcf --out snps_only.variants --remove-indels --recode --recode-INFO-all
python vcf2phyloviz.py -m  metadata.txt -o snps_only.variants snps_only.variants.recode.vcf
cat snps_only.variants_all_alleles.fasta
export PYENV_ROOT="/home/opt/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
export PATH=/home/opt/fastml/programs/fastml:$PATH
export PATH=/home/opt/standard-RAxML:$PATH
#Now run Gubbins to generate phylogeny tree with FastTree (or RAxML optional).
run_gubbins.py --tree_builder fasttree --prefix gubbins snps_only.variants_all_alleles.fasta

#Visualize the tree in FigTree or IGV

---------------------------------------------------------------

Pacific BioSciences........
100x coverage possible with latest chemistry
Longest read possible  for PacBio is ~5,000bp (read length average 1500 bp)


Steps of de novo assembly: circularization, error correction 1#PacBio Corrected Reads (PBcR) is a pipeline to correct errors in long reads


2#Celera assembler is used to assembled long reads (contigs)
3#Quiver polishes the assembly (corrects)
4#When assembly results in a circular genome, as confirmed by Dot plot, the start and end contain same sequence. The ends have low mapping score , so poor consensus, which makes closing the genome tricky.
Smart analysis AMOS package can address this issue.


polished assembly------->break in the middle


toAmos -s polished_assembly -o circularized.afg
minimus2 circularized

cat circularized.singletons.seq >> circularized.fasta

#Use the perl code to find dnaA in the genome to create a break at OriC
circ-perm.pl

Ion Torrent (Thermo-Fisher).......
Semiconductor technology is used. A chip (silicon wafer) with million to billion sensors.
A library molecule is put on a bead. Billion beads represent the libraries. Clonal copies are produced. 
Hydrogen ion is released. Voltage potential is measured.

Popular products include Ion S5 Systems, Ion AmpliSeq Technology, Ion Proton System, Ion PGM System, Ion Chef System, Ion Reporter Software and Server



Copy number variation (CNV) analysis to find chromosomal aberrations like aneuploidy is possible by Ion S5™ Systems


(https://vimeo.com/121444329)

10ng of DNA per pool. PCR-based selection. Up to 24K primers per pool.

Ion AmpliSeq technology (amplify, digest, ligase)


Oxford Nanopore MinION
MinION (longer read)
~10,000bp /Ultra long reads (up to 882 kb and higher)
uses traditional DNA extraction techniques, with some change in the library preparation protocol
a modified Sambrook phenol-chloroform extraction/purification, DNA QC, minimal pipetting steps, high-input to rapid kit and MinKNOW 1.4

 92% genome coverage of E. coli, average depth of 3.8x

Its GridION X5 runs 5 MinION flowcells independently in parallel
a MinION plugged into the USB port
#Issues with sequencers
Illumina has systematic bias
Pacbio has random bias
Drawbacks of sequencing
Many genomes are still in draft status, with too many contigs
#GC-rich region are hard to sequence
#A mutation can be a genetic marker and not a marker of resistance 
#Commonly inspected genes and their surrounding regions 
#Mutation can be in structural gene, promoter or intergenic regions
Difference between reference supporting read and variant supporting read determines if a base is SNP or not.
Quality score is required as each base has a tendency to be incorrect

Quality scores were originally derived from the PHRED program which was used to read

DNA sequence trace files
The lower the integer, the higher the probability that the base  has been called incorrectly.
The resulting quality score lookup tables are then used to calculate a quality score for de novo NGS data. 
Q10: 1 in 10 wrong : accuracy 90%
Q20: 1 in 100 wrong : accuracy 99%
Q30: 1 in 1000 wrong : accuracy 99.9%
Synonymy: If base change has not changed the amino acid (S), if changed the amino acid (NS)

Lab variant call pipeline

FASTQ file mapped to reference genome------->SAM/BAM file------->recalibrated BAM file---------

>mpileup file----varscan---> Consensus file (a vcf file)--------->snp file 

BAM file needs to indexed by SAMTools. BAM can be seen by IGV.
Consensus file when filtered generates snp file
If variant read is  larger by 3 or equal to 3, consider it SNP (command used awk, wc -l)


GATK (from Broad institute), Samtools and varscan for variant calling


Some genomic regions are difficult to sequence, so coverage depth is poor there
Coverage is genearted by Bamtools
VCF: Variant calling format
Tools to manipulate VCF files : samtools, bedtools, vcftools, GATK tools
vcf file has 19 columns. 
$1: chrom
$2: pos
$3: ref
$4: mut
$5: read1
$6: read2
$19: var allele
Gene in minus strand needs to be reverse complemented

False Positive SNP: SNP called by NGS platform, not by gold standard Sanger 
False Negative SNP: SNP called by gold standard Sanger, not by NGS platform 

A SNP analysis pipeline..........
step 1: Quality score recalibration
java  -jar file.jar  -R file.fasta  -I file.bam  -O output_file

step 2: Covariates
java  -jar file.jar  -R file.fasta  -Known_SNPs  -I file.bam   -T CountCovariates -recalFile recalibrated_file

step 3: Known_SNPs file generation (Use of MUMmer for rapid alignment of entire genome)


step 4: mpileup file generation
samtools mpileup -BI  -f path_to_ref_fasta  recalibrated.bam  > file.mpileup

step 5: SNP and consensus calling
java -jar  VarScan.jar  pileup2snp  file.mpileup  --min-coverage 2  --min-var-freq 0.55 --p-value 0.10  > snp_file

java -jar  VarScan.jar  pileup2cns  file.mpileup  --min-coverage 2  --min-var-freq 0.55 --p-value 0.10  >consensus_file

step 6: Uploading SNP data to SQL database
LOAD DATA LOCAL INFILE 'path_to_file'
INTO TABLE TABLE_NAME (ID, pos, ref, mut,....)

step 7: Generate coverage file
bamtools coverage -in file.bam -out  output_file
awk '{if ($2 >= 1000 && $2 <= 2000 ) {print $3}}' file
awk '{if ($2 >= 1000 && $2 <= 2000 ) {sum +=3; count +=1}} END {print sum, cont, sum/count}}' file
########
Indel calculation
IMF in vcf file is Maximum fraction of reads supporting an indel
#Code for using the aligner bbmap
bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk
----------------------------------------------
#Check human genome  vcf file with polymorphisms 
#Using tools and R packages
#get bwa
cd /root
wget -O bwa-0.7.10.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download
#untar and compile bwa (via make)
tar xvfj bwa-0.7.10.tar.bz2
cd bwa-0.7.10
make
cp bwa /usr/local/bin
#install some tools
apt-get update
apt-get -y install samtools screen git curl gcc make g++ python-dev unzip \
      default-jre pkg-config libncurses5-dev r-base-core \
      r-cran-gplots python-matplotlib sysstat libcurl4-openssl-dev libxml2-dev
git clone https://github.com/schimar/ngs2014_popGen.git
cd ngs2014_popGen/var_call2/
#index the reference genome
bwa index ref_genome.fna
#map reads to the indexed reference genome
bwa aln ref_genome.fna read_file.fq > mapped_reads.sai
#create the SAM file
bwa samse ref_genome.fna mapped_reads.sai read_file.fq > mapped_reads.sam
#index the reference genome
samtools faidx ref_genome.fna
#convert from SAM to BAM
samtools view -b -S -o mapped_reads.bam mapped_reads.sam
#sort the BAM
samtools sort mapped_reads.bam mapped_reads.sorted
#index again, but now the sorted BAM file
samtools index mapped_reads.sorted.bam
#visualize the alignment
samtools tview mapped_reads.sorted.bam ref_genome.fna
#variant exploration with Bioconductor
R
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("VariantAnnotation")
biocLite("SNPlocs.Hsapiens.dbSNP.20101109")
biocLite("BSgenome.Hsapiens.UCSC.hg19_1.3.1000")
#quality control
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
(hdr <- scanVcfHeader(fl))
info(hdr)[c("VT", "RSQ"),]
(vcf <- readVcf(fl, "hg19"))
head(rowData(vcf), 3)
rowData(vcf) <- renameSeqlevels(rowData(vcf), c("22"="ch22"))
#load the SNP database and discover whether our SNPs are in dbSNP
library(SNPlocs.Hsapiens.dbSNP.20101109)
destination <- tempfile()
pre <- FilterRules(list(isLowCoverageExomeSnp = function(x) {
grepl("LOWCOV,EXOME", x, fixed=TRUE)
}))
filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))
snpFilt <- filterVcf(fl, "hg19", destination, prefilters=pre, filters= filt)
vcf_filt <- readVcf(snpFilt, "hg19")
rowData(vcf)
rowData(vcf_filt)
#compare new snps with database
inDbSNP <- rownames(vcf) %in% rownames(vcf_filt)
table(inDbSNP)
metrics <- data.frame(inDbSNP = inDbSNP, RSQ = info(vcf)$RSQ)
#visualize
Integrative Genomics Viewer (IGV) from Broad Institute is great for visaulization
Use the Java Webstart to download and run IGV
http://www.broadinstitute.org/igv/
if you are using a remote server (your own or Cloud), then you will need to download the BAM and .bai files from the server to your desktop/laptop in order to visualize.

or


library(ggplot2)

ggplot(metrics, aes(RSQ, fill=inDbSNP)) +
geom_density(alpha=0.5) +
scale_x_continuous(name="MaCH / Thunder Imputation Quality") +
scale_y_continuous(name="Density") +

theme(legend.position="top")

-------------------------------------------------------------------------------
 #To download human genome vcf files
 http://uswest.ensembl.org/info/data/ftp/index.html

==============================================
Alignment...
Alignment of sequencing reads to a reference genome is a core step in the analysis workflows for many high-throughput sequencing assays, including ChIP-Seq, RNA-seq, ribosome profiling and others.
Bowtie  uses an extremely economical data structure called the FM index to store the reference genome sequence and allows it to be searched rapidly. 
TopHat uses Bowtie as an alignment ‘engine’ 


Burrows Wheeler Aligner (BWA)  is preferred for mapping low-divergent sequences against a large reference genome. The alignment steps include:

Indexing the reference genome
Aligning the reads to the reference genome
Index the reference genome

 # This step helps with the speed of alignment

 bwa index  ref.fasta  

#create a .sai file

bwa aln path/to/ref.fasta path/to/fastq > SAIfile



Mauve?
#To run the Mauve GUI from within Terminal 
#Add directory with executables to Mauve path
cd Mauve/
ls
cd mauve_2.3.1/
./Mauve 

File
Align with progressive Mauve
Select the executable folder (by navigation
Mauve Console starts running (1-2 minutes for two full genomes)
 Viewing the alignment
Zoom in    Ctrl + UpScroll 
display left    Ctrl + LeftScroll 
display right    Ctrl + RightLarge 
left scroll    Shift + Ctrl + LeftLarge 
right scroll    Shift + Ctrl + Right
Tool ---------> Export ---------> Export SNPs   




R PSI Blast
#Reversed Position Specific BLAST, or RPS BLAST, use at command line
#extract just these *.smp files from the large archive (cdd.tar.gz).
#run the formatrpsdb tool to build a database:
formatrpsdb -t Sigma.v001 -i Sigma.pn -o T -f 9.82 -n Sigma -S 100.0
#creates the eight files i.e. Sigma.aux, Sigma.loo, Sigma.phr, Sigma.pin, Sigma.psd, Sigma.psi, Sigma.psq and Sigma.rps which together make up the database.
#Compare
rpsblast -i rpoD.faa -d Sigma -e 0.00001
rpsblast -i rpoD.faa -d Sigma -e 0.00001 -o rpoD.txt
rpsblast -i rpoD.faa -d Sigma -e 0.00001 -m 7 -o rpoD.xml
#If comparing with Pfam database
rpsblast -i rpoD.faa -d Pfam -e 0.00001
#comparing entire genome with the Sigma database made earlier.
rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -o NC_003197.txt
rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -m 7 -o NC_003197.xml

#Analyzing RPS-BLAST output with Biopython
#For the smaller xml file
from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("rpoD.xml")) :
print "QUERY: %s" % record.query
for align in record.alignments :
print " MATCH: %s..." % align.title[:60]
for hsp in align.hsps :
print " HSP, e=%f, from position %i to %i" \
% (hsp.expect, hsp.query_start, hsp.query_end)
if hsp.align_length < 60 :
print " Query: %s" % hsp.query
print " Match: %s" % hsp.match
print " Sbjct: %s" % hsp.sbjct
else :
print " Query: %s..." % hsp.query[:57]
print " Match: %s..." % hsp.match[:57]
print " Sbjct: %s..." % hsp.sbjct[:57]
print "Done"


#For the large xml file
from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("NC_003197.xml")) :
    #We want to ignore any queries with no search results:
    if record.alignments :
        print "QUERY: %s..." % record.query[:60]
        for align in record.alignments :
            for hsp in align.hsps :
                print " %s HSP, e=%f, from position %i to %i" \
                % (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
print "Done"
That should give you the following output - note there is only 



#Running RPS-BLAST from Biopython
#Adjust the file locations to match your own:
rpsblast_db = "C:\\Blast\\cdd\\Sigma"
rpsblast_exe = "C:\\Blast\\bin\\rpsblast.exe"

query_filename = "rpoD.faa"
#query_filename = "NC_003197.faa"

E_VALUE_THRESH = 0.00001 #Adjust the expectation cut-off here

from Bio.Blast import NCBIStandalone
output_handle, error_handle = NCBIStandalone.rpsblast(rpsblast_exe, \
rpsblast_db, query_filename, expectation=E_VALUE_THRESH)


from Bio.Blast import NCBIXML
for record in NCBIXML.parse(output_handle) :
    #We want to ignore any queries with no search results:
    if record.alignments :
        print "QUERY: %s..." % record.query[:60]
        for align in record.alignments :
            for hsp in align.hsps :
                print " %s HSP, e=%f, from position %i to %i" \
                % (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
                assert hsp.expect <= E_VALUE_THRESH
print "Done"
====================================
Exome sequencing

Exome is the ~1% of the human genome (~30Mb size total)

Exome sequencing target DNA from exon region Its cost-effective, but can skin information in other part of the genome
Workflow:fragmentation of genomic DNA; enriching the target fragments in exome region by designed hybridization chips; sequencing
Exome sequencing analyzes the protein-coding region of the genome, as its cheaper than to whole-genome sequencing. It reduces search space but suffers from the loss of intergenic regions. With both rapid Run and High Output modes, HiSeq 2500 System is a flexible system that can sequence 5–75 exomes* per run.
*Assumes 100x coverage 

(https://www.cd-genomics.com/whole-exome-sequencing.html)

(http://www.genomics.hk/Exome.htm)

De Novo Sequencing
With a unique mix of high output, long reads, and paired-end sequencing, HiSeq is a powerful tool for de novo sequencing of species without a reference genome.

Metagenomics
Metagenomic analyses includes sequencing and identification of entire microbial communities in a place

Amplicon Sequencing
Ultra-deep sequencing of PCR amplicons is useful for the analysis of up to hundreds of target genomic regions in one assay.

Cytogenomics
Genome-wide detection of chromosomal variation

No comments:

Post a Comment

Laboratory tools and reagents (Micro-pipettes)...

Micro-pipettes are essential tools of R & D labs, and integral part of Good Laboratory Practices (GLPs) Micro-pipetting methods include ...