Tuesday, August 15, 2017

Plants tissue culture..........

Plant tissue culture is the technique to grow plant cells, tissues or organs under sterile conditions on a nutrient culture medium of known composition.
Plant tissue culture is widely used to produce clones of a plant in a method known as micropropagation. These new plantlets are grown in a short period of time, and they are free of soil-borne pathogens.
Sterile agar gel, plant hormones, and nutrients is needed for this objective. So. it makes tissue culture expensive and difficult than taking plant cuttings. However, its used widely, for different goals, including clonal propagation and genetic alteration for better vigor against pathogens/pests.
Four steps of tissue culture techniques:
1. Inoculation of explant 2. Incubation of culture 3. Sub-culturing 4. Transplantation of the regenerated plant.
(http://askmissteong.blogspot.com/2013/01/plant-tissue-culture-recommended-videos.html)

Step 1. Inoculation of explant:
The sterilized explants are transferred to the nutrient medium.

Step  2. Incubation of culture: 
After inoculation, the cultures are incubated in culture room or in incubator. Low humidity can cause desiccation, while high humidity can lead to microbial contamination.

Step 3. Sub-culturing:
The progress of the in vitro-grown tissues are monitored periodically.
For suspension culture, media is changed.
For callus culture, the sub-culturing of the callus tissue is performed.

Step 4. Transplantation of the regenerated plant:
Plants regenerated from in vitro tissue culture are transplanted to soil.  Acclimatization of these regenerated plants are conducted to get it prepared for survival in the field conditions.


Biostatistics for biomedical use..

Feature selection is used to identify the most discriminating features for biomarker discovery, medical diagnosis, and gene selection. 
Random Forest (RF): it is an ensemble (of multiple decision trees) classifier, which applies bagging technique to construct an ensemble of trees, with randomization technique for the growth of each tree.   RF is suitable for high-dimensional and small-sample datasets.
Support Vector Machine (SVM): it is a supervised classifier, generally used in bi-classification problem, but can be extended to multi-class problem.
provide a part of the data to linear SVM and tune the parameters such that SVM can can act as a discriminatory function separating the ham messages from the spam messages
#In R code
sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE)

head(sms_data)
************
Parzen window based distribution calculates probability density function (pdf) in non-parametric approach

Each data point contributes equally to pdf
Uniform distribution
Normal distribution (bell curve). pdf is Gaussian function here
Variance is square of standard deviation
Probability (p) = k/n
************
Artificial neural network (ANN)
ANN helps model complex relations between input and output. Finds patterns in data (e.g. protein catabolic rate, optical character recgnition (OCR), )
Input---hidden---output
ANN architecture can have many layers i.e 1 (3 node), 2 (4 node), 3 (2 node)...
Transfer function = sum of all weight * input
There are man activation functions
Deep learning is about making data analysis sophisticated enough to derive personality
Lowest E means less difference between desired and actual value (training iteration tends to minimize E)
Genetic algorithm (GA) is more random
If wave E(w), use GA
If steep descent, use back propagation or anything based on gradient descent
Clustering can be crisp or fuzzy

FDA, Pharmacy and regulatory writing...........

I got interested for a career in medical writing......
(https://www.kent.ac.uk/careers/workin/sciencewriting.htm)
Though I have a immense number of medical publications, I had not much knowledge of regulatory writing. So, I had to learn from scratch.  It had  a steep learning curve, but the journey was exciting.
Had to be familiarized with  a number of acronyms......
AE = Adverse Event
BIMO = Bioresearch Monitoring program
CER=Clinical Evaluation Reports
CPM = Clinical Project Manager
CRC = Clinical Research Coordinator
CRF = Case Report Form
CFR = Code of Federal Regulations
CRA = Clinical Research Associate
CRO = Contract Research Organization
EDC = Electronic Data Capture
FCE = Field Clinical Engineer
FDA = Food and Drug Administration
FDF = Financial Disclosure Form
ICF = Informed Consent Form
IDE = Investigational Device Exemption
IMV = Interim Monitoring Visits
IND = Investigational New Drug
IRB = Institutional Review Board
MP = Monitoring Plan
NDA = New Drug Application
PI = Principal Investigator
PMA = Pre Market Approval
QC = Quality Check
RBM = Risk Based Monitoring
RMP = Risk Management Plan
TMF = Trial Master File
SAE = Serious Adverse Event
SSR=Safety Surveillance Reports
WL = Warning Letter
483 = Inspectional Findings


FDA categorizes medical devices into three classes
Class I : low risk and subject to the least regulatory controls
Class III : highest risk devices and subject to the highest level of regulatory control, often requiring agency approval before they can be marketed.



Molecular biology.....

RNA folding is the most essential process underlying RNA function
DEAD-box helicase Mss116p is essential for respiratory growth by acting as group I and group II intron splicing factor
Mss116p assists RNA folding
ai5γ group II intron
ai5γ ribozyme D135 achieves its complex architecture by following a direct path, in which formation of the scaffolding domain (D1) constitutes a compact intermediate on the way from the unfolded to the native conformation
A small substructure within D1, the κ–ζ element, was found to control D135 compaction and to initiate a cascade of folding events
Mss116p plays a role in mitochondrial translation and RNA processing
Mss116p influences the folding mechanism in another way: long exon sequences interfere with ai5γ splicing 
-------------
HDV ribozymes catalyze their own scission from the transcript during rolling circle replication of the hepatitis delta virus
Genomic mapping of these RNAs suggested several biological roles, one of which is the 5' processing of non-LTR retrotransposons



Mitochondrial DNA (mtDNA)...

The mitochondrion is a highly specialized organelle, present in almost all eukaryotic cells and principally charged with the production of cellular energy through oxidative phosphorylation

Somatic mutations in mtDNA are also linked to other complex traits, including neurodegenerative diseases, ageing and cancer.

The mtDNA genome mutations can lead to energy and respiratory-related disorders such as myoclonic epilepsy with ragged red fiber disease, mitochondrial myopathy, encephalop
athy, lactic acidosis and stroke syndrome, and Leber's hereditary optic neuropathy
Heteroplasmy is the presence of more than one type of organellar genome (mitochondrial DNA or plastid DNA) within a cell or individual

It is an important factor in considering the severity of mitochondrial diseases.



Phylogeny.............

Phylogeny or or the lineage trace back is vital for genomic interpretations. The evolutionary link tracing is of importance in clinical, historical, and conservation biology. In this context, its important to be well-versed of the robust tools and data formats.
High bootstrap value means, node is well-supported. Bootstrap value of 95% or 0.95 means in 95 out of 100 iterations, the node is supported. A maximum likelihood tree with bootstrap value ~70% and above are considered okay.

Positive or diversifying or disruptive or accelerating selection
Negative or purifying selection

Sequence to analyze for phylogeny........
Molecular sequence (DNA, protein)
Molecular presence (RFLP, isozyme, RAPD, ISSR, AFLP)

Aligner: Blat, gmap
Multiple Alignment: Clustal, MUSCLE, TCoffee
Alignment Refinement: Gblocks
Substitution matrices: BLOSUM, PAM, WAG, JTT, DAYHOFF
Model selection: jModel, GTR (General Time Reversible), GTRCAT, GTRGAMMA, PROTGAMMAJTT
Phylogenetic analysis: MEGA, Mesquite, BioNJ, MrBayes, PAML, PAUP, PhyML, RAxML, SeaView, BEAST
PAML: Phylogenetic Analysis Using Maximum Likelihood
RAxML: Randomized Axelerated Maximum Likelihood (based on maximum likelihood)

RAxML is very popular as its fast and generates maximum likelihood tree with good scores. It accepts phylip format files. 
File formats: Phylip, RAxML, Nexus
Tools and the files they accept:
MEGA:
PAUP: nexus (.nex)
MrBayes: nexus (.nex)
PhyML, RAxML: phylip (.phy)
Others: fasta (.fa)
Phylogenetic tree view (vizualization): SplitsTree, Newick, Drwatree, TreeDyn, FigTree
Phylip: Phylogeny inference package

fasta------------>phylip------------>tree

(http://evolution.genetics.washington.edu/phylip/progs.algs.tree.html)
STEP 1: Code to convert fasta sequence to phylip format (convertFasta2Phylip.sh)........
#! /bin/sh
#The code convert.sh converts fasta sequence to phylip format
#Phylip format is almost same, just the sequences are presented in one line. It has an header mentioning the number/length of sequences (n, m) followed by alignment.
#If the execution is wrong print this statement
if [ $# != 1 ]; then
    echo "USAGE: ./script <fasta-file>"
    exit
fi

#first column should have > symbol. Count no. of >.
numSpec=$(grep -c  ">" $1)

#read field 1; substitute symbols; delete all line numbers; delete space, substitute; substitute
tmp=$(cat $1 | sed "s/>[ ]*\(\w*\).*/;\1</"  | tr -d "\n" | tr -d ' '  | sed 's/^;//' | tr "<" " " )
#find length
length=$(($(echo $tmp | sed 's/[^ ]* \([^;]*\);.*/\1/'   | wc -m ) - 1))

echo "$numSpec $length"
echo  $tmp | tr ";" "\n"
--------------------
 data_file
 >|cow|
ATCGGGGCTGCGTGAAAAAAAAATTGC
>|egret|
AGGGTCCAATGTTAACTTTCATGCGCTCG
>|turtle|
AGGTAAACCGTGAGCGGGCGGGATG
>|rabbit|
TATTGACTGACCCGGGCAATTCGTG
>|goat|
TTGAAAACCCGTGGGTGCGGGGCCCCGGG
--------------------
execution:
sh convert.sh data_file
 --------------------
output
 5 29
ATCGGGGCTGCGTGAAAAAAAAATTGC
 AGGGTCCAATGTTAACTTTCATGCGCTCG
 AGGTAAACCGTGAGCGGGCGGGATG
 TATTGACTGACCCGGGCAATTCGTG
 TTGAAAACCCGTGGGTGCGGGGCCCCGGG
############################################
STEP 2: Code to convert phylip sequence into tree........
# -s (input seq), -n (output_seq), -N (no. of replicates, no. of alignments), -T (threads to run), -f (estimation algorithm), -x(), -m(model), -b(randomizer), -p random seed, -f (rapid Bootstrap analysis and search for best-scoring ML tree in one program run), -m GTRGAMMA     (GTR + Optimization of substitution rates + GAMMA model of rate)
#Have to play with the parameters depending upon requirements
raxml -s phylip.phy -n phylip.raxml.signalTree -m GTRCAT -f a -T 2 -x 1000 -N 300

#Tree from DNA file

raxml-hpc -T 8 -m GTRGAMMA -s file.phylip -f d -n output
raxml-hpc -T 8 -m GTRGAMMA -s file..phylip -x 12345 -N 500 -n output.500rbs

#Single tree from protein file
raxmlHPC -s file.phy -n file.raxml.singleTree -c 4 -f d -m PROTGAMMAJTT

#A set of bootstrap tree from protein file
raxmlHPC -s file.phy -n file.raxml -c 4 -f d -m PROTGAMMAJTT -b 234534251 -N 10
---------------
# Tree from multiple alignment sequences (concatenated core genes, mde by Roary)
raxmlHPC -m GTRGAMMA -p 12345 -s core_gene_alignment.aln -n NAME

# Run RAxML in bootstrap mode
raxmlHPC -m GTRGAMMA -p 12345 -s core_gene_alignment.aln -n NAME_bootstrap -f a -x 12345 -N 100 -T 12
# Results (open with Forester)
RAxML_bestTree.NAME_bootstrap - best-scoring ML tree
RAxML_bipartitions.NAME_bootstrap - best-scoring ML tree with support values
RAxML_bipartitionsBranchLabels.NAME_bootstrap - best-scoring ML tree with support values as branch labels
RAxML_bootstrap.NAME_bootstrap - all bootstrapped trees
RAxML_info.NAME_bootstrap  - program  info
Most recent common ancestor (MCRA)



(https://evolution.berkeley.edu/evolibrary/article/0_0_0/evotrees_primer_04)
Paraphyletic taxon: does not include all the descendants of the most recent common ancestor.
Monophyletic taxon: include all the descendants of the most recent common ancestor.
Probabilistic analysis of a concatenated alignment - are limited by large demands in memory and computing time.
Supertree methods: focuses on the topology or structure of the phylogenetic tree, rather than the evolutionary divergences associated to it. 

Math formulae:.........

Good Math softwares, websites: Matlab, Mathematica, Maple, Symbolab, CAS, Wolfram
PEMDAS: Parenthesis, Exponential, Multiplication, Division, Addition, Subtraction
BODMAS: Brackets, Orders (powers and roots), Division and Multiplication, Addition and Subtraction
Identities(Additive, Associative,Multiplicative, Distributive, Commutative, polynomial)

(a+b)^2 = a^2 + 2ab + b^2
(a+b)(c+d) = ac + ad + bc + bd
#Difference of squares
a^2 - b^2 = (a+b)(a-b)
#Sum and Difference of Cubes
a^3 (+-) b^3 = (a (+-) b)(a^2(-+) ab + b^2)
#Quadratic Formula
ax^2 + bx + c = 0 then x = ( -b (+-)sqrt(b^2 - 4ac) ) / 2a
------------------------------
########Mensuration########
pi = 3.141
#Circle
Circumference = 2Pi r
Area = Pi r^2

Length of arc
    = theta (in degree)  (Pi/180)  r
    = theta (in radian) r

Area of Sector:
= (theta/360) Pi r^2
=((theta/(2Pi)) Pi r^2

#Surface Area formulae
Cube = 6a^2
Cylinder = 2pi r^2 + 2pi rh
Sphere = 4 pi r^2

#Volume formulae
cube = a^3
rectangular prism = abc
cylinder = pi r^2h
cone = (1/3)pi r^2h
sphere = (4/3) pi r^3

########Trigonometry########
sin(q) = opposite / hypotenuse = p/h
cos(q) = base / hypotenuse = b/h
sin(90°) = 1
cos(90°) = 0

Population Genetics............

Population Genetics
This branch of genetics deals with distribution and change in frequency of alleles
Population Genetics is the study of how populations change genetically over time
Microevolution - the change in the genetic makeup of a population from generation to generation
Gene pool: All the genes present in a population at any given time
Allele frequency : How often a particular allele shows up in a population
The Hardy-Weinberg Theorem: If allele frequencies stay the same from one generation to the next, then no evolution is occurring
p2 + 2pq + q2 = 1 Where p2 = frequency of AA genotype 2pq = frequency of Aa genotype q2 = frequency of aa genotype
(https://www.nature.com/scitable/knowledge/library/the-hardy-weinberg-principle-13235724)
H-W equilibrium: 1. The population is very large 2. Mating is random 3. no mutations 4. no migration
Data: SNP, microsattellite
Software: Arlequin, Structure, Populus
Founder effect : A way Nature to randomly create new species from existing populations

Geology..................

Mica is a sheet silicate (phyllosilicate).
Intrusive igneous rocks: diorite, gabbro, granite, pegmatite, and peridotite
Extrusive igneous rocks: andesite, basalt, obsidian, pumice, rhyolite, scoria, and tuff
---------
San Andreas Fault: It occur in between Pacific Plate and the North American Plate. Areas around  it are earthquake-prone. Pacific Plate has San Diego, Los Angeles and Big Sur. North American Plate has San Francisco, Sacramento and the Sierra Nevada.
Scree:  A collection of broken rock fragments at the foot of mountain cliffs or volcanoes. Landforms associated with these materials are often called talus deposits. Animals like marmot and pika live in these habitat.


Embryology, IVF, PGD/PGS.................

Embryology  studies the prenatal development of gametes (sex cells), fertilization, and development of embryos and fetuses
Embryology encompasses the study of congenital disorders that occur before birth, known as teratology.
Through mitosis, the zygote subdivides into blastomeres which contain the full complement of paternal and maternal chromosomes.
After cleavage, the dividing cells, or morula, becomes a hollow ball, or blastula, which develops a hole or pore at one end.
In bilateral animals, the blastula has 2 fates.
In the blastula the first pore (blastopore) becomes the mouth of the animal, it is a protostome;  if the first pore becomes the anus then it is a deuterostome.
The protostomes include most invertebrate animals, such as insects, worms and molluscs, while the deuterostomes include the vertebrates.
In due course, the blastula changes into a more differentiated structure called the gastrula.
(https://www.britannica.com/science/prenatal-development)

The blastocyst has an inner cell mass which  forms the embryo. The outer layer called trophoblast gives rise to the placenta. 
A healthy blastocyst hatches from its outer shell, the zona pellucida between day 5 to day 7 after fertilization. Within 24 hours after hatching, embryo implantation after IVF (or a "natural" pregnancy) begins as the embryo invades into the uterine lining.

Blastocyst is implanted in the uterus, embryogenesis continues with the next stage of gastrulation.The gastrula with its blastopore soon develops three distinct layers of cells (the germ layers) from which all the bodily organs and tissues then develop (histogenesis)
The innermost layer, or endoderm, gives a rise to the digestive organs, the gills, lungs or swim bladder if present, and kidneys or nephrites. The middle layer, or mesoderm, gives rise to the muscles, skeleton if any, and blood system. The outer layer of cells, or ectoderm, gives rise to the nervous system, including the brain, and skin or carapace and hair, bristles, or scales.
(https://en.wikipedia.org/wiki/Human_embryogenesis)
Embryos in many species often appear similar to one another in early developmental stages. The reason for this similarity is because species have a shared evolutionary history. These similarities among species are called homologous structures, which are structures that have the same or similar function and mechanism, having evolved from a common ancestor.
(http://www.biozoomer.com/2011/02/evolution-embryology-evidences.html)
In humans, the term embryo refers to the ball of dividing cells from the moment the zygote implants itself in the uterus wall until the end of the eighth week after conception. Beyond the eighth week after conception (tenth week of pregnancy), the developing human is then called a fetus.

In first trimester screen, the maternal blood is tested (by NIPT or  Non Invasive Prenatal Test) for two normal first-trimester proteins.
Cell free maternal DNA and fetal DNA (placental DNA). Fetal DNA is 10% of all circulating cell-re DNA. Length of DNA is 150-200bp.

Then, an ultrasound is used to look at the nuchal translucency region under the skin behind the baby's neck. This test is done between the 11th and 14th week of pregnancy. Nuchal translucency test uses ultrasound to measure the thickness of the fluid buildup at the back of the developing baby's neck. If this area is thicker than normal, it can be an early sign of Down syndrome, trisomy 18, or cardiac problems.

AFP (alpha-fetoprotein), hCG, and Estriol level are tested.
AFP is a major plasma protein produced by the yolk sac of the fetus. The gene coding for this protein is in the q arm of chromosome 4. It is thought to be the fetal form of serum albumin. AFP binds to copper, nickel, fatty acids and bilirubin  and is found in monomeric, dimeric and trimeric forms.
It binds estradiol to prevent the transport of this hormone across the placenta to the fetus. It prevents the virilization of female fetuses. AFP may protect the fetus from maternal estradiol that would otherwise have a masculinizing effect on the fetus. AFP can tell about neural tube defects.
Nowadays, maternal serum is tested for fetal DNA.

Preimplantation genetic diagnosis/screening (PGD or PGS) is used prior to implantation to help identify genetic defects within embryos. PGD benefits couple at risk for passing on a genetic disease or condition. It prevents certain genetic diseases or disorders from being passed on to the child.

The embryos used in PGD are usually created during the process of in vitro fertilization (IVF).
Egg retrieval and fertilization in a laboratory. Over the next three to five days, the embryos will divide into multiple cells.
PGD steps:
Few cells (which would have become placenta) are micro-surgically removed from the embryos, which are about 5 days developed.
After this cell collection, the embryos are safely frozen. The cells derived by embryo biopsy (for genetic material or DNA) are  placed in a tube.
The embryo is screened for genetic abnormalities. The genetic material is evaluated by PCR, FISH, CGH  (comparative genomic hybridisation) arrays or Next Generation Sequencing (NGS) to determine if the inheritance of a problematic gene is present in each embryo.
 
(https://www.invitra.com/preimplantation-genetic-diagnosis-pgd/)

This process takes at least one full week (time taken is constantly being reduced).
Embryos that are free of genetic problems are kept frozen (frozen embryo transfer). Embryos with problematic genes are destroyed.
If PGD finds that the embryos is free of genetic problems, the embryo(s) will be placed in the uterus (by an IVF procedure), and the wait for implantation and a positive pregnancy test begins.
Embryos with the correct number of chromosomes (euploid embryos)
All women are at risk of producing chromosomally abnormal embryos. As a woman ages, the potential for chromosomally abnormal embryos increases, regardless of the number of embryos produced.
For each embryo tested, PGS results will fall into one of three categories: euploid, aneuploid, or mosaic.
Fertlity clinic: In vitro fertilization with standard insemination, Intracytoplasmic Sperm Injection (ICSI), Assisted Hatching (AHA), embryo cryopreservation, blastocyst culture, TESE () and MESA () for male factor, and embryo biopsy for pre-implantation genetic diagnosis.
GnRH-Agonist is used to suppress the secretion of gonadotropin hormones
Then multiple follicles are recruited by daily injections of gonadotropins. Ultrasound imaging and hormone assessments are used to monitor follicular development
Final maturation of eggs is done by HCG administration
Egg retrieval is scheduled 34-36 hours after HCG injection, in a surgical suite under intravenous sedation
Ovarian follicles are aspirated using a needle guided by trans-vaginal ultrasonography. Follicular fluids are scanned by the embryologist to locate all available eggs. The eggs are placed in a special media and cultured in an incubator until insemination
If sperm parameters are normal, approximately 50,000 to 100,000 motile sperm are transferred to the dish containing the eggs. This is called standard insemination.
ICSI (Intracytoplasmic sperm injection) technique is utilized to fertilize mature eggs if sperm parameters are abnormal
Embryologist picks up a single spermatozoa using a fine glass micro needle and injects it directly into the egg cytoplasm
If there are no sperm in the ejaculate, sperm may be obtained via a surgical procedure
Fertilization is assessed 16-18 hours after insemination or ICSI
The fertilized eggs are called zygotes and are cultured in a specially formulated culture medium that supports their growth
They will be assessed on the second and third day after retrieval
Blastocyst culture has several advantages. Embryos at this stage have a higher potential for implantation, therefore fewer embryos can be transferred on day 5 to reduce the chance of multiple pregnancies. Blastocyst culture makes it possible to select the best one or two blastocysts vs. two or three (or rarely four) early embryos to transfer back to the mother. This reduces the occurrence of potentially risky multiple births Low numbers of embryos and poor embryo quality reduce the chances for good blastocyst development.
Early in the morning on the day of your transfer the embryos are evaluated and photographed by the embryologist
The embryologist will decide based on the rate of development and appearance of the embryos, which and how many embryos are recommended to be transferred.
Typically embryos are transferred at the cleavage stage (4 – 8 cells) (Day 3 after oocyte retrieval) or at the blastocyst stage (a ball of cells with fluid inside)(Day 5)
Embryo transfer is a simple procedure that does not require any anesthesia. Embryos are loaded in a soft catheter and are placed in the uterine cavity through the cervix.
An embryo must hatch out of its outer membrane (zona pellucida) before implanting in the uterine wall (endometrium)
Sometimes the zona is abnormally thick. Laser assisted hatching is a technique that allows a small gap in the zona pellucida to be made. This will aid the embryo in breaking out of this membrane and facilitates implantation
This perforation/ assisted hatching is performed prior to embryo transfer and when doing trophectoderm biopsies
Assisted hatching improves IVF success rates in both fresh embryo transfers and frozen embryo transfers.
(https://emedicine.medscape.com/article/273415-overview)

#Risks of PGS....
Incorrect result
Biopsy stress may cause embryo to arrest growth
Freeze-thaw cycle may be harmful

#Benefits of PGS.......
Beneficial for older woman
Female having repeated miscarriages may be benfittrrd
IVF failure rate may decrease




Metagenomics..........

Metagenome: all the genes in a particular community. Metagenomics is their study.
Metagenomic sequencing from samples (air, gut, tidepool, restroom)
The assembly of complete genomes from samples that are not pure cultures, requires the physical recovery of organism-specific clones from environmental-DNA libraries or the computational recovery from environmental-DNA sequence databases of overlapping target-organism-specific sequences (“contigs”).
(https://teachthemicrobiome.weebly.com/sequencing-the-microbiome.html)
For environments of low complexity, such as the acid mine drainage, it is possible to assemble several genomes simultaneously from an environmental sequence database by  binning methods

16S rRNA is conserved, so is targeted by primers for PCR amplification. But, species-specific regions are important to find species identity in metagenomes.

In functional genomics, each gene’s function in the organism is found
DNA microarrays, when bearing multiple rRNA as probes, can be used to track variations in population structure

Q large dataset is ------divided into distinct subsets based on some specific measure
Genome annotations is a form of clustering. In metagenomics, where a substantial percentage of sequences cannot be easily classified, annotations often remain at the preliminary stage of clustering. Binning criteria can use GC content and codon use. The challenge of simultaneously assembling multiple genomes was met by several binning procedures that allow provisional assignment of contigs to different genomes.
(https://www.slideshare.net/MadsAlbertsen/130707-albertsen-mewe13-metagenomics)
The determination of the genetic content of entire communities of organisms has many usages. The field of metagenomics has been responsible for advances in microbial ecology, evolution, and diversity.
Major steps in metagenomics:
Sample processing; sequencing technology; assembly; binning; annotation; experimental design; statistical analysis, and data storage and sharing

#Sampling and processing

The DNA extracted should be representative of all cells in the sample
sufficient amounts of high-quality nucleic acids must be obtained for subsequent library production and sequencing
fractionation or selective lysis, filtration or centrifugation, flow cytometry

#Sequencing technology
Sanger sequencing: Gold standard because of its low error rate, long read length (> 700 bp) and large insert sizes (e.g. > 30 Kb for fosmids or bacterial artificial chromosomes (BACs)). All of these aspects will improve assembly outcomes for shotgun data, and hence Sanger sequencing might still be applicable if generating close-to-complete genomes in low-diversity environments is the objective A drawback of Sanger sequencing is it is labor-intensive cloning process and  bias is seen against genes toxic for the cloning host and the overall cost per gigabase (appr. USD 400,000).
Both the 454/Roche and the Illumina/Solexa systems have been applied to metagenomic samples
(http://envgen.github.io/metagenomics.html)
========================
The 454/Roche system applies emulsion polymerase chain reaction (ePCR) to clonally amplify random DNA fragments, which are attached to microscopic beads. Beads are deposited into the wells of a picotitre plate and then individually and in parallel pyrosequenced. The pyrosequencing process involves the sequential addition of all four deoxynucleoside triphosphates, which, if complementary to the template strand, are incorporated by a DNA polymerase. This polymerization reaction releases pyrophosphate, which is converted via two enzymatic reactions to produce light. Light production of ~ 1.2 million reactions is detected in parallel via a charge-coupled device (CCD) camera and converted to the actual sequence of the template. Two aspects are important in this process with respect to metagenomic applications. First, the ePCR has been shown to produce artificial replicate sequences, which will impact any estimates of gene abundance. Understanding the amount of replicate sequences is crucial for the data quality of sequencing runs, and replicates can be identified and filtered out with bioinformatics tools [22,23]. Second, the intensity of light produced when the polymerase runs through a homopolymer is often difficult to correlate to the actual number of nucleotide positions. Typically, this results in insertion or deletion errors in homopolymers and can hence cause reading frameshifts, if protein coding sequences (CDSs) are called on a single read. This type of error can however be incorporated into models of CDS prediction thus resulting in high, albeit not perfect, accuracy [24]. Despite these disadvantages, the much cheaper cost of ~ USD 20,000 per gigabase pair has made 454/Roche pyrosequencing a popular choice for shotgun-sequencing metagenomics. In addition, the 454/Roche technology produces an average read length between 600-800 bp, which is long enough to cause only minor loss in the number of reads that can be annotated [25]. Sample preparation has also been optimized so that tens of nanograms of DNA are sufficient for sequencing single-end libraries [26,27], although pair-end sequencing might still require micrograms quantities. Moreover, the 454/Roche sequencing platform offers multiplexing allowing for up to 12 samples to be analyzed in a single run of ~500 Mbp.

The Illumina/Solexa technology immobilizes random DNA fragments on a surface and then performs solid-surface PCR amplification, resulting in clusters of identical DNA fragments. These are then sequenced with reversible terminators in a sequencing-by-synthesis process [28]. The cluster density is enormous, with hundreds of millions of reads per surface channel and 16 channels per run on the HiSeq2000 instrument. Read length is now approaching 150 bp, and clustered fragments can be sequenced from both ends. Continuous sequence information of nearly 300 bp can be obtained from two overlapping 150 bp paired-reads from a single insert. Yields of ~60 Gbp can therefore be typically expected in a single channel. While Illumina/Solexa has limited systematic errors, some datasets have shown high error rates at the tail ends of reads [29]. In general, clipping reads has proven to be a good strategy for eliminating the error in "bad" datasets, however, sequence quality values should also be used to detect "bad" sequences. The lower costs of this technology (~ USD 50 per Gbp) and recent success in its application to metagenomics, and even the generation of draft genomes from complex dataset [30,31], are currently making the Illumina technology an increasingly popular choice. As with 454/Roche sequencing, starting material can be as low as a 20 nanograms, but larger amounts (500-1000 ng) are required when matepair-libraries for longer insert libraries are made. The limited read length of the Illumina/Solexa technology means that a greater proportion of unassembled reads might be too short for functional annotation than are with 454/Roche technology [25]. While assembly might be advisable in such a case, potential bias, such as the suppression of low-abundance species (which can not be assembled) should be considered, as should the fact that some current software packages (e.g. MG-RAST) are capable of analyzing unassembled Illumina reads of 75 bp and longer. Multiplexing of samples is also available for individual sequencing channels, with more than 500 samples multiplexed per lane. Another important factor to consider is run time, with a 2 × 100 bp paired-end sequencing analysis taking approx. 10 days HiSeq2000 instrument time, in contrast to 1 day for the 454/ Roche technology. However, faster runtime (albeit at higher cost per Gbp of approx. USD 600) can be achieved with the new Illumina MiSeq instrument. This smaller version of Illumina/Solexa technology can also be used to test-run sequencing libraries, before analysis on HiSeq instrument for deeper sequencing.

A few additional sequencing technologies are available that might prove useful for metagenomic applications, now or in the near future. The Applied Biosystems SOLiD sequencer has been extensively used, for example, in genome resequencing [32]. SOLiD arguably provides the lowest error rate of any current NGS sequencing technology, however it does not achieve reliable read length beyond 50 nucleotides. This will limit its applicability for direct gene annotation of unassembled reads or for assembly of large contigs. Nevertheless, for assembly or mapping of metagenomic data against a reference genome, recent work showed encouraging outcomes [33]. Roche is also marketing a smaller-scale sequencer based on pyrosequencing with about 100 Mbp output and low per run costs. This system might be useful, because relatively low coverage of metagenomes can establish meaningful gene profiles [34]. Ion Torrent (and more recently Ion Proton) is another emerging technology and is based on the principle that protons released during DNA polymerization can detect nucleotide incorporation. This system promises read lengths of > 100 bp and throughput on the order of magnitude of the 454/Roche sequencing systems. Pacific Biosciences (PacBio) has released a sequencing technology based on single-molecule, real-time detection in zero-mode waveguide wells. Theoretically, this technology on its RS1 platform should provide much greater read lengths than the other technologies mentioned, which would facilitate annotation and assembly. In addition, a process called strobing will mimic pair-end reads. However, accuracy of single reads with PacBio is currently only at 85%, and random reads are "dropped," making the instrument unusable in its current form for metagenomic sequencing [35]. Complete Genomics is offering a technology based on sequencing DNA nanoballs with combinatorial probe-anchor ligation [36]. Its read length of 35 nucleotides is rather limited and so might be its utility for de novo assemblies. While none of the emerging sequencing technologies have been thoroughly applied and tested with metagenomics samples, they offer promising alternatives and even further cost reduction.

Go to:
Assembly
If the research aims at recovering the genome of uncultured organisms or obtain full-length CDS for subsequent characterization rather than a functional description of the community, then assembly of short read fragments will be performed to obtain longer genomic contigs. The majority of current assembly programs were designed to assemble single, clonal genomes and their utility for complex pan-genomic mixtures should be approached with caution and critical evaluation.

Two strategies can be employed for metagenomics samples: reference-based assembly (co-assembly) and de novo assembly.

Reference-based assembly can be done with software packages such as Newbler (Roche), AMOS http://sourceforge.net/projects/amos/, or MIRA [37]. These software packages include algorithms that are fast and memory-efficient and hence can often be performed on laptop-sized machines in a couple of hours. Reference-based assembly works well, if the metagenomic dataset contains sequences where closely related reference genomes are available. However, differences in the true genome of the sample to the reference, such as a large insertion, deletion, or polymorphisms, can mean that the assembly is fragmented or that divergent regions are not covered.

De novo assembly typically requires larger computational resources. Thus, a whole class of assembly tools based on the de Bruijn graphs was specifically created to handle very large amounts of data [38,39]. Machine requirements for the de Bruijn assemblers Velvet [40] or SOAP [41] are still significantly higher than for reference-based assembly (co-assembly), often requiring hundreds of gigabytes of memory in a single machine and run times frequently being days.

The fact that most (if not all) microbial communities include significant variation on a strain and species level makes the use of assembly algorithms that assume clonal genomes less suitable for metagenomics. The "clonal" assumptions built into many assemblers might lead to suppression of contig formation for certain heterogeneous taxa at specific parameter settings. Recently, two de Bruijn-type assemblers, MetaVelvet and Meta-IDBA [42] have been released that deal explicitly with the non-clonality of natural populations. Both assemblers aim to identify within the entire de Bruijn graph a sub-graph that represents related genomes. Alternatively, the metagenomic sequence mix can be partition into "species bins" via k-mer binning (Titus Brown, personal communications). Those subgraphs or subsets are then resolved to build a consensus sequence of the genomes. For Meta-IDBA a improvement in terms of N50 and maximum contig length has been observed when compared to "classical" de Bruijn assembler (e.g. Velvet or SOAP; results from the personal experience of the authors; data not shown here). The development of "metagenomic assemblers" is however still at an early stage, and it is difficult to access their accuracy for real metagenomic data as typically no references exist to compare the results to. A true gold standard (i.e. a real dataset for a diverse microbial community with known reference sequences) that assemblers can be evaluated against is thus urgently required.

Several factors need to be considered when exploring the reasons for assembling metagenomic data; these can be condensed to two important questions. First, what is the length of the sequencing reads used to generate the metagenomic dataset, and are longer sequences required for annotation? Some approaches, e.g. IMG/M, prefer assembled contigs, other pipelines such as MG-RAST [43] require only 75 bp or longer for gene prediction or similarity analysis that provides taxonomic binning and functional classification. On the whole, however, the longer the sequence information, the better is the ability to obtain accurate information. One obvious impact is on annotation: the longer the sequence, the more information provided, making it easier to compare with known genetic data (e.g. via homology searches [25]). Annotation issues will be discussed in the next section. Binning and classification of DNA fragments for phylogenetic or taxonomic assignment also benefits from long, contiguous sequences and certain tools (e.g. Phylopythia) work reliably only over a specific cut-off point (e.g. 1 Kb) [44]. Second, is the dataset assembled to reduce data-processing requirements? Here, as an alternative to assembling reads into contigs, clustering near-identical reads with cd-hit [45] or uclust [46] will provide clear benefits in data reduction. The MG-RAST pipeline also uses clustering as a data reduction strategy.

Fundamentally, assembly is also driven by the specific problem that single reads have generally lower quality and hence lower confidence in accuracy than do multiple reads that cover the same segment of genetic information. Therefore, merging reads increases the quality of information. Obviously in a complex community with low sequencing depth or coverage, it is unlikely to actually get many reads that cover the same fragment of DNA. Hence assembly may be of limited value for metagenomics.

Unfortunately, without assembly, longer and more complex genetic elements (e.g., CRISPRS) cannot be analyzed. Hence there is a need for metagenomic assembly to obtain high-confidence contigs that enable the study of, for example, major repeat classes. However, none of the current assembly tools is bias-free. Several strategies have been proposed to increase assembly accuracy [38], but strategies such as removal of rare k-mers are no longer considered adequate, since rare k-mers do not represent sequence errors (as initially assumed), but instead represent reads from less abundant pan-genomes in the metagenomic mix.

Go to:
Binning
Binning refers to the process of sorting DNA sequences into groups that might represent an individual genome or genomes from closely related organisms. Several algorithms have been developed, which employ two types of information contained within a given DNA sequence. Firstly, compositional binning makes use of the fact that genomes have conserved nucleotide composition (e.g. a certain GC or the particular abundance distribution of k-mers) and this will be also reflected in sequence fragments of the genomes. Secondly, the unknown DNA fragment might encode for a gene and the similarity of this gene with known genes in a reference database can be used to classify and hence bin the sequence.

Compositional-based binning algorithms include Phylopythia [44], S-GSOM [47], PCAHIER [48,49] and TACAO [49], while examples of purely similarity-based binning software include IMG/M [50], MG-RAST [43], MEGAN [51], CARMA [52], SOrt-ITEMS [53] and MetaPhyler [54]. There is also number of binning algorithms that consider both composition and similarity, including the programs PhymmBL [55] and MetaCluster [56]. All these tools employ different methods of grouping sequences, including self-organising maps (SOMs) or hierarchical clustering, and are operated in either an unsupervised manner or with input from the user (supervised) to define bins.

Important considerations for using any binning algorithm are the type of input data available and the existence of a suitable training datasets or reference genomes. In general, composition-based binning is not reliable for short reads, as they do not contain enough information. For example, a 100 bp read can at best possess only less than half of all 256 possible 4-mers and this is not sufficient to determine a 4-mer distribution that will reliably relate this read to any other read. Compositional assignment can however be improved, if training datasets (e.g. a long DNA fragment of known origin) exist that can be used to define a compositional classifier [44]. These "training" fragments can either be derived from assembled data or from sequenced fosmids and should ideally contain a phylogenetic marker (such as a rRNA gene) that can be used for high-resolution, taxonomic assignment of the binned fragments [57].

Short reads may contain similarity to a known gene and this information can be used to putatively assign the read to a specific taxon. This taxonomic assignment obviously requires the availability of reference data. If the query sequence is only distantly related to known reference genomes, only a taxonomic assignment at a very high level (e.g. phylum) is possible. If the metagenomic dataset, however, contains two or more genomes that would fall into this high taxon assignment, then "chimeric" bins might be produced. In this case, the two genomes might be separated by additional binning based on compositional features. In general, however this might again require that the unknown fragments have a certain length.

Binning algorithm will obviously in the future benefit from the availability of a greater number and phylogenetic breadth of reference genomes, in particular for similarity-based assignment to low taxonomic levels. Post-assembly the binning of contigs can lead to the generation of partial genomes of yet-uncultured or unknown organisms, which in turn can be used to perform similarity-based binning of other metagenomic datasets. Caution should however been taken to ensure the validity of any newly created genome bin, as "contaminating" fragments can rapidly propagate into false assignments in subsequent binning efforts. Prior to assembly with clonal assemblers binning can be used to reduce the complexity of an assembly effort and might reduce computational requirement.

As major annotation pipelines like IMG/M or MG-RAST also perform taxonomic assignments of reads, one needs to carefully weigh the additional computational demands of the particular binning algorithm chosen against the added value they provide.

Go to:
Annotation
For the annotation of metagenomes two different initial pathways can be taken. First, if reconstructed genomes are the objective of the study and assembly has produced large contigs, it is preferable to use existing pipelines for genome annotation, such as RAST [58] or IMG [59]. For this approach to be successful, minimal contigs length of 30,000 bp or longer are required. Second, annotation can be performed on the entire community and relies on unassembled reads or short contigs. Here the tools for genome annotation are significantly less useful than those specifically developed for metagenomic analyses. Annotation of metagenomic sequence data has in general two steps. First, features of interest (genes) are identified (feature prediction) and, second, putative gene functions and taxonomic neighbors are assigned (functional annotation).

Feature prediction is the process of labeling sequences as genes or genomic elements. For completed genome sequences a number of algorithms have been developed [60,61] that identify CDS with more than 95% accuracy and a low false negative ratio. A number of tools were specifically designed to handle metagenomic prediction of CDS, including FragGeneScan [24], MetaGeneMark [62], MetaGeneAnnotator (MGA)/ Metagene [63] and Orphelia [64,65]. All of these tools use internal information (e.g. codon usage) to classify sequence stretches as either coding or non-coding, however they distinguish themselves from each other by the quality of the training sets used and their usefulness for short or error-prone sequences. FragGeneScan is currently the only algorithm known to the authors that explicitly models sequencing errors and thus results in gene prediction errors of only 1-2%. True positive rates of FragGeneScan are around 70% (better than most other methods), which means that even this tool still misses a significant subset of genes. These missing genes can potentially be identified by BLAST-based searches, however the size of current metagenomic datasets makes this computational expensive step often prohibitive.

There exists also a number of tools for the prediction of non-protein coding genes such as tRNAs [66,67], signal peptides [68] or CRISPRs [69,70], however they might require significant computational resources or long contiguous sequences. Clearly subsequent analysis depends on the initial identification of features and users of annotation pipelines need to be aware of the specific prediction approaches used. MG-RAST uses a two-step approach for feature identification, FGS and a similarity search for ribosomal RNAs against a non-redundant integration of the SILVA [71], Greengenes [72] and RDP [73] databases. CAMERA's RAMCAPP pipeline [74] uses FGA and MGA, while IMG/M employs a combination of tools, including FGS and MGA [58,59].

Functional annotation represents a major computational challenge for most metagenomic projects and therefore deserves much attention now and over the next years. Current estimates are that only 20 to 50% of a metagenomic sequences can be annotated [75], leaving the immediate question of importance and function of the remaining genes. We note that annotation is not done de novo, but via mapping to gene or protein libraries with existing knowledge (i.e., a non-redundant database). Any sequences that cannot be mapped to the known sequence space are referred to as ORFans. These ORFans are responsible for the seemingly never-ending genetic novelty in microbial metagenomics (e.g. [76]. Three hypotheses exist for existence of this unknown fraction. First, ORFans might simply reflect erroneous CDS calls caused by imperfect detection algorithms. Secondly, these ORFans are real genes, but encode for unknown biochemical functions. Third, ORFan genes have no sequence homology with known genes, but might have structural homology with known proteins, thus representing known protein families or folds. Future work will likely reveal that the truth lies somewhere between these hypotheses [77]. For improving the annotation of ORFan genes, we will rely on the challenging and labor-intensive task of protein structure analysis (e.g. via NMR and x-ray crystallography) and on biochemical characterization.

Currently, metagenomic annotation relies on classifying sequences to known functions or taxonomic units based on homology searches against available "annotated" data. Conceptually, the annotation is relatively simple and for small datasets (< 10,000 sequences) manual curation can be used increase the accuracy of any automated annotation. Metagenomic datasets are typically very large, so manual annotation is not possible. Automated annotation therefore has to become more accurate and computationally inexpensive. Currently, running a BLASTX similarity search is computationally expensive; as much as ten times the cost of sequencing [78]. Unfortunately, computationally less demanding methods involving detecting feature composition in genes [44] have limited success for short reads. With growing dataset sizes, faster algorithms are urgently needed, and several programs for similarity searches have been developed to resolve this issue [46,79-81].

Many reference databases are available to give functional context to metagenomic datasets, such as KEGG [82], eggNOG [83], COG/KOG [84], PFAM [85], and TIGRFAM [86]. However, since no reference database covers all biological functions, the ability to visualize and merge the interpretations of all database searches within a single framework is important, as implemented in the most recent versions of MG-RAST and IMG/M. It is essential that metagenome analysis platforms be able to share data in ways that map and visualize data in the framework of other platforms. These metagenomic exchange languages should also reduce the burden associated with re-processing large datasets, minimizing, the redundancy of searching and enabling the sharing of annotations that can be mapped to different ontologies and nomenclatures, thereby allowing multifaceted interpretations. The Genomic Standards Consortium (GSC) with the M5 project is providing a prototypical standard for exchange of computed metagenome analysis results, one cornerstone of these exchange languages.

Several large-scale databases are available that process and deposit metagenomic datasets. MG-RAST, IMG/M, and CAMERA are three prominent systems [43,50,74]. MG-RAST is a data repository, an analysis pipeline and a comparative genomics environment. Its fully automated pipeline provides quality control, feature prediction and functional annotation and has been optimized for achieving a trade-off between accuracy and computational efficiency for short reads using BLAT {Kent, 2002 #64}. Results are expressed in the form of abundance profiles for specific taxa or functional annotations. Supported are the comparison of NCBI taxonomies derived from 16S rRNA gene or whole genome shotgun data and the comparison of relative abundance for KEGG, eggNOG, COG and SEED subsystems on multiple levels of resolution. Users can also download all data products generated by MG-RAST, share them and publish within the portal. The MG-RAST web interface allows comparison using a number of statistical techniques and allows for the incorporation of metadata into the statistics. MG-RAST has more than 7000 users, > 38,000 uploaded and analyzed metagenomes (of which 7000 are publicly accessible) and 9 Terabases analyzed as of December 2011. These statistics demonstrate a move by the scientific community to centralize resources and standardize annotation.

IMG/M also provides a standardized pipeline, but with "higher" sensitivity as it performs, for example, hidden Markov model (HMM) and BLASTX searches at substantial computational cost. In contrast to MG-RAST, comparisons in IMG/M are not performed on an abundance table level, but are based on an all vs. all genes comparison. Therefore IMG/M is the only system that integrates all datasets into a single protein level abstraction. Both IMG/M and MG-RAST provide the ability to use stored computational results for comparison, enabling comparison of novel metagenomes with a rich body of other datasets without requiring the end-user to provide the computational means for reanalysis of all datasets involved in their study. Other systems, such as CAMERA [74], offer more flexible annotation schema but require that individual researchers understand the annotation of data and analytical pipelines well enough to be confident in their interpretation. Also for comparison, all datasets need to be analyzed using the same workflow, thus adding additional computational requirements. CAMERA allows the publication of datasets and was the first to support the Genomic Standards Consortium's Minimal Information checklists for metadata in their web interface [87].

MEGAN is another tool used for visualizing annotation results derived from BLAST searches in a functional or taxonomic dendrogram [51]. The use of dendrograms to display metagenomic data provides a collapsible network of interpretation, which makes analysis of particular functional or taxonomic groups visually easy.

Go to:
Experimental Design and Statistical Analysis
Owing to the high costs, many of the early metagenomic shotgun-sequencing projects were not replicated or were focused on targeted exploration of specific organisms (e.g. uncultured organisms in low-diversity acid mine drainage [2]). Reduction of sequencing cost (see above) and a much wider appreciation of the utility of metagenomics to address fundamental questions in microbial ecology now require proper experimental designs with appropriate replication and statistical analysis. These design and statistical aspects, while obvious, are often not properly implemented in the field of microbial ecology [88]. However, many suitable approaches and strategies are readily available from the decades of research in quantitative ecology of higher organisms (e.g. animals, plants). In a simplistic way, the data from multiple metagenomic shotgun-sequencing projects can be reduced to tables, where the columns represent samples and the rows indicate either a taxonomic group or a gene function (or groups thereof) and the fields containing abundance or presence/absence data. This is analogous to species-sample matrices in ecology of higher organisms, and hence many for the statistical tools available to identify correlations and statistically significant patterns are transferable. As metagenomic data however often contain many more species or gene functions then the number of samples taken, appropriate corrections for multiple hypothesis testing have to be implemented (e.g. Bonferroni correction for t-test based analyses).

The Primer-E package [89] is a well-established tool, allowing for a range of multivariate statistical analyses, including the generation of multidimensional scaling (MDS) plots, analysis of similarities (ANOSIM), and identification of the species or functions that contribute to the difference between two samples (SIMPER). Recently, multivariate statistics was also incorporated in a web-based tools called Metastats [90], which revealed with high confidence discriminatory functions between the replicated metagenome dataset of the gut microbiota of lean and obese mice [91]. In addition, the ShotgunFunctionalizeR package provides several statistical procedures for assessing functional differences between samples, both for individual genes and for entire pathways using the popular R statistical package [92].

Ideally, and in general, experimental design should be driven by the question asked (rather than technical or operational restriction). For example, if a project aims to identify unique taxa or functions in a particular habitat, then suitable reference samples for comparison should be taken and processed in consistent manner. In addition, variation between sample types can be due to true biological variation, (something biologist would be most interested in) and technical variation and this should be carefully considered when planning the experiment. One should also be aware that many microbial systems are highly dynamic, so temporal aspects of sampling can have a substantial impact on data analysis and interpretation. While the question of the number of replicates is often difficult to predict prior to the final statistical analysis, small-scale experiments are often useful to understand the magnitude of variation inherent in a system. For example, a small number of samples could be selected and sequenced to shallower depth, then analyzed to determine if a larger sampling size or greater sequencing effort are required to obtain statistically meaningful results [88]. Also, the level at which replication takes place is something that should not lead to false interpretation of the data. For example, if one is interested in the level of functional variation of the microbial community in habitat A, then multiple samples from this habitat should be taken and processed completely separately, but in the same manner. Taking just one sample and splitting it up prior to processing will provide information only about technical, but not biological, variation in habitat A. Taking multiple samples and then pooling them will lose all information on variability and hence will be of little use for statistical purposes. Ultimately, good experimental design of metagenomic projects will facilitate integration of datasets into new or existing ecological theories [93].

As metagenomics gradually moves through a range of explorative biodiversity surveys, it will also prove itself extremely valuable for manipulative experiments. These will allow for observation of treatment impact on the functional and phylogenetic composition of microbial communities. Initial experiments already showed promising results [94]. However, careful experimental planning and interpretations should be paramount in this field.

One of the ultimate aims of metagenomics is to link functional and phylogenetic information to the chemical, physical, and other biological parameters that characterize an environment. While measuring all these parameters can be time-consuming and cost-intensive, it allows retrospective correlation analysis of metagenomic data that was perhaps not part of the initial aim of the project or might be of interest for other research questions. The value of such metadata cannot be overstated and, in fact, has become mandatory or optional for deposition of metagenomic data into some databases [50,74].

Go to:
Sharing and Storage of Data
Data sharing has a long tradition in the field of genome research, but for metagenomic data this will require a whole new level of organization and collaboration to provide metadata and centralized services (e.g., IMG/M, CAMERA and MG-RAST) as well as sharing of both data and computational results. In order to enable sharing of computed results, some aspects of the various analytical pipelines mentioned above will need to be coordinated - a process currently under way under the auspices of the GSC. Once this has been achieved, researchers will be able to download intermediate and processed results from any one of the major repositories for local analysis or comparison.

A suite of standard languages for metadata is currently provided by the Minimum Information about any (x) Sequence checklists (MIxS) [95]. MIxS is an umbrella term to describe MIGS (the Minimum Information about a Genome Sequence), MIMS (the Minimum Information about a Metagenome Sequence) and MIMARKS (Minimum Information about a MARKer Sequence)[87] and contains standard formats for recording environmental and experimental data. The latest of these checklists, MIMARKS builds on the foundation of the MIGS and MIMS checklists, by including an expansion of the rich contextual information about each environmental sample.

The question of centralized versus decentralized storage is also one of "who pays for the storage," which is a matter with no simple answer. The US National Center for Biotechnology Information (NCBI) is mandated to store all metagenomic data, however, the sheer volume of data being generated means there is an urgent need for appropriate ways of storing vast amounts of sequences. As the cost of sequencing continues to drop while the cost for analysis and storing remains more or less constant, selection of data storage in either biological (i.e. the sample that was sequenced) or digital form in (de-) centralized archives might be required. Ongoing work and successes in compression of (meta-) genomic data [96], however, might mean that digital information can still be stored cost-efficiently in the near future.

Go to:
Conclusion
Metagenomics has benefited in the past few years from many visionary investments in both financial and intellectual terms. To ensure that those investments are utilized in the best possible way, the scientific community should aim to share, compare, and critically evaluate the outcomes of metagenomic studies. As datasets become increasingly more complex and comprehensive, novel tools for analysis, storage, and visualization will be required. These will ensure the best use of the metagenomics as a tool to address fundamental question of microbial ecology, evolution and diversity and to derive and test new hypotheses. Metagenomics will be employed as commonly and frequently as any other laboratory method, and "metagenomizing" a sample might become as colloquial as "PCRing." It is therefore also important that metagenomics be taught to students and young scientists in the same way that other techniques and approaches have been in the past.

Go to:
Competing interests




Metagenomics Sequencing Guide
Coverage, read length, and workflow recommendations

Overview
Sample Extraction
Library Preparation
Sequencing
Assembly
Annotation
16S Sequencing
Coverage Estimates
Read Length
References

Search for Metagenomics Services
⬅️ NGS Handbook

Metagenomics overview
Metagenomics refers to both a research technique and research field. Metagenomics, the field can be defined as the genomic analysis of microbial DNA from environmental communities. Metagenomics tools enable the population analysis of un-culturable or previously unknown microbes. This is important as only around 1-2% of bacteria can be cultured in the laboratory (1). The ability to identify microbes without a priori knowledge of what a sample contains is opening new doors in disciplines like microbial ecology, virology, microbiology, environmental sciences and biomedical research. Sequencing based examination of the metagenome has become a powerful tool for generating novel hypotheses.

Shotgun metagenomic sequencing

Shotgun metagenomic sequencing is a relatively new environmental sequencing approach used to examine thousands of organisms in parallel and comprehensively sample all genes, providing insight into community biodiversity and function. Shotgun sequencing allows for the detection of low abundance members of microbial communities.

Metagenomics sequencing workflow
There are several steps involved in a sequencing based metagenomics project. These include DNA extraction, library preparation, sequencing, assembly, annotation and statistical analysis.

Sample extraction
A reproducible method to extract DNA from microbial communities is essential for surveying and whole genome metagenomic analysis. Isolation and extraction must yield high quality nucleic acid for subsequent library preparation and sequencing. Sampling variation can have an effect on comparisons, and abundance measurements. This introduces several challenges as some samples must be delivered anaerobically. Exposure to oxygen or freezing can change the dynamic composition of a given microbial community. For example, freezing, thawing and subsequent bead-beating can affect the cell wall of Gram-positive bacteria, and introduce artifacts compared to extraction performed on fresh samples.

Kits frequently used for DNA extraction from environmental samples include:

MoBIO DNA Extraction Kit
Qiagen DNA Microbiome Kit
Epicentre Metagenomic DNA Isolation Kit for water
Epicentre Meta-G-Nome DNA Isolation Kit
If the target community is associated with a host, e.g. human or plant, then physical fractionation or selective lysis can be employed to ensure host DNA is kept to a minimum. Host material can also be removed during bioinformatics filtering and mapping. Regardless of the approach used, it’s important to remember that extraction and isolation methods can introduce bias in terms of microbial diversity, yield and fragment lengths. It’s highly recommended that the exact same extraction method be used when comparing samples.

Library preparation
One of the biggest considerations for library preparation of environmental samples for shotgun metagenomic sequencing has to do with amplification. Certain types of samples (water, swabs) yield small amounts of DNA, necessitating amplification during library preparation. Amplification by PCR can over amplify certain fragments over others confounding abundance and microbial diversity measurements. Often the user does not have a choice when faced with low inputs of DNA. Minimizing variability, constructing libraries together to reduce batch effects and keeping library preparation steps as consistent as possible between samples is good practice. If you’re able to extract enough DNA material (~250 – 500 ng) an amplification-free based library preparation method is recommended. The following library preparation kits are frequently used for metagenomics library preparation:

Bioo Scientific NEXTflex PCR-Free DNA Sequencing Kit
Illumina TruSeq PCR-Free Library Preparation Kit
Kapa Hyper Prep Kit
Sequencing
Shotgun metagenomic sequencing is unique in the sense that you’re trying to sequence a large diverse pool of microbes, each with a different genome size, often mixed with host DNA. Current sequencing technologies offer a wide variety of read lengths and outputs. Illumina sequencing technology offers short reads, 2x250 or 2x300 bp but generates high sequencing depth. Longer reads are preferred as they overcome short contigs and other difficulties during assembly. However instruments that offer longer reads, e.g. PacBio and Oxford Nanopore are accompanied with higher error rates, lower sequencing depth and higher costs. PacBio error rates can be reduced using circular concensus sequencing (CCS) which involves repeat sequencing of a circular template and generation of a DNA insert consensus. High quality 500-4000 bp can be generated with >99% Q20 accuracy.

Not taking costs into consideration and simply evaluating a long PacBio read versus a short Illumina read, with PacBio reads you can expect improved metagenomics assembly statistics and genome binning of difficult to assemble phenotypes. PacBio sequencing is recommended for isolates or in cases when you’re only interested in examining several abundant organisms. Illumina reads are recommended in metagenomics studies where the difference between rare and abundant cells is significant. A compromise many in the field are now using are hybrid Illumina and PacBio reads. Hybrid assemblies using PacBio CCS and HiSeq contigs improve assembly stats, number of contigs and overall contig length. By combining both reads (PacBio and Illumina), you have a higher probability of achieving complete chromosomal closure. Rare microbial species will still have to rely on high depth Illumina sequencing alone for proper assembly.

Assembly
Assembly involves the merging of reads from the same genome into a single contiguous sequence (contig). Most available tools build upon a traditional de Brujin graph approach to genome assembly. One of the biggest challenges to assembly is the generation of chimeras, where two sequences from different genomes or parts of the genome are incorrectly merged due to similar sequence composition. This is often mitigated by performing a binning step, assigning each metagenomic sequence to a taxonomic group and then assembling each bin independently. This helps reduce data complexity and the chance of chimeras.

Annotation
Once assembled, genes can be predicted and functionally annotated. Genes are typically predicated in one of three ways: 1) de novo gene prediction, 2) protein family classification, 3) fragment recruitment (binning). Functional annotation is performed by classifying predicted metagenomics proteins into protein families using sequence or hidden Markov models (HMM) databases. Frequently used sequence databases for functional annotation include:

SEED, KEGG, MetaCyc, EggNOG
HMM databases for metagenomics analysis are usually limited to Pfam which uses HMM to model protein domains.
Metagenomics is not 16S sequencing as the later offers phylogenetic survey on the diversity of a single ribosomal gene, 16S rRNA
The 16S rRNA gene is a taxonomic genomic marker that is common to almost all bacteria and archaea. The marker allows one to examine genetic diversity in microbial communities, specifically what microbes are present in a sample. While some estimates of relative abundance within similar samples can be made, drawing conclusions across different sample types is not recommended due to amplification artifacts introduced during PCR.

16S rRNA sequencing is accomplished by designing primers to the entire 16S locus or targeting multiple hypervariable domains within the gene. The nine variable regions of the 16s rRNA gene are flanked by conserved stretches in the majority of bacteria. Conserved regions can be used as targets for PCR primers, designed upstream and downstream of the variable domains. These hypervariable regions provide the species-species signature necessary for identification. After these domains have been amplified, sequencing related primers are either ligated or added by a second PCR step.

Advantages of 16S rRNA sequencing

Gene is universally distributed
Abundance of 16S rRNA gene sequences exceed those of other bacterial genes
Easy measurements of phylogenetic relationships across different taxa

Horizontal gene transfer isn’t a big problem
Costs to perform 16S rRNA amplification and sequencing are typically between $47 - $60 per sample
Despite the wide use of 16S sequencing several factors limit proper interpretation of data.

Disadvantages of 16S rRNA sequencing

Copy numbers per genome can vary. While they tend to be taxon specific, variation among strains is possible
Relative abundance measurements are un-reliable because of amplification biases
Diversity of the gene tends to overinflate diversity estimates
Resolution of the 16S gene is often too low to differentiate between closely related species
As sequencing costs drop, microbiome research is moving from 16S rRNA gene sequencing to more comprehensive functional representations via whole genome or shotgun metagenomics sequencing.

Sequencing coverage / depth for metagenomics studies
The short answer is there is no easy way to estimate read depth required for shotgun metagenomics sequencing. Environmental samples have a large distribution of species; each species would have to be accounted for individually. You would need to know the number of total species in the sample, the genome size and relative abundance for each species. In most cases this is not possible when you’re sequencing a sample for the first time.

Let’s assume you were dealing with a simple sample that had 10 bacterial species and wanted 100x coverage depth for de no assembly. If your 10 bacterial species had an estimated genome size of 2 Mb, you’d aim for around 2 Gb of sequencing data per sample.

10 dominant bacterial species * 100x * 2 Mb = 2 Gb

In most metagenomics studies there are thousands or millions of species you need to contend with
Much of these reads will be removed by mapping, but they still need to be accounted for.

If you need to perform de novo assembly, sequence a sample on one paired end Illumina HiSeq lane.
If you already have an assembly and need to measure abundances, start with one single end Illumina MiSeq 
While Illumina read lengths go up to 2x300, those are currently reserved for the MiSeq which may not give you the depth needed on a single lane

Deep sequencing of viral or bacterial nucleic acids monitors the presence and diversity of microbes in select populations and locations.
Metagenomic study of mammalian viromes
High throughput sequencing of patient and untreated sewage microbiomes showed many sequences with no similarity to genomic sequences of known function or origin
To estimate the distribution of functional RNAs in these microbiomes,  hammerhead ribozyme (HHR) motif is used to search for sequences capable of assuming its three-way junction fold

Laboratory tools and reagents (Micro-pipettes)...

Micro-pipettes are essential tools of R & D labs, and integral part of Good Laboratory Practices (GLPs) Micro-pipetting methods include ...