Skip to main content

Whole Genome Sequencing (WGS) and Metagenomics

by Joao Andre CARRIÇO — November 24, 2021

Genome Sequence



All of the information required for a living being to grow, reproduce, and mature is encoded in sequences of 4 building blocks called nucleotides commonly abbreviated as C, G, A, and T) polymerized into long chains contained within cells called deoxyribonucleic acid (DNA). DNA consists of two parallel chains and the nucleotides along the length of the chains are often referred to as base pairs. Coding sequences within DNA are called genes that contain information for different biological functions, such as enzymes or other proteins. Non-coding sequences act as regulators for synthesis, but also contain unknown functions that are yet to be discovered. Inside the cell, the billions of base pairs of nucleotide chains are arranged into chromosomes that make up the genome of a particular organism. Throughout life, as an organism needs something, like a particular protein, the information in the corresponding gene is copied into a discrete and mobile blueprint in the form of a ribonucleic acid (RNA), which cells read to make whatever is needed.

Understanding the connection between what is encoded in the DNA of an organism, such as a specific strain of Salmonella bacteria, and the ultimate biological consequences, such as hardiness or resistance to antibiotics, is clearly of immense value. Similarly, as DNA defines the specific physical traits, it has been recognized that knowledge of the DNA sequence can sometimes help in predicting or overcoming genetic disorders, like cystic fibrosis, muscular dystrophy, and so on. However, the process of getting this information is often challenging because it relies on sequencing of the building blocks within DNA.



Whole genome sequencing is the measurement of the ordering of a specific organism’s entire set of C, G, A, and T nucleotides within their DNA at a single time. The first entire human genome that was obtained (approximately 3 billion base pairs long) has been measured and reported by piecing together many separate measurements from different individuals. Whole genome sequencing of a human, including massive stretches of non-coding DNA, is uncommon, mostly due to its high cost and required time. However, a more focused sequencing of specific parts of individuals’ DNA relevant to risk of disease or ancestry is now commonplace and a powerful tool used in personalized healthcare and pharmaceuticals. However, whole genome sequencing is more easily achieved for simpler organisms, such as bacteria, and the knowledge base connecting genetic information to expression and biological function continually grows, adding further value to genetic sequencing.



The Sanger Method for DNA sequencing was developed in 1977 and has historically been the most widely used. It remains in use today for smaller, targeted sequencing projects methods. The key to Sanger sequencing relies on two scientific principles. One is that DNA fragments can be precisely separated from each other on the basis of their length (using for example gel separation); it is possible to distinguish differences of only a few nucleotides in length in certain conditions. Therefore, if you have a mixture of thousands of DNA fragments of different lengths in a “soup,” the length of each one of those fragments can be separately detected. The second is that there exist modified “marker” C, G, A, and T nucleotides that glow brightly different colors and, once added to a DNA chain, prevents any other nucleotides from linking further.

The Sanger sequencing process begins by mixing a sample of DNA which is to be sequenced with all the ingredients needed for that DNA to duplicate itself; if left alone the DNA would fully duplicate many times. However, small amounts of marker C, G, A, and T nucleotides are also added to the mixture, which get used randomly causing the DNA replications to stop whenever one is added. The resulting “soup” is a mixture of DNA of different lengths, each with a brightly glowing marker nucleotide at the end. When the mixture is separated using gel chromatography, each length is detected one at a time, and the color of the marker nucleotide is recorded in sequence, that can be read and matches the sequence of the original DNA.

Using the Sanger sequencing method, DNA fragments up to around one thousand base pairs can be sequenced in a single experiment. Although this number may seem impressive, whole genomes contain many mega pairs or even giga base pairs. For example, the fruit fly genome contains approximately 137,000,000 base pairs. Therefore more recent techniques have been developed building on the Sanger Method which enable sequencing of longer DNA segments up to the entire genome in a more time and cost efficient way.



Scientists at modern whole genome sequencing centers rely on similar principles of the Sanger method but utilize benefits from state-of-the-art microfluidics and bioinformatics to break a massive genome sequencing problem down into thousands of smaller DNA sequences that can later be reassembled back to the full, or almost full, genome.



When a DNA molecule is too long to be sequenced in a single run, it must be cut into many more manageable “bite-sized” pieces which are each individually replicated and sequenced with overlapping areas and then reassembled to create the overall genetic picture. While possible to accomplish manually, each DNA cutting, replication, and sequencing step is the equivalent of an entire first-generation study, such as using the Sanger or Maxam-Gilbert method.

Next generation shotgun sequencing produces an elegant solution to the process through the production and management.

The details of the sequencing process itself varies by commercial technology, however sequencing by synthesis (SBS) is a common theme. In SBS, microfluidic chips are used that bind DNA fragments; once a DNA fragment is bound to a specific location on the chip, only identical segments of DNA can bind to that location, creating a cluster at the spot. After this, the DNA cluster is exposed to the ingredients needed to replicate, but only one nucleotide at a time. For example, if a cluster has a sequence GACA and the nucleotides (G, A, T, and C) are flushed over the cluster, then in the first round only G nucleotide with its label would be able to bind to the DNA fragments in the cluster. All unbound nucleotides would not bind to that cluster of DNA and would be washed away. In the second step, the label on the added nucleotide is read using, for example, color of light emitted by the label (via fluorescent microscopy). A third step is the removal of the marker on the added nucleotide (G in the example), effectively preparing the strand for the next synthesis step. The process is repeated. In the second cycle, the A nucleotide with the label is the only one able to bind to the cluster, which is then read and label cleared for the preparation of the third step. Each cycle is run until the sequencing of the entire fragment is complete. Important differences between different commercialized SBS methodologies amount primarily to differences in labelling and the way those labels are read.

The result of shotgun sequencing is a massive amount of data, containing the sequences of each small DNA fragment which was cut randomly. This data must then be processed to best align the overlapping sequences appearing in more than one fragment to rebuild the parent DNA molecule. The reconstruction process is complicated by the realities of imperfect analytical measurements, errors that may occur in DNA synthesis, and large regions of DNA with repeating patterns. Typically, to generate a satisfactory reconstruction, the same area of DNA will be sequenced up to 30 times in a eucaryotic genome or up to 100 times in a prokaryotic genome to reduce the rate of error.



DNA contains all of the coding for an organism’s biological function and is passed from one generation to another. Because of this DNA testing for paternity can be utilized. Additionally, sex-specific genetic material, like mitochondrial DNA, is passed from mother to offspring and can be used to track maternal heritage many generations back to identify common ancestry across species in evolutionary biology. Genetic information has also revolutionized forensic science, as a person’s DNA is unique and contained in all their cells, and because of that, we leave a trail of DNA throughout all our daily activities. Therefore, if a person’s DNA is known and a matching sample of DNA is found at the scene of a crime, it is conclusive that the person in question has been there at some point. This process is also known as DNA fingerprinting. Similar techniques can be used to track bacterial strains. These methods can help to significantly advance the food safety and quality industry by detecting microbiological transmission events within a factory, and such knowledge can be used to avoid pathogen or spoiler contaminations of the final product.

Therefore, the raw data from the sequencing process will then be subject to several bioinformatics analyses tools to produce either a draft genome, resulting from the assembly of this gigantic puzzle or a variation map by comparing the millions of fragments to a known genome to find the variable regions. Both of these approaches can be used to infer the relationships between bacterial strains found in a given place in similar ways to the DNA Fingerprinting process. The draft genomes can also be used to find target genes or mutations of interest that can have interesting phenotypic expression, such as antibiotic or biocide resistance in bacterial genomes.



Microorganisms are ubiquitous to life and provide a beneficial symbiotic relationship with larger organisms in addition to specific microbes being a source of infection and disease. Within the human body, an incredibly complex ecosystem of different microorganisms exists as a microbiome, and the state of that biome has been proven to directly impact human health. In other organisms or in larger ecological systems, microbiomes are also recognized as critical players. Understanding identities and balance between different species in a microbiome can yield valuable information related to the action of the microbe community. However, measurement of a microbiome using traditional cell culture methods is problematic as not all microbes grow favorably in standard cell culture conditions – some not at all – making it difficult or impossible to build a comprehensive view of the entire microbe community present.

Metagenomics is the sequencing of a whole community of microorganism DNA as opposed to sequencing just the individual microbes. Using methods common to whole genome sequencing, a mix of DNA from different individual microbes can be sequenced, and this information used to reconstruct the species profile of the entire microbiome population. In addition to the species profile of a microbial population, the genetic sequence of each member can also be known and compared with other members of the same species from other microbiomes to understand how those microbes have evolved or adapted to their specific environments.

Metagenomics has enabled elucidation of the relationship between microbiomes and larger systems, such as the soil of a farm, and the role of these microbiomes in nutrient cycling, disease suppression, and nitrogen fixation. In humans, the relationship between gut microbiome and health is clear, although the mechanisms themselves still remain under investigation.