Tuesday, 5 August 2014

The Digital Era of Biological Discovery

Originally published in The Australia Times Science Magazine, Vol.2 No.8 (August) 2014

The world of big data has re-focused research to a systems-level approach, promising extraordinary capacity for biological discovery.

Information Biology

Technology has undoubtedly transformed the way we interact with the world.

The ability to digitally organise, filter, and efficiently use the immense data we generate affects many aspects of our lives from marketing to social policy. Similarly, the realm of biological research has been revolutionised by technological innovations of the digital age.

Major advances in DNA sequencing technologies, such as greater speed and affordability, have ensured that whole or partial genome analysis has become a common feature of many biological studies. Combined with improved methods for measuring other important components of cell function in large volumes, never before have scientists had access to so much biological information.

With the ability to gather such unprecedented amounts of data, the re-focusing of researchers to systems-level investigations promises a remarkable increase in our capacity for biological discovery. While this is new territory for most biologists, the use of computational methods, tools, databases, and online resources is helping to translate this plethora of information to real scientific knowledge.

Incorporates image by Matt Trostle/Flickr  

The emergence of high throughput sequencing

From pioneering origins in the 1970s to recent technological innovations, DNA sequencing has considerably increased our awareness of the connection between genes and cell function.

Sequencing determines the order of nucleotides (T, G, C, and A) within a stretch of DNA. This approach has contributed to many scientific fields, encompassing studies of microbes and viruses, to conservation ecology and evolutionary biology. Equally, the usefulness of sequencing data for a molecular understanding of human health and disease is increasingly being appreciated.

The recent development of Next Generation Sequencing (NGS) technologies has dramatically enhanced the capacity for large-scale investigations.

These technologies are described as ‘high-throughput’ because they permit an enormous number of measurements to be taken in a short period of time. New sequencing platforms extend the methods used by earlier technologies across millions of parallel reactions. This allows researchers to take genome-wide measurements at once, as opposed to analysing a single gene or handful of DNA fragments.

The speed at which high throughput sequencing has developed is extraordinary. The latest NGS instruments are capable of astounding output, producing terabytes of data per sequencing run. A single laboratory can now sequence an entire human genome in a matter of days at the cost of a few thousand dollars. By comparison, the first human genome was sequenced (2003) using traditional methods by numerous collaborating laboratories over a period of 13 years, with a price tag of US$1 billion.   

Radically reduced cost places NGS well within the means of the commercial biotechnology industry as well as many academic institutions. And the potential for its use in personalised healthcare in the clinic is rapidly increasing for the same reason.

Big data

In addition to DNA nucleotide sequencing, NGS can generate vast amounts of information on several distinct functions of cell biology. Borrowing from the term genome, which defines the entirety of an organism’s DNA, scientists use the suffix -ome (or -omics) to describe other large-scale biological systems. 

High throughput sequencing operates at this level and is applicable to the collective measurement of all of the RNA transcribed from genes (transcriptome), or genome-wide patterns of epigenetic modifications (epigenome).

But the capacity to gather massive amounts of biological information at such unprecedented rates presents researchers with a new set of challenges. Most importantly, how do they manage and interpret these enormous volumes of data for real biological discovery?

Making sense of large data sets

The answer is computer processing. Managing digital sequencing information to identify correlations and make useful predictions is key to harnessing the power of large data sets. But these are very different skills to those possessed by the classically trained biologist.

Enter the bioinformatician.

With one pylon in biological science and another firmly planted in information technology, bioinformaticians bridge these distinct, yet increasingly related disciplines. The field of bioinformatics draws from statistical expertise to develop new algorithms and data mining tools to complement the growing demand for high throughput data analysis.

For example, numerous computational methods have been specifically developed for analysis and visualisation of large-scale profiling of the transcriptome. Because RNA is the intermediary copy of the information encoded in a gene for protein expression or other processes, transcriptome sequencing provides the strongest indicator of gene activity across entire genomes.

Further information can be extracted from these profiles by combining them with known biological relationships. Initiatives such as The Gene Ontology project aim to organise information about genes and gene products in databases that can be easily used by researchers. This enables more accurate interpretation of experimental data in the context of shared scientific knowledge.

This type of analysis can reveal previously unknown interactions or biologically relevant trends in large data sets. Detection of similar changes across functionally related groups of genes or gene networks is particularly informative.

Gene network diagram assembled from transcriptome profiling data of human vascular cells using known biological relationships.  Blue circles indicate reduced gene expression and red circles indicate increased gene expression detected in the original experiment. Arrows indicate known relationships to the activator protein 1 (AP-1) transcription factor. 

Online resources

Bioinformatics has positioned itself as a key player in modern biology. To keep up with the accelerating accumulation of big data, new computational tools and databases are continually being developed, and many are freely accessible online.

A prominent example is the Encyclopedia of DNA Elements (ENCODE) consortium, which provides downloadable information on functional parts of the human genome, as well as free tools for bioinformatic analysis. Similarly, The Database for Annotation, Visualization and Integrated Discovery (DAVID) allows researchers to integrate experimental data with intracellular pathway maps.

At the same time, increased use of NGS technologies has propelled rapid accumulation of large repositories of data available for public use. Leverage of these data sets allows researchers to intersect their own experimental data with information from similar or related studies, tremendously enhancing the power for discovery of new correlations and insights.

Systems biology

The genome, epigenome, transcriptome, and proteome (the collection of proteins in a cell) are intimately linked by complex interactions.

DNA plans that are organised into genes are copied to RNA transcripts. A proportion of these transcripts code for proteins, some of which can in turn act on the DNA to epigenetically regulate gene expression.  Other proteins collaborate to perform additional cell functions. On the other hand, the RNA transcripts that don’t code for proteins can directly affect gene activity and protein function.

Complex interactions of biological systems

Omics studies aim to collectively translate these data sets generated from immense, interacting networks to describe the structural and functional dynamics of cell biology.   

This shift to systems-level analysis signifies a fundamental change in the way cell biology is investigated and contrasts with the traditional, more limited focus on small groups of genes or gene products.

The ability to integrate multiple omics data sets to holistically consider all (or a subset of) DNA sequences, epigenetic modifications, gene expression, and proteins in relation to each other places further demonstrates the power of computational biology.

For instance, intersection of transcriptome data and epigenome data from the same experimental conditions has been used to functionally link specific epigenetic changes to the activity of particular genes or genomic regions. And further integration of proteome data can reveal the biological effects of these changes at the protein level, the ultimate determinant of phenotype (observable characteristics or traits).

The future of iBiology

Increased access to high throughput technologies allows small laboratories to generate big data.  However, the accumulation of vast amounts of biological information places increasing emphasis on data handling strategies.

Computational biology is at the forefront of this interface. Drawing from expanding repositories of validated and inferred scientific knowledge, bioinformaticians are continually developing and improving analytical methods to distill the most useful information.

The future importance and applications of information biology are infinite. More and more fields of research will benefit as the ability to interpret and integrate big data improves.

There is particularly strong potential for this interdisciplinary field to contribute to personalised medicine of the future.

Imagine a world where the precise cellular effects of particular drug treatments could soon be determined for individual patients to optimise treatment and reduce unwanted off-target effects.


Sanger F, Coulson AR (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology 94(3):441-8 PMID: 1100841

Metzker ML (2010) Sequencing technologies – the next generation. Nature Reviews Genetics 11(1):31-46 PMID: 19997069

Huang DW, Sherman BT, Lempicki RA (2009). Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protocols 4(1): 44-57 PMID: 19131956

Keating ST, Ziemann M, Okabe J, Kahn AW, Balcerczyk A, El-Osta A (2014). Deep sequencing reveals novel Set7 networks. Cellular and Molecular Life Sciences [Epub ahead of print] PMID: 24875254

Gullapalli RR, Desai KV, Santana-Santos L, Kant JA, Becich MJ (2012). Next generation sequencing in clinical medicine: Challenges and lessons for pathology and biomedical informatics. Journal of Pathology Informatics 3: 40 PMID: 23248761

Kamalakaran S, Varadan V, Janevski A, Banerjee N, Tuck D, McCombie WR, Dimitrova N, Harris LN (2013). Translating next generation sequencing to practice: opportunities and necessary steps. Molecular Oncology 7(4): 743-55 PMID:23769412


Image 1 incorporates an image from Matt Trostle, “computer mouse isolated on white background”, Flickr, copyright 2011 under an attribution licence, http://www.flickr.com/photos/trostle/6848810640

Image 2: XPRIZE Foundation, “AGXP_20120723_Ion Torrent Bus Tour_LA_9224”, Flickr, copyright 2012 under an attribution licence, http://www.flickr.com/photos/50507112@N05/7633280450

Other images by S.Keating