The world of big data has re-focused research to a systems-level approach, promising extraordinary capacity for biological discovery.
Information Biology
Technology has undoubtedly transformed the
way we interact with the world.
The ability to digitally organise, filter,
and efficiently use the immense data we generate affects many aspects of our
lives from marketing to social policy. Similarly, the realm of biological
research has been revolutionised by technological innovations of the digital
age.
Major advances in DNA sequencing
technologies, such as greater speed and affordability, have ensured that whole
or partial genome analysis has become a common feature of many biological
studies. Combined with improved methods for measuring other important
components of cell function in large volumes, never before have scientists had
access to so much biological information.
With the ability to gather such unprecedented
amounts of data, the re-focusing of researchers to systems-level investigations
promises a remarkable increase in our capacity for biological discovery. While
this is new territory for most biologists, the use of computational methods, tools,
databases, and online resources is helping to translate this plethora of information
to real scientific knowledge.
Incorporates image by Matt Trostle/Flickr |
The emergence of high throughput sequencing
From pioneering origins in the 1970s to
recent technological innovations, DNA sequencing has considerably increased our
awareness of the connection between genes and cell function.
Sequencing determines the order of
nucleotides (T, G, C, and A) within a stretch of DNA. This approach has
contributed to many scientific fields, encompassing studies of microbes and
viruses, to conservation ecology and evolutionary biology. Equally, the
usefulness of sequencing data for a molecular understanding of human health and
disease is increasingly being appreciated.
The recent development of Next Generation
Sequencing (NGS) technologies has dramatically enhanced the capacity for
large-scale investigations.
These technologies are described as
‘high-throughput’ because they permit an enormous number of measurements to be
taken in a short period of time. New sequencing platforms extend the methods used
by earlier technologies across millions of parallel reactions. This allows
researchers to take genome-wide measurements at once, as opposed to analysing a
single gene or handful of DNA fragments.
The speed at which high throughput
sequencing has developed is extraordinary. The latest NGS instruments are
capable of astounding output, producing terabytes of data per sequencing run. A
single laboratory can now sequence an entire human genome in a matter of days at
the cost of a few thousand dollars. By comparison, the first human genome was
sequenced (2003) using traditional methods by numerous collaborating
laboratories over a period of 13 years, with a price tag of US$1 billion.
Radically reduced cost places NGS well
within the means of the commercial biotechnology industry as well as many
academic institutions. And the potential for its use in personalised healthcare
in the clinic is rapidly increasing for the same reason.
Big data
In addition to DNA nucleotide sequencing,
NGS can generate vast amounts of information on several distinct functions of
cell biology. Borrowing from the term genome,
which defines the entirety of an organism’s DNA, scientists use the suffix -ome (or -omics) to describe other large-scale
biological systems.
High throughput sequencing operates at this
level and is applicable to the collective measurement of all of the RNA
transcribed from genes (transcriptome),
or genome-wide patterns of epigenetic modifications (epigenome).
But the capacity to gather massive amounts
of biological information at such unprecedented rates presents researchers with
a new set of challenges. Most importantly, how do they manage and interpret
these enormous volumes of data for real biological discovery?
Making sense of large data sets
The answer is computer processing. Managing
digital sequencing information to identify correlations and make useful
predictions is key to harnessing the power of large data sets. But these are
very different skills to those possessed by the classically trained biologist.
Enter the bioinformatician.
With one pylon in biological science and another
firmly planted in information technology, bioinformaticians bridge these
distinct, yet increasingly related disciplines. The field of bioinformatics draws
from statistical expertise to develop new algorithms and data mining tools to
complement the growing demand for high throughput data analysis.
For example, numerous
computational methods have been specifically developed for analysis and
visualisation of large-scale profiling of the transcriptome. Because RNA is the
intermediary copy of the information encoded in a gene for protein expression
or other processes, transcriptome sequencing provides the strongest indicator
of gene activity across entire genomes.
Further information can be extracted from
these profiles by combining them with known biological relationships. Initiatives
such as The Gene Ontology project aim to organise information about genes and
gene products in databases that can be easily used by researchers. This enables
more accurate interpretation of experimental data in the context of shared
scientific knowledge.
This type of analysis can reveal previously
unknown interactions or biologically relevant trends in large data sets. Detection
of similar changes across functionally related groups of genes or gene networks
is particularly informative.
Gene network diagram assembled from transcriptome profiling data of human vascular cells using known biological relationships. Blue circles indicate reduced gene expression and red circles indicate increased gene expression detected in the original experiment. Arrows indicate known relationships to the activator protein 1 (AP-1) transcription factor. |
Online resources
Bioinformatics has positioned itself as a key
player in modern biology. To keep up with the accelerating accumulation of big
data, new computational tools and databases are continually being developed,
and many are freely accessible online.
A prominent example is the Encyclopedia of DNA Elements (ENCODE) consortium, which
provides downloadable information on functional parts of the human genome, as
well as free tools for bioinformatic analysis. Similarly, The Database for Annotation, Visualization and Integrated Discovery (DAVID) allows researchers to integrate experimental data
with intracellular pathway maps.
At the same time, increased use of NGS
technologies has propelled rapid accumulation of large repositories of data
available for public use. Leverage of these data sets allows researchers to intersect
their own experimental data with information from similar or related studies, tremendously
enhancing the power for discovery of new correlations and insights.
Systems biology
The genome, epigenome, transcriptome, and
proteome (the collection of proteins in a cell) are intimately linked by
complex interactions.
DNA plans that are
organised into genes are copied to RNA transcripts. A proportion of these
transcripts code for proteins, some of which can in turn act on the DNA to
epigenetically regulate gene expression. Other proteins collaborate to perform additional
cell functions. On the other hand, the RNA transcripts that don’t code for
proteins can directly affect gene activity and protein function.
Complex interactions of biological systems |
Omics studies aim to collectively translate
these data sets generated from immense, interacting networks to describe the
structural and functional dynamics of cell biology.
This shift to systems-level analysis
signifies a fundamental change in the way cell biology is investigated and
contrasts with the traditional, more limited focus on small groups of genes or
gene products.
The ability to integrate multiple omics
data sets to holistically consider all (or a subset of) DNA sequences,
epigenetic modifications, gene expression, and proteins in relation to each
other places further demonstrates the power of computational biology.
For instance, intersection of transcriptome
data and epigenome data from the same experimental conditions has been used to
functionally link specific epigenetic changes to the activity of particular
genes or genomic regions. And further integration of proteome data can reveal
the biological effects of these changes at the protein level, the ultimate
determinant of phenotype (observable characteristics or traits).
The future of iBiology
Increased access to high throughput
technologies allows small laboratories to generate big data. However, the accumulation of vast amounts of
biological information places increasing emphasis on data handling strategies.
Computational biology is at the forefront
of this interface. Drawing from expanding repositories of validated and
inferred scientific knowledge, bioinformaticians are continually developing and
improving analytical methods to distill the most useful information.
The future importance and applications of
information biology are infinite. More and more fields of research will benefit
as the ability to interpret and integrate big data improves.
There is particularly strong potential for
this interdisciplinary field to contribute to personalised medicine of the
future.
Imagine a world where the precise
cellular effects of particular drug treatments could soon be determined for
individual patients to optimise treatment and reduce unwanted off-target
effects.
References
Sanger F, Coulson AR (1975). A rapid method
for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology 94(3):441-8
PMID: 1100841
Metzker ML (2010) Sequencing technologies –
the next generation. Nature Reviews
Genetics 11(1):31-46 PMID: 19997069
Huang DW, Sherman BT, Lempicki RA (2009).
Systematic and integrative analysis of large gene lists using DAVID
Bioinformatics Resources. Nature
Protocols 4(1): 44-57 PMID: 19131956
Keating ST, Ziemann M, Okabe J, Kahn AW,
Balcerczyk A, El-Osta A (2014). Deep sequencing reveals novel Set7 networks. Cellular and Molecular Life Sciences
[Epub ahead of print] PMID: 24875254
Gullapalli RR, Desai KV, Santana-Santos L,
Kant JA, Becich MJ (2012). Next generation sequencing in clinical medicine:
Challenges and lessons for pathology and biomedical informatics. Journal of Pathology Informatics 3: 40
PMID: 23248761
Kamalakaran S, Varadan V, Janevski A,
Banerjee N, Tuck D, McCombie WR, Dimitrova N, Harris LN (2013). Translating
next generation sequencing to practice: opportunities and necessary steps. Molecular Oncology 7(4): 743-55 PMID:23769412
Images
Image
1 incorporates an image from Matt Trostle,
“computer mouse isolated on white background”, Flickr, copyright 2011 under an
attribution licence, http://www.flickr.com/photos/trostle/6848810640
Image 2: XPRIZE Foundation, “AGXP_20120723_Ion Torrent Bus
Tour_LA_9224”, Flickr, copyright 2012 under an attribution licence, http://www.flickr.com/photos/50507112@N05/7633280450
Other images by S.Keating