IEEE Spectrum July, 2013 - 32

in Cambridge, Mass., or BGI, in Shenzhen, China, have more than
100 high- capacity instruments on site, but smaller institutions
like the Malaysia Genome Institute or the International Livestock
Research Institute, in Kenya, also have their own instruments. In all
these facilities, researchers are struggling to analyze the sequencing data for a wide variety of applications, such as investigations
into human health and disease, plant and animal breeding, and
monitoring microbial ecology and pathogen outbreaks.
The only hope for these overwhelmed researchers lies in advanced
computing technologies. Genomics researchers are investigating a
range of options, including very powerful but conventional servers,
specialized hardware, and cloud computing. Each has strengths
and weaknesses depending on the specific application and analysis. But for many, cloud computing is increasingly the best option,
because it allows the close integration of powerful computational
resources with extremely high-volume data storage.
One promising solution comes from Google, a company with
plenty of experience searching vast troves of data. Google doesn't
regularly release information on how much data it processes, but
in May 2010 it reported searching 946 petabytes per month. ToThe roughly 2000 sequencing instruments in labs and hos- day, three years later, it's safe to assume that figure is at least an
order of magnitude larger.
pitals around the world can collectively sequence 15 quadrillion
To mine the Internet, Google developed a parallel computing
nucleotides per year, which equals about 15 petabytes of compressed genetic data. A petabyte is 250 bytes, or in round numbers, framework called MapReduce. Outside of Google, an open-source
alternative to MapReduce called Apache Hadoop is emerging as
1000 terabytes. To put this into perspective, if you were to write
a standard platform for analyzing huge data sets in genomics and
this data onto standard DVDs, the resulting stack would be more
other fields. Hadoop's two main advantages are its programming
than 2 miles tall. And with sequencing capacity increasing at a rate
model, which harnesses the power of many computers in tandem,
of around three- to fivefold per year, next year the stack would be
and its smart integration of storage and computational power.
around 6 to 10 miles tall. At this rate, within the next five years the
While Hadoop and MapReduce are
stack of DVDs could reach higher than the
simple by design, their ability to coorbit of the International Space Station.
Clearly, we're dealing with a data delordinate the activity of many computuge in genomics. This data is vital for the
ers makes them powerful. Essentially,
Like the index of a book, a genome index is organized
advancement of biology and medicine,
they divide a large computational task
by key terms (in this case, short strings of nucleotides).
but storing, analyzing, and sharing such
into
small pieces that are distributed to
It lists all the places in the larger text (the genome) where
vast quantities is an immense challenge.
many computers across the network.
those key terms appear.
Still, it's not an unprecedented one: Other
Those computers perform their jobs
gatcacagaaattccagcatatgacatccacg
fields, notably high-energy physics and
(the "map" step), and then communicgctagccggtatatgaaatgagaggatcatc
astronomy, have already encountered
cate with each other to aggregate the
acactatgtgatgacatactagaccggtgatg
gggatatcaggaattccagcatatgacatcca
this problem. For example, the four main
results (the "reduce" step). This procgcgctagccggtatatgaaggatgagaggga
detectors at the Large Hadron Collider
cess can be repeated many times over,
gccaccactatgtgatgacatactagaccggt
produced around 13 petabytes of data
and the repetition of computation and
acgatggattacaggaattccagcatatgaca
in 2010, and when the Large Synoptic
aggregation steps quickly produces regaggccacgcgctagcgcgtatatgaaatgag
agagggacaccactatgtgatgacatactaga
Survey Telescope comes on line in
sults. This framework is much more powccccggtgatggattacaggaatcccagcata
2016, it's anticipated to produce around
erful than basic "queue system" software
tgacatacacgcgctagccgagtatatgagag
10 petabytes per year.
packages like the widely used HTCondor
acatgagagggacaccactatgtgatgacata
The crucial difference is that these
and Grid Engine. These systems also diccctagaccggtgatggattacaggaattccc
gcatatgacacccacgcgctagcacgtataag
physics and astronomy data deluges
vide up large tasks among many comcattgaaatgagagaggaatccactatgtgat
pour forth from just a few major instruputers but make no provision for the
gacatactagaccgtttgtgatggattacagg
computers to exchange information.
ments. The DNA data deluge comes
aattccagcatatgacatccacatcctagctc
Hadoop has another advantage: It uses
from thousands-and soon, tens of
caggtatatgaaatgagagggacaccactatg
the computer cluster's computational
thousands-of sources. After all, almost
INDEX
nodes for data storage as well. This
any life-science laboratory can now afa a a O f f s e t s : 9 , 4 9 , 2 5 7, 4 6 7, 5 7 1
means that Hadoop can often execute
ford to own and operate a sequencer.
a t c O f f s e t s : 2 , 6 0 , 1 0 4 , 1 2 7, 3 1 9 , 4 8 0 , 5 51
programs on the nodes themselves, thus
Major centers like the Broad Institute,
c g g O f f s e t s : 4 0 , 1 24 , 1 41 , 1 9 4 , 3 0 0 , 4 0 4

an index at the back of a book, a genome index is a list of all the
places in the genome where a certain string of letters appears-
for example, the roughly 697 000 occurrences of the sequence
"GATTACA" in the human genome.
One powerful recent invention is a genome index based on the
Burrows-Wheeler transform-an algorithm originally developed
for text compression. This efficient index allows us to align many
thousands of 100-nucleotide reads per second. The algorithm works
by carefully changing the order of a sequence of letters into one
that's more compressible-and doing so in a way that's reversible.
So, for example, let's say you have 21 As in your jumbled string of
As, Ts, Gs, and Cs. That part of the string could then be compressed
into A21, thus using 3 characters instead of 21-a sevenfold savings.
By compiling a genome index of sequences reordered in this way,
the search algorithm can scroll through the entire genome much
more quickly, looking for a read's best match.
Once we have the best algorithms and data structures, we arrive
at the next massive challenge: scaling up, and getting many computers to divvy up the work of parsing a genome.

genome IndexIng

32

|

jul 2013

|

north american

|

SPectrum.ieee.orG


http://SPectrum.ieee.orG

Table of Contents for the Digital Edition of IEEE Spectrum July, 2013

IEEE Spectrum July, 2013 - Cover1
IEEE Spectrum July, 2013 - Cover2
IEEE Spectrum July, 2013 - 1
IEEE Spectrum July, 2013 - 2
IEEE Spectrum July, 2013 - 3
IEEE Spectrum July, 2013 - 4
IEEE Spectrum July, 2013 - 5
IEEE Spectrum July, 2013 - 6
IEEE Spectrum July, 2013 - 7
IEEE Spectrum July, 2013 - 8
IEEE Spectrum July, 2013 - 9
IEEE Spectrum July, 2013 - 10
IEEE Spectrum July, 2013 - 11
IEEE Spectrum July, 2013 - 12
IEEE Spectrum July, 2013 - 13
IEEE Spectrum July, 2013 - 14
IEEE Spectrum July, 2013 - 15
IEEE Spectrum July, 2013 - 16
IEEE Spectrum July, 2013 - 17
IEEE Spectrum July, 2013 - 18
IEEE Spectrum July, 2013 - 19
IEEE Spectrum July, 2013 - 20
IEEE Spectrum July, 2013 - 21
IEEE Spectrum July, 2013 - 22
IEEE Spectrum July, 2013 - 23
IEEE Spectrum July, 2013 - 24
IEEE Spectrum July, 2013 - 25
IEEE Spectrum July, 2013 - 26
IEEE Spectrum July, 2013 - 27
IEEE Spectrum July, 2013 - 28
IEEE Spectrum July, 2013 - 29
IEEE Spectrum July, 2013 - 30
IEEE Spectrum July, 2013 - 31
IEEE Spectrum July, 2013 - 32
IEEE Spectrum July, 2013 - 33
IEEE Spectrum July, 2013 - 34
IEEE Spectrum July, 2013 - 35
IEEE Spectrum July, 2013 - 36
IEEE Spectrum July, 2013 - 37
IEEE Spectrum July, 2013 - 38
IEEE Spectrum July, 2013 - 39
IEEE Spectrum July, 2013 - 40
IEEE Spectrum July, 2013 - 41
IEEE Spectrum July, 2013 - 42
IEEE Spectrum July, 2013 - 43
IEEE Spectrum July, 2013 - 44
IEEE Spectrum July, 2013 - 45
IEEE Spectrum July, 2013 - 46
IEEE Spectrum July, 2013 - 47
IEEE Spectrum July, 2013 - 48
IEEE Spectrum July, 2013 - 49
IEEE Spectrum July, 2013 - 50
IEEE Spectrum July, 2013 - 51
IEEE Spectrum July, 2013 - 52
IEEE Spectrum July, 2013 - 53
IEEE Spectrum July, 2013 - 54
IEEE Spectrum July, 2013 - 55
IEEE Spectrum July, 2013 - 56
IEEE Spectrum July, 2013 - 57
IEEE Spectrum July, 2013 - 58
IEEE Spectrum July, 2013 - 59
IEEE Spectrum July, 2013 - 60
IEEE Spectrum July, 2013 - Cover3
IEEE Spectrum July, 2013 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1217
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1117
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1017
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0917
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0817
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0717
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0617
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0517
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0417
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0317
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0217
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0117
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1216
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1116
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1016
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0916
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0816
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0716
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0616
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0516
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0416
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0316
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0216
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0116
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1215
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1115
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1015
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0915
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0815
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0715
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0615
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0515
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0415
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0315
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0215
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0115
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1214
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1114
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1014
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0914
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0814
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0714
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0614
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0514
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0414
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0314
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0214
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0114
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1213
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1113
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1013
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0913
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0813
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0713
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0613
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0513
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0413
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0313
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0213
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0113
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1212
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1112
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1012
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0912
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0812
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0712
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0612
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0512
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0412
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0312
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0212
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0112
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1211
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1111
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1011
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0911
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0811
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0711
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0611
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0511
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0411
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0311
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0211
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0111
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1210
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1110
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1010
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0910
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0810
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0710
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0610
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0510
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0410
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0310
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0210
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0110
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1209
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1109
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1009
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0909
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0809
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0709
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0609
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0509
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0409
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0309
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0209
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0109
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1208
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1108
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1008
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0908
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0808
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0708
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0608
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0508
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0408
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0308
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0208
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0108
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1207
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1107
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1007
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0907
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0807
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0707
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0607
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0507
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0407
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0307
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0207
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0107
https://www.nxtbookmedia.com