The Sorcerer II Global Ocean Sampling
Expedition: Northwest Atlantic through
Eastern Tropical Pacific
Douglas B. Rusch

1*
, Aaron L. Halpern

1
, Granger Sutton

1
, Karla B. Heidelberg

1,2
, Shannon Williamson

1
, Shibu Yooseph

1
,

Dongying Wu
1,3

, Jonathan A. Eisen
1,3

, Jeff M. Hoffman
1

, Karin Remington
1,4

, Karen Beeson
1

, Bao Tran
1

,

Hamilton Smith
1

, Holly Baden-Tillson
1

, Clare Stewart
1

, Joyce Thorpe
1

, Jason Freeman
1

, Cynthia Andrews-Pfannkoch
1

,

Joseph E. Venter
1

, Kelvin Li
1

, Saul Kravitz
1

, John F. Heidelberg
1,2

, Terry Utterback
1

, Yu-Hui Rogers
1

, Luisa I. Falcón
5

,

Valeria Souza
5

, Germán Bonilla-Rosso
5

, Luis E. Eguiarte
5

, David M. Karl
6

, Shubha Sathyendranath
7

, Trevor Platt
7

,

Eldredge Bermingham
8

, Victor Gallardo
9

, Giselle Tamayo-Castillo
10

, Michael R. Ferrari
11

, Robert L. Strausberg
1

,

Kenneth Nealson
1,12

, Robert Friedman
1

, Marvin Frazier
1

, J. Craig Venter
1

1 J. Craig Venter Institute, Rockville, Maryland, United States of America, 2 Department of Biological Sciences, University of Southern California, Avalon, California, United

States of America, 3 Genome Center, University of California Davis, Davis, California, United States of America, 4 Your Genome, Your World, Rockville, Maryland, United States

of America, 5 Departmento de Ecologı́a Evolutiva, Instituto de Ecologı́a, Universidad Nacional Autónoma de México, Mexico City, Mexico, 6 Department of Oceanography,

University of Hawaii, Honolulu, Hawaii, United States of America, 7 Bedford Institute of Oceanography, Dartmouth, Nova Scotia, Canada, 8 Smithsonian Tropical Research

Institute, Balboa, Ancon, Republic of Panama, 9 Departamento de Oceanografı́a, Universidad de Concepción, Concepción, Chile, 10 Escuela de Quı́mica, Universidad de Costa

Rica, San Pedro, Costa Rica, 11 Department of Environmental Sciences, Rutgers University, New Brunswick, New Jersey, United States of America, 12 Department of Earth

Sciences, University of Southern California, Los Angles, California, United States of America

The world’s oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both
genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which
surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition.
These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and
ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp).
Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with
85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff.
Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and
assembly methods. One comparative genomic method, termed ‘‘fragment recruitment,’’ addressed questions of
genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes
and gene families. A second method, termed ‘‘extreme assembly,’’ made possible the assembly and reconstruction of
large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found
extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions
throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual
sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3)
hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into
genetically isolated populations that have overlapping but independent distributions, implying distinct environmental
preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show
how they may be grouped into several community types. Specific functional adaptations can be identified both within
individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or
absence of the phosphate-binding gene PstS.

Citation: Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern
tropical Pacific. PLoS Biol 5(3): e77. doi:10.1371/journal.pbio.0050077

Academic Editor: Nancy A. Moran, University of Arizona, United States of America

Received July 14, 2006; Accepted January 16, 2007; Published March 13, 2007

Copyright: � 2007 Rusch et al. This is an open-access article distributed under the
terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author
and source are credited.

Abbreviations: CAMERA, Cyberinfrastructure for Advanced Marine Microbial
Ecology Research and Analysis; GOS, Global Ocean Sampling; NCBI, National
Center for Biotechnology Information

* To whom correspondence should be addressed. E-mail: DRusch@venterinstitute.
org

This article is part of Global Ocean Sampling collection in PLoS Biology. The full
collection is available online at http://collections.plos.org/plosbiology/gos-2007.
php.

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770398

PLoS BIOLOGY


Introduction

The concept of microbial diversity is not well defined. It
can either refer to the genetic (taxonomic or phylogenetic)
diversity as commonly measured by molecular genetics
methods, or to the biochemical (physiological) diversity
measured in the laboratory with pure or mixed cultures.
However, we know surprisingly little about either the genetic
or biochemical diversity of the microbial world [1], in part
because so few microbes have been grown under laboratory
conditions [2,3], and also because it is likely that there are
immense numbers of low abundance ribotypes that have not
been detected using molecular methods [4]. Our under-
standing of microbial physiological and biochemical diversity
has come from studying the less than 1% of organisms that
can be maintained in enrichments or cultivated, while our
understanding of phylogenetic diversity has come from the
application of molecular techniques that are limited in terms
of identifying low-abundance members of the communities.

Historically, there was little distinction between genetic
and biochemical diversity because our understanding of
genetic diversity was based on the study of cultivated
microbes. Biochemical diversity, along with a few morpho-
logical features, was used to establish genetic diversity via an
approach called numerical taxonomy [5,6]. In recent years the
situation has dramatically changed. The determination of
genetic diversity has relied almost entirely on the use of gene
amplification via PCR to conduct taxonomic environmental
gene surveys. This approach requires the presence of slowly
evolving, highly conserved genes that are found in otherwise
very diverse organisms. For example, the gene encoding the
small ribosomal subunit RNA, known as 16S, based on
sedimentation coefficient, is most often used for distinguish-
ing bacterial and archaeal species [7–10]. The 16S rRNA
sequences are highly conserved and can be used as a
phylogenetic marker to classify organisms and place them
in evolutionary context. Organisms whose 16S sequences are
at least 97% identical are commonly considered to be the

same ribotype [11], otherwise referred to as species, opera-
tional taxonomic units, or phylotypes.
Although rRNA-based analysis has revolutionized our view

of genetic diversity, and has allowed the analysis of a large
part of the uncultivated majority, it has been less useful in
predicting biochemical diversity. Furthermore, the relation-
ship between genetic and biochemical diversity, even for
cultivated microbes, is not always predictable or clear. For
instance, organisms that have very similar ribotypes (97% or
greater homology) may have vast differences in physiology,
biochemistry, and genome content. For example, the gene
complement of Escherichia coli O157:H7 was found to be
substantially different from the K12 strain of the same species
[12].
In this paper, we report the results of the first phase of the

Sorcerer II Global Ocean Sampling (GOS) expedition, a
metagenomic study designed to address questions related to
genetic and biochemical microbial diversity. This survey was
inspired by the British Challenger expedition that took place
from 1872–1876, in which the diversity of macroscopic
marine life was documented from dredged bottom samples
approximately every 200 miles on a circumnavigation [13–15].
Through the substantial dataset described here, we identified
60 highly abundant ribotypes associated with the open ocean
and aquatic samples. Despite this relative lack of diversity in
ribotype content, we confirm and expand upon previous
observations that there is tremendous within-ribotype diver-
sity in marine microbial populations [4,7,8,16,17]. New
techniques and tools were developed to make use of the
sampling and sequencing metadata. These tools include: (1)
the fragment recruitment tool for performing and visualizing
comparative genomic analyses when a reference sequence is
available; (2) new assembly techniques that use metadata to
produce assemblies for uncultivated abundant microbial taxa;
and (3) a whole metagenome comparison tool to compare
entire samples at arbitrary degrees of genetic divergence.
Although there is tremendous diversity within cultivated and
uncultivated microbes alike, this diversity is organized into
phylogenetically distinct groups we refer to as subtypes.
Subtypes can occupy similar environments yet remain

genetically isolated from each other, suggesting that they are
adapted for different environmental conditions or roles
within the community. The variation between and within
subtypes consists primarily of nucleotide polymorphisms but
includes numerous small insertions, deletions, and hyper-
variable segments. Examination of the GOS data in these
terms sheds light on patterns of evolution and also suggests
approaches towards improving the assembly of complex
metagenomic datasets. At least some of this variation can be
associated with functional characters that are a direct
response to the environment. More than 6.1 million proteins,
including thousands of new protein families, have been
annotated from this dataset (described in the accompanying
paper [18]). In combination, these papers bring us closer to
reconciling the genetic and biochemical disconnect and to
understanding the marine microbial community.
We describe a metagenomic dataset generated from the

Sorcerer II expedition. The GOS dataset, which includes and
extends our previously published Sargasso Sea dataset [19],
now encompasses a total of 41 aquatic, largely marine
locations, constituting the largest metagenomic dataset yet
produced with a total of ;7.7 million sequencing reads. In

Author Summary

Marine microbes remain elusive and mysterious, even though they
are the most abundant life form in the ocean, form the base of the
marine food web, and drive energy and nutrient cycling. We know
so little about the vast majority of microbes because only a small
percentage can be cultivated and studied in the lab. Here we report
on the Global Ocean Sampling expedition, an environmental
metagenomics project that aims to shed light on the role of marine
microbes by sequencing their DNA without first needing to isolate
individual organisms. A total of 41 different samples were taken
from a wide variety of aquatic habitats collected over 8,000 km. The
resulting 7.7 million sequencing reads provide an unprecedented
look at the incredible diversity and heterogeneity in naturally
occurring microbial populations. We have developed new bioinfor-
matic methods to reconstitute large portions of both cultured and
uncultured microbial genomes. Organism diversity is analyzed in
relation to sampling locations and environmental pressures. Taken
together, these data and analyses serve as a foundation for greatly
expanding our understanding of individual microbial lineages and
their evolution, the nature of marine microbial communities, and
how they are impacted by and impact our world.

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770399

Sorcerer II GOS Expedition


the pilot Sargasso Sea study, 200 l surface seawater was
filtered to isolate microorganisms for metagenomic analysis.
DNA was isolated from the collected organisms, and genome
shotgun sequencing methods were used to identify more than
1.2 million new genes, providing evidence for substantial
microbial taxonomic diversity [19]. Several hundred new and
diverse examples of the proteorhodopsin family of light-
harvesting genes were identified, documenting their exten-
sive abundance and pointing to a possible important role in
energy metabolism under low-nutrient conditions. However,
substantial sequence diversity resulted in only limited
genome assembly. These results generated many additional
questions: would the same organisms exist everywhere in the
ocean, leading to improved assembly as sequence coverage
increased; what was the global extent of gene and gene family
diversity, and can we begin to exhaust it with a large but
achievable amount of sequencing; how do regions of the
ocean differ from one another; and how are different
environmental pressures reflected in organisms and com-
munities? In this paper we attempt to address these issues.

Results

Sampling and the Metagenomic Dataset
Microbial samples were collected as part of the Sorcerer II

expedition between August 8, 2003, and May 22, 2004, by the
S/V Sorcerer II, a 32-m sailing sloop modified for marine
research. Most specimens were collected from surface water
marine environments at approximately 320-km (200-mile)
intervals. In all, 44 samples were obtained from 41 sites
(Figure 1), covering a wide range of distinct surface marine

environments as well as a few nonmarine aquatic samples for
contrast (Table 1).
Several size fractions were isolated for every site (see

Materials and Methods). Total DNA was extracted from one
or more fractions, mostly from the 0.1–0.8-lm size range.
This fraction is dominated by bacteria, whose compact
genomes are particularly suitable for shotgun sequencing.
Random-insert clone libraries were constructed. Depending
on the uniqueness of each sampling site and initial estimates
of the genetic diversity, between 44,000 and 420,000 clones
per sample were end-sequenced to generate mated sequenc-
ing reads. In all, the combined dataset includes 6.25 Gbp of
sequence data from 41 different locations. Many of the clone
libraries were constructed with a small insert size (,2 kbp) to
maximize cloning efficiency. As this often resulted in mated
sequencing reads that overlapped one another, overlapping
mated reads were combined, yielding a total of ;6.4 M
contiguous sequences, totaling ;5.9 Gbp of nonredundant
sequence. Taken together, this is the largest collection of
metagenomic sequences to date, providing more than a 5-fold
increase over the dataset produced from the Sargasso Sea
pilot study [19] and more than a 90-fold increase over the
other large marine metagenomic dataset [20].

Assembly
Assembling genomic data into larger contigs and scaffolds,

especially metagenomic data, can be extremely valuable, as it
places individual sequencing reads into a greater genomic
context. A largely contiguous sequence links genes into
operons, but also permits the investigation of larger
biochemical and/or physiological pathways, and also connects
otherwise-anonymous sequences with highly studied ‘‘taxo-

Figure 1. Sampling Sites

Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along the
eastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of the
Sargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as part
of this study.
doi:10.1371/journal.pbio.0050077.g001

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770400

Sorcerer II GOS Expedition


T
a

b
le

1
.

Sa
m

p
lin

g
Lo

ca
ti

o
n

s
an

d
En

vi
ro

n
m

e
n

ta
l

D
at

a

ID
S

a
m

p
le

L
o

ca
ti

o
n

C
o

u
n

tr
y

D
a

te
,

m
m

/d
d

/y
y

T
im

e
L

o
ca

ti
o

n
S

a
m

p
le

D
e

p
th

,

m

W
a

te
r

D
e

p
th

,

m

T
(8

C
)a

S
b

(p
p

t)

S
iz

e

F
ra

ct
io

n

(l
m

)

H
a

b
it

a
t

T
y

p
e

C
h

l
a

S
a

m
p

le

M
o

n
th

(A
n

n
u

a
l

6
S

E
)

m
g

/m
�

3

G
o

o
d

S
e

q
u

e
n

ce
s

G
S0

0
a

Sa
rg

as
so

St
at

io
n

s
1

3
an

d
1

1
B

e
rm

u
d

a
(U

K
)

0
2

/2
6

/0
3

3
:0

0
3

1
83

2
96

99
n

;
6

3
83

5
94

2
99

w
5

.0
.

4
,2

0
0

2
0

.0
2

0
.5

3
6

.6
0

.1
–

0
.8

O
p

e
n

o
ce

an
0

.1
7

(0
.0

.9
6

0
.0

2
)

6
4

4
,5

5
1

1
0

:1
0

3
1
81

0
95

0
99

n
;

6
4
81

9
92

7
99

w
3

6
.7

G
S0

0
b

Sa
rg

as
so

St
at

io
n

s
1

3
an

d
1

1
B

e
rm

u
d

a
(U

K
)

0
2

/2
6

/0
3

3
:3

5
3

1
83

2
91

0
99

n
;

6
3
83

5
97

0
99

w
5

.0
.

4
,2

0
0

2
0

.0
2

0
.5

3
6

.6
0

.2
2

–
0

.8
O

p
e

n
o

ce
an

0
.1

7
(0

.0
.9

6
0

.0
2

)
3

1
7

,1
8

0

1
0

:4
3

3
1
81

0
95

0
n

;
6

4
81

9
92

7
99

w
3

6
.7

G
S0

0
c

Sa
rg

as
so

St
at

io
n

s
3

B
e

rm
u

d
a

(U
K

)
0

2
/2

5
/0

3
1

3
:0

0
3

2
80

9
93

0
99

n
;

6
4
80

0
93

6
99

w
5

.0
.

4
,2

0
0

1
9

.8
3

6
.7

0
.2

2
–

0
.8

O
p

e
n

o
ce

an
0

.1
7

(0
.0

.9
6

0
.0

2
)

3
6

8
,8

3
5

G
S0

0
d

Sa
rg

as
so

St
at

io
n

s
1

3
B

e
rm

u
d

a
(U

K
)

0
2

/2
5

/0
3

1
7

:0
0

3
1
83

2
96

99
n

;
6

3
83

5
94

2
99

w
5

.0
.

4
,2

0
0

2
0

.0
3

6
.6

0
.2

2
–

0
.8

O
p

e
n

o
ce

an
0

.1
7

(0
.0

.9
6

0
.0

2
)

3
3

2
,2

4
0

G
S0

1
a

H
yd

ro
st

at
io

n
S

B
e

rm
u

d
a

(U
K

)
0

5
/1

5
/0

3
1

1
:4

0
3

2
81

0
90

0
99

n
6

4
83

0
90

0
99

w
5

.0
.

4
,2

0
0

2
2

.9
3

6
.7

3
.0

–
2

0
.0

O
p

e
n

o
ce

an
0

.1
0

(0
.1

0
6

0
.0

1
)

1
4

2
,3

5
2

G
S0

1
b

H
yd

ro
st

at
io

n
S

B
e

rm
u

d
a

(U
K

)
0

5
/1

5
/0

3
1

1
:4

0
3

2
81

0
90

0
99

n
;

6
4
83

0
90

0
99

w
5

.0
.

4
,2

0
1

2
2

.9
3

6
.7

0
.8

–
3

.0
O

p
e

n
o

ce
an

0
.1

0
(0

.1
0

6
0

.0
1

)
9

0
,9

0
5

G
S0

1
c

H
yd

ro
st

at
io

n
S

B
e

rm
u

d
a

(U
K

)
0

5
/1

5
/0

3
1

1
:4

0
3

2
81

0
90

0
99

n
;

6
4
83

0
90

0
99

w
5

.0
.

4
,2

0
2

2
2

.9
3

6
.7

0
.1

–
0

.8
O

p
e

n
o

ce
an

0
.1

(0
.1

6
0

.0
1

)
9

2
,3

5
1

G
S0

2
G

u
lf

o
f

M
ai

n
e

U
SA

0
8

/2
1

/0
3

6
:3

2
4

2
83

0
91

1
99

n
;

6
7
81

4
92

4
99

w
1

.0
1

0
6

1
8

.2
2

9
.2

0
.1

–
0

.8
C

o
as

ta
l

1
.4

(1
.1

2
6

0
.1

9
)

1
2

1
,5

9
0

G
S0

3
B

ro
w

n
s

B
an

k,
G

u
lf

o
f

M
ai

n
e

C
an

ad
a

0
8

/2
1

/0
3

1
1

:5
0

4
2
85

1
91

0
99

n
;

6
6
81

3
92

99
w

1
.0

1
1

9
1

1
.7

2
9

.9
0

.1
–

0
.8

C
o

as
ta

l
1

.4
(1

.1
2

6
0

.1
9

)
6

1
,6

0
5

G
S0

4
O

u
ts

id
e

H
al

if
ax

,
N

o
va

Sc
o

ti
a

C
an

ad
a

0
8

/2
2

/0
3

5
:2

5
4

4
88

91
4

99
n

;
6

3
83

8
94

0
99

w
2

.0
1

4
2

1
7

.3
2

8
.3

0
.1

–
0

.8
C

o
as

ta
l

0
.4

(0
.7

8
6

0
.1

7
)

5
2

,9
5

9

G
S0

5
B

e
d

fo
rd

B
as

in
,

N
o

va
Sc

o
ti

a
C

an
ad

a
0

8
/2

2
/0

3
1

6
:2

1
4

4
84

1
92

5
99

n
;

6
3
83

8
91

4
99

w
1

.0
6

4
1

5
.0

3
0

.2
0

.1
–

0
.8

Em
b

ay
m

e
n

t
6

(6
.7

6
6

0
.9

8
)

6
1

,1
3

1

G
S0

6
B

ay
o

f
Fu

n
d

y,
N

o
va

Sc
o

ti
a

C
an

ad
a

0
8

/2
3

/0
3

1
0

:4
7

4
5
86

94
2

99
n

;
6

4
85

6
94

8
99

w
1

.0
1

1
1

1
.2

0
.1

–
0

.8
Es

tu
ar

y
2

.8
(1

.8
7

6
0

.1
8

)
5

9
,6

7
9

G
S0

7
N

o
rt

h
e

rn
G

u
lf

o
f

M
ai

n
e

C
an

ad
a

0
8

/2
5

/0
3

8
:2

5
4

3
83

7
95

6
99

n
;

6
6
85

0
95

0
99

w
1

.0
1

3
9

1
7

.9
3

1
.7

c
0

.1
–

0
.8

C
o

as
ta

l
1

.4
(1

.1
2

6
0

.1
9

)
5

0
,9

8
0

G
S0

8
N

e
w

p
o

rt
H

ar
b

o
r,

R
I

U
SA

1
1

/1
6

/0
3

1
6

:4
5

4
1
82

9
99

99
n

;
7

1
82

1
94

99
w

1
.0

1
2

9
.4

2
6

.5
c

0
.1

–
0

.8
C

o
as

ta
l

2
.2

(1
.5

9
6

0
.1

7
)

1
2

9
,6

5
5

G
S0

9
B

lo
ck

Is
la

n
d

,
N

Y
U

SA
1

1
/1

7
/0

3
1

0
:3

0
4

1
85

92
8

99
n

;
7

1
83

6
98

99
w

1
.0

3
2

1
1

.0
3

1
.0

c
0

.1
–

0
.8

C
o

as
ta

l
4

.0
(2

.7
2

6
0

.2
4

)
7

9
,3

0
3

G
S1

0
C

ap
e

M
ay

,
N

J
U

SA
1

1
/1

8
/0

3
4

:3
0

3
8
85

6
92

4
99

n
;

7
4
84

1
96

99
w

1
.0

1
0

1
2

.0
3

1
.0

c
0

.1
–

0
.8

C
o

as
ta

l
2

.0
(2

.7
5

6
0

.3
3

)
7

8
,3

0
4

G
S1

1
D

e
la

w
ar

e
B

ay
,

N
J

U
SA

1
1

/1
8

/0
3

1
1

:3
0

3
9
82

5
94

99
n

;
7

5
83

0
91

5
99

w
1

.0
8

1
1

.0
0

.1
–

0
.8

Es
tu

ar
y

4
.8

(9
.2

3
6

1
.0

2
)

1
2

4
,4

3
5

G
S1

2
C

h
e

sa
p

e
ak

e
B

ay
,

M
D

U
SA

1
2

/1
8

/0
3

1
1

:3
2

3
8
85

6
94

9
99

n
;

7
6
82

5
92

99
w

1
.0

2
5

3
.2

3
.4

7
c

0
.1

–
0

.8
Es

tu
ar

y
2

1
.0

(1
5

.0
6

1
.0

1
)

1
2

6
,1

6
2

G
S1

3
O

ff
N

ag
s

H
e

ad
,

N
C

U
SA

1
2

/1
9

/0
3

6
:2

8
3

6
80

91
4

99
n

;
7

5
82

3
94

1
99

w
1

.0
2

0
9

.3
0

.1
–

0
.8

C
o

as
ta

l
3

.0
(2

.2
4

6
0

.2
5

)
1

3
8

,0
3

3

G
S1

4
So

u
th

o
f

C
h

ar
le

st
o

n
,

SC
U

SA
1

2
/2

0
/0

3
1

7
:1

2
3

2
83

0
92

5
99

n
;

7
9
81

5
95

0
99

w
1

.0
3

1
1

8
.6

0
.1

–
0

.8
C

o
as

ta
l

1
.7

0
(1

.9
2

6
0

.2
5

)
1

2
8

,8
8

5

G
S1

5
O

ff
K

e
y

W
e

st
,

FL
U

SA
0

1
/0

8
/0

4
6

:2
5

2
4
82

9
91

8
99

n
;

8
3
84

91
2

99
w

2
.0

4
7

2
5

.3
3

6
.0

0
.1

–
0

.8
C

o
as

ta
l

0
.2

(0
.2

7
6

0
.0

9
)

1
2

7
,3

6
2

G
S1

6
G

u
lf

o
f

M
e

xi
co

U
SA

0
1

/0
8

/0
4

1
4

:1
5

2
4
81

0
92

9
99

n
;

8
4
82

0
94

0
99

w
2

.0
3

,3
3

3
2

6
.4

3
5

.8
0

.1
–

0
.8

C
o

as
ta

l
se

a
0

.1
6

(0
.1

1
6

0
.0

1
)

1
2

7
,1

2
2

G
S1

7
Y

u
ca

ta
n

C
h

an
n

e
l

M
e

xi
co

0
1

/0
9

/0
4

1
3

:4
7

2
0
83

1
92

1
99

n
;

8
5
82

4
94

9
99

w
2

.0
4

,5
1

3
2

7
.0

3
5

.8
0

.1
–

0
.8

O
p

e
n

o
ce

an
0

.1
3

(0
.0

9
6

0
.0

1
)

2
5

7
,5

8
1

G
S1

8
R

o
sa

ri
o

B
an

k
H

o
n

d
u

ra
s

0
1

/1
0

/0
4

8
:1

2
1

8
82

91
2

99
n

;
8

3
84

7
95

99
w

2
.0

4
,4

7
0

2
7

.4
3

5
.4

0
.1

–
0

.8
O

p
e

n
o

ce
an

0
.1

4
(0

.0
9

6
0

.0
1

)
1

4
2

,7
4

3

G
S1

9
N

o
rt

h
e

as
t

o
f

C
o

ló
n

P
an

am
a

0
1

/1
2

/0
4

9
:0

3
1

0
84

2
95

9
99

n
;

8
0
81

5
91

6
99

w
2

.0
3

,3
3

6
2

7
.7

3
5

.4
0

.1
–

0
.8

C
o

as
ta

l
0

.2
3

(0
.1

5
6

0
.0

2
)

1
3

5
,3

2
5

G
S2

0
La

ke
G

at
u

n
P

an
am

a
0

1
/1

5
/0

4
1

0
:2

4
9
89

95
2

99
n

;
7

9
85

0
91

0
99

w
2

.0
4

2
8

.5
0

.0
6

0
.1

–
0

.8
Fr

e
sh

w
at

e
r

2
9

6
,3

5
5

G
S2

1
G

u
lf

o
f

P
an

am
a

P
an

am
a

0
1

/1
9

/0
4

1
6

:4
8

8
87

94
5

99
n

;
7

9
84

1
92

8
99

w
2

.0
7

6
2

7
.6

3
0

.7
0

.1
–

0
.8

C
o

as
ta

l
0

.5
0

(0
.7

3
6

0
.2

2
)

1
3

1
,7

9
8

G
S2

2
2

5
0

m
ile

s
fr

o
m

P
an

am
a

C
it

y
P

an
am

a
0

1
/2

0
/0

4
1

6
:3

9
6
82

9
93

4
99

n
;

8
2
85

4
91

4
99

w
2

.0
2

,4
3

1
2

9
.3

3
2

.3
0

.1
–

0
.8

O
p

e
n

o
ce

an
0

.3
3

(0
.2

8
6

0
.0

2
)

1
2

1
,6

6
2

G
S2

3
3

0
m

ile
s

fr
o

m
C

o
co

s
Is

la
n

d
C

o
st

a
R

ic
a

0
1

/2
1

/0
4

1
5

:0
0

5
83

8
92

4
99

n
;

8
6
83

3
95

5
99

w
2

.0
1

,1
3

9
2

8
.7

3
2

.6
0

.1
–

0
.8

O
p

e
n

o
ce

an
0

.0
7

(0
.1

9
6

0
.0

2
)

1
3

3
,0

5
1

G
S2

5
D

ir
ty

R
o

ck
,

C
o

co
s

Is
la

n
d

C
o

st
a

R
ic

a
0

1
/2

8
/0

4
1

0
:5

1
5
83

3
91

0
99

n
;

8
7
85

91
6

99
w

1
.1

3
0

2
8

.3
3

1
.4

0
.8

–
3

.0
Fr

in
g

in
g

re
e

f
0

.1
1

(0
.1

9
6

0
.0

1
)

1
2

0
,6

7
1

G
S2

6
1

3
4

m
ile

s
N

E
o

f
G

al
ap

ag
o

s
Ec

u
ad

o
r

0
2

/0
1

/0
4

1
6

:1
6

1
81

5
95

1
99

n
;

9
0
81

7
94

2
99

w
2

.0
2

,3
7

6
2

7
.8

3
2

.6
0

.1
–

0
.8

O
p

e
n

o
ce

an
0

.2
2

(0
.2

8
6

0
.0

2
)

1
0

2
,7

0
8

G
S2

7
D

e
vi

l’s
C

ro
w

n
,

Fl
o

re
an

a
Ec

u
ad

o
r

0
2

/0
4

/0
4

1
1

:4
1

1
81

2
95

8
99

s;
9

0
82

5
92

2
99

w
2

.0
2

.3
2

5
.5

3
4

.9
0

.1
–

0
.8

C
o

as
ta

l
0

.4
0

(0
.3

8
6

0
.0

3
)

2
2

2
,0

8
0

G
S2

8
C

o
as

ta
l

Fl
o

re
an

a
Ec

u
ad

o
r

0
2

/0
4

/0
4

1
5

:4
7

1
81

3
91

99
s;

9
0
81

9
91

1
99

w
2

.0
1

5
6

2
5

.0
c

0
.1

–
0

.8
C

o
as

ta
l

0
.3

5
(0

.3
5

6
0

.0
2

)
1

8
9

,0
5

2

G
S2

9
N

o
rt

h
Ja

m
e

s
B

ay
,

Sa
n

ti
g

o
Ec

u
ad

o
r

0
2

/0
8

/0
4

1
8

:0
3

0
81

2
90

99
s;

9
0
85

0
97

99
w

2
.0

1
2

2
6

.2
3

4
.5

0
.1

–
0

.8
C

o
as

ta
l

0
.4

0
(0

.3
9

6
0

.0
3

)
1

3
1

,5
2

9

G
S3

0
W

ar
m

se
e

p
,

R
o

ca
R

e
d

o
n

d
a

Ec
u

ad
o

r
0

2
/0

9
/0

4
1

1
:4

2
0
81

6
92

0
99

n
;

9
1
83

8
90

99
w

1
9

.0
1

9
2

6
.9

0
.1

–
0

.8
W

ar
m

se
e

p
3

5
9

,1
5

2

G
S3

1
U

p
w

e
lli

n
g

,
Fe

rn
an

d
in

a
Ec

u
ad

o
r

0
2

/1
0

/0
4

1
4

:4
3

0
81

8
94

99
s;

9
1
83

9
96

99
w

1
2

.0
1

9
1

8
.6

0
.1

–
0

.8
C

o
as

ta
l

u
p

w
e

lli
n

g
0

.3
5

(0
.3

9
6

0
.0

3
)

4
3

6
,4

0
1

G
S3

2
M

an
g

ro
ve

,
Is

ab
e

lla
Ec

u
ad

o
r

0
2

/1
1

/0
4

1
1

:3
0

0
83

5
93

8
99

s;
9

1
84

91
0

99
w

0
.3

0
.6

7
2

5
.4

0
.1

–
0

.8
M

an
g

ro
ve

1
4

8
,0

1
8

G
S3

3
P

u
n

ta
C

o
rm

o
ra

n
t

La
g

o
o

n
,

Fl
o

re
an

a
Ec

u
ad

o
r

0
2

/1
9

/0
4

1
3

:3
5

1
81

3
94

2
99

s;
9

0
82

5
94

5
99

w
0

.2
0

.3
3

3
7

.6
4

6
c

0
.1

–
0

.8
H

yp
e

rs
al

in
e

6
9

2
,2

5
5

G
S3

4
N

o
rt

h
Se

am
o

re
Ec

u
ad

o
r

0
2

/1
9

/0
4

1
7

:0
6

0
82

2
95

9
99

s;
9

0
81

6
94

7
99

w
2

.0
3

5
2

7
.5

0
.1

–
0

.8
C

o
as

ta
l

0
.3

6
(0

.3
5

6
0

.0
2

)
1

3
4

,3
4

7

G
S3

5
W

o
lf

Is
la

n
d

Ec
u

ad
o

r
0

3
/0

1
/0

4
1

6
:4

4
1
82

3
92

1
99

n
;

9
1
84

9
91

99
w

2
.0

7
1

2
1

.8
3

4
.5

0
.1

–
0

.8
C

o
as

ta
l

0
.2

8
(0

.3
1

6
0

.0
2

)
1

4
0

,8
1

4

G
S3

6
C

ab
o

M
ar

sh
al

l,
Is

ab
e

lla
Ec

u
ad

o
r

0
3

/0
2

/0
4

1
2

:5
2

0
81

91
5

99
s;

9
1
81

1
95

2
99

w
2

.0
6

7
2

5
.8

3
4

.6
0

.1
–

0
.8

C
o

as
ta

l
0

.6
5

(0
.4

5
6

0
.0

5
)

7
7

,5
3

8

G
S3

7
Eq

u
at

o
ri

al
P

ac
if

ic
T

A
O

B
u

o
y

In
te

rn
at

io
n

al
0

3
/1

7
/0

4
1

6
:3

8
1
85

8
92

6
99

s;
9

5
80

95
3

99
w

2
.0

3
,3

3
4

2
8

.8
0

.1
–

0
.8

O
p

e
n

o
ce

an
0

.2
1

(0
.2

4
6

0
.0

2
)

6
5

,6
7

0

G
S4

7
2

0
1

m
ile

s
fr

o
m

Fr
e

n
ch

P
o

ly
n

e
si

a
In

te
rn

at
io

n
al

0
3

/2
8

/0
4

1
5

:2
5

1
0
87

95
3

99
s;

1
3

5
82

6
95

8
99

w
3

0
.0

2
,4

0
0

2
8

.6
3

7
.3

0
.1

–
0

.8
O

p
e

n
o

ce
an

6
6

,0
2

3

G
S5

1
R

an
g

ir
o

ra
A

to
ll

Fr
e

n
ch

P
o

ly
n

e
si

a
0

5
/2

2
/0

4
7

:0
4

1
5
88

93
7

99
s;

1
4

7
82

6
96

99
w

1
.0

1
0

2
7

.3
3

4
.2

0
.1

–
0

.8
C

o
ra

l
re

e
f

at
o

ll
1

2
8

,9
8

2

T
o

ta
l

7
,6

9
7

,9
2

6

a
T

e
m

p
e

ra
tu

re
.

b
Sa

lin
it

y.
c
M

e
as

u
re

m
en

ts
w

e
re

ac
q

u
ir

e
d

fr
o

m
n

e
ar

b
y

ve
ss

e
ls

an
d

/o
r

re
se

ar
ch

st
at

io
n

s.
d

o
i:1

0
.1

3
7

1
/j

o
u

rn
al

.p
b

io
.0

0
5

0
0

7
7

.t
0

0
1

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770401

Sorcerer II GOS Expedition


nomic markers’’ such as 16S or recA, thus clearly identifying
the taxonomic group with which they are associated. The
primary assembly of the combined GOS dataset was
performed using the Celera Assembler [21] with modifica-
tions as previously described [19] and as given in Materials
and Methods. The assembly was performed with quite
stringent criteria, beginning with an overlap cutoff of 98%
identity to reduce the potential for artifacts (e.g., chimeric
assemblies or consensus sequences diverging substantially
from the genome of any given cell). This assembly was the
substrate for annotation (see the accompanying paper by
Yooseph et al. [18]).

The degree of assembly of a metagenomic sample provides
an indication of the diversity of the sample. A few substantial
assemblies notwithstanding, the primary assembly was strik-
ingly fragmented (Table 2). Only 9% of sequencing reads
went into scaffolds longer than 10 kbp. A majority (53%) of
the sequencing reads remained unassembled singletons.
Scaffolds containing more than 50 kb of consensus sequence
totaled 20.7 Mbp; of these, .75% were produced from a
single Sargasso Sea sample and correspond to the Burkholderia
or Shewanella assemblies described previously [19]. These
results highlight the unusual abundance of these two
organisms in a single sample, which significantly affected

our expectations regarding the current dataset. Given the
large size of the combined dataset and the substantial amount
of sequencing performed on individual filters, the overall lack
of assembly provides evidence of a high degree of diversity in
surface planktonic communities. To put this in context,
suppose there were a clonal organism that made up 1% of
our data, or ;60 Mbp. Even a genome of 10 Mbp—enormous
by bacterial standards—would be covered ;6-fold. Such data
might theoretically assemble with an average contig ap-
proaching 50 kb [22]. While real assemblies generally fall
short of theory for various reasons, Shewanella data make up
,1% of the total GOS dataset, and yet most of the relevant
reads assemble into scaffolds .50 kb. Thus, with few scaffolds
of significant length, we could conclude that there are very
few clonal organisms present at even 1% in the GOS dataset.
To investigate the nature of the implied diversity and to see

whether greater assembly could be achieved, we explored
several alternative approaches. Breaks in the primary
assembly resulted from two factors: incomplete sequence
coverage and conflicts in the data. Conflicts can break
assemblies when there is no consistent way to chain together
all overlapping sequencing reads. As it was possible that there
would be fewer conflicts within a single sample (i.e., that
diversity within a single sample would be lower), assemblies
were attempted with individual samples. However, the results
did not show any systematic improvements even in those
samples with greater coverage (unpublished data). Upon
manual inspection, most assembly-breaking conflicts were
found to be local in nature. These observations suggested that
reducing the degree of sequence identity required for
assembly could ameliorate both factors limiting assembly:
effective coverage would increase and many minor conflicts
would be resolved.
Accordingly, we produced a series of assemblies based on

98%, 94%, 90%, 85%, and 80% identity overlaps for two
subsets of the GOS dataset, again using the Celera Assembler.
Assembly lengths increased as the overlap cutoff decreased
from 98% to 94% to 90%, and then leveled off or even
dropped as stringency was reduced below 90% (Table 3).
Although larger assemblies could be generated using lower
identity overlaps, significant numbers of overlaps satisfying
the chosen percent identity cutoff still went unused in each
assembly. This is consistent with a high rate of conflicting
overlaps and in turn diagnostic of significant polymorphism.
In mammalian sequencing projects the use of larger insert

libraries is critical to producing larger assemblies because of
their ability to span repeats or local polymorphic regions [23].
The shotgun sequencing libraries from the GOS filters were
typically constructed from inserts shorter than 2 kb. Longer
plasmid libraries were attempted but were much less stable.
We obtained paired-end sequences from 21,419 fosmid clones
(average insert size, 36 kb; [24,25]) from the 0.1-micron
fraction of GS-33. The effect of these long mate pairs on the
GS-33 assembly was quite dramatic, particularly at high
stringency (e.g., improving the largest scaffold from 70 kb to
1,247 kb and the largest contig from 70 kb to 427 kb). At least
for GS-33 this suggests that many of the polymorphisms affect
small, localized regions of the genome that can be spanned
using larger inserts. This degree of improvement may be
greater than what could be expected in general, as the
diversity of GS-33 is by far the lowest of any of the currently

Table 2. Summary Assembly Statistics

Category Statistic Value

Assembly

inputs

Number of reads used for assembly 7,697,926

Total read length (bp) 6,325,208,303

Number of ‘‘intigs’’ used for assemblya 6,389,523

Total intig length (bp)a 5,883,982,712

Assembly

outputs

Number of assembliesb 3,081,849

Total assembled consensus length (bp) 4,460,027,783

Percentage of unassembled reads 53%

Percentage of assembly at .13 coverage 15.3%

Base pairs in contigs � 10 kb 39,427,102

Base pairs in contigs � 50 kb 15,723,513

Base pairs in scaffolds � 2.5 kb (consensus bases) 458,196,599

Base pairs in scaffolds � 5 kb (consensus bases) 138,137,150

Base pairs in scaffolds � 10 kb (consensus bases) 65,238,481

Base pairs in scaffolds � 50 kb (consensus bases) 20,738,836

Base pairs in scaffolds � 100 kb (consensus bases) 16,005,244

Base pairs in scaffolds � 300 kb (consensus bases) 8,805,668

Percentage of assembly in scaffolds � 10 kbc 1.5%

Length of longest contig (bp) 977,960

Length of longest scaffold (bp) 2,097,794

N1d assembly bp 15,915

N10d assembly bp 2,533

N50d assembly bp 1,611

N1d contig bp 8,994

N10d contig bp 2,447

N50d contig bp (single reads)

aIntigs are overlapping mated reads that have been collapsed into a single sequence as
input into the assembler.
bAssemblies refers to the total number of scaffolds, pairs of mated nonoverlapping
singletons, and singleton unmated reads.
cFor comparison purposes, 10 kb is the average contig size predicted for 4.13 coverage
for an idealized shotgun assembly of a repeat-free, clonal genome [22].
dN1 indicates the length of the next largest assembled sequence or contig such that 1%
of the sequence data falls into longer assemblies or contigs. N10 and N50 indicate that
10% and 50% of the data fell into larger assemblies or contigs.
doi:10.1371/journal.pbio.0050077.t002

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770402

Sorcerer II GOS Expedition


sequenced GOS samples, yet it clearly indicates the utility of
including larger insert libraries for assembly.

Fragment Recruitment
In the absence of substantial assembly, direct comparison

of the GOS sequencing data to the genomes of sequenced
microbes is an alternative way of providing context, and also
allows for exploration of genetic variation and diversity. A
large and growing set of microbial genomes are available

from the National Center for Biotechnology Information
(NCBI; http://www.ncbi.nlm.nih.gov). At the time of this study,
we used 334 finished and 250 draft microbial genomes as
references for comparison with the GOS sequencing reads.
Comparisons were carried out in nucleotide-space using the
sequence alignment tool BLAST [26]. BLAST parameters
were designed to be extremely lenient so as to detect even
distant similarities (as low as 55% identity). A large
proportion of the GOS reads, 70% in all, aligned to one or
more genomes under these conditions. However, many of the
alignments were of low identity and used only a portion of
the entire read. Such low-quality hits may reflect distant
evolutionary relationships, and therefore less information is
gained based on the context of the alignment. More stringent
criteria could be imposed requiring that the reads be aligned
over nearly their entire length without any large gaps. Using
this stringent criterion only about 30% of the reads aligned
to any of the 584 reference genomes. We refer to these fully
aligned reads as ‘‘recruited reads.’’ Recruited reads are far
more likely to be from microbes closely related to the
reference sequence (same species) than are partial align-
ments. Despite the large number of microbial genomes
currently available, including a large number of marine
microbes, these results indicate that a substantial majority of
GOS reads cannot be specifically related to available micro-
bial genomes.
The amount and distribution of reads recruited to any

given genome provides an indication of the abundance of
closely related organisms. Only genomes from the five
bacterial genera Prochlorococcus, Synechococcus, Pelagibacter,
Shewanella, and Burkholderia yielded substantial and uniform
recruitment of GOS fragments over most of a reference
genome (Table 4). These genera include multiple reference
genomes, and we observed significant differences in recruit-
ment patterns even between organisms belonging to the same
species (Figure 2A–2I). Three genera, Pelagibacter (Figure 2A),
Prochlorococcus (Figure 2B–2F), and Synechococcus (Figure 2G–
2I), were found abundantly in a wide range of samples and
together accounted for roughly 50% of all the recruited reads
(though only ;15% of all GOS sequencing reads). By
contrast, although every genome tested recruited some GOS
reads, most recruited only a small number, and these reads
clustered at lower identity to locations corresponding to large
highly conserved genes (for typical examples see Figure 2E–
2F). We refer to this pattern as nonspecific recruitment as it
reflects taxonomically nonspecific signals, with the reads in

Table 4. Microbial Genera that Recruited the Bulk of the GOS Reads

Genus Read Count Best Strain

All Reads 80%þa 90%þb All Reads 80%þa 90%þb

Pelagibacter 922,677 195,539 36,965 HTCC1062 HTCC1062 HTCC1062

Prochlorococcus 208,999 159,102 84,325 MIT9312 MIT9312 MIT9312

Synechococcus 60,650 26,365 21,594 CC9902 RS9917 RS9917

Burkholderia 151,123 108,610 93,081 383 383 383

Shewanella 59,086 34,138 27,693 MR-1 MR-1 MR-1

Remaining 43,244 2,367 564 Buchnera aphidicola Str. Sg Buchnera aphidicola Str. APS Alteromonas macleodii

aReads aligned at or above 80% identity over the entire length of the read.
bReads aligned at or above 90% identity over the entire length of the read.
doi:10.1371/journal.pbio.0050077.t004

Table 3. Evaluation of Alternative Assembly Methods

Dataset Type Percent

Identity

Base Pairs in

10 k Contigs

Base Pairs in

100 k Contigs

GS33 plasmids WGSa 98 13,669,678 0

94 19,536,324 2,749,543

90 20,996,826 3,729,765

85 20,327,989 3,505,324

80 19,245,637 4,195,959

E-asmb 98 22,000,579 5,604,857

94 22,781,462 7,302,801

90 22,702,764 7,600,441

85 22,570,933 7,937,079

80 20,335,558 4,779,684

GS33 with fosmids WGSa 98 15,031,557 1,306,992

94 22,310,335 4,449,710

90 22,944,278 5,585,959

85 22,251,738 5,485,013

80 21,088,975 5,684,925

GS17,18,23,26 WGSa 98 185,058 0

94 5,422,366 213,755

90 10,694,783 373,822

85 11,514,421 800,290

80 9,004,221 879,401

E-asmb 98 2,047,524 0

94 10,668,547 1,184,881

90 15,215,981 2,634,227

85 15,786,515 3,132,152

80 13,767,929 2,942,160

Combined GOS WGSa 98 39,427,102 11,488,828

94 98,887,937 12,376,236

E-asmb 98 91,526,091 16,444,304

94 163,612,717 25,564,163

90 186,614,813 28,752,198

85 181,887,218 27,154,335

80 161,160,091 23,794,832

aWhole-genome shotgun (WGS) assembly performed with the Celera Assembler.
bAssemblies performed using extreme assembly approach (E-asm).
doi:10.1371/journal.pbio.0050077.t003

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770403

Sorcerer II GOS Expedition


Figure 2. Fragment Recruitment Plots

The horizontal axis of each panel corresponds to a 100-kb segment of genomic sequence from the indicated reference microbial genome. The vertical
axis indicates the sequence identity of an alignment between a GOS sequence and the reference genomic sequence. The identity ranges from 100%
(top) to 50% (bottom). Individual GOS sequencing reads were colored to reflect the sample from which they were isolated. Geographically nearby
samples have similar colors (see Poster S1 for key). Each organism shows a distinct pattern of recruitment reflecting its origin and relationship to the
environmental data collected during the course of this study.
(A) P. ubique HTCC1062 recruits the greatest density of GOS sequences of any genome examined to date. The GOS sequences show geographic

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770404

Sorcerer II GOS Expedition


question often recruiting to distantly related sets of genomes.
Most microbial genomes, including many of the marine
microbes (e.g., the ubiquitous genus Vibrio), demonstrated this
nonspecific pattern of recruitment.

The relationship between the similarity of an individual
sequencing read to a given genome and the sample fromwhich
the read was isolated can provide insight into the structure,
evolution, and geographic distribution of microbial popula-
tions. These relationships were assessed by constructing a
‘‘percent identity plot’’ [27] in which the alignment of a read to
a reference sequence is shown as a bar whose horizontal
position indicates location on the reference andwhose vertical
position indicates the percent identity of the alignment. We
colored the plotted reads according to the samples to which
they belonged, thus indirectly representing various forms of
metadata (geographic, environmental, and laboratory varia-
bles). We refer to these plots that incorporate metadata as
fragment recruitment plots. Fragment recruitment plots of
GOS sequences recruited to the entire genomes of Pelagibacter
ubique HTCC1062, Prochlorococcus marinus MIT9312, and Syn-
echococcus WH8102 are presented in Poster S1.

Within-Ribotype Population Structure and Variation
Characteristic patterns of recruitment emerged from each

of these abundant marine microbes consisting of horizontal
bands made up of large numbers of GOS reads. These bands
seem constrained to a relatively narrow range of identities
that tile continuously (or at least uniformly, in the case when
abundance/coverage is lower) along ;90% of the reference
sequence. The uninterrupted tiling indicates that environ-
mental genomes are largely syntenic with the reference
genomes. Multiple bands, distinguished by degree of sim-
ilarity to the reference and by sample makeup, may arise on a
single reference (Poster S1D and S1F). Each of these bands
appears to represent a distinct, closely related population we
refer to as a subtype. In some cases, an abundant subtype is

highly similar to the reference genome, as is the case for P.
marinus MIT9312 (Poster S1) and Synechococcus RS9917
(unpublished data). P. ubique HTCC1062 and other Synecho-
coccus strains like WH8102 show more complicated banding
patterns (Poster S1D and S1F) because of the presence of
multiple subtypes that produce complex often overlapping
bands in the plots. Though the recruitment patterns can be
quite complex they are also remarkably consistent over much
of the reference genome. In these more complicated recruit-
ment plots, such as the one for P. ubique HTCC1062,
individual bands can show sudden shifts in identity or
disappear altogether, producing a gap in recruitment that
appears to be specific to that band (see P. ubique recruitment
plots on Poster S1B and S1E, and specifically between 130–
140 kb). Finally, phylogenetic analysis indicates that separate
bands are indeed evolutionarily distinct at randomly selected
locations along the genome.
The amount of sequence variation within a given band

cannot be reliably determined from the fragment recruit-
ment plots themselves. To examine this variation, we
produced multiple sequence alignments and phylogenies of
reads that recruited to several randomly chosen intervals
along given reference genomes to show that there can be
considerable within-subtype variation (Figure 3A–3B). For
example, within the primary band found in recruitment plots
to P. marinus MIT9312, individual pairs of overlapping reads
typically differ on average between 3%–5% at the nucleotide
level (depending on exact location in the genome). Very few
reads that recruited to MIT9312 have perfect (mismatch-free)
overlaps with any other read or to MIT9312, despite ;100-
fold coverage. While many of these differences are silent (i.e.,
do not change amino acid sequences), there is still consid-
erable variation at the protein level (unpublished data). The
amount of variation within subtypes is so great that it is likely
that no two sequenced cells contained identical genomes.

stratification into bands, with sequences from temperate water samples off the North American coast having the highest identity (yellow to yellow-
green colors). At lower identity, sequences from all the marine environments could be aligned to HTCC1062.
(B) P. marinus MIT9312 recruits a large number of GOS sequences into a single band that zigzags between 85%–95% identity on average. These
sequences are largely derived from warm water samples in the Gulf of Mexico and eastern Pacific (green to greenish-blue reads).
(C) P. marinus MED4 recruits largely the same set of reads as MIT9312 (B) though the sequences that form the zigzag recruit at a substantially lower
identity. A small number of sequences from the Sargasso Sea samples (red) are found at high identity.
(D) P. marinus NATL2A recruits far fewer sequences than any of the preceding panels. Like MED4, a small number of high-identity sequences were
recruited from the Sargasso samples.
(E) P. marinus MIT9313 is a deep-water low-light–adapted strain of Prochlorococcus. GOS sequences were recruited almost exclusively at low identity in
vertical stacks that correspond to the locations of conserved genes. On the left side of this panel is a very distinctive pattern of recruitment that
corresponds to the highly conserved 16S and 23S mRNA gene operon.
(F) P. marinus CCMP1375, another deep-water low-light–adapted strain, does not recruit GOS sequences at high identity. Only stacks of sequences are
seen corresponding to the location of conserved genes.
(G) Synechococcus WH8102 recruits a modest number of high-identity sequences primarily from the Sargasso Sea samples. A large number of moderate
identity matches from the Pacific and hypersaline lagoon (GS33) samples are also visible.
(H) Synechococcus CC9605 recruits largely the same sequences as does Synechococcus WH8102, but was isolated from Pacific waters. GOS sequences
from some of the Pacific samples recruit at high identity, while sequences from the Sargasso and hypersaline lagoon (bluish-purple) were recruited at
moderate identities.
(I) Synechococcus CC9902 is distantly related to either of the preceding Synechococcus strains. While this strain also recruits largely the same sequences
as the WH8102 and CC9902 strains, they recruit at significantly lower identity.
(J–O) Fragment recruitment plots to extreme assemblies seeded with phylogenetically informative sequences. Using this approach it is not only
possible to assemble contigs with strong similarities to known genomes but to identify contigs from previously uncultured genomes. In each case a
100-kb segment from an extreme assembly is shown. Each plot shows a distinct pattern of recruitment that distinguishes the panels from each other.
(J) Seeded from a Prochlorococcus marinus-related sequence, this contig recruits a broad swath of GOS sequences that correspond to the GOS
sequences that form the zigzag on P. marinus MIT9312 recruitment plots (see [B] or Poster S1 for comparison).
(K–L) Seeded from SAR11 clones, these contigs show significant synteny to the known P. ubique HTCC1062 genome. (K) is strikingly similar to previous
recruitment plots to the HTCC1062 genome (see [A] or Poster S1). In contrast, (L) identifies a different strain that recruits high-identity GOS sequences
primarily from the Sargasso Sea samples (red).
(M–O) These three panels show recruitment plots to contigs belonging to the uncultured Actinobacter, Roseobacter, and SAR86 lineages.
doi:10.1371/journal.pbio.0050077.g002

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770405

Sorcerer II GOS Expedition


PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770406

Sorcerer II GOS Expedition


Identifying Genomic Structural Variation with
Metagenomic Data

Variation in genome structure in the form of rearrange-
ments, duplications, insertions, or deletions of stretches of
DNA can also be explored via fragment recruitment. The use
of mated sequencing reads (pairs of reads from opposite ends
of a clone insert) provides a powerful tool for assessing
structural differences between the reference and the environ-
mental sequences. The cloning and sequencing process
determines the orientation and approximate distance be-
tween two mated sequencing reads. Genomic structural
variation can be inferred when these are at odds with the
way in which the reads are recruited to a reference sequence.
Relative location and orientation of mated sequences provide
a form of metadata that can be used to color-code a fragment
recruitment plot (Figure 4). This makes it possible to visually
identify and classify structural differences and similarities
between the reference and the environmental sequences
(Figure 5). For the abundant marine microbes, a high

proportion of mated reads in the ‘‘good’’ category (i.e., in
the proper orientation and at the correct distance) show that
synteny is conserved for a large portion of the microbial
population. The strongest signals of structural differences
typically reflect a variant specific to the reference genome
and not found in the environmental data. In conjunction with
the requirement that reads be recruited over their entire
length without interruption, recruitment plots result in
pronounced recruitment gaps at locations where there is a
break in synteny. Other rearrangements can be partially
present or penetrant in the environmental data and thus may
not generate obvious recruitment gaps. However, given
sufficient coverage, breaks in synteny should be clearly
identifiable using the recruitment metadata based on the
presence of ‘‘missing’’ mates (i.e., the mated sequencing read
that was recruited but whose mate failed to recruit; Figure 4).
The ratio of missing mates to ‘‘good’’ mates determines how
penetrant the rearrangement is in the environmental
population.
In theory, all genome structure variations that are large

enough to prevent recruitment can be detected, and all such
rearrangements will be associated with missing mates.
Depending on the type of rearrangement present other
recruitment metadata categories will be present near the
rearrangements’ endpoints. This makes it possible to distin-
guish among insertions, deletions, translocations, inversions,
and inverted translocations directly from the recruitment
plots. Examples of the patterns associated with different
rearrangements are presented in Figure 5. This provides a
rapid and easy visual method for exploring structural
variation between natural populations and sequenced repre-
sentatives (Poster S1A and S1B).

Genomic Structural Variation in Abundant Marine
Microbes
Variation in genome structure potentially results in func-

tional differences. Of particular interest are those differences
between sequenced (reference) microbes and environmental
populations. These differences can indicate how representa-
tive a cultivated microbe might be and shed light on the
evolutionary forces driving change in microbial populations.
Fragment recruitment in conjunction with the mate metadata
helped us to identify both the consistent and the rare
structural differences between the genomes of microbial
populations in the GOS data and their closest sequenced
relatives. Our analysis has thus far been confined to the three
microbial genera that were widespread in the GOS dataset as
represented by the finished genomes of P. marinus MIT9312,
P. ubique HTCC1062, and to a lesser extent Synechococcus
WH8102. Each of these genomes is characterized by large and
small segments where little or no fragment recruitment took
place. We refer to these segments as ‘‘gaps.’’ These gaps

Figure 3. Population Structure and Variation as Revealed by Phylogeny

Phylogenies were produced using neighbor-joining. There is significant within-clade variation as well as an absence of strong geographic structure to
variants of SAR11 (P. ubique HTCC1062) and P. marinus MIT9312. Similar reads are not necessarily from similar locations, and reads from similar locations
are not necessarily similar.
(A) Geographic distribution of SAR11 proteorhodopsin variants. Keys to coloration: blue, Pacific; pink, Atlantic.
(B) Geographic distribution of Prochlorococcus variants. Keys to coloration: blue, Pacific; pink, Atlantic.
(C) Origins of spectral tuning of SAR11 proteorhodopsins. Reads are colored according to whether they contain the L (green) or Q (blue) variant at the
spectral tuning residue described in the text. The selection of tuning residue is lineage restricted, but each variant must have arisen on two separate
occasions.
doi:10.1371/journal.pbio.0050077.g003

Figure 4. Categories of Recruitment Metadata

The recruitment metadata distinguishes eight different general catego-
ries based on the relative placement of paired end sequencing reads
(mated reads) when recruited to a reference sequence in comparison to
their known orientation and separation on the clone from which they
were derived. Assuming orientation is correct, two mated reads can be
recruited closer together, further apart, or within expected distances
given the size of the clone from which the sequences were derived.
These sequences are categorized as ‘‘short,’’ ‘‘long,’’ or ‘‘good,’’
respectively. Alternately, the mated reads may be recruited in a mis-
oriented fashion, which trumps issues of separation. These reads can be
categorized as ‘‘normal,’’ ‘‘anti-normal,’’ or ‘‘outie.’’ In addition, there are
two other categories. ‘‘No mate’’ indicates that no mated read was
available for recruitment, possibly due to sequencing error. Perhaps most
useful of any of the recruitment categories, ‘‘missing’’ mates indicate
that while a mated sequence was available, it was not recruited to the
reference. ‘‘Missing’’ mates identify breaks in synteny between the
environmental data and the reference sequence.
doi:10.1371/journal.pbio.0050077.g004

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770407

Sorcerer II GOS Expedition


represent reference-specific differences that are not found in
the environmental populations rather than a cloning bias
that identifies genes or gene segments that are toxic or
unclonable in E. coli. The presence of missing mates flanking
these gaps indicates that the associated clones do exist, and
therefore that cloning issues are not a viable explanation for
the absence of recruited reads. Although the reference-
specific differences are quite apparent due to the recruitment
gaps they generate, there are also sporadic rearrangements
associated with single clones, mostly resulting from small
insertions or deletions.

Careful examination of the unrecruited mates of the reads
flanking the gaps allowed us to identify, characterize, and
quantify specific differences between the reference genome
and their environmental relatives. The results of this analysis
for P. ubique and P. marinus have been summarized in Table 5.
With few exceptions, small gaps resulted from the insertion
or deletion of only a few genes. Many of the genes associated
with these small insertions and deletions have no annotated
function. In some cases the insertions display a degree of
variability such that different sets of genes are found at these
locations within a portion of the population. In contrast,
many of the larger gaps are extremely variable to the extent
that every clone contains a completely unrelated or highly
divergent sequence when compared to the reference or to
other clones associated with that gap. These segments are
hypervariable and change much more rapidly than would be
expected given the variation in the rest of the genome. Sites
containing a hypervariable segment nearly always contained
some insert. We identified two exceptions both associated
with P. ubique. The first is approximately located at the 166-kb
position in the P. ubique HTCC1062 genome. Though no large
gap is present, the mated reads indicate that under many
circumstances a highly variable insert is often present. The
second is a gap on HTCC1062 that appears between 50 and
90 kb. This gap appears to be less variable than other
hypervariable segments and is occasionally absent based on
the large numbers of flanking long mated reads (Poster S1A).

Interestingly, the long mated reads around this gap seem to
be disproportionately from the Sargasso Sea samples,
suggesting that this segment may be linked to geographic
and/or environmental factors. Thus, hypervariable segments
are highly variable even within the same sample, can on
occasion be unoccupied, and the variation, or lack thereof,
can be sample dependent.
Hypervariable segments have been seen previously in a

wide range of microbes, including P. marinus [28], but their
precise source and functional role, especially in an environ-
mental context, remains a matter of ongoing research. For
clues to these issues we examined the genes associated with
the missing mates flanking these segments and the nucleotide
composition of the gapped sequences in the reference
genomes. In some rare cases the genes identified on reads
that should have recruited within a hypervariable gap were
highly similar to known viral genes. For example, a viral
integrase was associated with the P. ubique HTCC1062
hypervariable gap between 516 and 561 kb. However, in the
majority of cases the genes associated with these gaps were
uncharacterized, either bearing no similarity to known genes
or resembling genes of unknown function. If these genes were
indeed acquired through horizontal transfer then we might
expect that they would have obvious compositional biases.
Oligonucleotide frequencies along the P. ubique HTCC1062
and Synechococcus WH8102 genomes are quite different in the
large recruitment gaps in comparison to the well-represented
portions of the genome (Poster S1). Surprisingly, this was less
true for P. marinus MIT9312, where the gaps have been linked
to phage activity [28]. These results suggest that these
hypervariable segments of the genome are widespread among
marine microbial populations, and that they are the product
of horizontal transfer events perhaps mediated by phage or
transposable elements. These results are consistent with and
expand upon the hypothesis put forward by Coleman et al.
[28] suggesting that these segments are phage mediated, and
conflicts with initial claims that the HTCC1062 genome was
devoid of genes acquired by horizontal transfer [29].

Figure 5. Fragment Recruitment at Sites of Rearrangements

Environmental sequences recruited near breaks in synteny have characteristic patterns of recruitment metadata. Indeed, each of five basic
rearrangements (i.e., insertion, deletion, translocation, inversion, and inverted translocation) produced a distinct pattern when examining the
recruitment metadata. Here, example recruitment plots for each type of rearrangement have been artificially generated. The ‘‘good’’ and ‘‘no mate’’
categories have been suppressed. In each case, breaks in synteny are marked by the presence of stacks of ‘‘missing’’ mate reads. The presence or
absence of other categories distinguishes each type of rearrangement from the others.
doi:10.1371/journal.pbio.0050077.g005

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770408

Sorcerer II GOS Expedition


Table 5. Atypical Segments in P. marinus MIT9312 and P. ubique HTCC1062 (SAR11)

Reference Genome Begina Endb Size, bp Type of Variantc Description

MIT9312 36,401 38,311 2,132 Variable deletion 12 out of 66 clones support simple deletion. Remaining clones show considerable

sequence variation amongst themselves.

MIT9312 124,448 125,219 771 Variable insertion Associated with ASN tRNA gene; in the environment, half the reads identify pair

of small inserts with no similarity to known genes; the other half point to small

and large inserts of undetermined nature.

MIT9312 233,826 233,910 84 Insertion All clones support small insert (270 bp) with no clear sequence similarity to

known genes or sequences.

MIT9312 243,296 245,115 424 Variable deletion 24 of 42 clones support simple deletion of hypothetical protein. 13 support

slightly larger deletion. Remaining not clearly resolved.

MIT9312 296,818 300,888 4,344 Variable deletion 44 clones support deletion of 3,070 bp segment containing 4 genes (3 hypotheti-

cal; 1 carbamoyltransferase). 17 clones support alternative sequences with little or

no similarity to each other.

gMIT9312 342,404 342,662 326 Variable insert In environment is 93% chance of finding deoxyribodopyrimiden photolyase with

7% chance of finding an ABC type Fe3þ siderophore transport system permease

component.

MIT9312 345,933 365,351 19,418 Hypervariable Very limited similarity among clones indicates that this is a hypervariable seg-

ment. Note that it is closely associated with a site-specific integrase/recombinase.

MIT9312 551,347 552,025 678 Deletion Two small deletions within a hypothetical protein.

MIT9312 617,914 621,556 3,642 Hypervariable About 50% of the clones support small deletions among several hypothetical

proteins. Remaining support significant variability suggesting hypervariable seg-

ment. At least two of the missed mates contain integrase-like genes.

MIT9312 646,340 652,375 6,035 Deletion/Hypervariable Majority of clones support simple deletion of eight hypothetical and hypothetical

genes. Small number of clones indicate this may by hypervariable as well.

MIT9312 655,241 655,800 559 Deletion Small deletion between hypothetical proteins.

MIT9312 665,000 678000 12,000 Deletions Complex set of deletions and replacements that vary with geographic location.

MIT9312 665,824 666,380 556 Insertion Small inserted hypothetical protein.

MIT9312 670,747 671,933 1,186 Deletion Deletes hypothetical gene.

MIT9312 736,266 736,289 23 Insertion All clones support insertion of fructose-bisphosphate aldolase and fructose-1,6-bi-

sphosphate aldolase.

MIT9312 762,156 762,717 561 Deletion Small deletion between hypothetical proteins.

MIT9312 779,006 779,309 303 Insertion Small hypothetical protein inserted.

MIT9312 874,349 874,913 564 Insertion Small insertion including gene with similarity to RNA-dependent RNA-polymerase.

MIT9312 943,389 946,997 3,608 Variable Several small changes.

MIT9312 1,043,129 1,131,874 88,745 Hypervariable

MIT9312 1,140,922 1,141,412 490 Insertion Small insertion.

MIT9312 1,144,307 1,144,790 483 Insertion Small insertion of several genes.

MIT9312 1,155,123 1,156,440 1,317 Deletion Deletes high-light–inducible protein.

MIT9312 1,172,609 1,177,292 4,683 Variable Several genes have been replaced or deleted.

MIT9312 1,202,643 1,274,335 71,692 Hypervariable

MIT9312 1,288,481 1,290,367 1,886 Variable deletion Deletes polysaccharide export-related periplasmic protein (28 out of 55). Other

deletions are variable and may include replacement with alternate sequences.

MIT9312 1,323,606 1,324,523 917 Deletion Environmental sequences lack a small hypothetical protein.

MIT9312 1,369,637 1,369,996 359 Insertion NAD-dependent DNA ligase absent from MIT9312; has possible paralog

MIT9312 1,381,273 1,382,049 776 Deletion Small insert in MIT9312 not present in environment

MIT9312 1,384,664 1,385,110 446 Deletion Deletes delta(12)-fatty acid dehydrogenase and replaces gene with small (;100

bp) sequence. There is some variation in the exact location and replacement se-

quence.

MIT9312 1,388,430 1,389,718 1,288 Replacement Segment between two high-light–inducible proteins swapped for different se-

quence with no similarity.

MIT9312 1,392,865 1,392,976 111 Replacement Small replacement deletes hypothetical gene and replaces it with small, unknown

sequence. There is some variation in the precise boundaries of the deletion and

in the replacement sequences.

MIT9312 1,397,696 1,420,005 22,309 Hypervariable

MIT9312 1,486,145 1,487,971 231 Variable Small insertion of dolichyl-phosphate-mannose-protein mannosyltransferase; alter-

nately deletes glycosyl transferase (8 out of 56).

MIT9312 1,519,810 1,520,860 1,050 Variable deletion Deletes a single hypothetical protein; about half of the deletions contain variable

sequences of unknown origin.

MIT9312 1,568,049 1,569,121 1,072 Replacement Typically 928-bp portion of MIT9312 replaced by 175-bp stretch in environment;

some small amount of variation in environmental replacement sequence (11 out

of 51).

HTCC1062 50,555 93,942 43,387 Variable deletion Low recruitment segment containing many hypothetical, transporter, and secre-

tion genes.

HTCC1062 146,074 146,415 341 Deletion Often deleted segment containing DoxD-like and ferredoxin dependent gluta-

mate synthase peptide.

HTCC1062 166,600 166,700 100 Variable replacement GOS sequences indicate that variable blocks of genes are frequently inserted

here.

HTCC1062 308,720 309,633 913 Deletion Potential sulfotransferase domain deleted.

HTCC1062 339,545 339,951 406 Deletion Deletes a predicted O-linked N-acetylglucosamine transferase.

HTCC1062 385,348 386,224 876 Deletion SAM-dependent methyltransferase deleted.

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770409

Sorcerer II GOS Expedition


Though insertions and deletions accounted for many of the
obvious regions of structural variation, we also looked for
rearrangements. The high levels of local synteny associated
with P. ubique and P. marinus suggested that large-scale
rearrangements were rare in these populations. To inves-
tigate this hypothesis we used the recruitment data to
examine how frequently rearrangements besides insertions
and deletions could be identified. We looked for rearrange-
ments consisting of large (greater than 50 kb) inversions and
translocations associated with P. marinus; however, we did not
identify any such rearrangements that consistently distin-
guished environmental populations from sequenced cultivars.
Rare inversions and translocations were identified in the
dominant subtype associated with MIT9312 (Table 6). Based

on the amount of sequence that contributed to the analysis,
we estimate that one inversion or translocation will be
observed for every 2.6 Mbp of sequence examined (less than
once per P. marinus genome).
A further observation concerns the uniformity along a

genome of the evolutionary history among and within
subtypes. For instance, the similarity between GOS reads
and P. marinus MIT9312 is typically 85%–95%, while the
similarity between MIT9312 and P. marinus MED4 is generally
;10% lower. However, there are several instances where the
divergence of MIT9312 and MED4 abruptly decreases to no
more than that between the GOS sequences and MIT9312
(Poster S1G). These results are consistent either with
horizontal transfer (recombination) or with inhomogeneous

Table 6. Six Large-Scale Translocations and Inversions Were Identified in the Abundant P. marinus Subtype

Group Low

Genome

Begin

Low

Genome

End

High

Genome

Begin

High

Genome

End

Read ID Low

Read

Begin

Low

Read

Breakpoint

Low

Read

Strand

High

Read

Breakpoint

High

Read

End

High

Read

Strand

Read

Length

Inversion Sample

1 34,428 34,904 1,536,417 1,536,774 1092255385627 0 476 1 477 834 1 834 No 15

2 607,778 608,467 1,131,368 1,131,621 1093017685727 270 959 0 0 251 1 959 Yes 18

3 618,997 619,375 1,172,217 1,172,728 1092963065572 19 397 1 425 939 1 939 No 17

3 618,997 619,372 1,172,203 1,172,728 1095433012642 0 375 1 403 931 1 941 No 26

3 618,997 619,251 1,172,216 1,172,728 1091140913752 2 256 1 284 799 1 799 No 19

3 618,997 619,331 1,172,280 1,172,728 1641121 2 337 1 365 816 1 816 No 00d

3 619,007 619,375 1,172,223 1,172,728 1092963490951 19 387 1 425 933 1 933 No 17

4 652,933 653,483 1,369,774 1,370,077 200560 325 875 0 0 303 1 875 Yes 00a

4 652,979 653,350 1,369,774 1,370,200 1492647 0 371 1 439 865 0 867 Yes 00d

4 652,979 653,496 1,369,774 1,370,086 1092256128910 10 527 1 595 905 0 906 Yes 15

4 652,979 653,496 1,369,774 1,370,061 1092405979387 14 531 1 599 886 0 886 Yes 25

4 652,979 653,353 1,369,774 1,370,132 1093017637883 2 376 1 444 802 0 802 Yes 18

4 652,980 653,496 1,369,774 1,370,111 1092343389654 6 522 1 591 928 0 928 Yes 25

5 1,049,484 1,049,793 1,172,301 1,172,782 1400802 518 827 0 1 483 0 827 No 00d

6 1,219,993 1,220,519 1,485,834 1,486,222 1092256207885 2 528 1 532 919 1 919 No 17

6 1,219,993 1,220,496 1,485,925 1,486,222 380485 300 803 0 0 296 0 803 No 00a

6 1,219,993 1,220,477 1,485,782 1,486,204 1484583 0 484 1 506 927 1 927 No 00d

6 1,220,005 1,220,388 1,485,733 1,486,216 1682914 507 890 0 2 484 0 913 No 00d

doi:10.1371/journal.pbio.0050077.t006

Table 5. Continued.

Reference Genome Begina Endb Size, bp Type of Variantc Description

HTCC1062 441,074 441,152 78 Deletion Deletes possible methyltransferase FkbM.

HTCC1062 516,041 561,604 45,563 Variable replacement Several high identity ‘‘missed’’ mates match phage genes, including an integrase.

HTCC1062 660,413 660,978 565 Deletion Deletes hypothetical protein.

HTCC1062 675,141 676,399 1,258 Deletion Deletes hypothetical protein.

HTCC1062 766,022 768,263 2,241 Deletion Deletes four hypothetical proteins.

HTCC1062 814,015 816,386 2,371 Deletion Deletes steroid monoxygenase and short-chain dehydrogenase.

HTCC1062 893,450 922,604 29,154 Deletion Deletes large segment including a 7317-aa hypothetical gene.

HTCC1062 941,203 942,403 1,200 Deletion Deletes hypothetical protein.

HTCC1062 991,461 997,861 6,400 Deletion Deletes several hypotheticals and mix of other genes; adjacent to recombinase.

HTCC1062 1,117,126 1,143,483 26,357 Deletion Large number of hypothetical and transporters that are deleted.

HTCC1062 1,160,908 1,166,589 5,681 Deletion Deletes a small cluster of peptides with various functions.

HTCC1062 1,188,887 1,189,231 344 Deletion Deletes portion of winged helix DNA-binding protein but inserts sequences with

similarity to gij71082757 sodium bile symporter family protein found in large gap

between 50555 and 93942.

aBegin indicates the approximate bp position which marks the beginning of the gap in recruitment.
bEnd indicates the approximate bp position which marks the ending of the gap in recruitment.
cThe type of change indicates what would have to happen to the reference genome to produce the sequences seen in the environment (e.g., a deletion indicates that the indicated
portion of the reference would have to be deleted to generate the variant(s) seen in the environment).
doi:10.1371/journal.pbio.0050077.t005

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770410

Sorcerer II GOS Expedition


selectional pressures. Similar patterns are present in the two
high-identity subtypes seen on the P. ubique HTCC1062
genome (Poster S1D). Other regions show local increases in
similarity between MIT9312 and the dominant subtype that
are not reflected in the MIT9312/MED4 divergence (e.g., near
positions 50 kb, 288 kb, 730 kb, 850 kb, and 954 kb on
MIT9312; also see Poster S1G). These latter regions might
reflect either regions of homogenizing recombination or
regions of higher levels of purifying selection. However, the
lengths of the intervals (several are 10 kb or more) are longer
than any single gene and correspond to genes that are not
extremely conserved over greater taxonomic distances (in
contrast to the ribosomal RNA operon). Equally, if widespread
horizontal transfer of an advantageous segment explains these
intervals, the transfers occurred long enough ago for
appreciable variation to accumulate (unpublished data).

Extreme Assembly of Uncultivated Populations
The analyses described above have been confined to those

organisms with representatives in culture and for which
genomes were readily available. Producing assemblies for
other abundant but uncultivated microbial genera would
provide valuable physiological and biochemical information
that could eventually lead to the cultivation of these
organisms, help elucidate their role in the marine commun-
ity, and allow similar analyses of their evolution and variation
such as those performed on sequenced organisms. Previous
assembly efforts and the fragment recruitments plots showed
that there is considerable and in many cases conflicting
variation among related organisms. Such variation is known
to disrupt whole-genome assemblers. This led us to try an
assembly approach that aggressively resolves conflicts. We call
this approach ‘‘extreme assembly’’ (see Materials and Meth-
ods). This approach currently does not make use of mate-
pairing data and, therefore produces only contigs, not
scaffolded sequences. Using this approach, contigs as large
as 900 kb could be aligned almost in their entirety to the P.
marinus MIT9312 and P. ubique HTCC1062 genomes (Figure
2J–2L). Consistent patterns of fragment recruitment (see
below) generally provided evidence of the correctness of
contigs belonging to otherwise-unsequenced organisms.
Accordingly, large contigs from these alternate assemblies
were used to investigate genetic and geographic population
structure, as described below. However, the more aggressive
assemblies demonstrably suffered from higher rates of
assembly artifacts, including chimerism and false consensus
sequences (Figure 6). Thus, the more stringent primary
assembly was employed for most assembly-based analyses, as
manual curation was not practical.
As just noted, many of the large contigs produced by the

more aggressive assembly methods described above did not
align to any great degree with known genomes. Some could
be tentatively classified based on contained 16S sequences,
but the potential for computationally generated chimerism
within the rRNA operon is sufficiently high that inspection of
the assembly or other means of confirming such classifica-
tions is essential. An alternative to an unguided assembly that
facilitates the association of assemblies with known organisms
is to start from seed fragments that can be identified as
belonging to a particular taxonomic group. We employed
fragments outside the ribosomal RNA operon that were
mated to a 16S-containing read, limiting extension to the
direction away from the 16S operon. This produced contigs
of 100 kb or more for several of the ribotypes that were
abundant in the GOS dataset. When evaluated via fragment
recruitment (Figure 2M–2O), these assemblies revealed
patterns analogous to those seen for the sequenced genomes
described above: multiple subtypes could be distinguished
along the assembly, differing in similarity to the reference
sequence and sample distribution, with occasional gaps.
Hypervariable segments by definition were not represented
in these assemblies, but they may help explain the termi-
nation of the extreme assemblies for P. marinus and SAR11
and provide a plausible explanation for termination of
assemblies of the other deeply sampled populations as well.
This directed approach to assembly can also be used to

investigate variation within a group of related organisms (e.g.,
a 16S ribotype). We explored the potential to assemble

Figure 6. Examples of Chimeric Extreme Assemblies

(A) Fragment recruitment to an extreme assembly contig indicates the
assembly is chimeric between two organisms, based on dramatic shifts in
density of recruitment, level of conservation, and sample distribution.
(B) Fragment recruitment to a SAR11-related extreme assembly. Changes
in color, density, and vertical location toward the top of the figure
indicate transitions among multiple subtypes of SAR11.
doi:10.1371/journal.pbio.0050077.g006

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770411

Sorcerer II GOS Expedition


distinct subtypes of SAR11 by repeatedly seeding extreme
assembly with fragments mated to a SAR11-like 16S sequence.
Figure 7 compares the first 20 kb from each of 24
independent assemblies. Eighteen of these segments could
be aligned full-length to a portion of the HTCC1062 genome
just upstream of 16S, while six appeared to reflect rearrange-
ments relative to HTCC1062. The rearranged segments were
associated with more divergent 16S sequences (8%–14%
diverged from the 16S of HTCC1062), while those without
rearrangements corresponded to less divergent 16S (averag-
ing less than 3% different from HTCC1062). In each segment,
many reads were recruited above 90% identity, but different
samples dominated different assemblies. Phylogenetic trees
support the inference of evolutionarily distinct subtypes with
distinctive sample distributions (Figure 8).

Taxonomic Diversity
Environmental surveys provide a cultivation-independent

means to examine the diversity and complexity of an
environmental sample and serve as a basis to compare the
populations between different samples. Typically, these

surveys use PCR to amplify ubiquitous but slowly evolving
genes such as the 16S rRNA or recA genes. These in turn can
be used to distinguish microbial populations. Since PCR can
introduce various biases, we identified 16S genes directly
from the primary GOS assembly. In total, 4,125 distinct full-
length or partial 16S were identified. Clustering of these
sequences at 97% identity gave a total of 811 distinct
ribotypes. Nearly half (48%) of the GOS ribotypes and 88%
of the GOS 16S sequences were assigned to ribotypes
previously deposited in public databases. That is, more than
half the ribotypes in the GOS dataset were found to be novel
at what is typically considered the species level [30]. The
overall taxonomic distribution of the GOS ribotypes sampled
by shotgun sequencing is consistent with previously published
PCR based studies of marine environments (Table 7) [31]. A
smaller amount (16%) of GOS ribotypes and 3.4% of the GOS
16S sequences diverged by more than 10% from any publicly
available 16S sequence, thus being novel to at least the family
level.
A census of microbial ribotypes allows us to identify the

abundant microbial lineages and estimate their contribution

Figure 7. Fragment Recruitment Plots to 20-kb Segments of SAR11-Like Contigs Show That Many SAR11 Subtypes, with Distinct Distributions, Can Be

Separated by Extreme Assembly

Each segment is constructed of a unique set of GOS sequencing reads (i.e., no read was used in more than one segment). Segments are arbitrarily
labeled (A–X) for reference in Figure 8.
doi:10.1371/journal.pbio.0050077.g007

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770412

Sorcerer II GOS Expedition


Figure 8. Phylogeny of GOS Reads Aligning to P. ubique HTCC1062 Upstream of 16S Gene Indicates That the Extreme Assemblies in Figure 7

Correspond to Monophyletic Subtypes

Coloring of branches indicates that the corresponding reads align at .90% identity to the extreme assembly segments shown in Figure 7; colored
labels (A–X) correspond to the labels in Figure 7, indicating the segment or segments to which reads aligned.
doi:10.1371/journal.pbio.0050077.g008

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770413

Sorcerer II GOS Expedition


to the GOS dataset. Of the 811 ribotypes, 60 contain more
than 8-fold coverage of the 16S gene (Table 8); jointly, these 60
ribotypes accounted for 73% of all the 16S sequence data. All
but one of the 60 have been detected previously, yet only a few
are represented by close relatives with complete or nearly
complete genome sequencing projects (see Fragment Recruit-
ment for further details). Several other abundant 16S
sequences belong to well-known environmental ribotypes that
do not have cultivated representatives (e.g., SAR86, Roseobacter
NAC-1–2, and branches of SAR11 other than those containing
P. ubique). Interestingly, archaea are nearly absent from the list
of dominant organisms in these near-surface samples.

The distribution of these ribotypes reveals distinct micro-
bial communities (Figure 9 and Table 8). Only a handful of
the ribotypes appear to be ubiquitously abundant; these are
dominated by relatives of SAR11 and SAR86. Many of the
ribotypes that are dominant in one or more samples appear
to reside in one of three separable marine surface habitats.
For example, several SAR11, SAR86, and alpha Proteobacteria,
as well as an Acidimicrobidae group, are widespread in the
surface waters, while a second niche delineated by tropical
samples contains several different SAR86, Synechococcus and
Prochlorococcus (both cyanobacterial groups), and a Rhodospir-
illaceae group. Other ribotypes related to Roseobacter RCA,
SAR11, and gamma Proteobacteria are abundant in the
temperate samples but were not observed in the tropical or
Sargasso samples. Not surprisingly, samples taken from
nonmarine environments (GS33, GS20, GS32), estuaries
(GS11, GS12), and larger-sized fraction filters (GS01a,
GS01b, GS25) have distinguishing ribotypes. Furthermore,
as the complete genomes of these dominant members are
obtained, the capabilities responsible for their abundances
may well lend insight into the community metabolism in
various oceanic niches.

Sample Comparisons
The most common approach for comparing the microbial

community composition across samples has been to examine
the ribotypes present as indicated by 16S rRNA genes or by
analyzing the less-conserved ITS located between the 16S and

23S gene sequences [7,8,16,17]. However, a clear observation
emerging from the fragment recruitment views was that the
reference ribotypes recruit multiple subtypes, and that these
subtypes were distributed unequally among samples (Figures
2, 7, 8; Poster S1D, S1F, and S1I).
We developed a method to assess the genetic similarity

between two samples that potentially makes use of all
portions of a genome, not just the 16S rRNA region. This
similarity measure is assembly independent; under certain
circumstances, it is equivalent to an estimate of the fraction
of sequence from one sample that could be considered to be
in the other sample. Whole-metagenomic similarities were
computed for all pairs of samples. Results are presented for
comparisons at �98% and 90% identity. No universal cutoff
consistently divides sequences into natural subsets, but the
98% identity cutoff provides a relatively high degree of
resolution, while the 90% cutoff appears to be a reasonable
heuristic for defining subtypes. For instance, a 90% cutoff
treats most of the reads specifically recruited to P. marinus
MIT9312 as similar (those more similar to MED4 notably
excepted), while reasonably separating clades of SAR11
(Figures 7 and 8). Reads with no qualifying overlap alignment
to any other read in a pair of samples are uninformative for
this analysis, as they correspond to lineages that were so
lightly sequenced that their presence in one sample and
absence in another may be a matter of chance. For the 90%
cutoff, 38% of the sequence reads contributed to the analysis.
The resulting similarities reveal clear and consistent group-
ings of samples, as well as the outlier status of certain samples
(Figures 10 and 11).
The broadest contrast was between samples that could be

loosely labeled ‘‘tropical’’ (including samples from the
Sargasso Sea [GS00b, GS00c, GS00d] and samples that are
temperate by the formal definition but under the influence of
the Gulf Stream [GS14, GS15]) and ‘‘temperate.’’ Further
subgroups can be identified within each of these categories, as
indicated in Figures 10 and 11. In some cases, these groupings
were composed of samples taken from different ocean basins
during different legs of the expedition. A few pairs of samples
with strikingly high similarity were observed, including GS17
and GS18, GS23 and GS26, GS27 and GS28, and GS00b and
GS00d. In each case, these pairs of samples were collected
from consecutive or nearly consecutive samples. However,
the same could be said of many other pairs of samples that do
not show this same degree of similarity. Indeed, geographi-
cally and temporally separated samples taken in the Atlantic
(GS17, GS18) and Pacific (GS23, GS26) during separate legs of
the expedition are more similar to one another than were
most pairs of consecutive samples. The samples with least
similarity to any other sample were from unique habitats.
Thus, similarity cannot be attributed to geographic separa-
tion alone.
The groupings described above can be reconstructed from

taxonomically distinct subsets of the data. Specifically, the
major groups of samples visible in Figure 10 were reproduced
when sample similarities were determined based only on
fragments recruiting to P. ubique HTCC1062 (unpublished
data). Likewise, the same groupings were observed when the
fragments recruiting to either HTCC1062 or P. marinus
MIT9312, or both, were excluded from the calculations
(unpublished data). Thus, the factors influencing sample
similarities do not appear to rely solely on the most abundant

Table 7. Taxonomic Makeup of GOS Samples Based on 16S Data
from Shotgun Sequencing

Phylum or Class Fractiona

Alpha Proteobacteria 0.32

Unclassified Proteobacteria 0.155

Gamma Proteobacteria 0.132

Bacteroidetes 0.13

Cyanobacteria 0.079

Firmicutes 0.075

Actinobacteria 0.046

Marine Group A 0.022

Beta Proteobacteria 0.017

OP11 0.008

Unclassified Bacteria 0.008

Delta Proteobacteria 0.005

Planctomycetes 0.002

Epsilon Proteobacteria 0.001

aValues shown are averages over all samples.
doi:10.1371/journal.pbio.0050077.t007

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770414

Sorcerer II GOS Expedition


Table 8. Most Abundant Ribotypes (97% Identity Clusters)

Ribotype Classificationa Depth of Coverageb Range Number of Matching GenBank Entriesc

SAR11 Surface 1 581 Widespread 100þ
SAR11 Surface 2 182 Sargasso and GS31d 100þ
Burkholderia 139 00a 100þ
Acidomicrobidae type a 133 Tropical and Sargassod 77

Prochlorococcus 112 Tropical and Sargasso 76

SAR11 Surface 3 109 Widespread 63

SAR86-like type a 108 Widespread 88

Shewanella 80 00a 49

Synechococcus 59 Tropicald 100þ
Rhodospirillaceae 50 Tropical and Sargassod 15

SAR86-like type b 47 Hypersaline pondd 24

SAR86-like type c 47 Tropical and Sargasso 75

Chlorobi-like 40 Hypersaline pond 1

Alpha Proteobacteria type a 38 Widespread 10

Roseobacter type a 35 Tropical and Sargassod 42

Cellulomonadaceae type a 34 Hypersaline pond 12

SAR86-like type d 28 Widespread 30

Alpha Proteobacteria type b 27 Widespread 43

SAR86-like type e 26 Widespread 71

Cytophaga type a 24 Tropical and Sargasso 21

Bacteroidetes type a 23 Widespread 17

Bdellovibrionales type a 21 Tropical and Sargasso 36

Acidomicrobidae type b 21 Temperate 41

SAR116-like 21 Widespread 28

Marine Group A type a 20 Tropical and Sargassod 9

Remotely SAR11-like type a 19 Sargasso and tropical 12

Frankineae type a 19 Fresh and estuary 6

Frankineae type b 18 Hypersaline pondd 71

SAR86-like type f 18 Tropical 15

SAR86-like type g 17 Tropical 13

Remotely SAR11-like type b 16 Tropical and Sargasso 4

Gamma Proteobacteria type a 16 Sargasso and GS14 18

Microbacteriaceae 15 Hypersaline pond 1

SAR102/122-like 15 Tropical and Sargasso 15

SAR86-like type h 15 Tropical and Sargasso 27

Bacteroidetes type b 14 Tropicald 14

Remotely SAR11-like type c 14 Tropical and Sargasso 18

SAR86-like type i 14 Sargassod 13

Rhodobium-like type a 14 Sargasso and GS31d 9

Marine Group A type b 13 Tropical and Sargassod 28

Gamma Proteobacteria type b 12 S. temperate and GS31 22

Oceanospirillaceae 12 Mangrove 1

Gamma Proteobacteria type c 12 Widespread 14

SAR11-like type a 12 Temperate 17

SAR11-like type b 11 Fresh and estuary 11

SAR11-like type c 11 Sargasso and GS31d 12

SAR86-like type j 11 Tropical and Sargasso 1

Roseobacter Algicola 11 Widespread 20

Remotely SAR102/122-like 11 Hypersaline pond 0

Frankineae type c 10 Fresh and estuary 11

Rhodobium-like type b 10 Widespread 9

Roseobacter RCA 10 Temperate 39

Remotely SAR11-like type d 10 Tropical 32

Acidobacteria 9 Fresh 4

Remotely SAR11-like type e 9 Tropical and Sargasso 8

Frankineae type d 9 Estuary and fresh 89

SAR11-like type d 9 Sargasso and GS31 4

Methylophilus 9 Temperated 41

Archaea C1 C1a 8 Sargasso and GS31 2

Cytophaga type b 8 Widespread 9

aTaxonomic classifications based on Hugenholtz ARB database. Labels indicate the most specific taxonomic assignment that could be confidently assigned to each ribotype. ‘‘Type a,’’
‘‘type b,’’ etc., used to arbitrarily discriminate separate 97% ribotypes that would otherwise be given the same name.
bNote that the 16S rRNA gene can be multicopy.
cMatching GenBank entries required full-length matches at � 98% identity.
dLess than 13 coverage outside described range.
doi:10.1371/journal.pbio.0050077.t008

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770415

Sorcerer II GOS Expedition


organisms but rather are reflected in multiple microbial
lineages.

It is tempting to view the groups of similar samples as
constituting community types. Sample similarities based on
genomic sequences correlated significantly with differences
in the environmental parameters (Table 1), particularly water
temperature and salinity (unpublished data). Samples that are
very similar to each other had relatively small differences in
temperature and salinity. However, not all samples that had
similar temperature and salinity had high community
similarities. Water depth, primary productivity, fresh water
input, proximity to land, and filter size appeared consistent

with the observed groupings. Other factors such as nutrients
and light for phototrophs and fixed carbon/energy for
chemotrophs may ultimately prove better predictors, but
these results demonstrate the potential of using metagenomic
data to tease out such relationships.
Examining the groupings in Figure 11 in light of habitat

and physical characteristics, the following may be observed.
The first two samples, a hypersaline pond in the Galapagos
Islands (GS33) and the freshwater Lake Gatun in the Panama
Canal (GS20) are quite distinct from the rest. Salinity—both
higher and lower than the remaining coastal and ocean
samples—is the simplest explanation.

Figure 9. Presence and Abundance of Dominant Ribotypes

The relative abundance of various ribotypes (rows) in each filter (columns) is represented by the area of the corresponding spot (if any). The listed
ribotypes each satisfied the following criteria in at least one filter: the ribotype was among the five most abundant ribotypes detected in the shotgun
data, and was represented by at least three sequencing reads. Relative abundance is based on the total number of 16S sequences in a given filter. Order
and grouping of filters is based on the clustering of genomic similarity shown in Figure 11. Ribotype order was determined based on similarity of
sample distribution. A marked contrast between temperate and tropical groups is visible. Estuarine samples GS11 and GS12 contained a mix of
ribotypes seen in freshwater and temperate marine samples, while samples from nonmarine habitats or larger filter sizes were pronounced outliers. The
presence of large amounts of Burkholderia and Shewanella in one Sargasso Sea sample (GS00a) makes this sample look much less like other Sargasso
and tropical marine samples than it otherwise would. Note that 16S is not a measure of cell abundance since 16S genes can be multicopy.
doi:10.1371/journal.pbio.0050077.g009

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770416

Sorcerer II GOS Expedition


Twelve samples form a strong temperate cluster as seen in
the similarity matrix of Figure 11 as a darker square bounded
by GS06 and GS12. Embedded within the temperate cluster
are three subclusters. The first subcluster includes five

samples from Nova Scotia through the Gulf of Maine. This
is followed by a subcluster of four samples between Rhode
Island and North Carolina. The northern subcluster was
sampled in August, the southern subcluster in November and

Figure 10. Similarity between Samples in Terms of Shared Genomic Content

Genomic similarity, as described in the text, is an estimate of the amount of the genetic material in two filters that is ‘‘the same’’ at a given percent
identity cutoff—not the amount of sequence in common in a finite dataset, but rather in the total set of organisms present on each filter. Similarities are
shown for 98% identity.
(A) Hierarchical clustering of samples based on pairwise similarities.
(B) Pairwise similarities between samples, represented as a symmetric matrix of grayscale intensities; a darker cell in the matrix indicates greater
similarity between the samples corresponding to the row and column, with row and column ordering as in (A). Groupings of similar filters appear as
subtrees in (A) and as squares consisting of two or more adjacent rows and columns with darker shading. Colored bars highlight groups of samples
described in the text; labels are approximate characterizations rather than being strictly true of every sample in a group.
doi:10.1371/journal.pbio.0050077.g010

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770417

Sorcerer II GOS Expedition


December. Though all samples were collected in the top few
meters, the southern samples were in shallower waters, 10 to
30 m deep, whereas most of the northern samples were in
waters greater than 100 m deep. Monthly average estimates of
chlorophyll a concentrations were typically higher in the
southern samples as well (Table 1). All of these factors—
temperature, system primary production, and depth of the
sampled water body—likely contribute to the differences in
microbial community composition that result in the two well-
defined clusters. The final temperate subgroup includes two
estuaries, Chesapeake Bay (GS12) and Delaware Bay (GS11),
distinguished by their lower salinity and higher productivity.
However, GS11 is markedly similar not only to GS12 but also
to coastal samples, whereas the latter appears much more
unique. Interestingly, the Bay of Fundy estuary sample (GS06)
clearly did not group with the two other estuaries, but rather
with the northern subgroup, perhaps reflecting differences in
the rate or degree of mixing at the sampling site.

Continuing to the right and downward in Figure 11, one
can see a large cluster of 25 samples from the tropics and
Sargasso Sea, bounded by GS47 and GS00b. This can be
further subdivided into several subclusters. The first sub-
cluster (a square bounded by GS47 and GS14) includes 14
samples, about half of which were from the Galapagos. The
second distinct subcluster (a square bounded by GS16 and
GS26) includes seven samples from Key West, Florida, in the
Atlantic Ocean to a sample close to the Galapagos Islands in
the Pacific Ocean. Loosely associated with this subcluster is a
sample from a larger filter size taken en route to the
Galapagos (GS25). The remaining samples group weakly with
the tropical cluster. GS32 was taken in a coastal mangrove in
the Galapagos. The thick organic sediment at a depth of less
than a meter is the likely cause for it being unlike the other
samples. Sample 00a was from the Sargasso Sea and contained
a large fraction of sequence reads from apparently clonal
Burkholderia and Shewanella species that are atypical. When this

Figure 11. Sample Similarity at 90% Identity

Similarity between samples in terms of shared genomic content similar to Figure 10, except that the plots were done using a 90% identity cutoff that
has proven reasonable for separating some moderately diverged subtypes
doi:10.1371/journal.pbio.0050077.g011

PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770418

Sorcerer II GOS Expedition


Table 9. Relative Abundance of TIGRFAMs Associated with a Specific Sample

TIGRFAM Number of

Peptides

Samplea Relative

Abundanceb
Major

Category

Minor

Category

Description

TIGR01526 131 GS01a 3.4 Nicotinamide-nucleotide adenylyltransferase

TIGR00661 214 GS01a 2.6 Hypothetical Conserved Conserved hypothetical protein

TIGR01833 135 GS01a 2 Hydroxymethylglutaryl-CoA synthase

TIGR01408 267 GS01b 4.5 Ubiquitin-activating enzyme E1

TIGR01678 144 GS01b 2.6 Sugar 1,4-lactone oxidases

TIGR01879 758 GS01b 2.5 Amidase, hydantoinase/carbamoylase family

TIGR00890 131 GS01b 2.4 Transport Carbohydrates, Oxalate/Formate Antiporter

TIGR01767 112 GS01b 2.4 5-methylthioribose kinase

TIGR00101 495 GS01b 2.3 Central Nitrogen Urease accessory protein UreG

TIGR01659 186 GS01b 2.2 Sex-lethal family splicing factor

TIGR00313 455 GS01b 2.1 Biosynthesis Heme, Cobyric acid synthase CobQ

TIGR00317 306 GS01b 2.1 Biosynthesis Heme, Cobalamin 59-phosphate synthase

TIGR00601 186 GS01b 2.1 DNA DNA UV excision repair protein Rad23

TIGR01792 904 GS01b 2.1 Central Nitrogen Urease, alpha subunit

TIGR00749 483 GS01b 2 Energy Glycolysis/gluconeogenesis Glucokinase

TIGR01001 485 GS01b 2 Amino Aspartate Homoserine O-succinyltransferase

TIGR02238 165 GS32 5.1 Meiotic recombinase Dmc1

TIGR02239 159 GS32 3.5 DNA repair protein RAD51

TIGR02232 140 GS32 2.6 Myxococcus cysteine-rich repeat

TIGR02153 212 GS32 2.5 Protein tRNA Glutamyl-tRNA(Gln) amidotransferase, subunit D

TIGR00519 289 GS32 2.4 L-asparaginases, type I

TIGR00288 248 GS32 2 Hypothetical Conserved Conserved hypothetical protein TIGR00288

TIGR01681 136 GS32 2 HAD-superfamily phosphatase, subfamily IIIC

TIGR02236 143 GS32 2 DNA DNA DNA repair and recombination protein RadA

TIGR00028 110 GS33 14.7 Mycobacterium tuberculosis PIN domain family

TIGR01550 131 GS33 9.1 Unknown General Death-on-curing family protein

TIGR01552 200 GS33 5.3 Mobile Other Prevent-host-death family protein

TIGR00143 151 GS33 4.6 Protein Protein [NiFe] hydrogenase maturation protein HypF

TIGR01641 131 GS33 4.3 Mobile Prophage Phage putative head morphogenesis protein, SPP1 gp7 family

TIGR01710 1,926 GS33 3.9 Cellular Pathogenesis General secretion pathway protein G

TIGR01539 251 GS33 3.6 Mobile Prophage Phage portal protein, lambda family

TIGR00016 217 GS33 3.5 Energy Fermentation Acetate kinase

TIGR00942 117 GS33 3.5 Transport Cations Multicomponent Naþ:Hþ antiporter

TIGR01836 174 GS33 3.3 Fatty Biosynthesis Poly(R)-hydroxyalkanoic acid synthase, class III, PhaC subunit

TIGR02110 135 GS33 3.2 Biosynthesis Other Coenzyme PQQ biosynthesis protein PqqF

TIGR01106 902 GS33 3.1 Energy ATP-proton Na,H/K antiporter P-type ATPase, alpha subunit

TIGR01497 726 GS33 3.1 Transport Cations Kþ-transporting ATPase, B subunit

TIGR01524 603 GS33 3.1 Transport Cations Magnesium-translocating P-type ATPase

TIGR02140 166 GS33 3.1 Transport Anions Sulfate ABC transporter, permease protein CysW

TIGR01409 122 GS33 3 Tat (twin-arginine translocation) pathway signal sequence

TIGR02195 282 GS33 3 Cell Biosynthesis Lipopolysaccharide heptosyltransferase II

TIGR01522 696 GS33 2.9 Calcium-transporting P-type ATPase, PMR1-type

TIGR01523 766 GS33 2.9 Potassium/sodium efflux P-type ATPase, fungal-type

TIGR00202 228 GS33 2.8 Regulatory RNA Carbon storage r