The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific Douglas B. Rusch 1* , Aaron L. Halpern 1 , Granger Sutton 1 , Karla B. Heidelberg 1,2 , Shannon Williamson 1 , Shibu Yooseph 1 , Dongying Wu 1,3 , Jonathan A. Eisen 1,3 , Jeff M. Hoffman 1 , Karin Remington 1,4 , Karen Beeson 1 , Bao Tran 1 , Hamilton Smith 1 , Holly Baden-Tillson 1 , Clare Stewart 1 , Joyce Thorpe 1 , Jason Freeman 1 , Cynthia Andrews-Pfannkoch 1 , Joseph E. Venter 1 , Kelvin Li 1 , Saul Kravitz 1 , John F. Heidelberg 1,2 , Terry Utterback 1 , Yu-Hui Rogers 1 , Luisa I. Falcón 5 , Valeria Souza 5 , Germán Bonilla-Rosso 5 , Luis E. Eguiarte 5 , David M. Karl 6 , Shubha Sathyendranath 7 , Trevor Platt 7 , Eldredge Bermingham 8 , Victor Gallardo 9 , Giselle Tamayo-Castillo 10 , Michael R. Ferrari 11 , Robert L. Strausberg 1 , Kenneth Nealson 1,12 , Robert Friedman 1 , Marvin Frazier 1 , J. Craig Venter 1 1 J. Craig Venter Institute, Rockville, Maryland, United States of America, 2 Department of Biological Sciences, University of Southern California, Avalon, California, United States of America, 3 Genome Center, University of California Davis, Davis, California, United States of America, 4 Your Genome, Your World, Rockville, Maryland, United States of America, 5 Departmento de Ecologı́a Evolutiva, Instituto de Ecologı́a, Universidad Nacional Autónoma de México, Mexico City, Mexico, 6 Department of Oceanography, University of Hawaii, Honolulu, Hawaii, United States of America, 7 Bedford Institute of Oceanography, Dartmouth, Nova Scotia, Canada, 8 Smithsonian Tropical Research Institute, Balboa, Ancon, Republic of Panama, 9 Departamento de Oceanografı́a, Universidad de Concepción, Concepción, Chile, 10 Escuela de Quı́mica, Universidad de Costa Rica, San Pedro, Costa Rica, 11 Department of Environmental Sciences, Rutgers University, New Brunswick, New Jersey, United States of America, 12 Department of Earth Sciences, University of Southern California, Los Angles, California, United States of America The world’s oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition. These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp). Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with 85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff. Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and assembly methods. One comparative genomic method, termed ‘‘fragment recruitment,’’ addressed questions of genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes and gene families. A second method, termed ‘‘extreme assembly,’’ made possible the assembly and reconstruction of large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3) hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into genetically isolated populations that have overlapping but independent distributions, implying distinct environmental preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show how they may be grouped into several community types. Specific functional adaptations can be identified both within individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or absence of the phosphate-binding gene PstS. Citation: Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol 5(3): e77. doi:10.1371/journal.pbio.0050077 Academic Editor: Nancy A. Moran, University of Arizona, United States of America Received July 14, 2006; Accepted January 16, 2007; Published March 13, 2007 Copyright: � 2007 Rusch et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abbreviations: CAMERA, Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis; GOS, Global Ocean Sampling; NCBI, National Center for Biotechnology Information * To whom correspondence should be addressed. E-mail: DRusch@venterinstitute. org This article is part of Global Ocean Sampling collection in PLoS Biology. The full collection is available online at http://collections.plos.org/plosbiology/gos-2007. php. PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770398 PLoS BIOLOGY Introduction The concept of microbial diversity is not well defined. It can either refer to the genetic (taxonomic or phylogenetic) diversity as commonly measured by molecular genetics methods, or to the biochemical (physiological) diversity measured in the laboratory with pure or mixed cultures. However, we know surprisingly little about either the genetic or biochemical diversity of the microbial world [1], in part because so few microbes have been grown under laboratory conditions [2,3], and also because it is likely that there are immense numbers of low abundance ribotypes that have not been detected using molecular methods [4]. Our under- standing of microbial physiological and biochemical diversity has come from studying the less than 1% of organisms that can be maintained in enrichments or cultivated, while our understanding of phylogenetic diversity has come from the application of molecular techniques that are limited in terms of identifying low-abundance members of the communities. Historically, there was little distinction between genetic and biochemical diversity because our understanding of genetic diversity was based on the study of cultivated microbes. Biochemical diversity, along with a few morpho- logical features, was used to establish genetic diversity via an approach called numerical taxonomy [5,6]. In recent years the situation has dramatically changed. The determination of genetic diversity has relied almost entirely on the use of gene amplification via PCR to conduct taxonomic environmental gene surveys. This approach requires the presence of slowly evolving, highly conserved genes that are found in otherwise very diverse organisms. For example, the gene encoding the small ribosomal subunit RNA, known as 16S, based on sedimentation coefficient, is most often used for distinguish- ing bacterial and archaeal species [7–10]. The 16S rRNA sequences are highly conserved and can be used as a phylogenetic marker to classify organisms and place them in evolutionary context. Organisms whose 16S sequences are at least 97% identical are commonly considered to be the same ribotype [11], otherwise referred to as species, opera- tional taxonomic units, or phylotypes. Although rRNA-based analysis has revolutionized our view of genetic diversity, and has allowed the analysis of a large part of the uncultivated majority, it has been less useful in predicting biochemical diversity. Furthermore, the relation- ship between genetic and biochemical diversity, even for cultivated microbes, is not always predictable or clear. For instance, organisms that have very similar ribotypes (97% or greater homology) may have vast differences in physiology, biochemistry, and genome content. For example, the gene complement of Escherichia coli O157:H7 was found to be substantially different from the K12 strain of the same species [12]. In this paper, we report the results of the first phase of the Sorcerer II Global Ocean Sampling (GOS) expedition, a metagenomic study designed to address questions related to genetic and biochemical microbial diversity. This survey was inspired by the British Challenger expedition that took place from 1872–1876, in which the diversity of macroscopic marine life was documented from dredged bottom samples approximately every 200 miles on a circumnavigation [13–15]. Through the substantial dataset described here, we identified 60 highly abundant ribotypes associated with the open ocean and aquatic samples. Despite this relative lack of diversity in ribotype content, we confirm and expand upon previous observations that there is tremendous within-ribotype diver- sity in marine microbial populations [4,7,8,16,17]. New techniques and tools were developed to make use of the sampling and sequencing metadata. These tools include: (1) the fragment recruitment tool for performing and visualizing comparative genomic analyses when a reference sequence is available; (2) new assembly techniques that use metadata to produce assemblies for uncultivated abundant microbial taxa; and (3) a whole metagenome comparison tool to compare entire samples at arbitrary degrees of genetic divergence. Although there is tremendous diversity within cultivated and uncultivated microbes alike, this diversity is organized into phylogenetically distinct groups we refer to as subtypes. Subtypes can occupy similar environments yet remain genetically isolated from each other, suggesting that they are adapted for different environmental conditions or roles within the community. The variation between and within subtypes consists primarily of nucleotide polymorphisms but includes numerous small insertions, deletions, and hyper- variable segments. Examination of the GOS data in these terms sheds light on patterns of evolution and also suggests approaches towards improving the assembly of complex metagenomic datasets. At least some of this variation can be associated with functional characters that are a direct response to the environment. More than 6.1 million proteins, including thousands of new protein families, have been annotated from this dataset (described in the accompanying paper [18]). In combination, these papers bring us closer to reconciling the genetic and biochemical disconnect and to understanding the marine microbial community. We describe a metagenomic dataset generated from the Sorcerer II expedition. The GOS dataset, which includes and extends our previously published Sargasso Sea dataset [19], now encompasses a total of 41 aquatic, largely marine locations, constituting the largest metagenomic dataset yet produced with a total of ;7.7 million sequencing reads. In Author Summary Marine microbes remain elusive and mysterious, even though they are the most abundant life form in the ocean, form the base of the marine food web, and drive energy and nutrient cycling. We know so little about the vast majority of microbes because only a small percentage can be cultivated and studied in the lab. Here we report on the Global Ocean Sampling expedition, an environmental metagenomics project that aims to shed light on the role of marine microbes by sequencing their DNA without first needing to isolate individual organisms. A total of 41 different samples were taken from a wide variety of aquatic habitats collected over 8,000 km. The resulting 7.7 million sequencing reads provide an unprecedented look at the incredible diversity and heterogeneity in naturally occurring microbial populations. We have developed new bioinfor- matic methods to reconstitute large portions of both cultured and uncultured microbial genomes. Organism diversity is analyzed in relation to sampling locations and environmental pressures. Taken together, these data and analyses serve as a foundation for greatly expanding our understanding of individual microbial lineages and their evolution, the nature of marine microbial communities, and how they are impacted by and impact our world. PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770399 Sorcerer II GOS Expedition the pilot Sargasso Sea study, 200 l surface seawater was filtered to isolate microorganisms for metagenomic analysis. DNA was isolated from the collected organisms, and genome shotgun sequencing methods were used to identify more than 1.2 million new genes, providing evidence for substantial microbial taxonomic diversity [19]. Several hundred new and diverse examples of the proteorhodopsin family of light- harvesting genes were identified, documenting their exten- sive abundance and pointing to a possible important role in energy metabolism under low-nutrient conditions. However, substantial sequence diversity resulted in only limited genome assembly. These results generated many additional questions: would the same organisms exist everywhere in the ocean, leading to improved assembly as sequence coverage increased; what was the global extent of gene and gene family diversity, and can we begin to exhaust it with a large but achievable amount of sequencing; how do regions of the ocean differ from one another; and how are different environmental pressures reflected in organisms and com- munities? In this paper we attempt to address these issues. Results Sampling and the Metagenomic Dataset Microbial samples were collected as part of the Sorcerer II expedition between August 8, 2003, and May 22, 2004, by the S/V Sorcerer II, a 32-m sailing sloop modified for marine research. Most specimens were collected from surface water marine environments at approximately 320-km (200-mile) intervals. In all, 44 samples were obtained from 41 sites (Figure 1), covering a wide range of distinct surface marine environments as well as a few nonmarine aquatic samples for contrast (Table 1). Several size fractions were isolated for every site (see Materials and Methods). Total DNA was extracted from one or more fractions, mostly from the 0.1–0.8-lm size range. This fraction is dominated by bacteria, whose compact genomes are particularly suitable for shotgun sequencing. Random-insert clone libraries were constructed. Depending on the uniqueness of each sampling site and initial estimates of the genetic diversity, between 44,000 and 420,000 clones per sample were end-sequenced to generate mated sequenc- ing reads. In all, the combined dataset includes 6.25 Gbp of sequence data from 41 different locations. Many of the clone libraries were constructed with a small insert size (,2 kbp) to maximize cloning efficiency. As this often resulted in mated sequencing reads that overlapped one another, overlapping mated reads were combined, yielding a total of ;6.4 M contiguous sequences, totaling ;5.9 Gbp of nonredundant sequence. Taken together, this is the largest collection of metagenomic sequences to date, providing more than a 5-fold increase over the dataset produced from the Sargasso Sea pilot study [19] and more than a 90-fold increase over the other large marine metagenomic dataset [20]. Assembly Assembling genomic data into larger contigs and scaffolds, especially metagenomic data, can be extremely valuable, as it places individual sequencing reads into a greater genomic context. A largely contiguous sequence links genes into operons, but also permits the investigation of larger biochemical and/or physiological pathways, and also connects otherwise-anonymous sequences with highly studied ‘‘taxo- Figure 1. Sampling Sites Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along the eastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of the Sargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as part of this study. doi:10.1371/journal.pbio.0050077.g001 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770400 Sorcerer II GOS Expedition T a b le 1 . Sa m p lin g Lo ca ti o n s an d En vi ro n m e n ta l D at a ID S a m p le L o ca ti o n C o u n tr y D a te , m m /d d /y y T im e L o ca ti o n S a m p le D e p th , m W a te r D e p th , m T (8 C )a S b (p p t) S iz e F ra ct io n (l m ) H a b it a t T y p e C h l a S a m p le M o n th (A n n u a l 6 S E ) m g /m � 3 G o o d S e q u e n ce s G S0 0 a Sa rg as so St at io n s 1 3 an d 1 1 B e rm u d a (U K ) 0 2 /2 6 /0 3 3 :0 0 3 1 83 2 96 99 n ; 6 3 83 5 94 2 99 w 5 .0 . 4 ,2 0 0 2 0 .0 2 0 .5 3 6 .6 0 .1 – 0 .8 O p e n o ce an 0 .1 7 (0 .0 .9 6 0 .0 2 ) 6 4 4 ,5 5 1 1 0 :1 0 3 1 81 0 95 0 99 n ; 6 4 81 9 92 7 99 w 3 6 .7 G S0 0 b Sa rg as so St at io n s 1 3 an d 1 1 B e rm u d a (U K ) 0 2 /2 6 /0 3 3 :3 5 3 1 83 2 91 0 99 n ; 6 3 83 5 97 0 99 w 5 .0 . 4 ,2 0 0 2 0 .0 2 0 .5 3 6 .6 0 .2 2 – 0 .8 O p e n o ce an 0 .1 7 (0 .0 .9 6 0 .0 2 ) 3 1 7 ,1 8 0 1 0 :4 3 3 1 81 0 95 0 n ; 6 4 81 9 92 7 99 w 3 6 .7 G S0 0 c Sa rg as so St at io n s 3 B e rm u d a (U K ) 0 2 /2 5 /0 3 1 3 :0 0 3 2 80 9 93 0 99 n ; 6 4 80 0 93 6 99 w 5 .0 . 4 ,2 0 0 1 9 .8 3 6 .7 0 .2 2 – 0 .8 O p e n o ce an 0 .1 7 (0 .0 .9 6 0 .0 2 ) 3 6 8 ,8 3 5 G S0 0 d Sa rg as so St at io n s 1 3 B e rm u d a (U K ) 0 2 /2 5 /0 3 1 7 :0 0 3 1 83 2 96 99 n ; 6 3 83 5 94 2 99 w 5 .0 . 4 ,2 0 0 2 0 .0 3 6 .6 0 .2 2 – 0 .8 O p e n o ce an 0 .1 7 (0 .0 .9 6 0 .0 2 ) 3 3 2 ,2 4 0 G S0 1 a H yd ro st at io n S B e rm u d a (U K ) 0 5 /1 5 /0 3 1 1 :4 0 3 2 81 0 90 0 99 n 6 4 83 0 90 0 99 w 5 .0 . 4 ,2 0 0 2 2 .9 3 6 .7 3 .0 – 2 0 .0 O p e n o ce an 0 .1 0 (0 .1 0 6 0 .0 1 ) 1 4 2 ,3 5 2 G S0 1 b H yd ro st at io n S B e rm u d a (U K ) 0 5 /1 5 /0 3 1 1 :4 0 3 2 81 0 90 0 99 n ; 6 4 83 0 90 0 99 w 5 .0 . 4 ,2 0 1 2 2 .9 3 6 .7 0 .8 – 3 .0 O p e n o ce an 0 .1 0 (0 .1 0 6 0 .0 1 ) 9 0 ,9 0 5 G S0 1 c H yd ro st at io n S B e rm u d a (U K ) 0 5 /1 5 /0 3 1 1 :4 0 3 2 81 0 90 0 99 n ; 6 4 83 0 90 0 99 w 5 .0 . 4 ,2 0 2 2 2 .9 3 6 .7 0 .1 – 0 .8 O p e n o ce an 0 .1 (0 .1 6 0 .0 1 ) 9 2 ,3 5 1 G S0 2 G u lf o f M ai n e U SA 0 8 /2 1 /0 3 6 :3 2 4 2 83 0 91 1 99 n ; 6 7 81 4 92 4 99 w 1 .0 1 0 6 1 8 .2 2 9 .2 0 .1 – 0 .8 C o as ta l 1 .4 (1 .1 2 6 0 .1 9 ) 1 2 1 ,5 9 0 G S0 3 B ro w n s B an k, G u lf o f M ai n e C an ad a 0 8 /2 1 /0 3 1 1 :5 0 4 2 85 1 91 0 99 n ; 6 6 81 3 92 99 w 1 .0 1 1 9 1 1 .7 2 9 .9 0 .1 – 0 .8 C o as ta l 1 .4 (1 .1 2 6 0 .1 9 ) 6 1 ,6 0 5 G S0 4 O u ts id e H al if ax , N o va Sc o ti a C an ad a 0 8 /2 2 /0 3 5 :2 5 4 4 88 91 4 99 n ; 6 3 83 8 94 0 99 w 2 .0 1 4 2 1 7 .3 2 8 .3 0 .1 – 0 .8 C o as ta l 0 .4 (0 .7 8 6 0 .1 7 ) 5 2 ,9 5 9 G S0 5 B e d fo rd B as in , N o va Sc o ti a C an ad a 0 8 /2 2 /0 3 1 6 :2 1 4 4 84 1 92 5 99 n ; 6 3 83 8 91 4 99 w 1 .0 6 4 1 5 .0 3 0 .2 0 .1 – 0 .8 Em b ay m e n t 6 (6 .7 6 6 0 .9 8 ) 6 1 ,1 3 1 G S0 6 B ay o f Fu n d y, N o va Sc o ti a C an ad a 0 8 /2 3 /0 3 1 0 :4 7 4 5 86 94 2 99 n ; 6 4 85 6 94 8 99 w 1 .0 1 1 1 1 .2 0 .1 – 0 .8 Es tu ar y 2 .8 (1 .8 7 6 0 .1 8 ) 5 9 ,6 7 9 G S0 7 N o rt h e rn G u lf o f M ai n e C an ad a 0 8 /2 5 /0 3 8 :2 5 4 3 83 7 95 6 99 n ; 6 6 85 0 95 0 99 w 1 .0 1 3 9 1 7 .9 3 1 .7 c 0 .1 – 0 .8 C o as ta l 1 .4 (1 .1 2 6 0 .1 9 ) 5 0 ,9 8 0 G S0 8 N e w p o rt H ar b o r, R I U SA 1 1 /1 6 /0 3 1 6 :4 5 4 1 82 9 99 99 n ; 7 1 82 1 94 99 w 1 .0 1 2 9 .4 2 6 .5 c 0 .1 – 0 .8 C o as ta l 2 .2 (1 .5 9 6 0 .1 7 ) 1 2 9 ,6 5 5 G S0 9 B lo ck Is la n d , N Y U SA 1 1 /1 7 /0 3 1 0 :3 0 4 1 85 92 8 99 n ; 7 1 83 6 98 99 w 1 .0 3 2 1 1 .0 3 1 .0 c 0 .1 – 0 .8 C o as ta l 4 .0 (2 .7 2 6 0 .2 4 ) 7 9 ,3 0 3 G S1 0 C ap e M ay , N J U SA 1 1 /1 8 /0 3 4 :3 0 3 8 85 6 92 4 99 n ; 7 4 84 1 96 99 w 1 .0 1 0 1 2 .0 3 1 .0 c 0 .1 – 0 .8 C o as ta l 2 .0 (2 .7 5 6 0 .3 3 ) 7 8 ,3 0 4 G S1 1 D e la w ar e B ay , N J U SA 1 1 /1 8 /0 3 1 1 :3 0 3 9 82 5 94 99 n ; 7 5 83 0 91 5 99 w 1 .0 8 1 1 .0 0 .1 – 0 .8 Es tu ar y 4 .8 (9 .2 3 6 1 .0 2 ) 1 2 4 ,4 3 5 G S1 2 C h e sa p e ak e B ay , M D U SA 1 2 /1 8 /0 3 1 1 :3 2 3 8 85 6 94 9 99 n ; 7 6 82 5 92 99 w 1 .0 2 5 3 .2 3 .4 7 c 0 .1 – 0 .8 Es tu ar y 2 1 .0 (1 5 .0 6 1 .0 1 ) 1 2 6 ,1 6 2 G S1 3 O ff N ag s H e ad , N C U SA 1 2 /1 9 /0 3 6 :2 8 3 6 80 91 4 99 n ; 7 5 82 3 94 1 99 w 1 .0 2 0 9 .3 0 .1 – 0 .8 C o as ta l 3 .0 (2 .2 4 6 0 .2 5 ) 1 3 8 ,0 3 3 G S1 4 So u th o f C h ar le st o n , SC U SA 1 2 /2 0 /0 3 1 7 :1 2 3 2 83 0 92 5 99 n ; 7 9 81 5 95 0 99 w 1 .0 3 1 1 8 .6 0 .1 – 0 .8 C o as ta l 1 .7 0 (1 .9 2 6 0 .2 5 ) 1 2 8 ,8 8 5 G S1 5 O ff K e y W e st , FL U SA 0 1 /0 8 /0 4 6 :2 5 2 4 82 9 91 8 99 n ; 8 3 84 91 2 99 w 2 .0 4 7 2 5 .3 3 6 .0 0 .1 – 0 .8 C o as ta l 0 .2 (0 .2 7 6 0 .0 9 ) 1 2 7 ,3 6 2 G S1 6 G u lf o f M e xi co U SA 0 1 /0 8 /0 4 1 4 :1 5 2 4 81 0 92 9 99 n ; 8 4 82 0 94 0 99 w 2 .0 3 ,3 3 3 2 6 .4 3 5 .8 0 .1 – 0 .8 C o as ta l se a 0 .1 6 (0 .1 1 6 0 .0 1 ) 1 2 7 ,1 2 2 G S1 7 Y u ca ta n C h an n e l M e xi co 0 1 /0 9 /0 4 1 3 :4 7 2 0 83 1 92 1 99 n ; 8 5 82 4 94 9 99 w 2 .0 4 ,5 1 3 2 7 .0 3 5 .8 0 .1 – 0 .8 O p e n o ce an 0 .1 3 (0 .0 9 6 0 .0 1 ) 2 5 7 ,5 8 1 G S1 8 R o sa ri o B an k H o n d u ra s 0 1 /1 0 /0 4 8 :1 2 1 8 82 91 2 99 n ; 8 3 84 7 95 99 w 2 .0 4 ,4 7 0 2 7 .4 3 5 .4 0 .1 – 0 .8 O p e n o ce an 0 .1 4 (0 .0 9 6 0 .0 1 ) 1 4 2 ,7 4 3 G S1 9 N o rt h e as t o f C o ló n P an am a 0 1 /1 2 /0 4 9 :0 3 1 0 84 2 95 9 99 n ; 8 0 81 5 91 6 99 w 2 .0 3 ,3 3 6 2 7 .7 3 5 .4 0 .1 – 0 .8 C o as ta l 0 .2 3 (0 .1 5 6 0 .0 2 ) 1 3 5 ,3 2 5 G S2 0 La ke G at u n P an am a 0 1 /1 5 /0 4 1 0 :2 4 9 89 95 2 99 n ; 7 9 85 0 91 0 99 w 2 .0 4 2 8 .5 0 .0 6 0 .1 – 0 .8 Fr e sh w at e r 2 9 6 ,3 5 5 G S2 1 G u lf o f P an am a P an am a 0 1 /1 9 /0 4 1 6 :4 8 8 87 94 5 99 n ; 7 9 84 1 92 8 99 w 2 .0 7 6 2 7 .6 3 0 .7 0 .1 – 0 .8 C o as ta l 0 .5 0 (0 .7 3 6 0 .2 2 ) 1 3 1 ,7 9 8 G S2 2 2 5 0 m ile s fr o m P an am a C it y P an am a 0 1 /2 0 /0 4 1 6 :3 9 6 82 9 93 4 99 n ; 8 2 85 4 91 4 99 w 2 .0 2 ,4 3 1 2 9 .3 3 2 .3 0 .1 – 0 .8 O p e n o ce an 0 .3 3 (0 .2 8 6 0 .0 2 ) 1 2 1 ,6 6 2 G S2 3 3 0 m ile s fr o m C o co s Is la n d C o st a R ic a 0 1 /2 1 /0 4 1 5 :0 0 5 83 8 92 4 99 n ; 8 6 83 3 95 5 99 w 2 .0 1 ,1 3 9 2 8 .7 3 2 .6 0 .1 – 0 .8 O p e n o ce an 0 .0 7 (0 .1 9 6 0 .0 2 ) 1 3 3 ,0 5 1 G S2 5 D ir ty R o ck , C o co s Is la n d C o st a R ic a 0 1 /2 8 /0 4 1 0 :5 1 5 83 3 91 0 99 n ; 8 7 85 91 6 99 w 1 .1 3 0 2 8 .3 3 1 .4 0 .8 – 3 .0 Fr in g in g re e f 0 .1 1 (0 .1 9 6 0 .0 1 ) 1 2 0 ,6 7 1 G S2 6 1 3 4 m ile s N E o f G al ap ag o s Ec u ad o r 0 2 /0 1 /0 4 1 6 :1 6 1 81 5 95 1 99 n ; 9 0 81 7 94 2 99 w 2 .0 2 ,3 7 6 2 7 .8 3 2 .6 0 .1 – 0 .8 O p e n o ce an 0 .2 2 (0 .2 8 6 0 .0 2 ) 1 0 2 ,7 0 8 G S2 7 D e vi l’s C ro w n , Fl o re an a Ec u ad o r 0 2 /0 4 /0 4 1 1 :4 1 1 81 2 95 8 99 s; 9 0 82 5 92 2 99 w 2 .0 2 .3 2 5 .5 3 4 .9 0 .1 – 0 .8 C o as ta l 0 .4 0 (0 .3 8 6 0 .0 3 ) 2 2 2 ,0 8 0 G S2 8 C o as ta l Fl o re an a Ec u ad o r 0 2 /0 4 /0 4 1 5 :4 7 1 81 3 91 99 s; 9 0 81 9 91 1 99 w 2 .0 1 5 6 2 5 .0 c 0 .1 – 0 .8 C o as ta l 0 .3 5 (0 .3 5 6 0 .0 2 ) 1 8 9 ,0 5 2 G S2 9 N o rt h Ja m e s B ay , Sa n ti g o Ec u ad o r 0 2 /0 8 /0 4 1 8 :0 3 0 81 2 90 99 s; 9 0 85 0 97 99 w 2 .0 1 2 2 6 .2 3 4 .5 0 .1 – 0 .8 C o as ta l 0 .4 0 (0 .3 9 6 0 .0 3 ) 1 3 1 ,5 2 9 G S3 0 W ar m se e p , R o ca R e d o n d a Ec u ad o r 0 2 /0 9 /0 4 1 1 :4 2 0 81 6 92 0 99 n ; 9 1 83 8 90 99 w 1 9 .0 1 9 2 6 .9 0 .1 – 0 .8 W ar m se e p 3 5 9 ,1 5 2 G S3 1 U p w e lli n g , Fe rn an d in a Ec u ad o r 0 2 /1 0 /0 4 1 4 :4 3 0 81 8 94 99 s; 9 1 83 9 96 99 w 1 2 .0 1 9 1 8 .6 0 .1 – 0 .8 C o as ta l u p w e lli n g 0 .3 5 (0 .3 9 6 0 .0 3 ) 4 3 6 ,4 0 1 G S3 2 M an g ro ve , Is ab e lla Ec u ad o r 0 2 /1 1 /0 4 1 1 :3 0 0 83 5 93 8 99 s; 9 1 84 91 0 99 w 0 .3 0 .6 7 2 5 .4 0 .1 – 0 .8 M an g ro ve 1 4 8 ,0 1 8 G S3 3 P u n ta C o rm o ra n t La g o o n , Fl o re an a Ec u ad o r 0 2 /1 9 /0 4 1 3 :3 5 1 81 3 94 2 99 s; 9 0 82 5 94 5 99 w 0 .2 0 .3 3 3 7 .6 4 6 c 0 .1 – 0 .8 H yp e rs al in e 6 9 2 ,2 5 5 G S3 4 N o rt h Se am o re Ec u ad o r 0 2 /1 9 /0 4 1 7 :0 6 0 82 2 95 9 99 s; 9 0 81 6 94 7 99 w 2 .0 3 5 2 7 .5 0 .1 – 0 .8 C o as ta l 0 .3 6 (0 .3 5 6 0 .0 2 ) 1 3 4 ,3 4 7 G S3 5 W o lf Is la n d Ec u ad o r 0 3 /0 1 /0 4 1 6 :4 4 1 82 3 92 1 99 n ; 9 1 84 9 91 99 w 2 .0 7 1 2 1 .8 3 4 .5 0 .1 – 0 .8 C o as ta l 0 .2 8 (0 .3 1 6 0 .0 2 ) 1 4 0 ,8 1 4 G S3 6 C ab o M ar sh al l, Is ab e lla Ec u ad o r 0 3 /0 2 /0 4 1 2 :5 2 0 81 91 5 99 s; 9 1 81 1 95 2 99 w 2 .0 6 7 2 5 .8 3 4 .6 0 .1 – 0 .8 C o as ta l 0 .6 5 (0 .4 5 6 0 .0 5 ) 7 7 ,5 3 8 G S3 7 Eq u at o ri al P ac if ic T A O B u o y In te rn at io n al 0 3 /1 7 /0 4 1 6 :3 8 1 85 8 92 6 99 s; 9 5 80 95 3 99 w 2 .0 3 ,3 3 4 2 8 .8 0 .1 – 0 .8 O p e n o ce an 0 .2 1 (0 .2 4 6 0 .0 2 ) 6 5 ,6 7 0 G S4 7 2 0 1 m ile s fr o m Fr e n ch P o ly n e si a In te rn at io n al 0 3 /2 8 /0 4 1 5 :2 5 1 0 87 95 3 99 s; 1 3 5 82 6 95 8 99 w 3 0 .0 2 ,4 0 0 2 8 .6 3 7 .3 0 .1 – 0 .8 O p e n o ce an 6 6 ,0 2 3 G S5 1 R an g ir o ra A to ll Fr e n ch P o ly n e si a 0 5 /2 2 /0 4 7 :0 4 1 5 88 93 7 99 s; 1 4 7 82 6 96 99 w 1 .0 1 0 2 7 .3 3 4 .2 0 .1 – 0 .8 C o ra l re e f at o ll 1 2 8 ,9 8 2 T o ta l 7 ,6 9 7 ,9 2 6 a T e m p e ra tu re . b Sa lin it y. c M e as u re m en ts w e re ac q u ir e d fr o m n e ar b y ve ss e ls an d /o r re se ar ch st at io n s. d o i:1 0 .1 3 7 1 /j o u rn al .p b io .0 0 5 0 0 7 7 .t 0 0 1 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770401 Sorcerer II GOS Expedition nomic markers’’ such as 16S or recA, thus clearly identifying the taxonomic group with which they are associated. The primary assembly of the combined GOS dataset was performed using the Celera Assembler [21] with modifica- tions as previously described [19] and as given in Materials and Methods. The assembly was performed with quite stringent criteria, beginning with an overlap cutoff of 98% identity to reduce the potential for artifacts (e.g., chimeric assemblies or consensus sequences diverging substantially from the genome of any given cell). This assembly was the substrate for annotation (see the accompanying paper by Yooseph et al. [18]). The degree of assembly of a metagenomic sample provides an indication of the diversity of the sample. A few substantial assemblies notwithstanding, the primary assembly was strik- ingly fragmented (Table 2). Only 9% of sequencing reads went into scaffolds longer than 10 kbp. A majority (53%) of the sequencing reads remained unassembled singletons. Scaffolds containing more than 50 kb of consensus sequence totaled 20.7 Mbp; of these, .75% were produced from a single Sargasso Sea sample and correspond to the Burkholderia or Shewanella assemblies described previously [19]. These results highlight the unusual abundance of these two organisms in a single sample, which significantly affected our expectations regarding the current dataset. Given the large size of the combined dataset and the substantial amount of sequencing performed on individual filters, the overall lack of assembly provides evidence of a high degree of diversity in surface planktonic communities. To put this in context, suppose there were a clonal organism that made up 1% of our data, or ;60 Mbp. Even a genome of 10 Mbp—enormous by bacterial standards—would be covered ;6-fold. Such data might theoretically assemble with an average contig ap- proaching 50 kb [22]. While real assemblies generally fall short of theory for various reasons, Shewanella data make up ,1% of the total GOS dataset, and yet most of the relevant reads assemble into scaffolds .50 kb. Thus, with few scaffolds of significant length, we could conclude that there are very few clonal organisms present at even 1% in the GOS dataset. To investigate the nature of the implied diversity and to see whether greater assembly could be achieved, we explored several alternative approaches. Breaks in the primary assembly resulted from two factors: incomplete sequence coverage and conflicts in the data. Conflicts can break assemblies when there is no consistent way to chain together all overlapping sequencing reads. As it was possible that there would be fewer conflicts within a single sample (i.e., that diversity within a single sample would be lower), assemblies were attempted with individual samples. However, the results did not show any systematic improvements even in those samples with greater coverage (unpublished data). Upon manual inspection, most assembly-breaking conflicts were found to be local in nature. These observations suggested that reducing the degree of sequence identity required for assembly could ameliorate both factors limiting assembly: effective coverage would increase and many minor conflicts would be resolved. Accordingly, we produced a series of assemblies based on 98%, 94%, 90%, 85%, and 80% identity overlaps for two subsets of the GOS dataset, again using the Celera Assembler. Assembly lengths increased as the overlap cutoff decreased from 98% to 94% to 90%, and then leveled off or even dropped as stringency was reduced below 90% (Table 3). Although larger assemblies could be generated using lower identity overlaps, significant numbers of overlaps satisfying the chosen percent identity cutoff still went unused in each assembly. This is consistent with a high rate of conflicting overlaps and in turn diagnostic of significant polymorphism. In mammalian sequencing projects the use of larger insert libraries is critical to producing larger assemblies because of their ability to span repeats or local polymorphic regions [23]. The shotgun sequencing libraries from the GOS filters were typically constructed from inserts shorter than 2 kb. Longer plasmid libraries were attempted but were much less stable. We obtained paired-end sequences from 21,419 fosmid clones (average insert size, 36 kb; [24,25]) from the 0.1-micron fraction of GS-33. The effect of these long mate pairs on the GS-33 assembly was quite dramatic, particularly at high stringency (e.g., improving the largest scaffold from 70 kb to 1,247 kb and the largest contig from 70 kb to 427 kb). At least for GS-33 this suggests that many of the polymorphisms affect small, localized regions of the genome that can be spanned using larger inserts. This degree of improvement may be greater than what could be expected in general, as the diversity of GS-33 is by far the lowest of any of the currently Table 2. Summary Assembly Statistics Category Statistic Value Assembly inputs Number of reads used for assembly 7,697,926 Total read length (bp) 6,325,208,303 Number of ‘‘intigs’’ used for assemblya 6,389,523 Total intig length (bp)a 5,883,982,712 Assembly outputs Number of assembliesb 3,081,849 Total assembled consensus length (bp) 4,460,027,783 Percentage of unassembled reads 53% Percentage of assembly at .13 coverage 15.3% Base pairs in contigs � 10 kb 39,427,102 Base pairs in contigs � 50 kb 15,723,513 Base pairs in scaffolds � 2.5 kb (consensus bases) 458,196,599 Base pairs in scaffolds � 5 kb (consensus bases) 138,137,150 Base pairs in scaffolds � 10 kb (consensus bases) 65,238,481 Base pairs in scaffolds � 50 kb (consensus bases) 20,738,836 Base pairs in scaffolds � 100 kb (consensus bases) 16,005,244 Base pairs in scaffolds � 300 kb (consensus bases) 8,805,668 Percentage of assembly in scaffolds � 10 kbc 1.5% Length of longest contig (bp) 977,960 Length of longest scaffold (bp) 2,097,794 N1d assembly bp 15,915 N10d assembly bp 2,533 N50d assembly bp 1,611 N1d contig bp 8,994 N10d contig bp 2,447 N50d contig bp (single reads) aIntigs are overlapping mated reads that have been collapsed into a single sequence as input into the assembler. bAssemblies refers to the total number of scaffolds, pairs of mated nonoverlapping singletons, and singleton unmated reads. cFor comparison purposes, 10 kb is the average contig size predicted for 4.13 coverage for an idealized shotgun assembly of a repeat-free, clonal genome [22]. dN1 indicates the length of the next largest assembled sequence or contig such that 1% of the sequence data falls into longer assemblies or contigs. N10 and N50 indicate that 10% and 50% of the data fell into larger assemblies or contigs. doi:10.1371/journal.pbio.0050077.t002 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770402 Sorcerer II GOS Expedition sequenced GOS samples, yet it clearly indicates the utility of including larger insert libraries for assembly. Fragment Recruitment In the absence of substantial assembly, direct comparison of the GOS sequencing data to the genomes of sequenced microbes is an alternative way of providing context, and also allows for exploration of genetic variation and diversity. A large and growing set of microbial genomes are available from the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov). At the time of this study, we used 334 finished and 250 draft microbial genomes as references for comparison with the GOS sequencing reads. Comparisons were carried out in nucleotide-space using the sequence alignment tool BLAST [26]. BLAST parameters were designed to be extremely lenient so as to detect even distant similarities (as low as 55% identity). A large proportion of the GOS reads, 70% in all, aligned to one or more genomes under these conditions. However, many of the alignments were of low identity and used only a portion of the entire read. Such low-quality hits may reflect distant evolutionary relationships, and therefore less information is gained based on the context of the alignment. More stringent criteria could be imposed requiring that the reads be aligned over nearly their entire length without any large gaps. Using this stringent criterion only about 30% of the reads aligned to any of the 584 reference genomes. We refer to these fully aligned reads as ‘‘recruited reads.’’ Recruited reads are far more likely to be from microbes closely related to the reference sequence (same species) than are partial align- ments. Despite the large number of microbial genomes currently available, including a large number of marine microbes, these results indicate that a substantial majority of GOS reads cannot be specifically related to available micro- bial genomes. The amount and distribution of reads recruited to any given genome provides an indication of the abundance of closely related organisms. Only genomes from the five bacterial genera Prochlorococcus, Synechococcus, Pelagibacter, Shewanella, and Burkholderia yielded substantial and uniform recruitment of GOS fragments over most of a reference genome (Table 4). These genera include multiple reference genomes, and we observed significant differences in recruit- ment patterns even between organisms belonging to the same species (Figure 2A–2I). Three genera, Pelagibacter (Figure 2A), Prochlorococcus (Figure 2B–2F), and Synechococcus (Figure 2G– 2I), were found abundantly in a wide range of samples and together accounted for roughly 50% of all the recruited reads (though only ;15% of all GOS sequencing reads). By contrast, although every genome tested recruited some GOS reads, most recruited only a small number, and these reads clustered at lower identity to locations corresponding to large highly conserved genes (for typical examples see Figure 2E– 2F). We refer to this pattern as nonspecific recruitment as it reflects taxonomically nonspecific signals, with the reads in Table 4. Microbial Genera that Recruited the Bulk of the GOS Reads Genus Read Count Best Strain All Reads 80%þa 90%þb All Reads 80%þa 90%þb Pelagibacter 922,677 195,539 36,965 HTCC1062 HTCC1062 HTCC1062 Prochlorococcus 208,999 159,102 84,325 MIT9312 MIT9312 MIT9312 Synechococcus 60,650 26,365 21,594 CC9902 RS9917 RS9917 Burkholderia 151,123 108,610 93,081 383 383 383 Shewanella 59,086 34,138 27,693 MR-1 MR-1 MR-1 Remaining 43,244 2,367 564 Buchnera aphidicola Str. Sg Buchnera aphidicola Str. APS Alteromonas macleodii aReads aligned at or above 80% identity over the entire length of the read. bReads aligned at or above 90% identity over the entire length of the read. doi:10.1371/journal.pbio.0050077.t004 Table 3. Evaluation of Alternative Assembly Methods Dataset Type Percent Identity Base Pairs in 10 k Contigs Base Pairs in 100 k Contigs GS33 plasmids WGSa 98 13,669,678 0 94 19,536,324 2,749,543 90 20,996,826 3,729,765 85 20,327,989 3,505,324 80 19,245,637 4,195,959 E-asmb 98 22,000,579 5,604,857 94 22,781,462 7,302,801 90 22,702,764 7,600,441 85 22,570,933 7,937,079 80 20,335,558 4,779,684 GS33 with fosmids WGSa 98 15,031,557 1,306,992 94 22,310,335 4,449,710 90 22,944,278 5,585,959 85 22,251,738 5,485,013 80 21,088,975 5,684,925 GS17,18,23,26 WGSa 98 185,058 0 94 5,422,366 213,755 90 10,694,783 373,822 85 11,514,421 800,290 80 9,004,221 879,401 E-asmb 98 2,047,524 0 94 10,668,547 1,184,881 90 15,215,981 2,634,227 85 15,786,515 3,132,152 80 13,767,929 2,942,160 Combined GOS WGSa 98 39,427,102 11,488,828 94 98,887,937 12,376,236 E-asmb 98 91,526,091 16,444,304 94 163,612,717 25,564,163 90 186,614,813 28,752,198 85 181,887,218 27,154,335 80 161,160,091 23,794,832 aWhole-genome shotgun (WGS) assembly performed with the Celera Assembler. bAssemblies performed using extreme assembly approach (E-asm). doi:10.1371/journal.pbio.0050077.t003 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770403 Sorcerer II GOS Expedition Figure 2. Fragment Recruitment Plots The horizontal axis of each panel corresponds to a 100-kb segment of genomic sequence from the indicated reference microbial genome. The vertical axis indicates the sequence identity of an alignment between a GOS sequence and the reference genomic sequence. The identity ranges from 100% (top) to 50% (bottom). Individual GOS sequencing reads were colored to reflect the sample from which they were isolated. Geographically nearby samples have similar colors (see Poster S1 for key). Each organism shows a distinct pattern of recruitment reflecting its origin and relationship to the environmental data collected during the course of this study. (A) P. ubique HTCC1062 recruits the greatest density of GOS sequences of any genome examined to date. The GOS sequences show geographic PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770404 Sorcerer II GOS Expedition question often recruiting to distantly related sets of genomes. Most microbial genomes, including many of the marine microbes (e.g., the ubiquitous genus Vibrio), demonstrated this nonspecific pattern of recruitment. The relationship between the similarity of an individual sequencing read to a given genome and the sample fromwhich the read was isolated can provide insight into the structure, evolution, and geographic distribution of microbial popula- tions. These relationships were assessed by constructing a ‘‘percent identity plot’’ [27] in which the alignment of a read to a reference sequence is shown as a bar whose horizontal position indicates location on the reference andwhose vertical position indicates the percent identity of the alignment. We colored the plotted reads according to the samples to which they belonged, thus indirectly representing various forms of metadata (geographic, environmental, and laboratory varia- bles). We refer to these plots that incorporate metadata as fragment recruitment plots. Fragment recruitment plots of GOS sequences recruited to the entire genomes of Pelagibacter ubique HTCC1062, Prochlorococcus marinus MIT9312, and Syn- echococcus WH8102 are presented in Poster S1. Within-Ribotype Population Structure and Variation Characteristic patterns of recruitment emerged from each of these abundant marine microbes consisting of horizontal bands made up of large numbers of GOS reads. These bands seem constrained to a relatively narrow range of identities that tile continuously (or at least uniformly, in the case when abundance/coverage is lower) along ;90% of the reference sequence. The uninterrupted tiling indicates that environ- mental genomes are largely syntenic with the reference genomes. Multiple bands, distinguished by degree of sim- ilarity to the reference and by sample makeup, may arise on a single reference (Poster S1D and S1F). Each of these bands appears to represent a distinct, closely related population we refer to as a subtype. In some cases, an abundant subtype is highly similar to the reference genome, as is the case for P. marinus MIT9312 (Poster S1) and Synechococcus RS9917 (unpublished data). P. ubique HTCC1062 and other Synecho- coccus strains like WH8102 show more complicated banding patterns (Poster S1D and S1F) because of the presence of multiple subtypes that produce complex often overlapping bands in the plots. Though the recruitment patterns can be quite complex they are also remarkably consistent over much of the reference genome. In these more complicated recruit- ment plots, such as the one for P. ubique HTCC1062, individual bands can show sudden shifts in identity or disappear altogether, producing a gap in recruitment that appears to be specific to that band (see P. ubique recruitment plots on Poster S1B and S1E, and specifically between 130– 140 kb). Finally, phylogenetic analysis indicates that separate bands are indeed evolutionarily distinct at randomly selected locations along the genome. The amount of sequence variation within a given band cannot be reliably determined from the fragment recruit- ment plots themselves. To examine this variation, we produced multiple sequence alignments and phylogenies of reads that recruited to several randomly chosen intervals along given reference genomes to show that there can be considerable within-subtype variation (Figure 3A–3B). For example, within the primary band found in recruitment plots to P. marinus MIT9312, individual pairs of overlapping reads typically differ on average between 3%–5% at the nucleotide level (depending on exact location in the genome). Very few reads that recruited to MIT9312 have perfect (mismatch-free) overlaps with any other read or to MIT9312, despite ;100- fold coverage. While many of these differences are silent (i.e., do not change amino acid sequences), there is still consid- erable variation at the protein level (unpublished data). The amount of variation within subtypes is so great that it is likely that no two sequenced cells contained identical genomes. stratification into bands, with sequences from temperate water samples off the North American coast having the highest identity (yellow to yellow- green colors). At lower identity, sequences from all the marine environments could be aligned to HTCC1062. (B) P. marinus MIT9312 recruits a large number of GOS sequences into a single band that zigzags between 85%–95% identity on average. These sequences are largely derived from warm water samples in the Gulf of Mexico and eastern Pacific (green to greenish-blue reads). (C) P. marinus MED4 recruits largely the same set of reads as MIT9312 (B) though the sequences that form the zigzag recruit at a substantially lower identity. A small number of sequences from the Sargasso Sea samples (red) are found at high identity. (D) P. marinus NATL2A recruits far fewer sequences than any of the preceding panels. Like MED4, a small number of high-identity sequences were recruited from the Sargasso samples. (E) P. marinus MIT9313 is a deep-water low-light–adapted strain of Prochlorococcus. GOS sequences were recruited almost exclusively at low identity in vertical stacks that correspond to the locations of conserved genes. On the left side of this panel is a very distinctive pattern of recruitment that corresponds to the highly conserved 16S and 23S mRNA gene operon. (F) P. marinus CCMP1375, another deep-water low-light–adapted strain, does not recruit GOS sequences at high identity. Only stacks of sequences are seen corresponding to the location of conserved genes. (G) Synechococcus WH8102 recruits a modest number of high-identity sequences primarily from the Sargasso Sea samples. A large number of moderate identity matches from the Pacific and hypersaline lagoon (GS33) samples are also visible. (H) Synechococcus CC9605 recruits largely the same sequences as does Synechococcus WH8102, but was isolated from Pacific waters. GOS sequences from some of the Pacific samples recruit at high identity, while sequences from the Sargasso and hypersaline lagoon (bluish-purple) were recruited at moderate identities. (I) Synechococcus CC9902 is distantly related to either of the preceding Synechococcus strains. While this strain also recruits largely the same sequences as the WH8102 and CC9902 strains, they recruit at significantly lower identity. (J–O) Fragment recruitment plots to extreme assemblies seeded with phylogenetically informative sequences. Using this approach it is not only possible to assemble contigs with strong similarities to known genomes but to identify contigs from previously uncultured genomes. In each case a 100-kb segment from an extreme assembly is shown. Each plot shows a distinct pattern of recruitment that distinguishes the panels from each other. (J) Seeded from a Prochlorococcus marinus-related sequence, this contig recruits a broad swath of GOS sequences that correspond to the GOS sequences that form the zigzag on P. marinus MIT9312 recruitment plots (see [B] or Poster S1 for comparison). (K–L) Seeded from SAR11 clones, these contigs show significant synteny to the known P. ubique HTCC1062 genome. (K) is strikingly similar to previous recruitment plots to the HTCC1062 genome (see [A] or Poster S1). In contrast, (L) identifies a different strain that recruits high-identity GOS sequences primarily from the Sargasso Sea samples (red). (M–O) These three panels show recruitment plots to contigs belonging to the uncultured Actinobacter, Roseobacter, and SAR86 lineages. doi:10.1371/journal.pbio.0050077.g002 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770405 Sorcerer II GOS Expedition PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770406 Sorcerer II GOS Expedition Identifying Genomic Structural Variation with Metagenomic Data Variation in genome structure in the form of rearrange- ments, duplications, insertions, or deletions of stretches of DNA can also be explored via fragment recruitment. The use of mated sequencing reads (pairs of reads from opposite ends of a clone insert) provides a powerful tool for assessing structural differences between the reference and the environ- mental sequences. The cloning and sequencing process determines the orientation and approximate distance be- tween two mated sequencing reads. Genomic structural variation can be inferred when these are at odds with the way in which the reads are recruited to a reference sequence. Relative location and orientation of mated sequences provide a form of metadata that can be used to color-code a fragment recruitment plot (Figure 4). This makes it possible to visually identify and classify structural differences and similarities between the reference and the environmental sequences (Figure 5). For the abundant marine microbes, a high proportion of mated reads in the ‘‘good’’ category (i.e., in the proper orientation and at the correct distance) show that synteny is conserved for a large portion of the microbial population. The strongest signals of structural differences typically reflect a variant specific to the reference genome and not found in the environmental data. In conjunction with the requirement that reads be recruited over their entire length without interruption, recruitment plots result in pronounced recruitment gaps at locations where there is a break in synteny. Other rearrangements can be partially present or penetrant in the environmental data and thus may not generate obvious recruitment gaps. However, given sufficient coverage, breaks in synteny should be clearly identifiable using the recruitment metadata based on the presence of ‘‘missing’’ mates (i.e., the mated sequencing read that was recruited but whose mate failed to recruit; Figure 4). The ratio of missing mates to ‘‘good’’ mates determines how penetrant the rearrangement is in the environmental population. In theory, all genome structure variations that are large enough to prevent recruitment can be detected, and all such rearrangements will be associated with missing mates. Depending on the type of rearrangement present other recruitment metadata categories will be present near the rearrangements’ endpoints. This makes it possible to distin- guish among insertions, deletions, translocations, inversions, and inverted translocations directly from the recruitment plots. Examples of the patterns associated with different rearrangements are presented in Figure 5. This provides a rapid and easy visual method for exploring structural variation between natural populations and sequenced repre- sentatives (Poster S1A and S1B). Genomic Structural Variation in Abundant Marine Microbes Variation in genome structure potentially results in func- tional differences. Of particular interest are those differences between sequenced (reference) microbes and environmental populations. These differences can indicate how representa- tive a cultivated microbe might be and shed light on the evolutionary forces driving change in microbial populations. Fragment recruitment in conjunction with the mate metadata helped us to identify both the consistent and the rare structural differences between the genomes of microbial populations in the GOS data and their closest sequenced relatives. Our analysis has thus far been confined to the three microbial genera that were widespread in the GOS dataset as represented by the finished genomes of P. marinus MIT9312, P. ubique HTCC1062, and to a lesser extent Synechococcus WH8102. Each of these genomes is characterized by large and small segments where little or no fragment recruitment took place. We refer to these segments as ‘‘gaps.’’ These gaps Figure 3. Population Structure and Variation as Revealed by Phylogeny Phylogenies were produced using neighbor-joining. There is significant within-clade variation as well as an absence of strong geographic structure to variants of SAR11 (P. ubique HTCC1062) and P. marinus MIT9312. Similar reads are not necessarily from similar locations, and reads from similar locations are not necessarily similar. (A) Geographic distribution of SAR11 proteorhodopsin variants. Keys to coloration: blue, Pacific; pink, Atlantic. (B) Geographic distribution of Prochlorococcus variants. Keys to coloration: blue, Pacific; pink, Atlantic. (C) Origins of spectral tuning of SAR11 proteorhodopsins. Reads are colored according to whether they contain the L (green) or Q (blue) variant at the spectral tuning residue described in the text. The selection of tuning residue is lineage restricted, but each variant must have arisen on two separate occasions. doi:10.1371/journal.pbio.0050077.g003 Figure 4. Categories of Recruitment Metadata The recruitment metadata distinguishes eight different general catego- ries based on the relative placement of paired end sequencing reads (mated reads) when recruited to a reference sequence in comparison to their known orientation and separation on the clone from which they were derived. Assuming orientation is correct, two mated reads can be recruited closer together, further apart, or within expected distances given the size of the clone from which the sequences were derived. These sequences are categorized as ‘‘short,’’ ‘‘long,’’ or ‘‘good,’’ respectively. Alternately, the mated reads may be recruited in a mis- oriented fashion, which trumps issues of separation. These reads can be categorized as ‘‘normal,’’ ‘‘anti-normal,’’ or ‘‘outie.’’ In addition, there are two other categories. ‘‘No mate’’ indicates that no mated read was available for recruitment, possibly due to sequencing error. Perhaps most useful of any of the recruitment categories, ‘‘missing’’ mates indicate that while a mated sequence was available, it was not recruited to the reference. ‘‘Missing’’ mates identify breaks in synteny between the environmental data and the reference sequence. doi:10.1371/journal.pbio.0050077.g004 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770407 Sorcerer II GOS Expedition represent reference-specific differences that are not found in the environmental populations rather than a cloning bias that identifies genes or gene segments that are toxic or unclonable in E. coli. The presence of missing mates flanking these gaps indicates that the associated clones do exist, and therefore that cloning issues are not a viable explanation for the absence of recruited reads. Although the reference- specific differences are quite apparent due to the recruitment gaps they generate, there are also sporadic rearrangements associated with single clones, mostly resulting from small insertions or deletions. Careful examination of the unrecruited mates of the reads flanking the gaps allowed us to identify, characterize, and quantify specific differences between the reference genome and their environmental relatives. The results of this analysis for P. ubique and P. marinus have been summarized in Table 5. With few exceptions, small gaps resulted from the insertion or deletion of only a few genes. Many of the genes associated with these small insertions and deletions have no annotated function. In some cases the insertions display a degree of variability such that different sets of genes are found at these locations within a portion of the population. In contrast, many of the larger gaps are extremely variable to the extent that every clone contains a completely unrelated or highly divergent sequence when compared to the reference or to other clones associated with that gap. These segments are hypervariable and change much more rapidly than would be expected given the variation in the rest of the genome. Sites containing a hypervariable segment nearly always contained some insert. We identified two exceptions both associated with P. ubique. The first is approximately located at the 166-kb position in the P. ubique HTCC1062 genome. Though no large gap is present, the mated reads indicate that under many circumstances a highly variable insert is often present. The second is a gap on HTCC1062 that appears between 50 and 90 kb. This gap appears to be less variable than other hypervariable segments and is occasionally absent based on the large numbers of flanking long mated reads (Poster S1A). Interestingly, the long mated reads around this gap seem to be disproportionately from the Sargasso Sea samples, suggesting that this segment may be linked to geographic and/or environmental factors. Thus, hypervariable segments are highly variable even within the same sample, can on occasion be unoccupied, and the variation, or lack thereof, can be sample dependent. Hypervariable segments have been seen previously in a wide range of microbes, including P. marinus [28], but their precise source and functional role, especially in an environ- mental context, remains a matter of ongoing research. For clues to these issues we examined the genes associated with the missing mates flanking these segments and the nucleotide composition of the gapped sequences in the reference genomes. In some rare cases the genes identified on reads that should have recruited within a hypervariable gap were highly similar to known viral genes. For example, a viral integrase was associated with the P. ubique HTCC1062 hypervariable gap between 516 and 561 kb. However, in the majority of cases the genes associated with these gaps were uncharacterized, either bearing no similarity to known genes or resembling genes of unknown function. If these genes were indeed acquired through horizontal transfer then we might expect that they would have obvious compositional biases. Oligonucleotide frequencies along the P. ubique HTCC1062 and Synechococcus WH8102 genomes are quite different in the large recruitment gaps in comparison to the well-represented portions of the genome (Poster S1). Surprisingly, this was less true for P. marinus MIT9312, where the gaps have been linked to phage activity [28]. These results suggest that these hypervariable segments of the genome are widespread among marine microbial populations, and that they are the product of horizontal transfer events perhaps mediated by phage or transposable elements. These results are consistent with and expand upon the hypothesis put forward by Coleman et al. [28] suggesting that these segments are phage mediated, and conflicts with initial claims that the HTCC1062 genome was devoid of genes acquired by horizontal transfer [29]. Figure 5. Fragment Recruitment at Sites of Rearrangements Environmental sequences recruited near breaks in synteny have characteristic patterns of recruitment metadata. Indeed, each of five basic rearrangements (i.e., insertion, deletion, translocation, inversion, and inverted translocation) produced a distinct pattern when examining the recruitment metadata. Here, example recruitment plots for each type of rearrangement have been artificially generated. The ‘‘good’’ and ‘‘no mate’’ categories have been suppressed. In each case, breaks in synteny are marked by the presence of stacks of ‘‘missing’’ mate reads. The presence or absence of other categories distinguishes each type of rearrangement from the others. doi:10.1371/journal.pbio.0050077.g005 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770408 Sorcerer II GOS Expedition Table 5. Atypical Segments in P. marinus MIT9312 and P. ubique HTCC1062 (SAR11) Reference Genome Begina Endb Size, bp Type of Variantc Description MIT9312 36,401 38,311 2,132 Variable deletion 12 out of 66 clones support simple deletion. Remaining clones show considerable sequence variation amongst themselves. MIT9312 124,448 125,219 771 Variable insertion Associated with ASN tRNA gene; in the environment, half the reads identify pair of small inserts with no similarity to known genes; the other half point to small and large inserts of undetermined nature. MIT9312 233,826 233,910 84 Insertion All clones support small insert (270 bp) with no clear sequence similarity to known genes or sequences. MIT9312 243,296 245,115 424 Variable deletion 24 of 42 clones support simple deletion of hypothetical protein. 13 support slightly larger deletion. Remaining not clearly resolved. MIT9312 296,818 300,888 4,344 Variable deletion 44 clones support deletion of 3,070 bp segment containing 4 genes (3 hypotheti- cal; 1 carbamoyltransferase). 17 clones support alternative sequences with little or no similarity to each other. gMIT9312 342,404 342,662 326 Variable insert In environment is 93% chance of finding deoxyribodopyrimiden photolyase with 7% chance of finding an ABC type Fe3þ siderophore transport system permease component. MIT9312 345,933 365,351 19,418 Hypervariable Very limited similarity among clones indicates that this is a hypervariable seg- ment. Note that it is closely associated with a site-specific integrase/recombinase. MIT9312 551,347 552,025 678 Deletion Two small deletions within a hypothetical protein. MIT9312 617,914 621,556 3,642 Hypervariable About 50% of the clones support small deletions among several hypothetical proteins. Remaining support significant variability suggesting hypervariable seg- ment. At least two of the missed mates contain integrase-like genes. MIT9312 646,340 652,375 6,035 Deletion/Hypervariable Majority of clones support simple deletion of eight hypothetical and hypothetical genes. Small number of clones indicate this may by hypervariable as well. MIT9312 655,241 655,800 559 Deletion Small deletion between hypothetical proteins. MIT9312 665,000 678000 12,000 Deletions Complex set of deletions and replacements that vary with geographic location. MIT9312 665,824 666,380 556 Insertion Small inserted hypothetical protein. MIT9312 670,747 671,933 1,186 Deletion Deletes hypothetical gene. MIT9312 736,266 736,289 23 Insertion All clones support insertion of fructose-bisphosphate aldolase and fructose-1,6-bi- sphosphate aldolase. MIT9312 762,156 762,717 561 Deletion Small deletion between hypothetical proteins. MIT9312 779,006 779,309 303 Insertion Small hypothetical protein inserted. MIT9312 874,349 874,913 564 Insertion Small insertion including gene with similarity to RNA-dependent RNA-polymerase. MIT9312 943,389 946,997 3,608 Variable Several small changes. MIT9312 1,043,129 1,131,874 88,745 Hypervariable MIT9312 1,140,922 1,141,412 490 Insertion Small insertion. MIT9312 1,144,307 1,144,790 483 Insertion Small insertion of several genes. MIT9312 1,155,123 1,156,440 1,317 Deletion Deletes high-light–inducible protein. MIT9312 1,172,609 1,177,292 4,683 Variable Several genes have been replaced or deleted. MIT9312 1,202,643 1,274,335 71,692 Hypervariable MIT9312 1,288,481 1,290,367 1,886 Variable deletion Deletes polysaccharide export-related periplasmic protein (28 out of 55). Other deletions are variable and may include replacement with alternate sequences. MIT9312 1,323,606 1,324,523 917 Deletion Environmental sequences lack a small hypothetical protein. MIT9312 1,369,637 1,369,996 359 Insertion NAD-dependent DNA ligase absent from MIT9312; has possible paralog MIT9312 1,381,273 1,382,049 776 Deletion Small insert in MIT9312 not present in environment MIT9312 1,384,664 1,385,110 446 Deletion Deletes delta(12)-fatty acid dehydrogenase and replaces gene with small (;100 bp) sequence. There is some variation in the exact location and replacement se- quence. MIT9312 1,388,430 1,389,718 1,288 Replacement Segment between two high-light–inducible proteins swapped for different se- quence with no similarity. MIT9312 1,392,865 1,392,976 111 Replacement Small replacement deletes hypothetical gene and replaces it with small, unknown sequence. There is some variation in the precise boundaries of the deletion and in the replacement sequences. MIT9312 1,397,696 1,420,005 22,309 Hypervariable MIT9312 1,486,145 1,487,971 231 Variable Small insertion of dolichyl-phosphate-mannose-protein mannosyltransferase; alter- nately deletes glycosyl transferase (8 out of 56). MIT9312 1,519,810 1,520,860 1,050 Variable deletion Deletes a single hypothetical protein; about half of the deletions contain variable sequences of unknown origin. MIT9312 1,568,049 1,569,121 1,072 Replacement Typically 928-bp portion of MIT9312 replaced by 175-bp stretch in environment; some small amount of variation in environmental replacement sequence (11 out of 51). HTCC1062 50,555 93,942 43,387 Variable deletion Low recruitment segment containing many hypothetical, transporter, and secre- tion genes. HTCC1062 146,074 146,415 341 Deletion Often deleted segment containing DoxD-like and ferredoxin dependent gluta- mate synthase peptide. HTCC1062 166,600 166,700 100 Variable replacement GOS sequences indicate that variable blocks of genes are frequently inserted here. HTCC1062 308,720 309,633 913 Deletion Potential sulfotransferase domain deleted. HTCC1062 339,545 339,951 406 Deletion Deletes a predicted O-linked N-acetylglucosamine transferase. HTCC1062 385,348 386,224 876 Deletion SAM-dependent methyltransferase deleted. PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770409 Sorcerer II GOS Expedition Though insertions and deletions accounted for many of the obvious regions of structural variation, we also looked for rearrangements. The high levels of local synteny associated with P. ubique and P. marinus suggested that large-scale rearrangements were rare in these populations. To inves- tigate this hypothesis we used the recruitment data to examine how frequently rearrangements besides insertions and deletions could be identified. We looked for rearrange- ments consisting of large (greater than 50 kb) inversions and translocations associated with P. marinus; however, we did not identify any such rearrangements that consistently distin- guished environmental populations from sequenced cultivars. Rare inversions and translocations were identified in the dominant subtype associated with MIT9312 (Table 6). Based on the amount of sequence that contributed to the analysis, we estimate that one inversion or translocation will be observed for every 2.6 Mbp of sequence examined (less than once per P. marinus genome). A further observation concerns the uniformity along a genome of the evolutionary history among and within subtypes. For instance, the similarity between GOS reads and P. marinus MIT9312 is typically 85%–95%, while the similarity between MIT9312 and P. marinus MED4 is generally ;10% lower. However, there are several instances where the divergence of MIT9312 and MED4 abruptly decreases to no more than that between the GOS sequences and MIT9312 (Poster S1G). These results are consistent either with horizontal transfer (recombination) or with inhomogeneous Table 6. Six Large-Scale Translocations and Inversions Were Identified in the Abundant P. marinus Subtype Group Low Genome Begin Low Genome End High Genome Begin High Genome End Read ID Low Read Begin Low Read Breakpoint Low Read Strand High Read Breakpoint High Read End High Read Strand Read Length Inversion Sample 1 34,428 34,904 1,536,417 1,536,774 1092255385627 0 476 1 477 834 1 834 No 15 2 607,778 608,467 1,131,368 1,131,621 1093017685727 270 959 0 0 251 1 959 Yes 18 3 618,997 619,375 1,172,217 1,172,728 1092963065572 19 397 1 425 939 1 939 No 17 3 618,997 619,372 1,172,203 1,172,728 1095433012642 0 375 1 403 931 1 941 No 26 3 618,997 619,251 1,172,216 1,172,728 1091140913752 2 256 1 284 799 1 799 No 19 3 618,997 619,331 1,172,280 1,172,728 1641121 2 337 1 365 816 1 816 No 00d 3 619,007 619,375 1,172,223 1,172,728 1092963490951 19 387 1 425 933 1 933 No 17 4 652,933 653,483 1,369,774 1,370,077 200560 325 875 0 0 303 1 875 Yes 00a 4 652,979 653,350 1,369,774 1,370,200 1492647 0 371 1 439 865 0 867 Yes 00d 4 652,979 653,496 1,369,774 1,370,086 1092256128910 10 527 1 595 905 0 906 Yes 15 4 652,979 653,496 1,369,774 1,370,061 1092405979387 14 531 1 599 886 0 886 Yes 25 4 652,979 653,353 1,369,774 1,370,132 1093017637883 2 376 1 444 802 0 802 Yes 18 4 652,980 653,496 1,369,774 1,370,111 1092343389654 6 522 1 591 928 0 928 Yes 25 5 1,049,484 1,049,793 1,172,301 1,172,782 1400802 518 827 0 1 483 0 827 No 00d 6 1,219,993 1,220,519 1,485,834 1,486,222 1092256207885 2 528 1 532 919 1 919 No 17 6 1,219,993 1,220,496 1,485,925 1,486,222 380485 300 803 0 0 296 0 803 No 00a 6 1,219,993 1,220,477 1,485,782 1,486,204 1484583 0 484 1 506 927 1 927 No 00d 6 1,220,005 1,220,388 1,485,733 1,486,216 1682914 507 890 0 2 484 0 913 No 00d doi:10.1371/journal.pbio.0050077.t006 Table 5. Continued. Reference Genome Begina Endb Size, bp Type of Variantc Description HTCC1062 441,074 441,152 78 Deletion Deletes possible methyltransferase FkbM. HTCC1062 516,041 561,604 45,563 Variable replacement Several high identity ‘‘missed’’ mates match phage genes, including an integrase. HTCC1062 660,413 660,978 565 Deletion Deletes hypothetical protein. HTCC1062 675,141 676,399 1,258 Deletion Deletes hypothetical protein. HTCC1062 766,022 768,263 2,241 Deletion Deletes four hypothetical proteins. HTCC1062 814,015 816,386 2,371 Deletion Deletes steroid monoxygenase and short-chain dehydrogenase. HTCC1062 893,450 922,604 29,154 Deletion Deletes large segment including a 7317-aa hypothetical gene. HTCC1062 941,203 942,403 1,200 Deletion Deletes hypothetical protein. HTCC1062 991,461 997,861 6,400 Deletion Deletes several hypotheticals and mix of other genes; adjacent to recombinase. HTCC1062 1,117,126 1,143,483 26,357 Deletion Large number of hypothetical and transporters that are deleted. HTCC1062 1,160,908 1,166,589 5,681 Deletion Deletes a small cluster of peptides with various functions. HTCC1062 1,188,887 1,189,231 344 Deletion Deletes portion of winged helix DNA-binding protein but inserts sequences with similarity to gij71082757 sodium bile symporter family protein found in large gap between 50555 and 93942. aBegin indicates the approximate bp position which marks the beginning of the gap in recruitment. bEnd indicates the approximate bp position which marks the ending of the gap in recruitment. cThe type of change indicates what would have to happen to the reference genome to produce the sequences seen in the environment (e.g., a deletion indicates that the indicated portion of the reference would have to be deleted to generate the variant(s) seen in the environment). doi:10.1371/journal.pbio.0050077.t005 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770410 Sorcerer II GOS Expedition selectional pressures. Similar patterns are present in the two high-identity subtypes seen on the P. ubique HTCC1062 genome (Poster S1D). Other regions show local increases in similarity between MIT9312 and the dominant subtype that are not reflected in the MIT9312/MED4 divergence (e.g., near positions 50 kb, 288 kb, 730 kb, 850 kb, and 954 kb on MIT9312; also see Poster S1G). These latter regions might reflect either regions of homogenizing recombination or regions of higher levels of purifying selection. However, the lengths of the intervals (several are 10 kb or more) are longer than any single gene and correspond to genes that are not extremely conserved over greater taxonomic distances (in contrast to the ribosomal RNA operon). Equally, if widespread horizontal transfer of an advantageous segment explains these intervals, the transfers occurred long enough ago for appreciable variation to accumulate (unpublished data). Extreme Assembly of Uncultivated Populations The analyses described above have been confined to those organisms with representatives in culture and for which genomes were readily available. Producing assemblies for other abundant but uncultivated microbial genera would provide valuable physiological and biochemical information that could eventually lead to the cultivation of these organisms, help elucidate their role in the marine commun- ity, and allow similar analyses of their evolution and variation such as those performed on sequenced organisms. Previous assembly efforts and the fragment recruitments plots showed that there is considerable and in many cases conflicting variation among related organisms. Such variation is known to disrupt whole-genome assemblers. This led us to try an assembly approach that aggressively resolves conflicts. We call this approach ‘‘extreme assembly’’ (see Materials and Meth- ods). This approach currently does not make use of mate- pairing data and, therefore produces only contigs, not scaffolded sequences. Using this approach, contigs as large as 900 kb could be aligned almost in their entirety to the P. marinus MIT9312 and P. ubique HTCC1062 genomes (Figure 2J–2L). Consistent patterns of fragment recruitment (see below) generally provided evidence of the correctness of contigs belonging to otherwise-unsequenced organisms. Accordingly, large contigs from these alternate assemblies were used to investigate genetic and geographic population structure, as described below. However, the more aggressive assemblies demonstrably suffered from higher rates of assembly artifacts, including chimerism and false consensus sequences (Figure 6). Thus, the more stringent primary assembly was employed for most assembly-based analyses, as manual curation was not practical. As just noted, many of the large contigs produced by the more aggressive assembly methods described above did not align to any great degree with known genomes. Some could be tentatively classified based on contained 16S sequences, but the potential for computationally generated chimerism within the rRNA operon is sufficiently high that inspection of the assembly or other means of confirming such classifica- tions is essential. An alternative to an unguided assembly that facilitates the association of assemblies with known organisms is to start from seed fragments that can be identified as belonging to a particular taxonomic group. We employed fragments outside the ribosomal RNA operon that were mated to a 16S-containing read, limiting extension to the direction away from the 16S operon. This produced contigs of 100 kb or more for several of the ribotypes that were abundant in the GOS dataset. When evaluated via fragment recruitment (Figure 2M–2O), these assemblies revealed patterns analogous to those seen for the sequenced genomes described above: multiple subtypes could be distinguished along the assembly, differing in similarity to the reference sequence and sample distribution, with occasional gaps. Hypervariable segments by definition were not represented in these assemblies, but they may help explain the termi- nation of the extreme assemblies for P. marinus and SAR11 and provide a plausible explanation for termination of assemblies of the other deeply sampled populations as well. This directed approach to assembly can also be used to investigate variation within a group of related organisms (e.g., a 16S ribotype). We explored the potential to assemble Figure 6. Examples of Chimeric Extreme Assemblies (A) Fragment recruitment to an extreme assembly contig indicates the assembly is chimeric between two organisms, based on dramatic shifts in density of recruitment, level of conservation, and sample distribution. (B) Fragment recruitment to a SAR11-related extreme assembly. Changes in color, density, and vertical location toward the top of the figure indicate transitions among multiple subtypes of SAR11. doi:10.1371/journal.pbio.0050077.g006 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770411 Sorcerer II GOS Expedition distinct subtypes of SAR11 by repeatedly seeding extreme assembly with fragments mated to a SAR11-like 16S sequence. Figure 7 compares the first 20 kb from each of 24 independent assemblies. Eighteen of these segments could be aligned full-length to a portion of the HTCC1062 genome just upstream of 16S, while six appeared to reflect rearrange- ments relative to HTCC1062. The rearranged segments were associated with more divergent 16S sequences (8%–14% diverged from the 16S of HTCC1062), while those without rearrangements corresponded to less divergent 16S (averag- ing less than 3% different from HTCC1062). In each segment, many reads were recruited above 90% identity, but different samples dominated different assemblies. Phylogenetic trees support the inference of evolutionarily distinct subtypes with distinctive sample distributions (Figure 8). Taxonomic Diversity Environmental surveys provide a cultivation-independent means to examine the diversity and complexity of an environmental sample and serve as a basis to compare the populations between different samples. Typically, these surveys use PCR to amplify ubiquitous but slowly evolving genes such as the 16S rRNA or recA genes. These in turn can be used to distinguish microbial populations. Since PCR can introduce various biases, we identified 16S genes directly from the primary GOS assembly. In total, 4,125 distinct full- length or partial 16S were identified. Clustering of these sequences at 97% identity gave a total of 811 distinct ribotypes. Nearly half (48%) of the GOS ribotypes and 88% of the GOS 16S sequences were assigned to ribotypes previously deposited in public databases. That is, more than half the ribotypes in the GOS dataset were found to be novel at what is typically considered the species level [30]. The overall taxonomic distribution of the GOS ribotypes sampled by shotgun sequencing is consistent with previously published PCR based studies of marine environments (Table 7) [31]. A smaller amount (16%) of GOS ribotypes and 3.4% of the GOS 16S sequences diverged by more than 10% from any publicly available 16S sequence, thus being novel to at least the family level. A census of microbial ribotypes allows us to identify the abundant microbial lineages and estimate their contribution Figure 7. Fragment Recruitment Plots to 20-kb Segments of SAR11-Like Contigs Show That Many SAR11 Subtypes, with Distinct Distributions, Can Be Separated by Extreme Assembly Each segment is constructed of a unique set of GOS sequencing reads (i.e., no read was used in more than one segment). Segments are arbitrarily labeled (A–X) for reference in Figure 8. doi:10.1371/journal.pbio.0050077.g007 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770412 Sorcerer II GOS Expedition Figure 8. Phylogeny of GOS Reads Aligning to P. ubique HTCC1062 Upstream of 16S Gene Indicates That the Extreme Assemblies in Figure 7 Correspond to Monophyletic Subtypes Coloring of branches indicates that the corresponding reads align at .90% identity to the extreme assembly segments shown in Figure 7; colored labels (A–X) correspond to the labels in Figure 7, indicating the segment or segments to which reads aligned. doi:10.1371/journal.pbio.0050077.g008 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770413 Sorcerer II GOS Expedition to the GOS dataset. Of the 811 ribotypes, 60 contain more than 8-fold coverage of the 16S gene (Table 8); jointly, these 60 ribotypes accounted for 73% of all the 16S sequence data. All but one of the 60 have been detected previously, yet only a few are represented by close relatives with complete or nearly complete genome sequencing projects (see Fragment Recruit- ment for further details). Several other abundant 16S sequences belong to well-known environmental ribotypes that do not have cultivated representatives (e.g., SAR86, Roseobacter NAC-1–2, and branches of SAR11 other than those containing P. ubique). Interestingly, archaea are nearly absent from the list of dominant organisms in these near-surface samples. The distribution of these ribotypes reveals distinct micro- bial communities (Figure 9 and Table 8). Only a handful of the ribotypes appear to be ubiquitously abundant; these are dominated by relatives of SAR11 and SAR86. Many of the ribotypes that are dominant in one or more samples appear to reside in one of three separable marine surface habitats. For example, several SAR11, SAR86, and alpha Proteobacteria, as well as an Acidimicrobidae group, are widespread in the surface waters, while a second niche delineated by tropical samples contains several different SAR86, Synechococcus and Prochlorococcus (both cyanobacterial groups), and a Rhodospir- illaceae group. Other ribotypes related to Roseobacter RCA, SAR11, and gamma Proteobacteria are abundant in the temperate samples but were not observed in the tropical or Sargasso samples. Not surprisingly, samples taken from nonmarine environments (GS33, GS20, GS32), estuaries (GS11, GS12), and larger-sized fraction filters (GS01a, GS01b, GS25) have distinguishing ribotypes. Furthermore, as the complete genomes of these dominant members are obtained, the capabilities responsible for their abundances may well lend insight into the community metabolism in various oceanic niches. Sample Comparisons The most common approach for comparing the microbial community composition across samples has been to examine the ribotypes present as indicated by 16S rRNA genes or by analyzing the less-conserved ITS located between the 16S and 23S gene sequences [7,8,16,17]. However, a clear observation emerging from the fragment recruitment views was that the reference ribotypes recruit multiple subtypes, and that these subtypes were distributed unequally among samples (Figures 2, 7, 8; Poster S1D, S1F, and S1I). We developed a method to assess the genetic similarity between two samples that potentially makes use of all portions of a genome, not just the 16S rRNA region. This similarity measure is assembly independent; under certain circumstances, it is equivalent to an estimate of the fraction of sequence from one sample that could be considered to be in the other sample. Whole-metagenomic similarities were computed for all pairs of samples. Results are presented for comparisons at �98% and 90% identity. No universal cutoff consistently divides sequences into natural subsets, but the 98% identity cutoff provides a relatively high degree of resolution, while the 90% cutoff appears to be a reasonable heuristic for defining subtypes. For instance, a 90% cutoff treats most of the reads specifically recruited to P. marinus MIT9312 as similar (those more similar to MED4 notably excepted), while reasonably separating clades of SAR11 (Figures 7 and 8). Reads with no qualifying overlap alignment to any other read in a pair of samples are uninformative for this analysis, as they correspond to lineages that were so lightly sequenced that their presence in one sample and absence in another may be a matter of chance. For the 90% cutoff, 38% of the sequence reads contributed to the analysis. The resulting similarities reveal clear and consistent group- ings of samples, as well as the outlier status of certain samples (Figures 10 and 11). The broadest contrast was between samples that could be loosely labeled ‘‘tropical’’ (including samples from the Sargasso Sea [GS00b, GS00c, GS00d] and samples that are temperate by the formal definition but under the influence of the Gulf Stream [GS14, GS15]) and ‘‘temperate.’’ Further subgroups can be identified within each of these categories, as indicated in Figures 10 and 11. In some cases, these groupings were composed of samples taken from different ocean basins during different legs of the expedition. A few pairs of samples with strikingly high similarity were observed, including GS17 and GS18, GS23 and GS26, GS27 and GS28, and GS00b and GS00d. In each case, these pairs of samples were collected from consecutive or nearly consecutive samples. However, the same could be said of many other pairs of samples that do not show this same degree of similarity. Indeed, geographi- cally and temporally separated samples taken in the Atlantic (GS17, GS18) and Pacific (GS23, GS26) during separate legs of the expedition are more similar to one another than were most pairs of consecutive samples. The samples with least similarity to any other sample were from unique habitats. Thus, similarity cannot be attributed to geographic separa- tion alone. The groupings described above can be reconstructed from taxonomically distinct subsets of the data. Specifically, the major groups of samples visible in Figure 10 were reproduced when sample similarities were determined based only on fragments recruiting to P. ubique HTCC1062 (unpublished data). Likewise, the same groupings were observed when the fragments recruiting to either HTCC1062 or P. marinus MIT9312, or both, were excluded from the calculations (unpublished data). Thus, the factors influencing sample similarities do not appear to rely solely on the most abundant Table 7. Taxonomic Makeup of GOS Samples Based on 16S Data from Shotgun Sequencing Phylum or Class Fractiona Alpha Proteobacteria 0.32 Unclassified Proteobacteria 0.155 Gamma Proteobacteria 0.132 Bacteroidetes 0.13 Cyanobacteria 0.079 Firmicutes 0.075 Actinobacteria 0.046 Marine Group A 0.022 Beta Proteobacteria 0.017 OP11 0.008 Unclassified Bacteria 0.008 Delta Proteobacteria 0.005 Planctomycetes 0.002 Epsilon Proteobacteria 0.001 aValues shown are averages over all samples. doi:10.1371/journal.pbio.0050077.t007 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770414 Sorcerer II GOS Expedition Table 8. Most Abundant Ribotypes (97% Identity Clusters) Ribotype Classificationa Depth of Coverageb Range Number of Matching GenBank Entriesc SAR11 Surface 1 581 Widespread 100þ SAR11 Surface 2 182 Sargasso and GS31d 100þ Burkholderia 139 00a 100þ Acidomicrobidae type a 133 Tropical and Sargassod 77 Prochlorococcus 112 Tropical and Sargasso 76 SAR11 Surface 3 109 Widespread 63 SAR86-like type a 108 Widespread 88 Shewanella 80 00a 49 Synechococcus 59 Tropicald 100þ Rhodospirillaceae 50 Tropical and Sargassod 15 SAR86-like type b 47 Hypersaline pondd 24 SAR86-like type c 47 Tropical and Sargasso 75 Chlorobi-like 40 Hypersaline pond 1 Alpha Proteobacteria type a 38 Widespread 10 Roseobacter type a 35 Tropical and Sargassod 42 Cellulomonadaceae type a 34 Hypersaline pond 12 SAR86-like type d 28 Widespread 30 Alpha Proteobacteria type b 27 Widespread 43 SAR86-like type e 26 Widespread 71 Cytophaga type a 24 Tropical and Sargasso 21 Bacteroidetes type a 23 Widespread 17 Bdellovibrionales type a 21 Tropical and Sargasso 36 Acidomicrobidae type b 21 Temperate 41 SAR116-like 21 Widespread 28 Marine Group A type a 20 Tropical and Sargassod 9 Remotely SAR11-like type a 19 Sargasso and tropical 12 Frankineae type a 19 Fresh and estuary 6 Frankineae type b 18 Hypersaline pondd 71 SAR86-like type f 18 Tropical 15 SAR86-like type g 17 Tropical 13 Remotely SAR11-like type b 16 Tropical and Sargasso 4 Gamma Proteobacteria type a 16 Sargasso and GS14 18 Microbacteriaceae 15 Hypersaline pond 1 SAR102/122-like 15 Tropical and Sargasso 15 SAR86-like type h 15 Tropical and Sargasso 27 Bacteroidetes type b 14 Tropicald 14 Remotely SAR11-like type c 14 Tropical and Sargasso 18 SAR86-like type i 14 Sargassod 13 Rhodobium-like type a 14 Sargasso and GS31d 9 Marine Group A type b 13 Tropical and Sargassod 28 Gamma Proteobacteria type b 12 S. temperate and GS31 22 Oceanospirillaceae 12 Mangrove 1 Gamma Proteobacteria type c 12 Widespread 14 SAR11-like type a 12 Temperate 17 SAR11-like type b 11 Fresh and estuary 11 SAR11-like type c 11 Sargasso and GS31d 12 SAR86-like type j 11 Tropical and Sargasso 1 Roseobacter Algicola 11 Widespread 20 Remotely SAR102/122-like 11 Hypersaline pond 0 Frankineae type c 10 Fresh and estuary 11 Rhodobium-like type b 10 Widespread 9 Roseobacter RCA 10 Temperate 39 Remotely SAR11-like type d 10 Tropical 32 Acidobacteria 9 Fresh 4 Remotely SAR11-like type e 9 Tropical and Sargasso 8 Frankineae type d 9 Estuary and fresh 89 SAR11-like type d 9 Sargasso and GS31 4 Methylophilus 9 Temperated 41 Archaea C1 C1a 8 Sargasso and GS31 2 Cytophaga type b 8 Widespread 9 aTaxonomic classifications based on Hugenholtz ARB database. Labels indicate the most specific taxonomic assignment that could be confidently assigned to each ribotype. ‘‘Type a,’’ ‘‘type b,’’ etc., used to arbitrarily discriminate separate 97% ribotypes that would otherwise be given the same name. bNote that the 16S rRNA gene can be multicopy. cMatching GenBank entries required full-length matches at � 98% identity. dLess than 13 coverage outside described range. doi:10.1371/journal.pbio.0050077.t008 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770415 Sorcerer II GOS Expedition organisms but rather are reflected in multiple microbial lineages. It is tempting to view the groups of similar samples as constituting community types. Sample similarities based on genomic sequences correlated significantly with differences in the environmental parameters (Table 1), particularly water temperature and salinity (unpublished data). Samples that are very similar to each other had relatively small differences in temperature and salinity. However, not all samples that had similar temperature and salinity had high community similarities. Water depth, primary productivity, fresh water input, proximity to land, and filter size appeared consistent with the observed groupings. Other factors such as nutrients and light for phototrophs and fixed carbon/energy for chemotrophs may ultimately prove better predictors, but these results demonstrate the potential of using metagenomic data to tease out such relationships. Examining the groupings in Figure 11 in light of habitat and physical characteristics, the following may be observed. The first two samples, a hypersaline pond in the Galapagos Islands (GS33) and the freshwater Lake Gatun in the Panama Canal (GS20) are quite distinct from the rest. Salinity—both higher and lower than the remaining coastal and ocean samples—is the simplest explanation. Figure 9. Presence and Abundance of Dominant Ribotypes The relative abundance of various ribotypes (rows) in each filter (columns) is represented by the area of the corresponding spot (if any). The listed ribotypes each satisfied the following criteria in at least one filter: the ribotype was among the five most abundant ribotypes detected in the shotgun data, and was represented by at least three sequencing reads. Relative abundance is based on the total number of 16S sequences in a given filter. Order and grouping of filters is based on the clustering of genomic similarity shown in Figure 11. Ribotype order was determined based on similarity of sample distribution. A marked contrast between temperate and tropical groups is visible. Estuarine samples GS11 and GS12 contained a mix of ribotypes seen in freshwater and temperate marine samples, while samples from nonmarine habitats or larger filter sizes were pronounced outliers. The presence of large amounts of Burkholderia and Shewanella in one Sargasso Sea sample (GS00a) makes this sample look much less like other Sargasso and tropical marine samples than it otherwise would. Note that 16S is not a measure of cell abundance since 16S genes can be multicopy. doi:10.1371/journal.pbio.0050077.g009 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770416 Sorcerer II GOS Expedition Twelve samples form a strong temperate cluster as seen in the similarity matrix of Figure 11 as a darker square bounded by GS06 and GS12. Embedded within the temperate cluster are three subclusters. The first subcluster includes five samples from Nova Scotia through the Gulf of Maine. This is followed by a subcluster of four samples between Rhode Island and North Carolina. The northern subcluster was sampled in August, the southern subcluster in November and Figure 10. Similarity between Samples in Terms of Shared Genomic Content Genomic similarity, as described in the text, is an estimate of the amount of the genetic material in two filters that is ‘‘the same’’ at a given percent identity cutoff—not the amount of sequence in common in a finite dataset, but rather in the total set of organisms present on each filter. Similarities are shown for 98% identity. (A) Hierarchical clustering of samples based on pairwise similarities. (B) Pairwise similarities between samples, represented as a symmetric matrix of grayscale intensities; a darker cell in the matrix indicates greater similarity between the samples corresponding to the row and column, with row and column ordering as in (A). Groupings of similar filters appear as subtrees in (A) and as squares consisting of two or more adjacent rows and columns with darker shading. Colored bars highlight groups of samples described in the text; labels are approximate characterizations rather than being strictly true of every sample in a group. doi:10.1371/journal.pbio.0050077.g010 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770417 Sorcerer II GOS Expedition December. Though all samples were collected in the top few meters, the southern samples were in shallower waters, 10 to 30 m deep, whereas most of the northern samples were in waters greater than 100 m deep. Monthly average estimates of chlorophyll a concentrations were typically higher in the southern samples as well (Table 1). All of these factors— temperature, system primary production, and depth of the sampled water body—likely contribute to the differences in microbial community composition that result in the two well- defined clusters. The final temperate subgroup includes two estuaries, Chesapeake Bay (GS12) and Delaware Bay (GS11), distinguished by their lower salinity and higher productivity. However, GS11 is markedly similar not only to GS12 but also to coastal samples, whereas the latter appears much more unique. Interestingly, the Bay of Fundy estuary sample (GS06) clearly did not group with the two other estuaries, but rather with the northern subgroup, perhaps reflecting differences in the rate or degree of mixing at the sampling site. Continuing to the right and downward in Figure 11, one can see a large cluster of 25 samples from the tropics and Sargasso Sea, bounded by GS47 and GS00b. This can be further subdivided into several subclusters. The first sub- cluster (a square bounded by GS47 and GS14) includes 14 samples, about half of which were from the Galapagos. The second distinct subcluster (a square bounded by GS16 and GS26) includes seven samples from Key West, Florida, in the Atlantic Ocean to a sample close to the Galapagos Islands in the Pacific Ocean. Loosely associated with this subcluster is a sample from a larger filter size taken en route to the Galapagos (GS25). The remaining samples group weakly with the tropical cluster. GS32 was taken in a coastal mangrove in the Galapagos. The thick organic sediment at a depth of less than a meter is the likely cause for it being unlike the other samples. Sample 00a was from the Sargasso Sea and contained a large fraction of sequence reads from apparently clonal Burkholderia and Shewanella species that are atypical. When this Figure 11. Sample Similarity at 90% Identity Similarity between samples in terms of shared genomic content similar to Figure 10, except that the plots were done using a 90% identity cutoff that has proven reasonable for separating some moderately diverged subtypes doi:10.1371/journal.pbio.0050077.g011 PLoS Biology | www.plosbiology.org March 2007 | Volume 5 | Issue 3 | e770418 Sorcerer II GOS Expedition Table 9. Relative Abundance of TIGRFAMs Associated with a Specific Sample TIGRFAM Number of Peptides Samplea Relative Abundanceb Major Category Minor Category Description TIGR01526 131 GS01a 3.4 Nicotinamide-nucleotide adenylyltransferase TIGR00661 214 GS01a 2.6 Hypothetical Conserved Conserved hypothetical protein TIGR01833 135 GS01a 2 Hydroxymethylglutaryl-CoA synthase TIGR01408 267 GS01b 4.5 Ubiquitin-activating enzyme E1 TIGR01678 144 GS01b 2.6 Sugar 1,4-lactone oxidases TIGR01879 758 GS01b 2.5 Amidase, hydantoinase/carbamoylase family TIGR00890 131 GS01b 2.4 Transport Carbohydrates, Oxalate/Formate Antiporter TIGR01767 112 GS01b 2.4 5-methylthioribose kinase TIGR00101 495 GS01b 2.3 Central Nitrogen Urease accessory protein UreG TIGR01659 186 GS01b 2.2 Sex-lethal family splicing factor TIGR00313 455 GS01b 2.1 Biosynthesis Heme, Cobyric acid synthase CobQ TIGR00317 306 GS01b 2.1 Biosynthesis Heme, Cobalamin 59-phosphate synthase TIGR00601 186 GS01b 2.1 DNA DNA UV excision repair protein Rad23 TIGR01792 904 GS01b 2.1 Central Nitrogen Urease, alpha subunit TIGR00749 483 GS01b 2 Energy Glycolysis/gluconeogenesis Glucokinase TIGR01001 485 GS01b 2 Amino Aspartate Homoserine O-succinyltransferase TIGR02238 165 GS32 5.1 Meiotic recombinase Dmc1 TIGR02239 159 GS32 3.5 DNA repair protein RAD51 TIGR02232 140 GS32 2.6 Myxococcus cysteine-rich repeat TIGR02153 212 GS32 2.5 Protein tRNA Glutamyl-tRNA(Gln) amidotransferase, subunit D TIGR00519 289 GS32 2.4 L-asparaginases, type I TIGR00288 248 GS32 2 Hypothetical Conserved Conserved hypothetical protein TIGR00288 TIGR01681 136 GS32 2 HAD-superfamily phosphatase, subfamily IIIC TIGR02236 143 GS32 2 DNA DNA DNA repair and recombination protein RadA TIGR00028 110 GS33 14.7 Mycobacterium tuberculosis PIN domain family TIGR01550 131 GS33 9.1 Unknown General Death-on-curing family protein TIGR01552 200 GS33 5.3 Mobile Other Prevent-host-death family protein TIGR00143 151 GS33 4.6 Protein Protein [NiFe] hydrogenase maturation protein HypF TIGR01641 131 GS33 4.3 Mobile Prophage Phage putative head morphogenesis protein, SPP1 gp7 family TIGR01710 1,926 GS33 3.9 Cellular Pathogenesis General secretion pathway protein G TIGR01539 251 GS33 3.6 Mobile Prophage Phage portal protein, lambda family TIGR00016 217 GS33 3.5 Energy Fermentation Acetate kinase TIGR00942 117 GS33 3.5 Transport Cations Multicomponent Naþ:Hþ antiporter TIGR01836 174 GS33 3.3 Fatty Biosynthesis Poly(R)-hydroxyalkanoic acid synthase, class III, PhaC subunit TIGR02110 135 GS33 3.2 Biosynthesis Other Coenzyme PQQ biosynthesis protein PqqF TIGR01106 902 GS33 3.1 Energy ATP-proton Na,H/K antiporter P-type ATPase, alpha subunit TIGR01497 726 GS33 3.1 Transport Cations Kþ-transporting ATPase, B subunit TIGR01524 603 GS33 3.1 Transport Cations Magnesium-translocating P-type ATPase TIGR02140 166 GS33 3.1 Transport Anions Sulfate ABC transporter, permease protein CysW TIGR01409 122 GS33 3 Tat (twin-arginine translocation) pathway signal sequence TIGR02195 282 GS33 3 Cell Biosynthesis Lipopolysaccharide heptosyltransferase II TIGR01522 696 GS33 2.9 Calcium-transporting P-type ATPase, PMR1-type TIGR01523 766 GS33 2.9 Potassium/sodium efflux P-type ATPase, fungal-type TIGR00202 228 GS33 2.8 Regulatory RNA Carbon storage r