Sunday, November 29, 2009

Bioinformatics and cloud computing

From the Using clouds for parallel computations in systems biology workshop at the recent SC09 conference (Informatics Iron writeup) to last month’s Genome Informatics meeting, everyone in bioinformatics is talking about cloud computing these days. Last week Steven Salzberg’s group published a paper on their Crossbow tool entitled Searching for SNPs with cloud computing (Cloudera blog post on Crossbow). In the paper the authors describe how they were able to analyze the human sequence data published last year by BGI using Amazon EC2. Specifically, they have developed an alignment (bowtie) and SNP detection (SoapSNP) pipeline that is executed in parallel across a cluster using the Hadoop framework (a free software implementation of Google’s MapReduce framework). Using a 40-node, 320-core EC2 cluster, they were able to analyze 38× coverage sequence data in about three hours. The whole analysis, including data transfer and storage on Amazon S3, cost about $125. You can find a more detailed cost breakdown and comparison on Gary Stiehr’s HPCInfo post and more detail on the SNP detection on Dan Koboldt’s Mass Genomics post.

For analyzing a single genome, you really can’t beat that price. Of course, at the rate next-generation sequencing instruments are generating data, most people are not going to want to analyze just one genome. So the question becomes, what is the break even point? That is, how many genomes do you have to sequence to make buying compute resources cheaper than renting them from Amazon? We currently estimate that the fully loaded (node, chassis, rack, networking, etc.) cost of a single computational core is about $500. Thus, to purchase 320 cores would cost you about $160,000. It’s going to take a lot (1280) genomes to hit that break even point. But, do you really need to analyze a genome in three hours? With the current per run throughput of a single Illumina GA IIx, it would take about four ten-day runs (40 days) to generate 38× coverage of a human genome. After each run, you could align the sequence data from that run. Each lane of data would take 8-12 core·hours to align, so a whole run’s (eight lanes’) worth of data would take about 80 core·hours. Therefore, even if you had just one core, you could align all the data before the next run completed. The consensus calling and variant detection portions of the pipeline typically take a handful of core·hours and therefore do not change the economics; they too can be completed before the first run of the next genome is completed. Thus, with a $500 investment in computational resources, you can more than keep pace with the Illumina instrument. Note that I am completely excluding the cost of storage, as that will be needed for the data and results regardless of where the computation is done. Of course, you probably wouldn’t buy just one core. Checking over at the Dell Higher Education web site, you can get a Quad Core Precision T3500n with 4 GiB of RAM (more RAM per core than the Amazon EC2 Extra Large Instance used in the paper) and 750 GB local storage capacity (about the same storage per core as the Extra Large Instance) for $1700. You would need less than one core’s (25%) of that workstation’s capacity dedicated to alignment of and variant detection on data from a single Illumina GA IIx (thanks to Burrows-Wheeler Transform aligners like bowtie and bwa). Using the single core numbers, the break even point for purchase versus cloud is less than five whole genomes. Using the entire cost of the Dell workstation (even though you require less than 25% of its computational capcity), the break even point is about 14 genomes. It would take about 1.5 years (about half the expected life of IT hardware) at current throughput to sequence 14 genomes with a single Illumina GA IIx. At data rates expected in January 2010, it would take less than a year to break even.

These numbers indicate that unless you are just sequencing a few genomes, you are probably better off purchasing a (possibly single node) cluster. With the proliferation of sequencing applications and publications in the last couple years, not many researchers will fall into the “few genomes” bin. Our experience has been that the more sequencing data people get, the more they want. Another way to look at this is that the entire analysis computational hardware costs (<$1700) is less than 1% of the sequencing instrument cost; or the computational cost to analyze a whole genome (<$500) is less than 1% of the total data generation costs (reagents, flow cells, instrument depreciation, technician time, etc.). This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that's the topic of a future post.
(source URL, Via PolITiGenomics.)

Crossbow: NGS Informatics in the Cloud

Just online at Genome Biology is a new paper from the Steven Salzberg lab (UMD) on searching for SNPs with cloud computing.  Using $85 of computing time rented from Amazon’s EC2, Langmead et al processed an entire human genome - 3.3 billion reads totaling 38x coverage - in three hours.


logo_aws


The “Cloud” Can Be Nebulous


Cloud computing is a term bandied about often these days.  What it boils down to is this:  places with huge banks of computers (Providers, i.e. Amazon) rent out processing time to people who need it (Users).  The “cloud” refers to a software layer between providers and users that acts like a virtual operating system - it loads any software needed by the user, and also provides an access point for running highly parallelized tasks on the cluster. Next-gen sequencing data is well suited to this kind of processing, since a large NGS dataset can usually be broken into smaller subsets (i.e. Illumina lanes) and processed at the same time on different computers, without affecting the results.


Map, Sort, Reduce


Crossbow - the cloud computing software featured in this publication - cleverly breaks down the analysis into a series of map, sort, and reduce steps.  It takes a large sequencing dataset, breaks the reads into subsets, and maps them to the human genome using Bowtie (map).  Then, it divides the 3.2 gigabase human genome into 1,600 non-overlapping 2-megabase partitions and assigns every mapped read to a bin (sort).  The SNP caller, in this case SOAPsnp, is applied to each of these smaller bins rather than to the entire genome (reduce).


The Need for Parallelization


The CHB dataset is ~3.3 billion reads, with an average read length of 35 bp.  Even with Bowtie’s multi-threading and incredible speed, this massive dataset would take months to process on a single computer.  However, the authors divided the input reads into smaller subsets and aligned them in parallel, then processed the 2-Mbp genome “bins” in parallel as well.  Throw all of these parallel tasks at Amazon’s Elastic Compute Cloud (EC2), and it eats them up.  The high-performance EC2 cluster (40 nodes, each with 8 CPUs and 7 GB of RAM) finished all of the tasks in about 3 hours.


Digging into the Numbers


There are a couple of inconsistencies in the numbers that need to be ironed out.  For example, the BGI study reported 36X coverage from 3.3 billion reads (2.02 billion single-end, 658 million paired-end), whereas Langmead et al downloaded 2.7 billion reads from the “YanHuang Site” and noted that it represented 38X coverage.  Where did that extra 2X come from?  Langmead et al do cite the Nature paper by Wang et al, and I believe it’s the same dataset.


At first I was concerned that the Salzberg group had only downloaded the mapped reads and run them, which would have been a biased test of alignment performance.  However, I don’t believe this is the case.  Instead, I believe they meant to say that they’d downloaded 2.02 billion single-end reads, and they’d also downloaded 657 million read pairs (1.314 billion paired-end reads).  This would yield the correct total of 3.3 billion reads.  I realize this is nitpicky.


More of a concern and hopefully less nitpicky are the SNP calling numbers.  Langmead et al reported over 21% more SNPs (3.73 million) than BGI did (3.07 million) on the same dataset, and attributed the difference to less stringent filtering.  Yet both groups used the same SNP caller, so is it possible that the Bowtie alignment, not the SNP filters, were responsible for what we presume are false positives?  This is an important question that Heng Li and others are already considering.


Whole-genome Sequencing Analysis for the Masses


I like the Salzberg group because they’re all about the small lab, about putting NGS processing capabilities into the hands of people without substantial computing resources.  Bowtie made it possible to map a lane of Illumina/Solexa data in a few hours, using only a laptop with 4 GB of RAM.  Now, Crossbow offers anyone with $85 in their budget to run entire WGS datasets on borrowed (or rented) CPU time.  There’s no need to purchase, maintain, or continuously upgrade expensive computing hardware.  Even the storage space can be rented (i.e. from Amazon S3, which the authors used).  It is literally now possible for someone to analyze an entire human genome while sitting on their laptop at the local coffee house.


References

Ben Langmead, Michael C. Schatz, Jimmy Lin, Mihai Pop and Steven L. Salzberg (2009). Searching for SNPs with cloud computing Genome Biology, 10 (R134) : doi:10.1186/gb-2009-10-11-r134


(source URL, Via MassGenomics.)

Saturday, November 21, 2009

Whole Genome Distribution and Ethnic Differentiation of Copy Number Variation in Caucasian and Asian Populations

Although copy number variation (CNV) has recently received much attention as a form of structure variation within the human genome, knowledge is still inadequate on fundamental CNV characteristics such as occurrence rate, genomic distribution and ethnic differentiation. In the present study, we used the Affymetrix GeneChip® Mapping 500K Array to discover and characterize CNVs in the human genome and to study ethnic differences of CNVs between Caucasians and Asians. Three thousand and nineteen CNVs, including 2381 CNVs in autosomes and 638 CNVs in X chromosome, from 985 Caucasian and 692 Asian individuals were identified, with a mean length of 296 kb. Among these CNVs, 190 had frequencies greater than 1% in at least one ethnic group, and 109 showed significant ethnic differences in frequencies (p<0.01). After merging overlapping CNVs, 1135 copy number variation regions (CNVRs), covering approximately 439 Mb (14.3%) of the human genome, were obtained. Our findings of ethnic differentiation of CNVs, along with the newly constructed CNV genomic map, extend our knowledge on the structural variation in the human genome and may furnish a basis for understanding the genomic differentiation of complex traits across ethnic groups.


(source URL, Via Genetics and Genomics.)

Tip of the Week: SwissVar, a New Genotype-phenotype Resource from SIB

SwissVar_tip_movie

Today’s tip is on a new genotype/phenotype resource from the Swiss Institute of Bioinformatics, or SIB. I was already a fan of many SIB tools and resources, and was using one (ENZYME) when I found a notice about SwissVar. SwissVar is described as ‘a portal to Swiss-Prot diseases and variants.’ It includes information about genotype-phenotype relationships for each specific variant, manually annotated from literature. Manual annotation adds a level of quality and believability to this data. The SwissVar portal also contains various pre-computed information that may aid in determining the effect of the variant. Genotype-phenotype searches can begin with either Medical Subject Headings, or MeSH terms (Disease), gene or protein names (General characteristics) or variants (Functional/structural features). There are multiple ways to modify your searches, and results are clean tables of data including gene/protein accessions, names, links to MeSH definitions and links to variation reports.

If your research could benefit from high quality, manually curated genotype/phenotype information, I suggest you watch this tip, and then explore SwissVar according to your own interests.

SwissVar – a Portal to Swiss-Prot Diseases and Variants: http://www.expasy.ch/swissvar/

(source URL, Via The OpenHelix Blog.)

Friday, November 20, 2009

De novo sequencing of plant genomes using second-generation technologies

The ability to sequence the DNA of an organism has become one of the most important tools in modern biological research. Until recently, the sequencing of even small model genomes required substantial funds and international collaboration. The development of ‘second-generation’ sequencing technology has increased the throughput and reduced the cost of sequence generation by several orders of magnitude. These new methods produce vast numbers of relatively short reads, usually at the expense of read accuracy. Since the first commercial second-generation sequencing system was produced by 454 Technologies and commercialised by Roche, several other companies including Illumina, Applied Biosystems, Helicos Biosciences and Pacific Biosciences have joined the competition. Because of the relatively high error rate and lack of assembly tools, short-read sequence technology has mainly been applied to the re-sequencing of genomes. However, some recent applications have focused on the de novo assembly of these data. De novo assembly remains the greatest challenge for DNA sequencing and there are specific problems for second generation sequencing which produces short reads with a high error rate. However, a number of different approaches for short-read assembly have been proposed and some have been implemented in working software. In this review, we compare the current approaches for second-generation genome sequencing, explore the future direction of this technology and the implications for plant genome research.


(source URL, Via Briefings in Bioinformatics - recent issues.)

Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing

Technical advances such as the development of molecular cloning, Sanger sequencing, PCR and oligonucleotide microarrays are key to our current capacity to sequence, annotate and study complete organismal genomes. Recent years have seen the development of a variety of so-called ‘next-generation’ sequencing platforms, with several others anticipated to become available shortly. The previously unimaginable scale and economy of these methods, coupled with their enthusiastic uptake by the scientific community and the potential for further improvements in accuracy and read length, suggest that these technologies are destined to make a huge and ongoing impact upon genomic and post-genomic biology. However, like the analysis of microarray data and the assembly and annotation of complete genome sequences from conventional sequencing data, the management and analysis of next-generation sequencing data requires (and indeed has already driven) the development of informatics tools able to assemble, map, and interpret huge quantities of relatively or extremely short nucleotide sequence data. Here we provide a broad overview of bioinformatics approaches that have been introduced for several genomics and functional genomics applications of next-generation sequencing.


(source URL, Via Briefings in Bioinformatics - Advance Access.)

ENCODE whole-genome data in the UCSC Genome Browser

The Encyclopedia of DNA Elements (ENCODE) project is an international consortium of investigators funded to analyze the human genome with the goal of producing a comprehensive catalog of functional elements. The ENCODE Data Coordination Center at The University of California, Santa Cruz (UCSC) is the primary repository for experimental results generated by ENCODE investigators. These results are captured in the UCSC Genome Bioinformatics database and download server for visualization and data mining via the UCSC Genome Browser and companion tools (Rhead et al. The UCSC Genome Browser Database: update 2010, in this issue). The ENCODE web portal at UCSC (http://encodeproject.org or http://genome.ucsc.edu/ENCODE) provides information about the ENCODE data and convenient links for access.


(source URL, Via NAR - Advance Access.)