Genome Sequencing with Big Data

Genome Sequencing with Big Data

DNA sequencing has signi cantly impacted healthcare and biotechnology by greatly accelerating medical and biological research. It has even become indispensable in areas like forensic science. Modern DNA sequencing technology has made it possible to use DNA sequencing to advance its potential bene ts in the elds of diagnostics and early detection of genetic predisposition to disease to improve agriculture and livestock breeding and processing. The possibilities seem endless, hampered only by limitations in computing throughput, speed, scalability, and resolution. This changed with the advent of NGS.

NGS has enabled scientists and researchers to maximize the potential of data sequencing. This is accomplished by searching, and comparing billions of DNA fragments. As a result of this processing, however, NGS produces massive amounts of data that pose challenges in terms of storage, analysis, management, and sharing of data. Without addressing these new NGS issues, researchers would face a new infrastructure bottleneck.

Whiteklay looked for a solution to address these issues and improve the quality and speed of NGS data handling. It developed the BioDek distributed system, which combines the post-processing of DNA reads after sequencing to prepare the NGS data for further analysis and the post-processing steps of sequence alignment. BioDek addressed the challenge in two steps: rst, FASTQ and FASTA genome formats were converted to a sequence alignment map (SAM) le, which is a fundamental post-processing step in nearly all applications of deep sequencing technologies; and second, BioDek improved the quality and reduced the amount of stored data by removing duplicate reads.

The BioDek environment covered such tasks as annotating sequence data, browsing annotations mapped to a reference genome, and comparing and analyzing genomic sequences. Whiteklay initially tested BioDek using Apache Hadoop 1.x on a di erent server architecture, but subsequently deployed the CDH running on servers based on the Intel Xeon processor E5-2680 v2. This migration to CDH running on Intel architecture increased BioDek’s performance by managing the server resources on an optimum note, speeding up the whole analysis process of converting FASTQ or FASTA genome format to SAM. This resulted in a 30-percent performance improvement that cut in the time to analyze NGS data by almost 50 percent1. The resulting increase in productivity lets scientists and researchers accomplish more sequencing in less time, cutting operating costs and increasing ROI by allowing for more sequences to be processed on the same infrastructure.Download the complete paper from .

To understand more about Genome Sequencing using BigData contact us at

Share This Post