Big Data in Genomics and Cancer Treatment Tanya Maslyanko Big data. These are two words the world has been hearing a lot lately and it has been in relevance to a wide array of use cases in social media, government regulation, auto insurance, retail targeting, etc. The list goes on. However, a very important concept that should receive the same (if not more) recognition is the presence of big data in human genome research. Three billion base pairs make up the DNA present in humans. It’s probably safe to say that such a massive amount of data should be organized in a useful way, especially if it presents the possibility of eliminating cancer. Cancer treatment has been around since its first documented case in Egypt (1500 BC) when humans began distinguishing between malignant and benign tumors by learning how to surgically remove them. It is intriguing and scientifically helpful to take a look at how far the world’s knowledge of cancer has progressed since that time and what kind of role big data (and its management and analysis) plays in the search for a cure. The most concerning issue with cancer, and the ultimate reason for why it still hasn’t been completely cured, is that it mutates differently for every individual and reacts in unexpected ways with people’s genetic make up. Professionals and researchers in the field of oncology have to assert the fact that each patient requires personalized treatment and medication in order to manage the specific type of cancer that they have. Elaine Mardis, PhD, co-director of the Genome Institute at the School of Medicine, believes that it is essential to identify mutations at the root of each tumor and to map their genetic evolution in order to make progress in the battle against cancer. “Genome analysis can play a role at multiple time points during a patient’s treatment, to identify ‘driver’ mutations in the tumor genome and to determine whether cells carrying those mutations have been eliminated by treatment.” However, the extensive amount of data that comes with the analysis of human genetics undoubtedly requires a more stable structure and organization to help researchers and scientists to make sense of it all and relate it accordingly to necessary medical care. Many companies have recently been developing their own compilations of data that allow them to sort and analyze genomic data. This is a significant step forward, but to bring this data to its full potential, companies could benefit from Apache Hadoop as a data platform by allowing it to store and sort the massive income of information that keeps increasing from new and upcoming research. For instance, MediSapiens is a Finnish company that hosts the world’s largest unified gene expression database and provides software that allows oncologists to cross-reference 19,000 genes (as well as 40 tissues types and 70 cancer types) across over 20,000 patients. New research advancements are presented through their quarterly data updates, which include molecular profile data selection (the most recent, relevant gene expression data), clinical data curation (data annotations and validity analysis), and data unification (publication of journals). Nevertheless, simply storing this information is not enough to aid the organization and comparison of scientific prospects that continue to develop today. Bioinformatics research of DNA and genes has gone from $1 million in 2007 to $1 thousand in 2012, which allows for an incredibly large increase in sequencing activity and data. Although it’s comforting that genome studies have become so financially accessible, this actually creates a problem for the efficient management of genomic datasets. At Hadoop Summit 2010, Jeremy Bruestle from Spiral Genetics, Inc. spoke about how Hadoop could help solve the challenge of big datasets in the field of genomics. Hadoop supports parallelization, offers good composability, and maps genomics problems through Map Reduce. According to Bruestle, assembly and annotation could become significantly less complicated. There is definitely a need for Hadoop in genomic studies and progress. In “Making sense of cancer genomic data”, Lynda Chin et al. explain that genome analysis has already developed into something extraordinary, leading to new cancer therapy targets and discoveries about certain cancer mutations and the medical responses they require. They also point out that these discoveries need to be handled more effectively. “For one, these large-scale genome characterization efforts involve generation and interpretation of data at an unprecedented scale which has brought into sharp focus the need for improved information technology infrastructure and new computational tools to render the data suitable for meaningful analysis.” This presents the perfect opportunity for Hadoop. Fortunately, there have been quite a few companies that have tapped into the useful power of Hadoop. It initially started at the University of Maryland in 2009, where Michael Schatz released his Cloudburst software on Hadoop, which specialized in mapping sequence data to reference genomes. Since then, many software applications have been developed that focus particularly on genome analysis. Crossbow, for example, is a software company that runs components like short read aligners (Bowtie) and genotypers (SoapSNP) on a Hadoop cluster. The UNC-CH Lineberger Bioinformatics Group also stated that they use Hadoop for its high throughput sequencing services for computational analysis. Hadoop-BAM is another specific data platform that works with the BAM (Binary Alignment/Map) format and uses MapReduce to perform functions like genotyping and peak calling. Deepak Singh, the principal product manager at Amazon Web Services, said, “We've definitely seen an uptake in adopting Hadoop in the life sciences community, mostly targeting next-generation sequencing, and simple read mapping because what [developers] discovered was that a number of bioinformatics problems transferred very well to Hadoop, especially at scale.” On top of sequencing, Hadoop has also sparked an interest at pharmaceutical companies because it doesn’t make data formatting a tedious and worrisome issue, allowing these companies to focus their efforts (and money) on building hypotheses from their collected data. Together, the worlds of bioinformatics and big data are joining forces to conjure up innovative ways to spread knowledge about personalized cancer treatments. For example, Nantworks is working with Verizon to develop the Cancer Knowledge Action Network, using a cloud database, which will allow doctors to easily access protocols about specific cancer medicines and treatments. Dr. Patrick Soon-Shiong of Nantworks stated, “Our goal is to turn this data into actionable information at the point of care, enabling better care through mobile devices in hospitals, clinics and homes.” Basically, this network would be a self-learning health care system equipped with the most up-to-date reassessment of information. Big data bioinformatics projects, like Cloudburst and the Cancer Knowledge Action Network, are placing doctors and scientists at the very hub and turning point of cancer treatment research and development. Oncologists are now able to access necessary information on the spot to make medical decisions and possibly save lives by evaluating and removing tumors before they spread. The momentum that big data is gaining every day has allowed for an impressive advancement in the betterment of the world’s health. The key is to continue down this path in a knowledgeable and efficient way in order to use upcoming research to the utmost advantage.