High Performance Computing Microsoft Customer Solution Solution Overview Customer Profile Discovering Genetic Variations and Improving Lives with Windows High Performance Computing Clusters Published: May 2003 Perlegen Sciences conducts genetics research and develops products that impact and improve people’s lives. Through extensive research, the company has identified and validated millions of genetic variations in humans. Perlegen uses these markers to identify genetic factors associated with disease states and drug metabolism. It can interrogate these markers in thousands of individuals at an unprecedented level of resolution, making whole genome scanning a reality. Business Situation Perlegen’s Bioinformatics organization uses a Microsoft Windows-based High Performance Computing cluster to analyze individual variations in human genome data. Perlegen provides a cost-effective way for drug companies to research and develop treatments for a variety of diseases, helping to improve the lives of millions of people who suffer from them. Background Perlegen Sciences is a privately-held company founded in 2000 to conduct genetics research and develop therapeutic and diagnostic products that impact and improve people's lives. Perlegen has identified and validated millions of genetic variations in humans using high density microarray technology. These variations, which occur in about 0.1% of the sequence that comprises human DNA, are known as single nucleotide polymorphisms, or SNPs (pronounced “snips”). They are responsible for the traits that distinguish one individual from another – including differences in disease susceptibility and variations in drug metabolism that can impact the effectiveness of therapeutic treatments. Many of today’s most debilitating and costly illnesses, including heart disease, diabetes, cancer, and migraines have a significant genetic component – Perlegen’s technology will provide researchers with new insights into such diseases, and new tools for crafting effective therapies. Perlegen combines this information about the natural genetic variations with high density microarray whole genome scans to compare millions of genetic variations in thousands of individuals at an unprecedented level of resolution. This makes whole genome scanning of patient populations a cost effective tool in determining the genetic factors involved in disease and drug response. Based on this technology platform, Perlegen has developed partnerships with leading pharmaceutical companies to conduct ongoing research into how variations in genes and regulatory sequences are associated with disease, drug response and other traits. Many common diseases have a genetic basis. In order to better understand how these diseases work and how they may be treated, detailed analysis of massive amounts of genomic data is required. Solution Perlegen uses a Microsoft® Windows®-based High Performance Computing (HPC) cluster and Microsoft SQL Server™ 2000 Laboratory Information Management System (LIMS) to manage the experiments and process the voluminous genomic data. Benefits Perlegen’s HPC cluster uses commodity hardware, the Windows operating system, and Microsoft® .NET applications, providing a low cost solution using high productivity development tools. Scenario High Performance Computing Through these collaborations, Perlegen is accelerating the discovery and development of pharmaceutical and diagnostic products by enabling: Discovery of novel potential drug targets and markers which predict drug response Prioritization of drug targets for further development Stratification of clinical trial participants for drug efficacy and side effect susceptibility Solution Components Hardware Dell Intel-based dual processor compute nodes. Software Expansion of use for drugs already on the market Microsoft Windows 2000 Server, SQL Server 2000, Microsoft .NET development environment, and custom application software. Development of new pharmaceutical products and diagnostic tests Application Software Solution In order to conduct such studies, Perlegen first had to locate these variations in a representative human population. In its SNP discovery effort, Perlegen performed full genome scans of the DNA of 50 unique individuals – nearly ten times the genetic content analyzed by Celera and the Human Genome Project in generating the draft of the Human Genome. It did so using high density oligonucleotide microarray wafers and a Microsoft® Windows® 2000-based compute cluster used to process the information on the wafers. Perlegen was able to complete this activity in less than eighteen months – under budget and ahead of schedule. With the discovery effort nearing completion and a robust data processing and analysis facility in place, the informatics team turned its attention to the development of an enhanced Laboratory Information Management System (LIMS) that would form the foundation for its genotyping and association studies. Perlegen’s approach to the implementation of informatics systems that support both of these efforts is based on a development philosophy that includes: Exploitation of commodity hardware and software components and low-cost distributed computing; Use of industry-standard RDBMS technology and database-centric applications; Use of current enterprise software development tools and methodologies; and Development of highly leveraged applications employing both native client and web client architectures. “This is the architectural blueprint that provides the context for Perlegen’s development efforts,” according to Bruce Moxon, Director of Bioinformatics at Perlegen. “The novel approaches and unprecedented scale of our operations led us to select development tools and implementation platforms that would allow us to best leverage rich software development frameworks and concentrate on the challenging problems before us.” Database-centric application development is a key theme that differentiates Perlegen from many bioinformatics organizations and provides significant leverage in the day-today management and analysis of large datasets. The database-centric approach greatly facilitates real-time tracking, monitoring, and reporting – not just of laboratory activities, but also of the complex analytic tasks and their results. Perlegen’s Laboratory Information Management System (LIMS) employs a SQL Server 2000 database and Microsoft .NET desktop and web applications. Microsoft’s Data Transformation Services (DTS) are used to publish experimental metadata from the LIMS system to the Oracle analytics database. Perlegen’s distributed computing pipeline executes and manages tens of thousands of application tasks per day on its Windows 2000 compute cluster. These Windows applications post results directly to the analytics database using OLEDB through ADO and Microsoft ADO.NET. “With simple [Microsoft] SQL queries and standard reporting tools, such as Microsoft Excel PivotTable® views, Perlegen is able to quickly obtain current and historical (trend) views of a wide range of operational metrics,” stated Pascual Starink, Manager of Laboratory Informatics at Perlegen. “This standardized approach to high throughput informatics enables us to provide a production-oriented set of services for our internal and external research partners.” Much of Perlegen’s analysis requires processing of very large datasets – typically with a wide range of data subsetting and reporting requirements. Perlegen employs techniques and tools that have been developed in support of commercial VLDB (very large database) and Data Warehousing and Mining systems, including dimensional modeling and parallel ETL approaches, to leverage the state-of-the-art in large scale data management. “Increasingly, bioinformatics organizations in genomics and proteomics companies are realizing the value of using commercial tools and approaches to enterprise data management challenges,” says Mr. Moxon. “The leverage that such tools provide is critical in converting new technologies and protocols in early phase biotechnology companies into scalable, replicable production biology.” Solution Details Perlegen’s computing infrastructure was developed around the database-centric distributed computing model. This model utilizes scalable commercial relational database technology in conjunction with commodity Network Attached Storage (NAS) technology, Linux and Windows-based distributed computing “farms”, and Gbit-overcopper networking to effectively support the required range of computing activities. This affords Perlegen the ability to incrementally scale its computational and data management infrastructure to meet the needs of its internal R&D teams, and of its growing set of partners and customers. Dell Intel-based compute nodes are configured into a centrally managed distributed computing farm that is used both to process Perlegen-generated wafer data and to provide more traditional sequence analysis and annotation capabilities. The rack-mount dual-processor nodes can be configured and managed remotely, providing a scalable computing infrastructure that affords capacity on demand. Perlegen’s Laboratory Information Management System (LIMS) is used to acquire, track, manage, and monitor information associated with laboratory operations in its experimental studies. This includes: experiment scheduling; sample and reagent acquisition and tracking and management of inventories; instrument and environment operational monitoring and data collection; and chain-of-custody tracking and electronic enforcement of Standard Operating Procedures (SOPs). The LIMS system includes modules supporting secure web-based remote data entry and data access of blinded study data, wireless handheld Pocket PCs with integrated barcode scanners, and a unique lab workflow engine that allows new activities and protocols to be quickly brought online. Microsoft Data Transformation Services (DTS) are used to automatically publish experimental data from the LIMS to a multi-terabyte Oracle analytic database. The LIMS system was developed with Microsoft .NET development environment, using Microsoft Visual C#® and the Microsoft SQL Server™ 2000 database. It is currently in operation in support of Perlegen’s Genotyping and Disease Association collaborations. The processing of the microarray wafer data is a high throughput application, requiring both massive datasets and extensive computation. The Windows-based compute farm is managed by Perlegen’s Production Computing Task Management System. This system provides for scheduling of compute tasks (Windows applications) that process the data on the compute cluster and store results to the analytic database. It employs a database-centric execution and monitoring component that affords policy-based prioritization, management by exception and immediate notification in case of application error, and cluster status and monitoring (including trending) using standard SQL-based reporting tools. Benefits As of January, 2003, the system has been in operation for nearly eighteen months, processing and tracking over 100 terabytes of information. At peak processing, daily system throughput exceeded 500GB a day, accomplished through the execution and management of over 15,000 daily computational tasks. During this time, nearly 6000 high density oligonucleotide arrays (wafers) were scanned and analyzed. Each of these wafers consists of 60 million individual DNA probes, and generates a little over 8 GB of raw data when scanned. Perlegen has benefited greatly in its effort from the effective use of Microsoft technologies, including Windows compute clusters, SQL Server 2000, and the .NET development environment. Microsoft .NET is software for connecting people, information, systems, and devices. Some of the key benefits include: Outstanding overall system reliability and availability Low total cost of ownership (TCO) Scalability to support growing business requirements; and Rapid application development and deployment “Perlegen’s informatics infrastructure has enabled unprecedented insight into human genetic variations,” states Perlegen’s Chief Information Officer, Greg Brandeau. “We will use this knowledge to explore the genetic cause of disease and drug response so that ultimately we can make a difference in people’s lives.” Conclusion Perlegen Sciences’ mission is to improve lives through better understanding of the molecular basis of disease and drug response. Perlegen’s approach employs whole genome scanning, a powerful technology that generates large amounts of data and requires sophisticated data management and analysis capabilities. Perlegen has been successful in meeting its aggressive research and business goals through the development of a world-class informatics infrastructure based on commercial information technologies – including key components from Microsoft. For more information about Microsoft High Performance Computing, go to: http://www.microsoft.com/hpc For More Information For more information about Microsoft products and services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Information Centre at (877) 568-2495. Customers who are deaf or hard-of-hearing can reach Microsoft text telephone (TTY/TDD) services at (800) 892-5234 in the United States or (905) 568-9641 in Canada. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information using the World Wide Web, go to: http://www.microsoft.com/ © 2003 Microsoft Corporation. All rights reserved. This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, PivotTable, Visual C#, and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.