High Performance Computing
Microsoft Customer Solution
Solution Overview
Customer Profile
Discovering Genetic Variations and
Improving Lives with Windows High
Performance Computing Clusters
Published: May 2003
Perlegen Sciences conducts
genetics research and develops
products that impact and improve
people’s lives. Through extensive
research, the company has
identified and validated millions of
genetic variations in humans.
Perlegen uses these markers to
identify genetic factors associated
with disease states and drug
metabolism. It can interrogate
these markers in thousands of
individuals at an unprecedented
level of resolution, making whole
genome scanning a reality.
Business Situation
Perlegen’s Bioinformatics organization uses a Microsoft Windows-based
High Performance Computing cluster to analyze individual variations in
human genome data. Perlegen provides a cost-effective way for drug
companies to research and develop treatments for a variety of diseases,
helping to improve the lives of millions of people who suffer from them.
Background
Perlegen Sciences is a privately-held company founded in 2000 to conduct genetics
research and develop therapeutic and diagnostic products that impact and improve
people's lives. Perlegen has identified and validated millions of genetic variations in
humans using high density microarray technology. These variations, which occur in
about 0.1% of the sequence that comprises human DNA, are known as single
nucleotide polymorphisms, or SNPs (pronounced “snips”). They are responsible for the
traits that distinguish one individual from another – including differences in disease
susceptibility and variations in drug metabolism that can impact the effectiveness of
therapeutic treatments. Many of today’s most debilitating and costly illnesses, including
heart disease, diabetes, cancer, and migraines have a significant genetic component –
Perlegen’s technology will provide researchers with new insights into such diseases,
and new tools for crafting effective therapies.
Perlegen combines this information about the natural genetic variations with high
density microarray whole genome scans to compare millions of genetic variations in
thousands of individuals at an unprecedented level of resolution. This makes whole
genome scanning of patient populations a cost effective tool in determining the genetic
factors involved in disease and drug response.
Based on this technology platform, Perlegen has developed partnerships with leading
pharmaceutical companies to conduct ongoing research into how variations in genes
and regulatory sequences are associated with disease, drug response and other traits.
Many common diseases have a
genetic basis. In order to better
understand how these diseases
work and how they may be
treated, detailed analysis of
massive amounts of genomic data
is required.
Solution
Perlegen uses a Microsoft®
Windows®-based High
Performance Computing (HPC)
cluster and Microsoft SQL
Server™ 2000 Laboratory
Information Management System
(LIMS) to manage the
experiments and process the
voluminous genomic data.
Benefits
Perlegen’s HPC cluster uses
commodity hardware, the
Windows operating system, and
Microsoft® .NET applications,
providing a low cost solution using
high productivity development
tools.
Scenario
High Performance Computing
Through these collaborations, Perlegen is accelerating the discovery and development
of pharmaceutical and diagnostic products by enabling:
 Discovery of novel potential drug targets and markers which predict drug
response
 Prioritization of drug targets for further development
 Stratification of clinical trial participants for drug efficacy and side effect
susceptibility
Solution Components
Hardware
Dell Intel-based dual
processor compute nodes.
Software
 Expansion of use for drugs already on the market
Microsoft Windows 2000
Server, SQL Server 2000,
Microsoft .NET development
environment, and custom
application software.
 Development of new pharmaceutical products and diagnostic tests
Application Software
Solution
In order to conduct such studies, Perlegen first had to locate these variations in a
representative human population. In its SNP discovery effort, Perlegen performed full
genome scans of the DNA of 50 unique individuals – nearly ten times the genetic
content analyzed by Celera and the Human Genome Project in generating the draft of
the Human Genome. It did so using high density oligonucleotide microarray wafers and
a Microsoft® Windows® 2000-based compute cluster used to process the information
on the wafers. Perlegen was able to complete this activity in less than eighteen months
– under budget and ahead of schedule.
With the discovery effort nearing completion and a robust data processing and analysis
facility in place, the informatics team turned its attention to the development of an
enhanced Laboratory Information Management System (LIMS) that would form the
foundation for its genotyping and association studies.
Perlegen’s approach to the implementation of informatics systems that support both of
these efforts is based on a development philosophy that includes:
 Exploitation of commodity hardware and software components and low-cost
distributed computing;
 Use of industry-standard RDBMS technology and database-centric applications;
 Use of current enterprise software development tools and methodologies; and
 Development of highly leveraged applications employing both native client and
web client architectures.
“This is the architectural blueprint that provides the context for Perlegen’s development
efforts,” according to Bruce Moxon, Director of Bioinformatics at Perlegen. “The novel
approaches and unprecedented scale of our operations led us to select development
tools and implementation platforms that would allow us to best leverage rich software
development frameworks and concentrate on the challenging problems before us.”
Database-centric application development is a key theme that differentiates Perlegen
from many bioinformatics organizations and provides significant leverage in the day-today management and analysis of large datasets. The database-centric approach
greatly facilitates real-time tracking, monitoring, and reporting – not just of laboratory
activities, but also of the complex analytic tasks and their results.
Perlegen’s Laboratory
Information Management
System (LIMS) employs a
SQL Server 2000 database
and Microsoft .NET desktop
and web applications.
Microsoft’s Data
Transformation Services
(DTS) are used to publish
experimental metadata from
the LIMS system to the Oracle
analytics database.
Perlegen’s distributed
computing pipeline executes
and manages tens of
thousands of application tasks
per day on its Windows 2000
compute cluster. These
Windows applications post
results directly to the analytics
database using OLEDB
through ADO and Microsoft
ADO.NET.
“With simple [Microsoft] SQL queries and standard reporting tools, such as Microsoft
Excel PivotTable® views, Perlegen is able to quickly obtain current and historical
(trend) views of a wide range of operational metrics,” stated Pascual Starink, Manager
of Laboratory Informatics at Perlegen. “This standardized approach to high throughput
informatics enables us to provide a production-oriented set of services for our internal
and external research partners.”
Much of Perlegen’s analysis requires processing of very large datasets – typically with a
wide range of data subsetting and reporting requirements. Perlegen employs
techniques and tools that have been developed in support of commercial VLDB (very
large database) and Data Warehousing and Mining systems, including dimensional
modeling and parallel ETL approaches, to leverage the state-of-the-art in large scale
data management.
“Increasingly, bioinformatics organizations in genomics and proteomics companies are
realizing the value of using commercial tools and approaches to enterprise data
management challenges,” says Mr. Moxon. “The leverage that such tools provide is
critical in converting new technologies and protocols in early phase biotechnology
companies into scalable, replicable production biology.”
Solution Details
Perlegen’s computing infrastructure was developed around the database-centric
distributed computing model. This model utilizes scalable commercial relational
database technology in conjunction with commodity Network Attached Storage (NAS)
technology, Linux and Windows-based distributed computing “farms”, and Gbit-overcopper networking to effectively support the required range of computing activities.
This affords Perlegen the ability to incrementally scale its computational and data
management infrastructure to meet the needs of its internal R&D teams, and of its
growing set of partners and customers. Dell Intel-based compute nodes are configured
into a centrally managed distributed computing farm that is used both to process
Perlegen-generated wafer data and to provide more traditional sequence analysis and
annotation capabilities. The rack-mount dual-processor nodes can be configured and
managed remotely, providing a scalable computing infrastructure that affords capacity
on demand.
Perlegen’s Laboratory Information Management System (LIMS) is used to acquire,
track, manage, and monitor information associated with laboratory operations in its
experimental studies. This includes: experiment scheduling; sample and reagent
acquisition and tracking and management of inventories; instrument and environment
operational monitoring and data collection; and chain-of-custody tracking and electronic
enforcement of Standard Operating Procedures (SOPs). The LIMS system includes
modules supporting secure web-based remote data entry and data access of blinded
study data, wireless handheld Pocket PCs with integrated barcode scanners, and a
unique lab workflow engine that allows new activities and protocols to be quickly
brought online. Microsoft Data Transformation Services (DTS) are used to
automatically publish experimental data from the LIMS to a multi-terabyte Oracle
analytic database. The LIMS system was developed with Microsoft .NET development
environment, using Microsoft Visual C#® and the Microsoft SQL Server™ 2000
database. It is currently in operation in support of Perlegen’s Genotyping and Disease
Association collaborations.
The processing of the microarray wafer data is a high throughput application, requiring
both massive datasets and extensive computation. The Windows-based compute farm
is managed by Perlegen’s Production Computing Task Management System. This
system provides for scheduling of compute tasks (Windows applications) that process
the data on the compute cluster and store results to the analytic database. It employs a
database-centric execution and monitoring component that affords policy-based
prioritization, management by exception and immediate notification in case of
application error, and cluster status and monitoring (including trending) using standard
SQL-based reporting tools.
Benefits
As of January, 2003, the system has been in operation for nearly eighteen months,
processing and tracking over 100 terabytes of information. At peak processing, daily
system throughput exceeded 500GB a day, accomplished through the execution and
management of over 15,000 daily computational tasks. During this time, nearly 6000
high density oligonucleotide arrays (wafers) were scanned and analyzed. Each of
these wafers consists of 60 million individual DNA probes, and generates a little over 8
GB of raw data when scanned.
Perlegen has benefited greatly in its effort from the effective use of Microsoft
technologies, including Windows compute clusters, SQL Server 2000, and the .NET
development environment. Microsoft .NET is software for connecting people,
information, systems, and devices. Some of the key benefits include:
 Outstanding overall system reliability and availability
 Low total cost of ownership (TCO)
 Scalability to support growing business requirements; and
 Rapid application development and deployment
“Perlegen’s informatics infrastructure has enabled unprecedented insight into human
genetic variations,” states Perlegen’s Chief Information Officer, Greg Brandeau. “We
will use this knowledge to explore the genetic cause of disease and drug response so
that ultimately we can make a difference in people’s lives.”
Conclusion
Perlegen Sciences’ mission is to improve lives through better understanding of the
molecular basis of disease and drug response. Perlegen’s approach employs whole
genome scanning, a powerful technology that generates large amounts of data and
requires sophisticated data management and analysis capabilities. Perlegen has been
successful in meeting its aggressive research and business goals through the
development of a world-class informatics infrastructure based on commercial
information technologies – including key components from Microsoft.
For more information about Microsoft High Performance Computing, go to:
http://www.microsoft.com/hpc
For More Information
For more information about Microsoft products and services, call the Microsoft Sales Information Center at
(800) 426-9400. In Canada, call the Microsoft Canada Information Centre at (877) 568-2495. Customers who
are deaf or hard-of-hearing can reach Microsoft text telephone (TTY/TDD) services at (800) 892-5234 in the
United States or (905) 568-9641 in Canada. Outside the 50 United States and Canada, please contact your
local Microsoft subsidiary. To access information using the World Wide Web, go to:
http://www.microsoft.com/
© 2003 Microsoft Corporation. All rights reserved.
This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR
IMPLIED, IN THIS SUMMARY.
Microsoft, PivotTable, Visual C#, and Windows are either registered trademarks or trademarks of Microsoft
Corporation in the United States and/or other countries. The names of actual companies and products
mentioned herein may be the trademarks of their respective owners.