PTI Fact Sheet - Data to Insight Center

advertisement
An Overview of PTI at
Indiana University
Beth Plale
PTI Managing Director
William K. Barnett
Director, Science Community Tools, Research Technologies
Robert H. McDonald
Associate Dean for Library Technologies
Mayfield Visit
12.12.12
PTI Fact Sheet
• Pervasive Technology Institute employs about
120 full time employees
• At any one time PTI has over 70 graduate
research assistants engaged in research in one of
the PTI centers
• Total amount of active grants from external
sources in PTI as of 30 November 2012 is
$72,641,407
• PTI outreach activities, which number over 100
every year, reach 10,000 people, the majority
located in Indiana
What is making Big Data?
• Some driving forces: Patient records growing fast (70PB pathology);
network graphs from Internet leading to community detection
• Large Hadron Collider (Switzerland, physics): analysis is mainly
through creating histogram charts
• Commercial: Google, Bing have largest data analytics in world
• Time Series: Earthquakes, Twitter tweets, Stock Market
• Image Processing: from climate simulations to NASA to DoD to
Radiology
• Financial decision support: marketing; fraud detection; automatic
preference detection (map users to books, films)
-From Professor Geoffrey Fox, director of the Digital Science Center in PTI
PTI Big Data Strategy
• Developing new technology
– New system implementations
– New software and technology
• Training 21st century workforce
– People with strong analytical and technical skills in statistics and
machine learning who can analyze large volumes of data to
derive business (and other) insights
– Data-savvy managers and analysts who have the skills to be
effective consumers of big data insights and who are capable of
posing the right questions for analysis, interpreting and
challenging the results, and making appropriate decisions
– Technology personnel who develop, implement, and maintain
the hardware and software tools needed to make and use big
data.
Big Data Needs Big Storage
• Big Data requires Big Storage
• Spreading this data over many machines and
servers lets them share the work
– Data storage, like computing, can take advantage of
parallelism
• IU has operated the Data Capacitor since 2006,
providing high-end performance for Big Data and
Big Science. Just announced upgrade to 5
PetaBytes of storage – this much data on CDs
would make a stack 5 miles high
Big Data Needs Big Computation
• Big Red II – first university-funded, university owned supercomputer
capable of 1 PetaFLOPS (a thousand trillion mathematical operations per
second). It would take one person, doing one calculation with a calculator,
31 trillion years to do what Big Red II will be able to do in a second
• Big Red II and Data Capacitor II provide the system resources to address
Big Data challenges
Networked Data Access
• Monon100 100 Gigabit per second connection to Internet2
(Indiana first state to announce!). Moves data FAST
• IU has leveraged its significant network expertise to make
the Data Capacitor available to users at IU, in Indiana,
nationally, and internationally
– High-performance networks together with high-performance
data storage for a “data cloud” for Big Science
• To keep pace with the tremendous growth in data, we must
stay on the cutting edge of computing, storage and network
technologies
– We can’t sit still or we will be crushed by the data deluge!
• PTI is a leader in developing and integrating the latest
approaches across these various technology domains
Dealing with Big Data - NSF DataNet
Program
Motivation:
“… one of the major challenges of this scientific
generation: how to develop the new methods,
management structures and technologies to manage the
diversity, size, and complexity of current and future data
sets and data streams.”
Response:
DataNet creates “a set of exemplar national and global
data research infrastructure organizations” to address
this challenge.
SEAD Approach to DataNet Challenges
SEAD Partners - http://sead-data.net
• Contribute infrastructure to the
NSF DataNet vision that supports
data access, sharing, reuse, and
preservation for the long tail
• Develop a data access and
preservation environment that
supports the research, technical,
and economic requirements for
data management in the long tail
• Enable Active and Social Curation
Utilize emerging preservation and
access infrastructures
SEAD Social Networking/Virtual Archive at IU
Time
Curator Preview
Ingest Data To VA
User Queries VA for DOI
Query
DOI Metadata
Query
Endpoint
Metadata update and View
SWORD
Endpoint
(SPARQL) Query Metadata
Return Metadata
VA UI
Curator
Virtual Archive
Mark Data For
Publication
(and Accept
Licensing Terms)
Curator Request for Preview
Active Curation Repository
ACR UI
RoCE [rok-ee] Demonstration at SC12
• At SC’12, we demonstrated a data system capable of
moving enough data to stream ~1000 high-definition
Blu-ray movies at once
– This was possible previously, but our approach reduced
the server stack required from 6 feet tall to about 9 inches
tall, reducing power and increasing efficiency
-
-
We deployed this system in collaboration with
Orange Telecom (the telephone company in
France), who offers service worldwide and
fields a research office in San Francisco
Many of their clients do video distribution and
our solution eliminates much of the
customized, expensive and power-hungry
hardware they currently use
RoCE [rok-ee] Technology
• Our approach integrated years of experience tuning
networks and filesystem with an emerging protocol
called RoCE (pronounced “Rocky”)
• RoCE eliminates many sources of overhead and
inefficiency in the venerable Internet Protocols
– If the Internet is like a highway full of cars and trucks, RoCE
is like an Indy car pulling a semi trailer!
• The expertise required to tune and operate a system
with Lustre and RoCE is significant, and we at IU were
the first to demonstrate it working over a long distance
– We focus today on making it work for Big Data and Big
Science, and tomorrow on automating it for a wider
audience
NATIONAL CENTER FOR GENOME ANALYSIS SUPPORT
(NCGAS)
• Sequencing a human genome cost $95M in 2001. Now
it costs $5,000
• Genomics are now part of most biology and all disease
research.
• Sequences are huge, each one is 250 Gigabytes (IU has
the storage) and need supercomputers to analyze (IU
has the supercomputers).
• Researchers don’t know how to use supercomputers –
we help them
• The National Science Foundation has provided $1.5M
for us to support genomics analysis.
FutureGrid
Motivation:
FutureGrid will make it possible for researchers to conduct
experiments by submitting an experiment plan that is then
executed via a sophisticated workflow engine, preserving the
provenance and state information necessary to allow
reproducibility.
Response:
The FutureGrid Project provides a distributed test-bed of
networked HPC resources that makes it possible for
researchers to tackle complex research challenges in
computer science related to the use and security of grids and
clouds.
What Does FutureGrid Offer?
•
•
•
•
•
•
•
•
•
Traditional HPC and Grid computing support
Cloud platforms – Nimbus, Eucalyptus, OpenStack
GPU computing
Dynamic Provisioning through RAIN and RAIN-MOVE
– Image Generation and Registration
– Generic Image Repository
– Image Deployment
Experiment Management
Information Services and Performance tools
Networks device and virtual networks tools
A convenient portal for easy account and project management
Help and Support via a ticket system
Big Data & 21st century economy - PTI creates high
quality jobs
1,108 person-years of employment supported by grants &
contracts since 1999
Q & A on PTI
PTI Impact at IU
Download