Big Data and Data-intensive science - CS4HS

advertisement
“Big Data” and
Data-Intensive Science (eScience)
Ed Lazowska
Bill & Melinda Gates Chair in
Computer Science & Engineering
University of Washington
July 2013
Exponential improvements in technology and
algorithms are enabling the “big data” revolution
z A proliferation of sensors
y Think about the sensors on your phone
z More generally, the creation of almost all information
in digital form
y It doesn’t need to be transcribed in order to be processed
z Dramatic cost reductions in storage
y You can afford to keep all the data
z Dramatic increases in network bandwidth
y You can move the data to where it’s needed
z Dramatic cost reductions and scalability improvements
in computation
y With Amazon Web Services, or Google App Engine, or
Microsoft Azure, 1000 computers for 1 day cost the same as
1 computer for 1000 days!
z Dramatic algorithmic breakthroughs
y Machine learning, data mining – fundamental advances in
computer science and statistics
Some examples of “big data” in action
z Collaborative filtering
z Fraud detection
z Price prediction
z Hospital re-admission
prediction
z Travel time prediction under
specific circumstances
z Sports
z Home energy monitoring
Larry Smarr, UCSD
John Guttag & Collin Stultz, MIT
Google self-driving car
Gordon Bell, Microsoft Research
z Speech recognition
z Machine translation
y Speech -> text
y Text -> text translation
y Text -> speech in speaker’s voice
http://www.youtube.com/watch?v=Nu-nlQqFCKg&t=7m30s
7:30 – 8:40
z Scientific discovery
Gene Sequencing
Ocean Observatories Initiative
Large Synoptic Survey Telescope
Large Hadron Collider
z Presidential campaigning
z Electoral forecasting
z Real data-driven decision-making (vs. MBA baloney)
for every sector!
eScience: Sensor-driven (data-driven)
science and engineering
Jim Gray
Transforming science (again!)
Theory
Experiment
Observation
Theory
Experiment
Observation
Theory
Experiment
Observation
[John Delaney, University of Washington]
Theory
Experiment
Observation
Computational
Science
Theory
Experiment
Observation
Computational
Science
eScience
eScience is driven by data
more than by cycles
z Massive volumes of data from sensors and networks
of sensors
Apache Point telescope,
SDSS
80TB of raw image data
(80,000,000,000,000 bytes)
over a 7 year period
Large Synoptic Survey
Telescope (LSST)
40TB/day
(an SDSS every two days),
100+PB in its 10-year
lifetime
400mbps sustained data
rate between
Chile and NCSA
Large Hadron Collider
700MB of data
per second,
60TB/day, 20PB/year
Illumina
HiSeq 2000
Sequencer
~1TB/day
Major labs
have 25-100
of these
machines
Regional Scale
Nodes of the NSF
Ocean Observatories
Initiative
1000 km of fiber
optic cable on the
seafloor, connecting
thousands of
chemical, physical,
and biological
sensors
The Web
20+ billion web pages
x 20KB = 400+TB
One computer can
read 30-35 MB/sec
from disk => 4 months
just to read the web
eScience is about the analysis of data
z The automated or semi-automated extraction of
knowledge from massive volumes of data
y There’s simply too much of it to look at
z It’s not just a matter of volume
y Volume
y Rate
y Complexity / dimensionality
eScience utilizes a spectrum of computer
science techniques and technologies
z Sensors and sensor
networks
z Backbone networks
z Databases
z Data mining
z Machine learning
z Data visualization
z Cluster computing
at enormous scale
eScience will be pervasive
z Simulation-oriented computational science has been
transformational, but it has been a niche
y As an institution (e.g., a university), you didn’t need to excel
in order to be competitive
z eScience capabilities must be broadly available in any
institution
y If not, the institution will simply cease to be competitive
Download