, part2

advertisement
Big data
What they are
and why we should care
Previous hot topics
60’s Catastrophe theory
70’s Fractals
80’s Chaos theory
90’s Data mining
00’s Machine learning
10’s Big Data
Math
CS
Carroll et al., 1997
“Due to the large number
of observations... we developed a
fast method to estimate the
parameters”
Hourly ozone measurements for
14 years at 12 stations in the
Houston area
12 x 24 x 365 x 14 = 1.5 million obs
or about 12 MB
Really only 12 observations...
Huber (1994)
Carroll’s data set is between
medium and large
Units of data
8 bits = 1 byte
1,024 bytes = 1 kB
1,048,576 bytes = 1 MB
1,073,741,824 bytes = 1GB
1,099,511,627,776 bytes = 1 TB
1,024 TB = 1 PB
1,024 PB = 1 EB
1,024 EB = 1 ZB
Human genome
April 2003 the human genome was
decoded
About 3 billion base pairs
< 400 gaps
99 percent finished
Accuracy rate < 1 error every
10,000 base pairs
Project started in 1988
Storage about 6 GB
Huge, in Huber’s classification
Remote sensing
EROS Consolidated Report on Data Distributed
All Projects Combined – Monthly/Cumulative
2
94 remote sensing satellites launched 2014
First LANDSAT satellite launched in 1972
LANDSAT 8 launched 2013
LANDSAT 7 and 8 are currently operational
Big Science
Large Scale Synoptic Telescope
Goal: 10 years of biweekly
surveys of the visible sky
Location: Cerro Pachón, Chile
Product: 200 PB of data
Operational: 2023
Social media
7,152 tweets per second
≈ 226 billion tweets per year
Live twitter statistics
714 Instagram photos per second
1,101 Tumblr posts per second
2,068 Skype calls per second
53,456 Google searches
119,318 YouTube videos viewed
2.5 million emails sent per second
US Library of Congress
2009:
142 million items (32 million
books)
74 TB digitized material
6 million videos, films, and audio
Digitizing 3-5 PB per year
Digital storage
Hollerith cards
Tape
Disk
Floppy disk
CD
DVD
Flash memory
How much
information is there?
We don’t know. But in 2007:
290 EB compressed storage
(a human’s memory is about 225
MB; humanity 1-2 PB)
6.4x1018 instructions/second
(about the same as the maximum
number of neural signals in the
brain per second)
Storage grows by 23% per year
Instructions by 58% per year
Moore’s law
Google flu trends
2008 Nature paper: predict onset
of flu epidemic based on Gogle
searches on flu-related keywords
Two weeks earlier than CDC
Relate flu doctor visits to 45
“best” query terms 2003-07
Includes seasonal terms such as
“high school basketball”
“97% accurate compared to CDC
data”
2012-13 predicted twice as many
flu cases as observed
Some more successful
examples
Matching stem cell donors
store stem cell DNA from donors
match patient to data base
graph representation
2 million nodes
How can we
learn from data?
The data mining mantra:
find relationships in data
the more data, the more
relationships
Link
For commercial uses–
personalized ads etc– we may not
need to know why
In science we need understanding
to use these relationships
Download