Big data What they are and why we should care Previous hot topics 60’s Catastrophe theory 70’s Fractals 80’s Chaos theory 90’s Data mining 00’s Machine learning 10’s Big Data Math CS Carroll et al., 1997 “Due to the large number of observations... we developed a fast method to estimate the parameters” Hourly ozone measurements for 14 years at 12 stations in the Houston area 12 x 24 x 365 x 14 = 1.5 million obs or about 12 MB Really only 12 observations... Huber (1994) Carroll’s data set is between medium and large Units of data 8 bits = 1 byte 1,024 bytes = 1 kB 1,048,576 bytes = 1 MB 1,073,741,824 bytes = 1GB 1,099,511,627,776 bytes = 1 TB 1,024 TB = 1 PB 1,024 PB = 1 EB 1,024 EB = 1 ZB Human genome April 2003 the human genome was decoded About 3 billion base pairs < 400 gaps 99 percent finished Accuracy rate < 1 error every 10,000 base pairs Project started in 1988 Storage about 6 GB Huge, in Huber’s classification Remote sensing EROS Consolidated Report on Data Distributed All Projects Combined – Monthly/Cumulative 2 94 remote sensing satellites launched 2014 First LANDSAT satellite launched in 1972 LANDSAT 8 launched 2013 LANDSAT 7 and 8 are currently operational Big Science Large Scale Synoptic Telescope Goal: 10 years of biweekly surveys of the visible sky Location: Cerro Pachón, Chile Product: 200 PB of data Operational: 2023 Social media 7,152 tweets per second ≈ 226 billion tweets per year Live twitter statistics 714 Instagram photos per second 1,101 Tumblr posts per second 2,068 Skype calls per second 53,456 Google searches 119,318 YouTube videos viewed 2.5 million emails sent per second US Library of Congress 2009: 142 million items (32 million books) 74 TB digitized material 6 million videos, films, and audio Digitizing 3-5 PB per year Digital storage Hollerith cards Tape Disk Floppy disk CD DVD Flash memory How much information is there? We don’t know. But in 2007: 290 EB compressed storage (a human’s memory is about 225 MB; humanity 1-2 PB) 6.4x1018 instructions/second (about the same as the maximum number of neural signals in the brain per second) Storage grows by 23% per year Instructions by 58% per year Moore’s law Google flu trends 2008 Nature paper: predict onset of flu epidemic based on Gogle searches on flu-related keywords Two weeks earlier than CDC Relate flu doctor visits to 45 “best” query terms 2003-07 Includes seasonal terms such as “high school basketball” “97% accurate compared to CDC data” 2012-13 predicted twice as many flu cases as observed Some more successful examples Matching stem cell donors store stem cell DNA from donors match patient to data base graph representation 2 million nodes How can we learn from data? The data mining mantra: find relationships in data the more data, the more relationships Link For commercial uses– personalized ads etc– we may not need to know why In science we need understanding to use these relationships