An Introduction to Big Data Harry E. Pence TLTC Faculty Fellow for Emerging Technologies Harry E. Pence 2013 Why Big Data Now? Big Data is produced by the collision of cheaper, faster computing with very large social networks. In 1980 a terabyte of disk storage cost $14 million; now it costs about $30. Amazon or Google will “rent” a cloud-based supercomputer cluster for only a few hundred dollars an hour. Sebastien Pierre’s Facebook Map The annual growth rate of big data is 60%. Twitter generates more than 7 Terabytes (TB) a day; Facebook more than 10 TBs, and some enterprises already store data in the petabyte range. 1 Megabyte = 1E6 Bytes. 1 gigabytes = 1E9 Bytes 1 terabytes = 1E12 Bytes 1 Petabyte = 1E15 Bytes or 250,000 DVDs Facebook currently stores more than 100 petabytes of data. Source http://tinyurl.com /a8zwman Or Garbage? The amount of data available is growing much faster than the ability of most companies to process and understand it. Moore’s Law 2: Data expands to fill available storage. According to researchers at the UC-San Diego, Americans consumed about 3.6 zettabytes of information in 2008. David Weinberger (p.7) says digital War and Peace is about 2 megabytes, so one zettabyte equals 5E14 copies of War and Peace. It would take light 2.9 days to go from the top to the bottom of this stack. Harry E. Pence 2013 One definition is, “Big Data is when the size of the data itself becomes a problem.” But Big Data is important not just because of size but also because of how it connects data, people, and information structures. (http://tinyurl.com/ato2hbu) It enables us to see patterns that weren’t visible before. Researchers at Harvard School of Public Health combined location data from over 14 million cell phone subscribers in Kenya with a malaria prevalence map to determine that a main source of malaria in Kenya. Anyone can use powerful, open, online databases, like http://data.worldbank.org/ and https://explore.data.gov/. (https://developers.google.com/bigquery/) Google’s Big Query allows anyone to query all of Wikipedia, Shakespeare, and weather stations from within your browser for less than $0.035 per GB. Harry E. Pence 2013 Gerd Leonhard John Wanamaker once said, “I know that half of my advertising doesn’t work. The problem is I don’t know which half.” Did Big Data help to determine the 2012 election? NY times, 2/14/13 http://tinyurl.com/acj8hk5 Romney raised slightly more money from his online ads than he spent on them, Obama’s team more than doubled the return on its online-ad investment. Romney’s get-out-the-vote digital tool, Orca, crashed on Election Day; Obama’s Narwhal, gave every member of the campaign instant access to continuously updated voter information. Obama was the very first candidate to appear on Reddit, and the photo of the Obamas became the most popular image ever seen on Twitter or Facebook. Romney’s senior strategist, Stuart Stevens, may well be the last guy to run a presidential campaign who never tweeted. Harry E. Pence 2013 Commercial users seem more concerned with correlation, not causality By comparing the 50 million most common search terms with CDC data on the flu, Google's found a combination of search terms that strongly correlated with the spread and intensity of 2009 flu season. Both Amazon and Netflix use correlationbased recommendations to boost sales. Target assumes that if a 20-something female shopper purchases a unscented lotion, an supplements such as zinc and calcium, and a large purse, she is pregnant . Cheap sensors constantly monitor personal data to allow for personalized support. Inexpensive ($100-200) devices, like Fitbit already will track your daily physical activity to a web page. General Motors (OnStar) and several insurance companies offer driving monitors. Harry E. Pence 2013 Inexpensive sensors are rapidly moving us towards an Internet of Things. Physical World We are here! http://tinyurl.com/aoev3x9 Google’s Information Graph is a step towards the Semantic Web. Companies ranging from the NY Times to the UK Ordnance Survey are creating linked data to make the web more interconnected. Each entity is defined by a Uniform Resource Identifier (URI) which is machine readable. The hope is to attach metadata to each entity to show how they relate to each other, employees to companies, actors to motion pictures, etc. Harry E. Pence 2013 Some predict that the Internet of Things will soon produce a massive volume and variety of data at unprecedented velocity. http://tinyurl.com/ahytzdf Welcome to the new information age http://tinyurl.com/ahytzdf Big Data and Big Science “The Fourth Paradigm” Massive Sensor Arrays are important in many areas of science, like meteorology, environmental monitoring, astronomy, and climate measurements. The Square Kilometre Array (SKA) under development in Australia and South Africa will collect one exabyte of data per day from 36 small antennas spread over more than 3000 km to simulate a single giant radio telescope. Bradley Voytek ( Big Data location 1186) argues that Big Data analysis allows researchers to identify patterns that were previously invisible; it is possible to automate critical aspects of the scientific method itself. Harry E. Pence 2013 17 miles The large Hadron Collider at CERN produces so much data that scientists must discard most of it, hoping they haven’t thrown away anything useful. Weather prediction combines data from multiple earth satellites with massive computing power. Most of the satellites belong to the U.S., but the Europeans have a more powerful computer. Our weather satellites are old. http://tinyurl.com/cvpz5qe Harry E. Pence 2013 Big Data and the Future of Health Care A patient can use a smartphone to do many tests now done in the laboratory, such as an EKG or glucose test. The Quest for the $1000 Genome Analysis. In May 2007, the first personal genome cost $1 million. Recently, several companies are close to $1000 and some are predicting a $100 dollar cost soon. Knowing an individual’s genome should allow treatment to be customized to the individual. As the cost plummets, the bottleneck will increasingly be the cost of interpreting the genomic data. IBM defines big data as the combination of velocity, variety, and volume The Three V’s The sheer volume of stored data is exploding. IBM predicts that there will be 35 zettabytes stored by 2020. This data comes in a bewildering variety of structured and unstructured formats. The velocity of data depends on not just the speed at which the data is flowing but also the pace at which it must be collected, analyzed, and retrieved. IBM claims that its software will adjust to changes in any of these three factors. Harry E. Pence 2013 Search engines, like Yahoo! and Google, were the first companies to work with datasets that were too large for conventional methods. (According to Big Data 2 location 177 Google has over a million servers.) In order to power its searches, Google developed a strategy called MapReduce. You map a task onto a multitude of processors then retrieve the results. Traditional data warehouses use a relational database (think rows and columns); Search engines need to handle non-relational databases, sometimes called NoSQL. Harry E. Pence 2013 The most popular software to search No-SQL databases is called Hadoop and several different versions are freeware. Named after his son’s pet elephant. Hadoop is designed to collect data, even if it doesn’t fit nicely into tables, distribute a query across a large number of separate processors, and then combine the results into a single answer set in order to deliver results in almost real time. This is often paired with machine learning apps, like a recommendation engine, classification, error detection, or facial recognition. Personal Aside: I suggest that Google Analytics might be the best way to introduce students to crafting a query for a Big Data exercise. Harry E. Pence 2013 We are in the Golden Age of Data Visualization http://tinyurl.com/beqxuyl A streamgraph of the conversation around a brand. Classify each of these as pro or con fracking: Don’t appoint a #fracking proponent to lead the Dept. of Energy. (often RT) Broome County Executive took $82,428 in pro#fracking campaign contributions. The only politician I know that has backed up his promise to address #fracking is Gov. Cuomo. Natural gas is neither perfect nor perfectly evil. Businesses surprised to see their names on #fracking petition. Mmmm @fracking fluid bit.ly/ZmQVQj Harry E. Pence 2013 Amazon’s Mechanical Turk is one way to manually create a standard data template. https://www.mturk.com/mturk/welcome Moving from Big Data to Smart Data is a multistep process. “Smart Data” http://tinyurl.com/atcanjw “The issue is not about the volume of data but the ability to analyze and act on data in real time.” http://tinyurl.com/atcanjw Scraping, processing, and buying Big Data One can only copyright a specific arrangement of the data, but the metadata is often extremely important and may not follow the scrape. Companies, like InfoChimp, are scraping and cleaning selected data from Twitter and then selling access to these datasets. Pete Warden reports that it only cost him $120 to gather, analyze, and visualize 220 million public Facebook profiles (http://tinyurl.com/yb2q3dv) and 8olegs allowed him to download a million web pages for about $2.20. Harry E. Pence 2013 Problems with Big Data – Selection bias (http://tinyurl.com/bdb7wgy) Selection bias occurs when the individuals or groups to take part in a study don’t represent the general population. According to a recent Pew survey, Twitter users are younger than the general public and more likely to lean toward the Democratic Party. Twitter reactions are often at odds (six out of eight times) with overall public opinion. http://tinyurl.com/cvqq5hz A recent article in EJP Data Science says Twitter is actually comprised of modern-day tribes, groups of people who use a discrete language and are connected to a character, occupation or interest. http://preview.tinyurl.com/bc8gecu Harry E. Pence 2013 The partition of English speaking Twitter users into communities, annotated with words typical of those often used by members of each community. http://tinyurl.com/ bmb9r9e Problems with Big Data – Misclassification bias (http://tinyurl.com/bdb7wgy) Misclassification occurs when either the cause or the effect is not accurately recognized. What if a response is not correctly identified as intended by the participant. This is especially true when subjective interpretation is required to classify an answer. http://tinyurl.com/cvqq5hz Remember: Correlation does not imply causation, But some projects, like targeting building inspections in NY City (http://tinyurl.com/clhmv3t), seem to work well on the basis of correlation only. Harry E. Pence 2013 Comments on Big Data for education The information in LMSs (i.e. time on task, number of logins, number of list posts, etc.), like Moodle or Angel, is well structured and is already being analyzed. In MOOCs, students interact entirely online, leaving behind a record of every page they visited. The Gates Foundation recently gave $100 million to InBloom to improve ways to transfer information among the many technology information silos where students records are currently stored. Nine states (including New York) are participating in this pilot project and plan to offer third-party vendors access to student data (without student or parental consent). http://tinyurl.com/b8e4whh Unanswered questions include security, who owns learner produced data (TurnItIn?), who owns the data analysis, and what will be shared with the students? Harry E. Pence 2013 Dwell time is how long a student spent on a given activity. Chico State University (CA) has been using learning analytics to study how student achievement is related to (LMS) use and student characteristics. http://tinyurl.com/acz5f7o The project merges LMS data with student characteristics and course performance from the campus database. The study reports a direct positive relationship between LMS usage and the student ‘s final grade. Voytek’s Third Law: Any sufficiently advanced statistics can trick people into believing the results reflect truth. Harry E. Pence 2013 Problems with Big Data – Confounding bias (http://tinyurl.com/bdb7wgy) Confounding occurs when there is a failure to control for some other factor that is affecting the outcome. How do other factors, like attendance, reading the textbook, attending extra help sessions, etc. relate to the time spent on the LMS (and so the course grade)? “Low-income students spend more time on the LMS.” The article notes that, “No individually identifiable information is included in the data files.” Really??? Harry E. Pence 2013 Problems with Big Data – Privacy A sensible back-up strategy may create more than 100 copies. How can organizations protect the privacy of all this data from hackers? This can make it hard to protect individual privacy, and recent experiences suggest that there are now so many public datasets available for cross-referencing that it is difficult to assure that any Big Data records can be kept private. In a number of cases, information from “anonymous studies” has been tracked back to identify individuals and even their families. Harry E. Pence 2013 “Just as Ford changed the way we make cars – and then transformed work itselfBig Data has emerged as a system of knowledge that is already changing the objects of knowledge, while also having the power to inform how we understand human networks and community.” danah boyd (http://tinyurl.com/bdb7wgy) “And finally, how will the harvesting of Big Data change the meaning of learning and what new possibilities and limitations may come from these systems of knowing?” Harry E. Pence 2013 Thank you for listening. http://soundcloud.com/tracks/search?q=acxiom That was Vienna Teng singing the song acxiom, sacred music for the age of big data. Any questions?