An Introduction to Big Data Harry E. Pence TLTC Faculty Fellow for Emerging Technologies Harry E. Pence 2013 Gerd Leonhard John Wanamaker once said, “I know that half of my advertising doesn’t work. The problem is I don’t know which half.” What’s new about Big Data? IBM describes the problem as The Three V’s. The sheer volume of stored data is exploding. IBM predicts that there will be 35 zettabytes stored by 2020. This data comes in a bewildering variety of structured and unstructured formats. The velocity of data depends on not just the speed at which the data is flowing but also the pace at which it must be collected, analyzed, and retrieved. Although most businesses already collect terabytes of information about customers, employees, and their enterprise, a recent survey found that 62 % of business leaders couldn’t access their information fast enough, and 83 % believe it didn’t give them what they needed to know. Harry E. Pence 2013 The annual growth rate of big data is 60%. Many Business Schools call it Business Analytics rather than Big Data. Computing resources are becoming much cheaper and more powerful. In 1980 a terabyte of disk storage cost $14 million; now it costs about $30. Amazon or Google will “rent” a cloud-based supercomputer cluster for only a few hundred dollars an hour. Social networks, like Facebook and Twitter, are spanning the globe. Twitter generates more than 7 Terabytes (TB) a day; Facebook more than 10 TBs, and some enterprises already store data in the petabyte range. Sebastien Pierre’s Facebook Map 1 Megabyte = 1E6 Bytes. 1 gigabytes = 1E9 Bytes 1 terabytes = 1E12 Bytes 1 Petabyte = 1E15 Bytes or 250,000 DVDs Facebook currently stores more than 100 petabytes of data. Source http://tinyurl.com /a8zwman The amount of data available is growing much faster than the ability of most companies to process and understand it. Moore’s Law 2: Data expands to fill available storage. According to researchers at the UC-San Diego, Americans consumed about 3.6 zettabytes of information in 2008. David Weinberger (p.7) says digital War and Peace is about 2 megabytes (1296 pgs.), so one zettabyte equals 5E14 copies of War and Peace. It would take light 2.9 days to go from the top to the bottom of this stack. Harry E. Pence 2013 Some applications of Big Data Topological data analysis is being used to rethink basketball (five or thirteen positions). http://tinyurl.com/c5ajwm3 In March, 2013, the Obama Administration announced $200 million in R&D investments for Big Data. http://tinyurl.com/85oytkj Google combined search terms with CDC data to identify search terms that correlated with the spread of the 2009 flu season. Both Amazon and Netflix use correlation-based suggestions to boost sales. Target assumes that if a 20-something female shopper purchases a unscented lotion, supplements such as zinc and calcium, and a large purse, she is pregnant . Did Big Data help to determine the 2012 election? NY Times, 2/14/13 http://tinyurl.com/acj8hk5 Romney raised slightly more money from his online ads than he spent on them, Obama’s team more than doubled the return on its online-ad investment. Romney’s get-out-the-vote digital tool, Orca, crashed on Election Day; Obama’s Narwhal, gave every member of the campaign instant access to continuously updated voter information. Obama was the very first candidate to appear on Reddit, and the photo of the Obamas became the most popular image ever seen on Twitter or Facebook. Romney’s senior strategist, Stuart Stevens, may well be the last guy to run a presidential campaign who never tweeted. Harry E. Pence 2013 One definition is, “Big Data is when the size of the data itself becomes a problem.” Often so much data is collect that much of it is discarded, a process know as the Donner Party Effect. Open, online databases, like http://data.worldbank.org/ and https://explore.data.gov/, are now available. Google Analytics allows us to query search patterns and Google’s Big Query (https://developers.google.com/bigquery/) allows anyone to query all of Wikipedia, Shakespeare, and weather stations for less than $0.035 per GB. http://tinyurl.com/bvd2yve But Big Data is important not just because of size but also because of how it connects data, people, and information structures. (http://tinyurl.com/ato2hbu) It enables us to see patterns that weren’t visible before. Harry E. Pence 2013 We are moving towards the Semantic Web. Companies ranging from the NY Times to the UK Ordnance Survey are creating linked data to make the web more interconnected. Each entity is defined by a Uniform Resource Identifier (URI) which is machine readable. The hope is to attach metadata to each entity to show how they relate to each other, employees to companies, actors to motion pictures, etc. Harry E. Pence 2013 Inexpensive sensors are rapidly moving us towards an Internet of Things. Physical World We are here! http://tinyurl.com/aoev3x9 Some predict that the Internet of Things will soon produce a massive volume and variety of data at unprecedented velocity. http://tinyurl.com/ahytzdf Welcome to the new information age http://tinyurl.com/ahytzdf Morgan Stanley predicts the following data applications will grow fastest in 2013: 1. Healthcare, 2. Entertainment 3. Com/Media, 4. Manufacturing 5. Financial Inexpensive ($100-200) devices, like Fitbit already will track your daily physical activity to a web page. American Society of Clinical Oncology is creating a database, CancerLinQ, to centralize cancer records so that Big Data methods can evaluate the effectiveness of treatments and hasten development of new medicines. http://tinyurl.com/cnv6wfw Harry E. Pence 2013 Big Data and the Future of Health Care A patient can use a smartphone to do many tests now done in the laboratory, such as an EKG or glucose test. The cost of a personal genome is dropping rapidly and some are predicting a $100 dollar cost soon. Knowing an individual’s genome should allow treatment to be customized to the individual. A recent report says the Big Data could save as much as $450 million in health care costs but the AMA says that current electronic health record systems lack the sophistication to manage the storage and retrieval of big data. http://tinyurl.com/cln8vf9 Ever since the 1970s, the roughly 25,000 families who create the Nielsen ratings have determined what TV shows survive and what ad rates will apply. In Nov. 2012, Neilsen purchased SocialGuide, which measures the “social impact” of TV, and announced it was partnering with Twitter. Now Twitter has purchased Bluefin Labs, a social-TV analytics company. As more and more people TiVoed, Nielsen created the C3 rating in 2007, and recently they added the C7 rating to measure how many people viewed the show after it was originally aired. Now advertisers want to know not just if people watched, but if they were “engaged.” Wired, April 2013, 92-94 Big Data and Big Science “The Fourth Paradigm” The University of California, San Diego, is building a "big data freeway system" for science projects in "genomic sequencing, climate science, electron microscopy, oceanography and physics." The Square Kilometre Array (SKA) under development in Australia and South Africa will collect one exabyte of data per day from 36 small antennas spread over more than 3000 km to simulate a single giant radio telescope. Bradley Voytek ( Big Data location 1186) argues that Big Data analysis allows researchers to identify patterns that were previously invisible; it is possible to automate critical aspects of the scientific method itself. Harry E. Pence 2013 17 miles The large Hadron Collider at CERN produces so much data that scientists must discard most of it, hoping they haven’t thrown away anything useful. Weather prediction combines data from multiple earth satellites with massive computing power. Most of the satellites belong to the U.S., but the Europeans have more powerful computers. Our weather satellites are old. http://tinyurl.com/cvpz5qe Harry E. Pence 2013 Search engines, like Yahoo! and Google, were the first companies to work with datasets that were too large for conventional methods. (According to Big Data 2 location 177, Google has over a million servers.) In order to power its searches, Google developed a strategy called MapReduce. You map a task onto a multitude of processors then retrieve the results. Traditional data warehouses use a relational database (think Excel rows and columns); Search engines need to handle non-relational databases, sometimes called NoSQL. Harry E. Pence 2013 The most popular software to search No-SQL databases is called Hadoop and several different versions are freeware. Named after his son’s pet elephant. Hadoop is designed to collect data, even if it doesn’t fit nicely into tables, distribute a query across a large number of separate processors, and then combine the results into a single answer set in order to deliver results in almost real time. This is often paired with machine learning apps, like a recommendation engine, classification, error detection, or facial recognition. Personal Aside: I suggest that Google Analytics might be the best way to introduce students to crafting a query for a Big Data exercise. Harry E. Pence 2013 We are in the Golden Age of Data Visualization http://tinyurl.com/beqxuyl A streamgraph of the conversation around a brand. It can be difficult to classify a tweet as pro or con. Don’t appoint a #fracking proponent to lead the Dept. of Energy. (often RT) Broome County Executive took $82,428 in pro#fracking campaign contributions. The only politician I know that has backed up his promise to address #fracking is Gov. Cuomo. Natural gas is neither perfect nor perfectly evil. Businesses surprised to see their names on #fracking petition. Mmmm @fracking fluid bit.ly/ZmQVQj Harry E. Pence 2013 Amazon’s Mechanical Turk is one way to manually create a standard data template. https://www.mturk.com/mturk/welcome Scraping, processing, and buying Big Data One can only copyright a specific arrangement of the data, but the metadata is often extremely important and may not follow the scrape. Companies, like InfoChimp, are scraping and cleaning selected data from Twitter and then selling access to these datasets. Pete Warden reports that it only cost him $120 to gather, analyze, and visualize 220 million public Facebook profiles (http://tinyurl.com/yb2q3dv) and 8olegs allowed him to download a million web pages for about $2.20. Harry E. Pence 2013 Problems with Big Data – Selection bias (http://tinyurl.com/bdb7wgy) Selection bias occurs when the individuals or groups to take part in a study don’t represent the general population. According to a recent Pew survey, Twitter users are younger than the general public and more likely to lean toward the Democratic Party. Twitter reactions are often at odds (six out of eight times) with overall public opinion. http://tinyurl.com/cvqq5hz A recent article in EJP Data Science says Twitter is actually comprised of modern-day tribes, groups of people who use a discrete language and are connected to a character, occupation or interest. http://preview.tinyurl.com/bc8gecu Harry E. Pence 2013 The partition of English speaking Twitter users into communities, annotated with words typical of those often used by members of each community. http://tinyurl.com/ bmb9r9e Problems with Big Data – Misclassification bias (http://tinyurl.com/bdb7wgy) Misclassification occurs when either the cause or the effect is not accurately recognized. What if a response is not correctly identified as intended by the customer. This is especially true when subjective interpretation is required to classify an answer. http://tinyurl.com/cvqq5hz Remember: Correlation does not imply causation, Harry E. Pence 2013 Comments on Big Data for education The information in LMSs (i.e. time on task, number of logins, number of list posts, etc.), like Moodle or Angel, is well structured and is already being analyzed. In MOOCs, students interact entirely online, leaving behind a record of every page they visited. The Gates Foundation recently gave $100 million to InBloom to improve ways to transfer information among the many technology information silos where students records are currently stored. Nine states (including New York) are participating in this pilot project and plan to offer third-party vendors access to student data (without student or parental consent). http://tinyurl.com/b8e4whh Unanswered questions include security, who owns learner produced data (TurnItIn?), who owns the data analysis, and what will be shared with the students? Harry E. Pence 2013 E-Textbooks can now track student use patterns. Image from the blog Electric Venom: Motherhood, mid-life crisis, martinis. Instructors who use an e-text from CourseSmart receive information about each student showing how much she is reading the book, what pages she skips, how much she highlights, and whether she is taking notes. See NY Times, April 9, 2013, http://tinyurl.com/d7dfob4 Students who take notes with pen and paper may be penalized, even if they are doing well in the course. If it can be measured, some teachers will grade it! Some Learning Management Systems already display a "dashboard" of a student’s performance. Attendance Danger Homework Needs improvement Time on LMS Good Math SAT score 390 (You need extra work in math intensive courses) Avg. grade of students like you who took this course: C Avg. grade of students like you who had this instructor: D Your grade in prerequisite courses: Precalculus: D Harry E. Pence 2013 Dwell time is how long a student spent on a given activity. Chico State University (CA) has been using learning analytics to study how student achievement is related to (LMS) use and student characteristics. http://tinyurl.com/acz5f7o The project merges LMS data with student characteristics and course performance from the campus database. The study reports a direct positive relationship between LMS usage and the student ‘s final grade. Voytek’s Third Law: Any sufficiently advanced statistics can trick people into believing the results reflect truth. Harry E. Pence 2013 Problems with Big Data – Confounding bias (http://tinyurl.com/bdb7wgy) Confounding occurs when there is a failure to control for some other factor that is affecting the outcome. How do other factors, like attendance, reading the textbook, attending extra help sessions, etc. relate to the time spent on the LMS (and so the course grade)? Do the Dwell Times really make sense? The article notes that, “No individually identifiable information is included in the data files.” Really??? Harry E. Pence 2013 Problems with Big Data – Privacy A sensible back-up strategy may create more than 100 copies. How can organizations protect the privacy of all this data from hackers? This can make it hard to protect individual privacy, and recent experiences suggest that there are now so many public datasets available for cross-referencing that it is difficult to assure that any Big Data records can be kept private. In a number of cases, information from “anonymous studies” has been tracked back to identify individuals and even their families. Harry E. Pence 2013 “Just as Ford changed the way we make cars – and then transformed work itselfBig Data has emerged as a system of knowledge that is already changing the objects of knowledge, while also having the power to inform how we understand human networks and community.” danah boyd (http://tinyurl.com/bdb7wgy) danah boyd “And finally, how will the harvesting of Big Data change the meaning of learning and what new possibilities and limitations may come from these systems of knowing?” Ira "Gus" Hunt, chief technology officer for the Central Intelligence Agency, said, "The value of any piece of information is only known when you can connect it with something else that arrives at a future point in time." Harry E. Pence 2013 Thank you for listening. Any questions?