COS 109 Wednesday December 9 • Housekeeping – Lab 8 and Assignment 9 are posted – Final exam – January 18 (Monday) at 7:30PM In class, 3 hour exam – A review session will be held on January 14 (time and location TBA) – Reading assignment for next Monday (also posted on course home page) https://static.newamerica.org/attachments/3421-riding-thewave/Riding%20the%20Wave_Final_and_Teaching_Note.d71b398fa011446ca 2739f5b0737dcfb.pdf • Today’s class – Big data Data mining, machine learning With applications The era of big data • 2.5 exabytes of information are generated every day – 1 exabyte = a billion gigabytes = 1018 bytes • This amounts to nearly 1 zettabyte (1000 exabytes) per year • This number is doubling every 40 months • Tools are continually being developed to use this data productively Some examples of data that is out there • Locations of all Starbucks • What can we observe? A residential building at 73 Payson Avenue in Inwood is as far from Starbucks as they can possibly be in Manhattan (1.36 miles) The farthest residential address from Starbucks south of 96th St. is at 140 Baruch Place on the Lower East Side (0.93 miles) Some other examples A situation of concern Excerpts from the article • • • • • Dr. Janese Trimaldi, 40, recently completed her residency in Tampa, Fla. On a terrifying evening in July 2011, she says, she locked herself in her bedroom to hide from a drunken, belligerent boyfriend. He went into the kitchen, retrieved a steak knife and jimmied open the door. The screams and commotion caused a neighbor to call the police. The boyfriend contended that a bleeding scratch on his chest had been inflicted by Dr. Trimaldi with the knife. (It was from one of her fingernails, she says.) She was arrested and charged with aggravated assault with a deadly weapon and battery domestic violence. The state dropped the charges, A few months later, her booking photograph turned up on a Florida mug-shot Web site and with it another mug shot from a 1996 arrest on an accusation of possession of marijuana and steroids. The authorities had raided her apartment on suspicion that a different boyfriend — this one a bodybuilder — was illegally selling the steroids. Records show that she was quickly released, and a certificate of disposition from the 13th Judicial Circuit of Florida shows that she was not prosecuted for either charge. She paid $30 to have the images taken down, but they soon appeared on other sites, one of which wanted $400 to pull the picture. Publishing mug shots This information was available about Janese Trimaldi This information may be more accurate What is the real story? Details in the article IN March last year, a college freshman named Maxwell Birnbaum was riding in a van filled with friends from Austin, Tex., to a spring-break rental house in Gulf Shores, Ala. As they neared their destination, the police pulled the van over, citing a faulty taillight. When an officer asked if he could search the vehicle, the driver — a fraternity brother of Mr. Birnbaum’s who quickly regretted his decision — said yes. Six Ecstasy pills were found in Mr. Birnbaum’s knapsack, and he was handcuffed and placed under arrest. Mr. Birnbaum later agreed to enter a multiyear, pretrial diversion program that has involved counseling and drug tests, as well as visits to Alabama every six months to update a judge on his progress. But once he is done, Mr. Birnbaum’s record will be clean. Which means that by the time he graduates from the University of Texas at Austin, he can start his working life without taint. Has it worked for him? Ad placement Who does the tracking? Drawbridge is one of several start-ups that have figured out how to follow people without cookies, and to determine that a cellphone, work computer, home computer and tablet belong to the same person, even if the devices are in no way connected. Before, logging onto a new device presented advertisers with a clean slate. Let’s visit drawbridge which is at http://www.drawbrid.ge/ How do they do it? .ge is the country code top-level domain (ccTLD) for Georgia. .ge top-level domain names are available for registration for residents of Georgia (unlimited) or for foreign companies via representation of any local legal person (limited to one domain name per registrant).[1] Second-level domain names are also available for registration for several specific types of registrants:[2] (from http://en.wikipedia.org/wiki/.ge) How can we use the data out there? • In 2013, the National Security Agency confirmed it had collected data from cellphone towers in 2010 and 2011 to locate Americans’ cellphones, though it said it never used the information. (further from the NYTimes article) • Testifying before a hearing of the U.S. Senate Judiciary Committee on Tuesday, Oct. 1, Ed Felten said that searching for patterns in large collections of data, called metadata, "can now reveal startling insights about the behavior of individuals or groups." …[he] said that merely by combining their analysis of phone records with call times and durations, investigators can learn about people's work, social habits, religion and political affiliations. • Want to do it yourself? Interesting displays of data The Selection of Majors by the Classes of 2000-2003 (Net Movement of More Than 20 Students) English Mechanical, Civil & Chemical Engineering Intended: 376 Actual: 378 Unchanged 167 (44%) Languages and Literatures Intended: 223 Actual: 256 Unchanged: 77 (35%) Music / Art & Arch. / Architecture Intended: 152 Actual: 198 Unchanged: 46 (30%) Intended: 493 Actual: 281 Unchanged: 206 (46%) Religion / Philosophy Computer Science / Electrical Engineering Intended: 72 Actual: 146 Unchanged: 16 (22%) Intended: 310 Actual: 319 Unchanged: 188 (61%) Operations Research Anthropology / Sociology Intended: 17 Actual: 133 Unchanged: 17 Intended: 56 Actual: 171 Unchanged: 9 (16%) Physical Sciences / Math Intended: 361 Actual: 134 Unchanged: 83 (23%) Unknown Intended: 31 Actual: 0 Net Movement of Students To From History / Politics 40 students Intended: 379 Actual: 931 Unchanged: 205 (54%) 80 students 120 students 160 students The net movement of 20 or fewer students is not shown. Biology / Chemistry Psychology Economics Woodrow Wilson School Intended: 980 Actual: 493 Unchanged: 352 (36%) Intended: 148 Actual: 232 Unchanged: 51 (34%) Intended: 365 Actual: 442 Unchanged: 143 (39%) Intended: 441 Actual: 290 Unchanged: 95 (22%) Example applications of big data • • • • • • Searches for X usually lead to website Y Purchasers of X often browse for Y X is rated as a good seller When stock X goes up, so does stock Y Symptoms X and Y typically indicate disease Z Mutations on Chromosome Y indicate a tendency towards Z • Those who buy X and Y are likely to perform act Z Application areas • Searches for X usually lead to website Y – Search engines (Google) • Purchasers of X often browse for Y – Recommendation systems (Amazon, Netflix) • X is rated as a good seller – Rating systems (Ebay, Amazon) • When stock X goes up, so does stock Y – Quantitative stock trading (Renaissance Technology, D. E. Shaw, …) • Symptoms X and Y typically indicate disease Z – Disease diagnosis • Mutations on Chromosome Y indicate a tendency towards Z – Human genome Project and genomics • Those who buy X and Y are likely to perform act Z – Total Information Awareness (DoD) Techniques • Who knows who – People near each other have similar characteristics • How important are the people you know – This tells your importance • How can we cluster data observations – Decimation • How can we separate the YES’s from the NO’s – Linear decision trees • Merging information from multiple sources – Consistency checking – Building complete profiles Who knows who (sidebar) • Some experiments How does Google work? • 3 components – A web crawler – An indexer – A query processor How Google works (cont.) • Crawl the web starting from a web page – Foreach web page reached Follow its hyperlinks to reach more web pages Retrieve the contents of each page Give each page a number • Index the pages reached – Foreach word found Record the words that occur there – Reverse index so that every word lists the pages on which it occurs • A search goes through this index – Foreach word Find its pages – Merge page lists to find those pages that list all words • List pages by page rank Making Google work efficiently • Crawl the web starting from a web page – Foreach web page reached (a separate computer for each page) Follow its hyperlinks to reach more web pages Retrieve the contents of each page Give each page a number • Index the pages reached – Foreach word found Record the words that occur there (several words per computer) – Reverse index so that every word lists the pages on which it occurs • A search goes through this index – Foreach word Find its pages (one (or more) computer per word) – Merge lists to find pages with all words (one or more computers) • List pages by page rank Page Rank • Factors – – – – – Number of links Importance of links Proximity of words Position of words (e.g. in title) Frequency of words on page The actual search done by Google From http://www.googleguide.com/google_works.html Processing by Google • More than a million CPUs are in Google data centers • They are in data centers spread around the globe – It costs $500M-$1B to build a data center • Where are the Google processors located? How does Google decide where to locate data centers • • • • • • The availability of large volumes of cheap electricity to power the data centers Google’s commitment to carbon neutrality, which has sharpened its focus on renewable power sources such as wind power and hydro power. The Dalles was chosen primarily for the availability of hydro power from the Columbia River, while the local utility’s wind power program influenced the selection of Council Bluffs, Iowa. The presence of a large supply of water to support the chillers and water towers used to cool Google’s data centers. A number of recent Google data center sites have been next to rivers or lakes. Large parcels of land, which allow for large buffer zones between the data center and nearby roads. This makes the facilities easier to secure, and is consistent with Google’s focus on data center secrecy. Google purchased 215 acres in Lenoir, 520 acres for the Goose Creek project, 800 acres of land in Pryor, and more than 1,200 acres in Council Bluffs. The extra land may also be used for building windmill farms to provide supplemental power at some facilities. Distance to other Google data centers. Google needs lightning-fast response time for its searches, and prizes fast connections between its data centers. While big pipes can help address this requirement, some observers believe Google carefully spaces its data centers to preserve low latency in connections between facilities. Tax incentives. Legislators in North Carolina, South Carolina, Oklahoma and Iowa have all passed measures to provide tax relief to Google. From http://www.datacenterknowledge.com/google-data-center-faq-part-2/ Website of the day • 35 Random Corners Of The Internet You Should Visit When You Need A Break A common technique • We want a machine to learn data (machine learning) • and make decisions (classification theory) • Often, our methods of doing so have no connection to previous ways of thinking about the problem Sample A sample data set • Wisconsin breast cancer data set – – – – – – – – – – – 1. Sample code number id number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10 11. Class: (2 for benign, 4 for malignant) • Data set consists of 10-tuples (having the numeric values corresponding to 1-10 above) along with a characteristic (B or M as to 11 above) • We can then ask questions like – Is 3*#2 – 4*#3 + 6*#5 > 12 and bear no relation to reality – Various mathematical techniques are used to design tests – We study tests to see which classify best How we might do a classification task Goal: Devise a set of linear tests that separate the circles from the squares Classifying Find those points above the solid line and also to the left of the dotted line. Add those points above the dashed line. A new point can be classified as circle or square by these tests Alternative classifier Find those points above the solid line Add those points above the dashed line and also to the left of the dotted line. A new point can be classified as circle or square by these tests Discrepancies Green point is classified as square Green point is classified as circle Resolving discrepancies • An algorithm for building a better classifier 1. 2. 3. • • • • Find many classifiers which work for a portion of your test set For each new point in your test set, determine which classifiers work and which don’t Combine the results to determine the vitality of each classifier and weight them accordingly. The algorithm is said to learn the weights by many repetitions of steps 2 and 3. Algorithms of this type are said to use machine learning and emulate how humans learn Find a hypothesis (or set of hypothesis) Use observations to refine hypothesis and match them to special circumstances The Netflix challenge (begun in 2006) • A $1M prize offered to the research team that could best predict which movie(s) would be rented next • Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies (as <user, movie, date of grade, grade>). • The qualifying data set contains over 2,817,131 entries of the form <user, movie, date of grade>, with grades known only to the jury. • To get the grand prize, you had to beat Netflix’s own algorithm by 10%. Each year, a progress prize would be awarded if no team had achieved that goal. The winners • BellKor – Robert M. Bell, Yehuda Koren and Chris Volinsky from ATT Labs • How they did it – Predictive accuracy is substantially improved when blending multiple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. Consequently, our solution is an ensemble of many methods. – Below, we list all 107 results that were blended to deliver RMSE=0.8712, with their weights within the ensemble. From http://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf Did they really need 107 results? • For completeness, we listed all 107 results that were blended in our RMSE=0.8712 submission. It is important to note that the major reason for using all these results was convenience, as we anyway accumulated them during the year. This is the nature of an ongoing competition. However, in hindsight, we would probably drop many of these results, and recompute some in a different way. We believe that far fewer results are really necessary. For example, based on just three results one can breach the RMSE=0.8800 barrier: A blend of #8, #38, and #92, with weights 0.1893, 0.4225, and 0.4441, respectively, would already achieve a solution with an RMSE of 0.8793. • Similarly, combining #8, #38, and #64 yields RMSE=0.8798. Notice that these combinations touch the main approaches (kNN, factorization, RBMs and asymmetric factor models). In addition, we have found that at most 11 results suffice for achieving a blend with above 8% improvement over Cinematch score. Encore – the next Netflix challenge • From Neil Hunt, Chief Product Officer for Netflix. About five months ago we announced that Netflix would sponsor a sequel to the Netflix Prize. (March 12, 2010) • In the past few months, the Federal Trade Commission (FTC) asked us how a Netflix Prize sequel might affect Netflix members' privacy, and a lawsuit was filed by KamberLaw LLC pertaining to the sequel. With both the FTC and the plaintiffs' lawyers, we've had very productive discussions centered on our commitment to protecting our members' privacy • In light of all this, we have decided to not pursue the Netflix Prize sequel that we announced on August 6, 2009. From: http://blog.netflix.com/2010/03/this-is-neil-hunt-chief-product-officer.html