New Directions in Large-Scale Data Analysis Padhraic Smyth Professor, Department of Computer Science Director , UCI Data Science Initiative Terminology Large-scale Data Analysis Data Mining Data Science Big Data Machine Learning Computational Statistics …… Padhraic Smyth, SIMS Presentation, March 2015: 2 Terminology Large-scale Data Analysis Data Mining Data Science Big Data Machine Learning Computational Statistics …… ……Using computer algorithms to analyze data sets that are too large and complex for humans to work with Padhraic Smyth, SIMS Presentation, March 2015: 3 350,000 new tweets 204 million emails sent 2.5 million search queries issued Graphic from http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html, downloaded in 2011 Padhraic Smyth, SIMS Presentation, March 2015: 4 A Revolution in Data Technology Graphic from Ray Kurzweil, singularity.com Padhraic Smyth, SIMS Presentation, March 2015: 5 A Paradigm Shift in Data Analysis • Technological drivers – – – – – Sensors Storage Computation Algorithms Internet Access • Convergence…..tremendous demand for data analysis – In the sciences, in medicine, in engineering, in business, and more…… • Problems often require a combination of skills – – – – Algorithms and optimization Large-scale data management Machine learning Statistics Padhraic Smyth, SIMS Presentation, March 2015: 6 Statistics and Computing • Post World War II – Increasing use of computing to solve algorithmic aspects of statistical analyses • 1960’s – Development of statistical computing and exploratory data analysis • 1980’s – Computing allowed statisticians to explore more flexible models – Increase in use of “non-parametric” techniques and simulation methods • 1990’s – Development of “machine learning” – very flexible predictive modeling techniques • Today – Distinctions between statistics and computer science often blurred – “Data science” , “big data”, “predictive analytics” are everywhere Padhraic Smyth, SIMS Presentation, March 2015: 7 Graphics from Lars Backstrom, ESWC 2011 Padhraic Smyth, SIMS Presentation, March 2015: 8 Padhraic Smyth, SIMS Presentation, March 2015: 9 Padhraic Smyth, SIMS Presentation, March 2015: 10 Padhraic Smyth, SIMS Presentation, March 2015: 11 From “Chocolate Consumption, Cognitive Function, and Nobel Laureates,” F. H. Messerli, New Eng. J. Medicine, 2012 Padhraic Smyth, SIMS Presentation, March 2015: 12 How Much Climate Data Do We Actually Have? Image from http://cimss.ssec.wisc.edu/ Image from ipcc.ch Padhraic Smyth, SIMS Presentation, March 2015: 13 Research at UC Irvine in Large-Scale Data Analysis Padhraic Smyth, SIMS Presentation, March 2015: 14 Three Illustrative Examples of Current UCI Research 1. Learning to make predictions with neural networks 2. Automatically extracting information from text 3. Modeling social media and sensor data Padhraic Smyth, SIMS Presentation, March 2015: 15 Examples of Input/Output Prediction Application Input (x variables) Predicted Output Spam email detection Word counts in an email Spam or not? Padhraic Smyth, SIMS Presentation, March 2015: 16 Examples of Input/Output Prediction Application Input (x variables) Predicted Output Spam email detection Word counts in an email Spam or not? Sentiment detection Word counts in a document Positive or negative? Padhraic Smyth, SIMS Presentation, March 2015: 17 Examples of Input/Output Prediction Application Input (x variables) Predicted Output Spam email detection Word counts in an email Spam or not? Sentiment detection Word counts in a document Positive or negative? Online advertising Text and user features Will a user click or not? Padhraic Smyth, SIMS Presentation, March 2015: 18 Examples of Input/Output Prediction Application Input (x variables) Predicted Output Spam email detection Word counts in an email Spam or not? Sentiment detection Word counts in a document Positive or negative? Online advertising Text and user features Will a user click or not? Face detection Image pixels Face in image or not? Padhraic Smyth, SIMS Presentation, March 2015: 19 Examples of Input/Output Prediction Application Input (x variables) Predicted Output Spam email detection Word counts in an email Spam or not? Sentiment detection Word counts in a document Positive or negative? Online advertising Text and user features Will a user click or not? Face detection Image pixels Face in image or not? Speech recognition Spectral energies Identity of spoken word Padhraic Smyth, SIMS Presentation, March 2015: 20 An Artificial Neuron Model x1 x2 f(x) x3 Each “edge” has an associated weight or parameter x4 Output is a weighted sum of the inputs Goal: learn the weights that best predict the output Padhraic Smyth, SIMS Presentation, March 2015: 21 Training and Prediction Input Variables Labeled Examples Training Data (used to learn the model) Padhraic Smyth, SIMS Presentation, March 2015: 22 Training and Prediction Input Variables Labeled Examples Training Data Class Labels are Known (used to learn the model) Padhraic Smyth, SIMS Presentation, March 2015: 23 Training and Prediction Input Variables Labeled Examples Training Data Class Labels are Known (used to learn the model) Unlabeled Examples Future Data Class Labels are Unknown (using the model to make predictions) Padhraic Smyth, SIMS Presentation, March 2015: 24 A Neural Network with 1 Hidden Layer x1 x2 f(x) x3 x4 Can recursively create more complex prediction models Many more weights now….requires more data to estimate Padhraic Smyth, SIMS Presentation, March 2015: 25 Deep Learning: Models with 2 or More Hidden Layers We can build on this idea to create “deep models” with many hidden layers x1 x2 f(x) x3 x4 The model is now a very flexible highly non-linear function Significant resurgent interest in the past 3 years in “deep learning” Padhraic Smyth, SIMS Presentation, March 2015: 26 Padhraic Smyth, SIMS Presentation, March 2015: 27 Figure from Krizhevsky, Sutskever, Hinton, 2012 Padhraic Smyth, SIMS Presentation, March 2015: 28 Visualizing what the Hidden Units are Learning Figure from Lee et al., ICML 2009 Padhraic Smyth, SIMS Presentation, March 2015: 29 Geolocated Tweets in Southern California over 6 months Padhraic Smyth, SIMS Presentation, March 2015: 30 Geolocated Tweets around UC Irvine Padhraic Smyth, SIMS Presentation, March 2015: 31 Geolocated Tweets at John Wayne Airport Padhraic Smyth, SIMS Presentation, March 2015: 32 Padhraic Smyth, SIMS Presentation, March 2015: 33 Research with Geolocated Event Data Applications? Personalization (e.g., for recommendations) Advertising Public and individual health Social science/behavioral research Urban planning/smart cities Challenges? Privacy (big brother) Non-stationarity Heterogeneity Sparsity Diverse data Padhraic Smyth, SIMS Presentation, March 2015: 34 Geolocated Tweets in Southern California over 6 months Padhraic Smyth, SIMS Presentation, March 2015: 35 Model from Population Data Combined Spatial Density Model Model from Individual Data Padhraic Smyth, SIMS Presentation, March 2015: 36 Text Collections NYT 330,000 articles CiteSeer 600,000 abstracts Enron 250,000 emails Pennsylvania Gazette 80,000 articles 1728-1800 NSF/ NIH 100,000 grants 16 million Medline articles Padhraic Smyth, SIMS Presentation, March 2015: 37 Topics are Represented as Distributions over Words Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART Padhraic Smyth, SIMS Presentation, March 2015: 38 Documents are Represented as Combinations of Topics Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART 70% 30% 50% 50% 90% … Document 1 Document 2 Document 3 Padhraic Smyth, SIMS Presentation, March 2015: 39 Topic Modeling Algorithm: Learn Topics from Documents Topic 1 Topic 2 Topic 3 Topic 4 ? ? ? ? ? ? ? ? ? … Document 1 Document 2 Document 3 Padhraic Smyth, SIMS Presentation, March 2015: 40 Examples of Topics Learned from 100,000 NIH Grant Abstracts Breast Cancer Skin Cancer Testing and Biomarkers Diet Conference Support breast cancer melanoma detection diet conference women skin cancer assay vitamin meeting breast cancer cells skin method obesity field estrogen melanoma cell technology activity participant human breast cancer melanomacyte sample risk session breast cancer patient scc biomarker selenium area tamoxifen keratinocyte approach change scientist breast tumor mutation analysis subject workshop estrogen receptor selenoprotein early detection month topic brca1 exposure phase food symposium Padhraic Smyth, SIMS Presentation, March 2015: 41 Topic Trends (New York Times Articles) kwords kwords kwords kwords 40 200 Basketball Sept-11-Attacks 20 100 0 0 20 200 Tour-de-France Anthrax 10 100 0 0 40 100 Oscars DC-Sniper 20 50 0 0 40 100 Quarterly-Earnings 20 0 Jan00 Enron 50 Jan01 Jan02 Jan03 0 Jan00 Jan01 Jan02 Jan03 Padhraic Smyth, SIMS Presentation, March 2015: 42 Real-Time Topic Modeling of Search Results Learned Topics Topic Mixtures Padhraic Smyth, SIMS Presentation, March 2015: 43 Challenges in Large-Scale Data Analysis • Statistical – Data are usually not from a nice random sample • Algorithmic – Scalability: applying an O(n3) algorithm when n = 1 million • Engineering and Operations – can the model be updated automatically every night? • Human and Socio-Cultural – Customer privacy • Educational – Intersection of statistics, computer science, applied math Padhraic Smyth, SIMS Presentation, March 2015: 44 New UCI undergraduate degree program proposed in Data Science, jointly between Statistics and Computer Science Padhraic Smyth, SIMS Presentation, March 2015: 45 March 13th, Calit2 Auditorium, UCI May 9th , Calit2 Auditorium, UCI Data Science Website: http://datascience.uci.edu Padhraic Smyth, SIMS Presentation, March 2015: 46 Acknowledgements Students and Colleagues Arthur Asuncion, Carter Butts, Chris DuBois, Jon Hutchins, Jimmy Foulds, Moshe Lichman, Nick Navaroli Funding Padhraic Smyth, SIMS Presentation, March 2015: 47 Thanks for listening…… questions? More information at www.datascience.uci.edu Padhraic Smyth, SIMS Presentation, March 2015: 48 BACKUP SLIDES Padhraic Smyth, SIMS Presentation, March 2015: 49 from IEEE Intelligent Systems, 2009 Padhraic Smyth, SIMS Presentation, March 2015: 50