New Directions in Large-Scale Data Analysis Padhraic

advertisement
New Directions in Large-Scale Data Analysis
Padhraic Smyth
Professor, Department of Computer Science
Director , UCI Data Science Initiative
Terminology
Large-scale Data Analysis
Data Mining
Data Science
Big Data
Machine Learning
Computational Statistics
……
Padhraic Smyth, SIMS Presentation, March 2015: 2
Terminology
Large-scale Data Analysis
Data Mining
Data Science
Big Data
Machine Learning
Computational Statistics
……
……Using computer algorithms to analyze data sets that are too
large and complex for humans to work with
Padhraic Smyth, SIMS Presentation, March 2015: 3
350,000
new tweets
204 million
emails sent
2.5 million
search queries
issued
Graphic from http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html, downloaded in 2011
Padhraic Smyth, SIMS Presentation, March 2015: 4
A Revolution in Data Technology
Graphic from Ray Kurzweil, singularity.com
Padhraic Smyth, SIMS Presentation, March 2015: 5
A Paradigm Shift in Data Analysis
• Technological drivers
–
–
–
–
–
Sensors
Storage
Computation
Algorithms
Internet Access
• Convergence…..tremendous demand for data analysis
– In the sciences, in medicine, in engineering, in business, and more……
• Problems often require a combination of skills
–
–
–
–
Algorithms and optimization
Large-scale data management
Machine learning
Statistics
Padhraic Smyth, SIMS Presentation, March 2015: 6
Statistics and Computing
• Post World War II
– Increasing use of computing to solve algorithmic aspects of statistical analyses
• 1960’s
– Development of statistical computing and exploratory data analysis
• 1980’s
– Computing allowed statisticians to explore more flexible models
– Increase in use of “non-parametric” techniques and simulation methods
• 1990’s
– Development of “machine learning” – very flexible predictive modeling techniques
• Today
– Distinctions between statistics and computer science often blurred
– “Data science” , “big data”, “predictive analytics” are everywhere
Padhraic Smyth, SIMS Presentation, March 2015: 7
Graphics from Lars Backstrom, ESWC 2011
Padhraic Smyth, SIMS Presentation, March 2015: 8
Padhraic Smyth, SIMS Presentation, March 2015: 9
Padhraic Smyth, SIMS Presentation, March 2015: 10
Padhraic Smyth, SIMS Presentation, March 2015: 11
From “Chocolate Consumption, Cognitive Function, and
Nobel Laureates,” F. H. Messerli, New Eng. J. Medicine, 2012
Padhraic Smyth, SIMS Presentation, March 2015: 12
How Much Climate Data Do We Actually Have?
Image from http://cimss.ssec.wisc.edu/
Image from ipcc.ch
Padhraic Smyth, SIMS Presentation, March 2015: 13
Research at UC Irvine in Large-Scale Data Analysis
Padhraic Smyth, SIMS Presentation, March 2015: 14
Three Illustrative Examples of Current UCI Research
1. Learning to make predictions with neural networks
2. Automatically extracting information from text
3. Modeling social media and sensor data
Padhraic Smyth, SIMS Presentation, March 2015: 15
Examples of Input/Output Prediction
Application
Input (x variables)
Predicted Output
Spam email
detection
Word counts in an email
Spam or not?
Padhraic Smyth, SIMS Presentation, March 2015: 16
Examples of Input/Output Prediction
Application
Input (x variables)
Predicted Output
Spam email
detection
Word counts in an email
Spam or not?
Sentiment detection
Word counts in a
document
Positive or negative?
Padhraic Smyth, SIMS Presentation, March 2015: 17
Examples of Input/Output Prediction
Application
Input (x variables)
Predicted Output
Spam email
detection
Word counts in an email
Spam or not?
Sentiment detection
Word counts in a
document
Positive or negative?
Online advertising
Text and user features
Will a user click or not?
Padhraic Smyth, SIMS Presentation, March 2015: 18
Examples of Input/Output Prediction
Application
Input (x variables)
Predicted Output
Spam email
detection
Word counts in an email
Spam or not?
Sentiment detection
Word counts in a
document
Positive or negative?
Online advertising
Text and user features
Will a user click or not?
Face detection
Image pixels
Face in image or not?
Padhraic Smyth, SIMS Presentation, March 2015: 19
Examples of Input/Output Prediction
Application
Input (x variables)
Predicted Output
Spam email
detection
Word counts in an email
Spam or not?
Sentiment detection
Word counts in a
document
Positive or negative?
Online advertising
Text and user features
Will a user click or not?
Face detection
Image pixels
Face in image or not?
Speech recognition
Spectral energies
Identity of spoken word
Padhraic Smyth, SIMS Presentation, March 2015: 20
An Artificial Neuron Model
x1
x2
f(x)
x3
Each “edge” has an associated weight or parameter
x4
Output is a weighted sum of the inputs
Goal: learn the weights that best predict the output
Padhraic Smyth, SIMS Presentation, March 2015: 21
Training and Prediction
Input Variables
Labeled
Examples
Training Data
(used to learn the model)
Padhraic Smyth, SIMS Presentation, March 2015: 22
Training and Prediction
Input Variables
Labeled
Examples
Training Data
Class Labels
are Known
(used to learn the model)
Padhraic Smyth, SIMS Presentation, March 2015: 23
Training and Prediction
Input Variables
Labeled
Examples
Training Data
Class Labels
are Known
(used to learn the model)
Unlabeled
Examples
Future Data
Class Labels
are Unknown
(using the model to make predictions)
Padhraic Smyth, SIMS Presentation, March 2015: 24
A Neural Network with 1 Hidden Layer
x1
x2
f(x)
x3
x4
Can recursively create more complex prediction models
Many more weights now….requires more data to estimate
Padhraic Smyth, SIMS Presentation, March 2015: 25
Deep Learning: Models with 2 or More Hidden Layers
We can build on this idea to create “deep models” with many hidden layers
x1
x2
f(x)
x3
x4
The model is now a very flexible highly non-linear function
Significant resurgent interest in the past 3 years in “deep learning”
Padhraic Smyth, SIMS Presentation, March 2015: 26
Padhraic Smyth, SIMS Presentation, March 2015: 27
Figure from Krizhevsky, Sutskever, Hinton, 2012
Padhraic Smyth, SIMS Presentation, March 2015: 28
Visualizing what the Hidden Units are Learning
Figure from Lee et al., ICML 2009
Padhraic Smyth, SIMS Presentation, March 2015: 29
Geolocated Tweets in Southern California over 6 months
Padhraic Smyth, SIMS Presentation, March 2015: 30
Geolocated Tweets around UC Irvine
Padhraic Smyth, SIMS Presentation, March 2015: 31
Geolocated Tweets at John Wayne Airport
Padhraic Smyth, SIMS Presentation, March 2015: 32
Padhraic Smyth, SIMS Presentation, March 2015: 33
Research with Geolocated Event Data
Applications?
Personalization (e.g., for recommendations)
Advertising
Public and individual health
Social science/behavioral research
Urban planning/smart cities
Challenges?
Privacy (big brother)
Non-stationarity
Heterogeneity
Sparsity
Diverse data
Padhraic Smyth, SIMS Presentation, March 2015: 34
Geolocated Tweets in Southern California over 6 months
Padhraic Smyth, SIMS Presentation, March 2015: 35
Model from Population Data
Combined Spatial
Density Model
Model from Individual Data
Padhraic Smyth, SIMS Presentation, March 2015: 36
Text Collections
NYT
330,000 articles
CiteSeer
600,000 abstracts
Enron
250,000 emails
Pennsylvania Gazette
80,000 articles
1728-1800
NSF/ NIH
100,000 grants
16 million Medline articles
Padhraic Smyth, SIMS Presentation, March 2015: 37
Topics are Represented as Distributions over Words
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
SEPT_11
WAR
SECURITY
IRAQ
TERRORISM
NATION
KILLED
AFGHANISTAN
ATTACKS
OSAMA_BIN_LADEN
WALL_STREET
ANALYSTS
INVESTORS
FIRM
GOLDMAN_SACHS
FIRMS
INVESTMENT
MERRILL_LYNCH
COMPANIES
SECURITIES
WEEK
DOW_JONES
POINTS
10_YR_TREASURY_YIELD
PERCENT
CLOSE
NASDAQ_COMPOSITE
STANDARD_POOR
CHANGE
FRIDAY
BANKRUPTCY
CREDITORS
BANKRUPTCY_PROTECTION
ASSETS
COMPANY
FILED
BANKRUPTCY_FILING
ENRON
BANKRUPTCY_COURT
KMART
Padhraic Smyth, SIMS Presentation, March 2015: 38
Documents are Represented as Combinations of Topics
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
SEPT_11
WAR
SECURITY
IRAQ
TERRORISM
NATION
KILLED
AFGHANISTAN
ATTACKS
OSAMA_BIN_LADEN
WALL_STREET
ANALYSTS
INVESTORS
FIRM
GOLDMAN_SACHS
FIRMS
INVESTMENT
MERRILL_LYNCH
COMPANIES
SECURITIES
WEEK
DOW_JONES
POINTS
10_YR_TREASURY_YIELD
PERCENT
CLOSE
NASDAQ_COMPOSITE
STANDARD_POOR
CHANGE
FRIDAY
BANKRUPTCY
CREDITORS
BANKRUPTCY_PROTECTION
ASSETS
COMPANY
FILED
BANKRUPTCY_FILING
ENRON
BANKRUPTCY_COURT
KMART
70%
30%
50%
50%
90%
…
Document 1
Document 2
Document 3
Padhraic Smyth, SIMS Presentation, March 2015: 39
Topic Modeling Algorithm: Learn Topics from Documents
Topic 1
Topic 2
Topic 3
Topic 4
?
?
?
?
?
?
?
?
?
…
Document 1
Document 2
Document 3
Padhraic Smyth, SIMS Presentation, March 2015: 40
Examples of Topics Learned from 100,000 NIH Grant Abstracts
Breast Cancer
Skin Cancer
Testing and
Biomarkers
Diet
Conference
Support
breast cancer
melanoma
detection
diet
conference
women
skin cancer
assay
vitamin
meeting
breast cancer
cells
skin
method
obesity
field
estrogen
melanoma cell
technology
activity
participant
human breast
cancer
melanomacyte
sample
risk
session
breast cancer
patient
scc
biomarker
selenium
area
tamoxifen
keratinocyte
approach
change
scientist
breast tumor
mutation
analysis
subject
workshop
estrogen
receptor
selenoprotein
early
detection
month
topic
brca1
exposure
phase
food
symposium
Padhraic Smyth, SIMS Presentation, March 2015: 41
Topic Trends (New York Times Articles)
kwords
kwords
kwords
kwords
40
200
Basketball
Sept-11-Attacks
20
100
0
0
20
200
Tour-de-France
Anthrax
10
100
0
0
40
100
Oscars
DC-Sniper
20
50
0
0
40
100
Quarterly-Earnings
20
0
Jan00
Enron
50
Jan01
Jan02
Jan03
0
Jan00
Jan01
Jan02
Jan03
Padhraic Smyth, SIMS Presentation, March 2015: 42
Real-Time Topic Modeling of Search Results
Learned
Topics
Topic Mixtures
Padhraic Smyth, SIMS Presentation, March 2015: 43
Challenges in Large-Scale Data Analysis
• Statistical
– Data are usually not from a nice random sample
• Algorithmic
– Scalability: applying an O(n3) algorithm when n = 1 million
• Engineering and Operations
– can the model be updated automatically every night?
• Human and Socio-Cultural
– Customer privacy
• Educational
– Intersection of statistics, computer science, applied math
Padhraic Smyth, SIMS Presentation, March 2015: 44
New UCI undergraduate degree program proposed in Data
Science, jointly between Statistics and Computer Science
Padhraic Smyth, SIMS Presentation, March 2015: 45
March 13th, Calit2 Auditorium, UCI
May 9th , Calit2 Auditorium, UCI
Data Science Website: http://datascience.uci.edu
Padhraic Smyth, SIMS Presentation, March 2015: 46
Acknowledgements
Students and Colleagues
Arthur Asuncion, Carter Butts, Chris DuBois, Jon
Hutchins, Jimmy Foulds, Moshe Lichman, Nick
Navaroli
Funding
Padhraic Smyth, SIMS Presentation, March 2015: 47
Thanks for listening…… questions?
More information at www.datascience.uci.edu
Padhraic Smyth, SIMS Presentation, March 2015: 48
BACKUP SLIDES
Padhraic Smyth, SIMS Presentation, March 2015: 49
from IEEE Intelligent Systems, 2009
Padhraic Smyth, SIMS Presentation, March 2015: 50
Download