Data Intensive Research: Data Analysis Chris Williams March 2010

advertisement
Data Intensive Research: Data Analysis
Chris Williams
School of Informatics, University of Edinburgh
March 2010
I
“... drowning in information, but starving for knowledge”
(Naisbett, 1982).
I
Statistics, machine learning, data mining
(called “the numerati” in a recent popular book)
I
Data-centric thinking: problems and solutions that arise
across a wide range of application domains
I
Serious progress will likely require a partnership between
domain experts and data analysts to develop new
techniques/methods to address specific structure or
features in the problem domain
Outline
I
Types of analysis
I
I
I
I
Issues
I
I
I
I
Exploratory data analysis
Descriptive data analysis
Predictive data analysis
Data Cleaning and Quality
Complexity and Prior Knowledge
Scale
Training requirements
Exploratory data analysis (EDA)
Visualization:
I
Uses the power of eye/brain to find structure in data
I
Opposite end of spectrum from formal model building
I
Helps to find unexpected relationships and to identify
outliers etc
Descriptive Data Analysis
I
I
I
Task is to discover significant patterns or features in the
input data, without a teacher; self-organization
No external teacher or critic, but often an internal quality
measure is optimized
Examples
I
I
Clustering
Dimensionality reduction/fitting a lower-dimensional
manifold
... .
. .
. X
. .
.
. . .. ..
.. X..
. . ..
. . .
. . .. .. ..
.
.. X
.. ..
Cluster centres (0-d)
...
. .
.
.
. .
...
. .
. .
.
..
.. . .. .
.
..
. .
.
Lines, sheets (1-d, 2-d)
I
Example: Topic Modelling (Blei et al, 2003)
I
I
I
I
Bag-of-words representation for each document (ignore
word order)
Each document is regarded as being generated from a
weighted set of topics
Learn topics and estimate per-document topic weightings
AP corpus, 16,333 articles with 23,075 unique terms
Figure credit: Blei et al, 2003
Example: Learning about objects from image sequences
(Williams and Titsias, 2004)
Predictive Modelling
I
Learning from input-output pairs
I
Predict output(s) given inputs
Examples
I
I
I
I
SKICAT (JPL/Caltech): Predict if an astronomical object is
a star or a galaxy
Predicting disease presence/absence based on gene
expression data
PapNet – searching for abnormal cells in Pap smears
I
Classification and regression problems
I
Methods: linear regression, logistic regression, decision
tress, nearest-neighbour methods, neural networks,
support vector machines
I
Example 2: Netflix data
I
Over 100 million ratings from over 480,000 customers on
nearly 18,000 movies, competition to win $ 1,000,000
Figure credit: Andriy Mnih and Ruslan Salakhutdinov
The Data Mining Process
Figure credit: Cross Industry Standard Process for Data Mining, http://www.crisp-dm.org/
eScience Issues
I
Data Cleaning and Quality
I
Complexity and Prior Knowledge
I
Scale
Data Cleaning and Quality
I
Dealing with missing or corrupted data
I
Widely quoted that 50-80% of the time in real-world data
mining projects is spent on data preparation
I
Large amounts of “dirty” data vs small curated datasets
I
Can sometimes include data cleaning steps in the analysis
itself
Complexity and Prior Knowledge
I
Complexity of the data (e.g. multiple sources), or
complexity of the models used for analysis
I
Structured probabilistic graphical models are one way of
encoding domain knowledge (Williams’ talk, Wednesday)
I
Addressing the (probabilistic) representation of uncertainty
in science (Rougier talk, Tuesday)
Figure credit: Kevin Murphy, UBC
Scale
I
Beware of machismo: do you really need to deal with 109
instances? Start small!
I
A simple way to “deal” with big data (big n) is to sample it
(this has a long history in statistics ...)
I
Problematic wrt rare events (cf Zipf’s law). Sometimes we
do really need to handle all the data
I
Exact calculations with efficient data structures (e.g.
Andrew Moore, CMU and collaborators)
I
Approximation algorithms, e.g. Nyström algorithm
(Williams and Seeger, 2000) for approximation of top
eigenvectors/values of large Gram matrices
I
Necessary tricks for handling large datasets are a means
to an end
Training
I
Data-centric thinking: problems and solutions that arise
across a wide range of application domains
I
How should we best prepare researchers (who will come
from disparate disciplines)?
I
Serious progress will likely require a partnership between
domain experts and data analysts; this requires time
investment from both sides
I
In a MSc or centre for doctoral training (CDT) one should
have core first year courses on data analysis
The week ahead
I
Understand the variety of analysis challenges faced in the
different areas
I
Consider what forms of training would be most effective to
address these challenges
Download