Data Intensive Research: Data Analysis Chris Williams School of Informatics, University of Edinburgh March 2010 I “... drowning in information, but starving for knowledge” (Naisbett, 1982). I Statistics, machine learning, data mining (called “the numerati” in a recent popular book) I Data-centric thinking: problems and solutions that arise across a wide range of application domains I Serious progress will likely require a partnership between domain experts and data analysts to develop new techniques/methods to address specific structure or features in the problem domain Outline I Types of analysis I I I I Issues I I I I Exploratory data analysis Descriptive data analysis Predictive data analysis Data Cleaning and Quality Complexity and Prior Knowledge Scale Training requirements Exploratory data analysis (EDA) Visualization: I Uses the power of eye/brain to find structure in data I Opposite end of spectrum from formal model building I Helps to find unexpected relationships and to identify outliers etc Descriptive Data Analysis I I I Task is to discover significant patterns or features in the input data, without a teacher; self-organization No external teacher or critic, but often an internal quality measure is optimized Examples I I Clustering Dimensionality reduction/fitting a lower-dimensional manifold ... . . . . X . . . . . .. .. .. X.. . . .. . . . . . .. .. .. . .. X .. .. Cluster centres (0-d) ... . . . . . . ... . . . . . .. .. . .. . . .. . . . Lines, sheets (1-d, 2-d) I Example: Topic Modelling (Blei et al, 2003) I I I I Bag-of-words representation for each document (ignore word order) Each document is regarded as being generated from a weighted set of topics Learn topics and estimate per-document topic weightings AP corpus, 16,333 articles with 23,075 unique terms Figure credit: Blei et al, 2003 Example: Learning about objects from image sequences (Williams and Titsias, 2004) Predictive Modelling I Learning from input-output pairs I Predict output(s) given inputs Examples I I I I SKICAT (JPL/Caltech): Predict if an astronomical object is a star or a galaxy Predicting disease presence/absence based on gene expression data PapNet – searching for abnormal cells in Pap smears I Classification and regression problems I Methods: linear regression, logistic regression, decision tress, nearest-neighbour methods, neural networks, support vector machines I Example 2: Netflix data I Over 100 million ratings from over 480,000 customers on nearly 18,000 movies, competition to win $ 1,000,000 Figure credit: Andriy Mnih and Ruslan Salakhutdinov The Data Mining Process Figure credit: Cross Industry Standard Process for Data Mining, http://www.crisp-dm.org/ eScience Issues I Data Cleaning and Quality I Complexity and Prior Knowledge I Scale Data Cleaning and Quality I Dealing with missing or corrupted data I Widely quoted that 50-80% of the time in real-world data mining projects is spent on data preparation I Large amounts of “dirty” data vs small curated datasets I Can sometimes include data cleaning steps in the analysis itself Complexity and Prior Knowledge I Complexity of the data (e.g. multiple sources), or complexity of the models used for analysis I Structured probabilistic graphical models are one way of encoding domain knowledge (Williams’ talk, Wednesday) I Addressing the (probabilistic) representation of uncertainty in science (Rougier talk, Tuesday) Figure credit: Kevin Murphy, UBC Scale I Beware of machismo: do you really need to deal with 109 instances? Start small! I A simple way to “deal” with big data (big n) is to sample it (this has a long history in statistics ...) I Problematic wrt rare events (cf Zipf’s law). Sometimes we do really need to handle all the data I Exact calculations with efficient data structures (e.g. Andrew Moore, CMU and collaborators) I Approximation algorithms, e.g. Nyström algorithm (Williams and Seeger, 2000) for approximation of top eigenvectors/values of large Gram matrices I Necessary tricks for handling large datasets are a means to an end Training I Data-centric thinking: problems and solutions that arise across a wide range of application domains I How should we best prepare researchers (who will come from disparate disciplines)? I Serious progress will likely require a partnership between domain experts and data analysts; this requires time investment from both sides I In a MSc or centre for doctoral training (CDT) one should have core first year courses on data analysis The week ahead I Understand the variety of analysis challenges faced in the different areas I Consider what forms of training would be most effective to address these challenges