Examples of High-Dimensional Data

advertisement
Examples of High-Dimensional Data
Genevera I. Allen
Statistical Learning: High-Dimensional Data
January 10, 2011
(Stat 699)
High-Dimensional Data
January 10, 2011
1 / 16
1
Large p, Small n: Biological Data
2
Random Fields Data
3
Collaborative Filtering Data
(Stat 699)
High-Dimensional Data
January 10, 2011
2 / 16
Example: Microarrays
arrays
Measure gene expression.
Only tens of hundreds of samples.
(Stat 699)
High-Dimensional Data
genes
Often tens of thousands of genes
(features).
January 10, 2011
3 / 16
Review: Microarrays
(Stears, R. L. et. al, 2003)
(Stat 699)
High-Dimensional Data
January 10, 2011
4 / 16
Statistical Questions: Microarrays
arrays
Data pre-processing:
I
I
Normalization.
Missing data imputation.
I
Which genes are significant?
Clustering:
I
genes
Inference:
Groups of genes, groups of
samples.
Model Building:
I
Small n, large p.
(Stat 699)
High-Dimensional Data
January 10, 2011
5 / 16
Other Types of Biological Data
Genetics:
Deep Sequencing - Counts.
Micro RNA Expression - Continuous.
CGH (Copy Number Variation) - Continuous / Categorical.
SNPs (Single Nucleotide Polymorphisms) - Binary / Categorical.
Methalaytion - Continuous.
(Stat 699)
High-Dimensional Data
January 10, 2011
6 / 16
Other Types of Biological Data
Proteomics / Metabolomics (Chemometrics):
(H-NMR) Measures the chemical shift associated with various
metabolites.
(Stat 699)
High-Dimensional Data
January 10, 2011
6 / 16
1
Large p, Small n: Biological Data
2
Random Fields Data
3
Collaborative Filtering Data
(Stat 699)
High-Dimensional Data
January 10, 2011
7 / 16
Example: Functional MRIs (fMRI)
Rows: Voxels.
Columns: Subjects (And/or replicates and times).
Measurement: Hemodynamic response (change in blood flow).
Slice 15
(Stat 699)
Slice 16
High-Dimensional Data
Slice 17
January 10, 2011
8 / 16
Review: fMRIs
(Heeger & Ress, 2002)
(Stat 699)
High-Dimensional Data
January 10, 2011
9 / 16
Statistical Questions: fMRIs
Inference:
I
I
Which voxels are significant?
Which groups of voxels ( regions
of interest) are significant?
Clustering:
I
Groups of voxels that behave
similarly - finding regions of
interest.
Networks (Functional
Connectivity):
I
I
How are voxels or groups of
voxels related to each other?
How are voxels or groups of
voxels related through time?
(Stat 699)
High-Dimensional Data
January 10, 2011
10 / 16
Others
Finance.
I
Time Series Data.
Climate Data.
I
I
Spatial Data.
Spatio-temporal Data.
Neuroimaging.
I
I
I
DTI - Diffusion Tensor Imaging.
Calcium-Florescence Imaging.
EEG & MEG.
(Stat 699)
High-Dimensional Data
January 10, 2011
11 / 16
1
Large p, Small n: Biological Data
2
Random Fields Data
3
Collaborative Filtering Data
(Stat 699)
High-Dimensional Data
January 10, 2011
12 / 16
Example: Netflix Movie Rating Data
Rows: Movies.
Columns: Customers.
Measurement: Movie
ratings (scale of 1 - 5).
Star Wars
Harry Potter
Pretty Woman
Titanic
Lord of the Rings
..
.
(Stat 699)
Anne
2
3
4
5
?
..
.
Ben
5
4
?
?
5
..
.
Charlie
4
5
2
2
5
..
.
High-Dimensional Data
Doug
4
3
?
1
4
..
.
Eve
3
?
5
3
4
..
.
...
...
...
...
...
...
..
.
January 10, 2011
13 / 16
Netflix Prize
Challenge: Predict un-rated movies with 10% improvement over
Cinematch.
Training Set: 480,000 customer ratings on 18,000 movies.
Around 98.7% missing ratings!
$1,000,000 prize!
Contest: October 2006 - August 2009.
Winners: Team led by Robert Bell and Yehuda Koren.
Methods: Variations on the SVD and k-nearest neighbors (Bell &
Koren, 2008).
Fields: Recommender systems & Collaborative filtering.
(Stat 699)
High-Dimensional Data
January 10, 2011
14 / 16
Visualizing Netflix Data
Zipf
(Justin S. Dyer & Art B. Owen, 2010)
(Stat 699)
High-Dimensional Data
January 10, 2011
15 / 16
Visualizing Netflix Data
Copulas
(Justin S. Dyer & Art B. Owen, 2010)
(Stat 699)
High-Dimensional Data
January 10, 2011
15 / 16
Other Examples
Amazon
Facebook
Yahoo!
Twitter
(Stat 699)
High-Dimensional Data
January 10, 2011
16 / 16
Download