CEng 574 Statistical Data Analysis

advertisement
CEng 574 Statistical Data
Analysis
Volkan Atalay
Fall 2014
Wisconsin Breast Cancer Database
Number of Instances: 699
# Attribute
Domain
-- ----------------------------------------1. Sample code number
id number
2. Clump Thickness
1 - 10
3. Uniformity of Cell Size
1 - 10
4. Uniformity of Cell Shape
1 - 10
5. Marginal Adhesion
1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei
1 - 10
8. Bland Chromatin
1 - 10
9. Normal Nucleoli
1 - 10
10. Mitoses
1 - 10
11. Class:
(benign, malignant)
5,1,1,1,2,1,3,1,1,2
5,4,4,5,7,10,3,2,1,2
3,1,1,1,2,2,3,1,1,2
6,8,8,1,3,4,3,7,1,2
4,1,1,3,2,1,3,1,1,2
8,10,10,8,7,10,9,7,1,4
1,1,1,1,2,10,3,1,1,2
2,1,2,1,2,1,3,1,1,2
2,1,1,1,2,1,1,1,5,2
4,2,1,1,2,1,2,1,1,2
1,1,1,1,1,1,3,1,1,2
2,1,1,1,2,1,2,1,1,2
5,3,3,3,2,3,4,4,1,4
1,1,1,1,2,3,3,1,1,2
8,7,5,10,7,9,5,5,4,4
7,4,6,4,6,1,4,3,1,4
4,1,1,1,2,1,2,1,1,2
4,1,1,1,2,1,3,1,1,2
10,7,7,6,4,10,4,1,2,4
6,1,1,1,2,1,3,1,1,2
7,3,2,10,5,10,5,4,4,4
10,5,5,3,6,7,7,10,1,4
3,1,1,1,2,1,2,1,1,2
1,1,1,1,2,1,3,1,1,2
5,2,3,4,2,7,3,6,1,4
Projection by PCA
Projection by PCA
‘Poverty map’ based on 39 indicators
from World Bank statistics (1992)
Powerty Map
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
Instructor
Volkan Atalay
phone 210 2108 vatalay@metu.edu.tr
Class
Tuesday 9:40-12:30
(A-101)
Office Hour by appointment
Course web page address
http://www.ceng.metu.edu.tr/courses/ceng574/
Course Objectives
•  The objective of this course is to introduce the
concepts and techniques of clustering and
multivariate and exploratory data analysis.
•  This course also offers an opportunity to perform
data analysis by using data visualization and
projection.
•  In addition, it allows students to apply these
techniques in a specific field, such as
bioinformatics.
•  Prerequisites Knowledge of programming,
probability and linear algebra.
Main Reference Book
E. Alpaydın (2010) Introduction to Machine
Learning. 2nd Edition, The MIT Press.
Yapay Öğrenme, Turkish language edition,
translated by the author, Boğaziçi
Üniversitesi Yayınevi, April 2011
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e/
Other Reference Books
•  W. Härdle and L. Simar (2007) Applied Multivariate Statistical
Analysis. Springer.
•  A. K. Jain and R. C. Dubes (1988) Algorithms for Clustering
Data. Prentice Hall. (freely available online)
http://www.cse.msu.edu/%7Ejain/Clustering_Jain_Dubes.pdf
•  S. Theodoridis, K. Koutroumbas, (2003) Pattern recognition,
2nd Edition. Academic Press.
•  B. Everitt, S. Landau, and M. Leese (2001) Cluster analysis.
4th Edition. Edward Arnold Pubs. Ltd.
•  A. Webb (2002) Statistical Pattern Recognition. Wiley. New
York.
•  R. O. Duda, P. E. Hart and D. G. Stork (2001) Pattern
Classification (2nd ed.). John Wiley.
Grading
• 
• 
• 
• 
Assignments
Term Paper/Report
Presentations
Attendance and
class participation
40
20
30
10
Course Outline
1.  Sept 25: Syllabus distribution; Getting to know each other; Overview of ML and PR;
Mechanics of the course
2.  Oct 2: Data, Measurements, Features, Similarities
3.  Oct 9:: Review of probability; R; Sample assignments; projects from previous years
4.  Oct 16:: Data set presentations by students
5.  Oct 23:: Linear projections and principal component analysis
6.  Oct 30:: Non-linear projections and multi-dimensional scaling
7.  Nov 6:: Clustering, and hierarchical clustering and k-means clustering and their
variations
8.  Nov 13::Clustering by mixture of Gaussians and EM algorithm, Evaluation and validity
of clusters
9.  Nov 20:: Choosing paper, information about advanced topics
10.  Nov 27:: Bioinformatics research, discussion on advanced topic papers
11.  Dec 4:: Presentations by students
12.  Dec 11:: Presentations by students
13.  Dec 18:: Presentations by students
14.  Dec 25:: Presentations by students
Assignments
0. Reading and Report-due Oct 2
1.  Data Set selection-due Oct 16
2.  Projections 1-due Oct 30
3.  Projections 2-due Nov 6
4.  Clustering-due Nov 13
5.  Validation-due Nov 20
Also:: Decide on advanced topic and paper
by Nov 29
Presentations
•  Data Set (All)
•  Weka, Phyton (2 persons)
•  Projections: PCA (1 person)
•  Projections: MDS, GTM, LLE, Isomap (1 person)
•  Clustering: k-means, hierarchical (1 person)
•  Validation: (1 person)
•  Advanced Topics (paper-all except the above 6 persons)
–  Semi-Supervised Clustering
–  Kernel-based clustering
–  Manifold Learning and Clustering
–  Spectral Clustering
•  Assignments and term paper and project
should be done on individual basis.
•  Remark that R seems to be the most
convenient environment to perform
computational operations during this
course.
http://www.r-project.org/
Learning Management System
METUCLASS-Moodle
http://metuclass.metu.edu.tr
Resources: web pages and
tutorials
•  Webpages
•  http://www.sciencemag.org/site/feature/data/
compsci/machine_learning.xhtml
•  http://dataclustering.cse.msu.edu/
•  http://en.wikipedia.org/wiki/Machine_learning
•  Tutorials
•  http://homepages.inf.ed.ac.uk/rbf/IAPR/
researchers/MLPAGES/mltut.htm
•  https://www.coursera.org/course/ml
Resources: Datasets
UCI Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html
UCI KDD Archive:
http://kdd.ics.uci.edu/summary.data.application.html
Statlib: http://lib.stat.cmu.edu/
Delve: http://www.cs.utoronto.ca/~delve/
20
Resources: Journals
Journal of Machine Learning Research www.jmlr.org
Machine Learning
Neural Computation
Neural Networks
IEEE Transactions on Neural Networks
IEEE Transactions on Pattern Analysis and Machine
Intelligence
Annals of Statistics
Journal of the American Statistical Association
...
21
Resources: Conferences
International Conference on Machine Learning (ICML)
European Conference on Machine Learning (ECML)
Neural Information Processing Systems (NIPS)
Uncertainty in Artificial Intelligence (UAI)
Computational Learning Theory (COLT)
International Conference on Artificial Neural Networks
(ICANN)
International Conference on AI & Statistics (AISTATS)
International Conference on Pattern Recognition (ICPR)
...
22
Resources: Computational
• 
MatLab, R, Weka, Phyton
• 
Machine Learning Open Source Software
• 
http://jmlr.org/mloss/ and http://mloss.org/software/
• 
http://www.dmoz.org/Computers/Artificial_Intelligence/
Machine_Learning/Software/ and
http://www.dmoz.org/Science/Math/Statistics/Software/
• 
http://www.cs.ubc.ca/~murphyk/Teaching/CS540_Fall05/
software.html
Article
http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1
•  Sergey Brin's mother was diagnosed with Parkinson’s in 1999.
•  In 2006, his wife-to-be, Anne Wojcicki, started the personal
genetics company 23andMe (Google is an investor).
•  As an alpha tester, Brin had the chance to get an early look at
his genome.
•  He looked up a spot known as G2019S—the notch on the
LRRK2 gene where an adenine nucleotide, the A in the ACTG
code of DNA, sometimes substitutes for a guanine nucleotide,
the G. And there it was: He had the mutation. His mother’s
23andMe readout showed that she had it, too.
Article
http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1
Sergey Brin’s Search for a Parkinson’s Cure
WIRED Magazine, July (August) 2010, p.124-133.
Most Parkinson’s research, like much of medical research, relies
on the classic scientific method: hypothesis, analysis, peer
review, publication.
Brin proposes a different approach, one driven by computational
muscle and staggeringly large data sets. It’s a method that
draws on his algorithmic sensibility—and Google’s storied
faith in computing power—with the aim of accelerating the
pace and increasing the potential of scientific research.
“Generally the pace of medical research is glacial compared to
what I’m used to in the Internet,” Brin says. “We could be
looking lots of places and collecting lots of information. And if
we see a pattern, that could lead somewhere.”
Article
http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1
Article
http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1
Increasingly, though, scientists—especially those
with a background in computing and information
theory—are starting to wonder if that model
could be inverted.
Why not start with tons of data, a deluge of
information, and then wade in, searching for
patterns and correlations?
Assignment #0: Alternatives
Read the article and the comments, and do
one of the following in at most a page:
•  Write your comments OR
•  Write a letter to the author about the
article OR
•  Write a review of the article as a referee.
Due : October 2, 2014
Download