Data Mining and Materials Informatics: a primer Krishna Rajan

advertisement
Data Mining and Materials Informatics:
a primer
Krishna Rajan
Department of Materials Science and Engineering
NSF Intl. Materials Institute Combinatorial Sciences &
Materials Informatics Collaboratory
Iowa State University
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
OUTLINE
What is data?
primary, secondary , derivative
sources of data
What do we mean by mining data?
data correlations – dimensional analysis approach
data correlations – data mining
What can we learn from data mining?
Classification
data mining databases
hierarchy of data
trends in data
Prediction
predicting structure-property relationships
computational informatics vs. computational
materials science
data mining for building databases
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
INFORMATICS
What is it ?
Why ?
• Searching for patterns of
behavior among multivariate
data sets
• Establish new correlations
• Can pattern recognition lead
to predictions?
• Enlarge database / virtual libraries
• Identify outliers
• Evaluate databases
• Establish predictions
Krishna Rajan
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
COMPLEXITY OF BIOSYSTEMS
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
INFORMATICS STRATEGY
Data
+
Correlations
+
Theory
=
Knowledge-base
InformaticsBased
Design
• Atomistic based calculations
• Continuum based theories
• Machine learning
• Data compression
• Pattern recognition
• Simulations
• Combinatorially derived datasets
• Digital libraries
• Spectral and imaging libraries
Machine learning toolkit:
• Classification
• Prediction
• Hybrid tools
ULTRA LARGE
SCALE
INFORMATION
SPACE
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
INFORMATICS-BASED DESIGN STRATEGIES
BIOLOGICAL ANALOGUE
Ideker and Lauffenburger:(2003)
CRYSTAL STRUCTURE
ANALOGUE
Nye (1950)
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
Statistical Tools
ƒ Experimental data
ƒ Computational derived
data sets
ƒSimulations
Principal Component Analysis
(PCA)
Prediction
Materials Databases
Support Vector Machine
(SVM)
Latent Variables /
Partial Least Squares
(PLS)
Classification
Combinatorial &
Spectral Libraries
Input & Output
Association Mining
(AM)
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
MACHINE LEARNING AND MATERIALS DISCOVERY
Requirements
•
•
•
•
•
•
Global minima
Categorical data
Missing / variable data
Skewed distributions
Large data sets
Scalable
Pitfalls
• Local minima
• Categorical data difficult
• Convergence difficult for large
data sets
• Few outliers can lead to poor
performance
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
Database administration & management
DATA STORAGE
• thermodynamic, crystallographic & property
data bases
• combinatorial experimental data
Oracle, Unix, Supercomputing, SQL
DATA CURATION
• taxonomy and ontology of materials
science data
• data sharing , networking / cyber infrastructure
JAVA, HTML, Python
DATA REPRESENTATION
• object oriented programming language
• visualization of high dimensional data
Data mining algorithms
KNOWLEDGE
DISCOVERY
• Clustering analysis
• Quantitative Structure-Activity Relationships
(QSAR) for materials design
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
WHY COUPLE COMPUTATIONAL MATERIALS SCIENCE AND INFORMATICS ?
• Accelerated insertion of materials into engineering systems
• Rapid multiscale design and optimization of materials properties
• Establishment of new structure –property correlations among
large, heterogeneous and distributed data sets
• Discovery of new chemistries and compounds
• Formulation and / or refinement of new theories for materials
behavior
• Rapid identification of critical data and theoretical needs for
future problems
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
SOFT MODELING vs. HARD MODELING
Interpretation
Softmodeling
modeling
Soft
Data Mining
& Visualization
Feature Extraction
Data
Warehousing
Knowledge
Patterns
Transformed
Data
Constitutive
Equations
Dimensional
Analysis and/or
Hardmodeling
modeling
Theory
Hard
Original
Data
Data bases
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Rajan
2006
Cincinatti, OH October 15th Krishna
DATA MAP
22x22 (484 cells – 2500 data points):
Silicon nitride descriptors
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
DIMENSIONALITY REDUCTION
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
EIGENVALUE DECOMPOSITION
Sp = λ p
eigenvector
eigenvalue
SP = PΛ
Eigenvalues on the diagonal of
this diagonal matrix
Eigenvectors forming the columns of this matrix
If V is nonsingular, this become the eigenvalue decomposition
Eigenvalue Matrix
S = P ΛP
−1
Loading Matrix=Eigenvector Matrix
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
PREDICTIONS
PLS
ƒ ‘Upgradeable’ linear predictive tool
ƒ Quadratic and interaction terms :flexibility
T
U
Factors
Responses
Calculate latent factors from original
predictor variables
T=XW where T- latent factors, X – predictors, W weights
Use latent factors to predict response
Y=TQ+E where Y - responses, Q – loadings, Enoise
• W maximizes covariance between Y and T
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
INFORMATICS AND THE PERIODIC TABLE
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
DATA WAREHOUSING
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
BIVARIATE MAPS
1. # of atoms/unit cell 2. Valence electron # 3. Electronegativity 4. 1st Ionization potential
5. Atomic radius 6. Melting point 7. Boiling point 8. Density
6
4
1
2
20
2
10
0
4
3
2
0
10
4
5
0
3
5
2
1
4000
6
2000
0
10000
7
5000
0
8
40
20
0
2
4
1
6
0
10
2
20
0
2
3
4
0
5
4
10
1
5
2
3
0
6
2000
4000
7
0
5000
10000
0
20
40
8
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
SEEKING PATTERNS
6
PC1
4
2
0
-2
-4
-6
6
PC2
4
2
0
-2
-4
-6
2
PC3
1
0
-1
-2
-6
-4
-2
0
PC1
2
4
6
-6
-4
-2
0
PC2
2
4
6
-2
-1
0
1
2
PC3
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
MENDELEEV SEQUENCING
5
54
4
Cs, Ba, La, Ce, Pr, Nd, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb,
55
Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, Pb, and Bi
27
Scores on PC 3 (12.02%)
3
14
57
56
28
2
525853
15
1
29
30
0
3
-1
-4
18
11
1
4
-2
41
44
38 31
3742
49
51
32
43
3639
40
46
34
33 35
4750
48
23
21
76
8
-3
17
24
22 2019
910
-2 Na, Mg, Al
-3
26
25
16
5 1213
2
45
-1
0
K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu,Scores
Zn, andon
GaPC 2 (22.61%)
1
2
Sr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, In, and Sn
Each cluster represent each row in the periodic table
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
DEVELOPING PHYSICAL LAWS
Valence
Electron
Melting
Heat of
number
Point
Vaporization
Heat of
Fusion
MartynovBatsanov
electronegativity
Boiling
Point
Specific Heat
Density
Pauling
electronegativity
First
Ionization
Potential
Pseudopotential
Radius
Covalent
radius
Molar
Volume Metallic
radius
•
Location of property in loading plot
indicates influence of property on
PC
• Atomic weight has no influence on
PC1, high influence on PC2 and 3
•
Clustered properties indicate
relationships
• Electrical and thermal conductivity
• Melting point and density
• Molar volume and atomic radius
• Melting and boiling points, heats of
fusion and vaporization, and
valence number
• Pauling electronegativity and first
ionization potential
Atomic
Weight
Thermal
Conductivity
Electrical
Conductivity
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
DEVELOPING PHYSICAL LAWS
Activation energy for self-diffusion versus melting point
Fe
Ni
Co
Linear regression
Pt
60
Cu
50
Ag
18R
Au
40
g
Al
29. Copper (Cu)
Bokshtein & Zhukhovitskii
30
Pb
20
500
1000
1500
T [K]
m
From Glicksman
2000
2500
Thermal Conductivity (W/(m*K))
Activation Energy, Q [kcal/mol]
70
6. Carbon, diamond (C)
100
32. Germanium (Ge)
10
52. Tellurium (Te)
83. Bismuth (Bi)
1
1e-12
34. Selenium (Se)
1e-10
1e-8
1e-6
1e-4
0.01
RPI Electrical Conductivity (0.00000001*ohm*m)
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
PCA of 34 binary, ternary, quaternary compounds: Score plot
9 descriptors from NIST data
- Lattice parameters (a, b, c, α, β, γ)
- c/a, b/c
- V/abc
Orthorhombic
1
2
22
9
8
21
10
23
3
Tetragonal
(I-42d)
Scores Plot
Hexagonal
15
25
26
27
28
10
PC 3 ( 5.85%)
Cubic
30
5
7
33
24
20
11
0
16
4
18
1
9
21
382
23
10
15
16
422
28
27
26
25
32
34
31
14
12
13
-10
31
34
32
19
17
-20
14
13
12
-30
29
-40
50
6
Monoclinic
Tetragonal
(P42/mmc)
50
0
0
-50
PC 2 (29.17%)
-50
-100
PCA Score plot (34 binary, ternary, quaternary compounds)
PC 1 (61.71%)
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
Cluster analysis of 34 binary, ternary, quaternary compounds
α, β, γ = 90°
α, β, γ ≠ 90°
1
2
9
22
25
28
27
26
3
23
10
15
8
4
16
21
12
13
14
31
32
34
29
5
7
11
20
24
18
30
33
6
17
19
Tetragonal
(I4/mcm)
Cubic
(Fm-3m)
Tetragonal
(I-42d)
Cubic
(Im-3,Pm-3)
Monoclinic
Hexagonal
(P63/mmc)
Hexagonal (P-62m,P-6)
Orthorhombic
Cubic
(P4132,P-43m,Pm-3,
Pm-3,Fd-3ms)
Hexagonal (P63/mmc,P6/mmm, P-6m2)
/ Trigonal (P-3m1)
Tetragonal
(P42/mmc)
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
RULE EXTRACTION
Atomic packing of single element
Ex.) Engel’s model using “# of sp electrons/atom”†
BCC < 1.5
HCP 1.7 – 2.1
FCC 2.5-3
sp electrons/atom†
Li
Al
Mg
Cu
Sc
1
3
2
11
2
BCC
FCC
FCC
HCP
HCP
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
PATTERN DETECTION
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
DATA MINING DATABASES
prediction
classification
6
PC3(19.47%)
4
2
0
-2
-4
-6
-6
-4
PC
1(4
4
-2
9%
covalent
ionic
2
0
0
-2
2
9.8
6
-4
4
)
6
-6
2
PC
(2
2
3
.4
%
)
Graduciz, Gaune-Escard
Univ.Novi Sad, Serbia
CNRS, Marseilles
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
COMPUTATIONAL ISSUES
Focus on properties of signal / macroscopic behavior
rather than noise/ error. Assume complexity !!!
Establish multivariate database:
Seek DIVERSITY in datasets
Data can come across
length and time scales
Utilize data dimensionality
reduction techniques
Model relationships in data to seek
heuristic relationships:
Analyze variation and correlation
in data
Covariance
Advanced statistical learning tools
can deal with:
- skewed data
- missing data
- differentiate between local
and global minima
- ultra large scale datasets
- variable uncertainty
Establish correlations across
diverse data sets ( ie. length &
time scales)
Identify outliers: explore cause
Develop predictive models
- Target requirements of missing
data
- Quantitatively assess data
diversity
•Singular value decomposition
•Support vector machines
•Association mining
•Fuzzy clustering
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Rajan
2006
Cincinatti, OH October 15th Krishna
CRYSTAL CHEMISTRY DESIGN
Ching et.al J. Amer. Ceram.Soc. 85 75-80 (2002)
1. Assess influence of latent
variables ( i.e. electronic structure
parameters) on properties of
known data
2. Establish heuristic relationships
on database of all input variables
instead of phenomenological
relationships in bivariate manner
3. Use statistical learning to
predict new materials behavior
on new multivariate input data
4. Inverse problem approach to
formulate quantitative structureproperty relationships
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Rajan
2006
Cincinatti, OH October 15th Krishna
Virtual library:
via informatics
Refractory
metals
“Real” library:
via first principles
calculations
Suh and Rajan (2005, 2006)
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Rajan
2006
Cincinatti, OH October 15th Krishna
LINKING DISPARATE LENGTH SCALES
Visualizing Associations
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
SYNTHESIZING REMOTE DATA SETS
hcp
fcc
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
STRATEGY FOR MATERIALS DESIGN
• Search and retrieve
• Structure databases
• Refining descriptors
• Diffraction spectra databases
• Developing predictions
• Hyper spectral imaging databases
• Filling in missing data
Mapping / Visualization of databases:
• Correlations
• Trend analysis
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
CONCLUSIONS
Science was originally empirical, like
Leonardo, making wonderful drawings of
nature. Next came the theorists who
tried to write down the equations that
explained the observed behaviors, like
Kepler or Einstein. Then when we got to
complex enough systems like the clustering
of a million galaxies, there came the
computer simulations, the computational
branch of science. Now we are getting into
the data exploration part of science, which
is kind of a little bit of them all
The Facts of the Matter: Finding,
understanding, and using information about our
physical world (2000)
Information science based design of
materials……next stage of the scientific
discovery process
”..
Dr. Alex Szalay: Virtual Observatory Project
Division of Materials Research:
Cyber infrastructure and cyber discovery in
materials science (2006)
May 20th 2003
Krishna Rajan
TMS / ASM Materials Informatics Workshop
Cincinatti, OH October 15th 2006
Download