Data Mining and Materials Informatics: a primer Krishna Rajan Department of Materials Science and Engineering NSF Intl. Materials Institute Combinatorial Sciences & Materials Informatics Collaboratory Iowa State University Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 OUTLINE What is data? primary, secondary , derivative sources of data What do we mean by mining data? data correlations – dimensional analysis approach data correlations – data mining What can we learn from data mining? Classification data mining databases hierarchy of data trends in data Prediction predicting structure-property relationships computational informatics vs. computational materials science data mining for building databases Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS What is it ? Why ? • Searching for patterns of behavior among multivariate data sets • Establish new correlations • Can pattern recognition lead to predictions? • Enlarge database / virtual libraries • Identify outliers • Evaluate databases • Establish predictions Krishna Rajan Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 COMPLEXITY OF BIOSYSTEMS Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS STRATEGY Data + Correlations + Theory = Knowledge-base InformaticsBased Design • Atomistic based calculations • Continuum based theories • Machine learning • Data compression • Pattern recognition • Simulations • Combinatorially derived datasets • Digital libraries • Spectral and imaging libraries Machine learning toolkit: • Classification • Prediction • Hybrid tools ULTRA LARGE SCALE INFORMATION SPACE Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS-BASED DESIGN STRATEGIES BIOLOGICAL ANALOGUE Ideker and Lauffenburger:(2003) CRYSTAL STRUCTURE ANALOGUE Nye (1950) Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Statistical Tools Experimental data Computational derived data sets Simulations Principal Component Analysis (PCA) Prediction Materials Databases Support Vector Machine (SVM) Latent Variables / Partial Least Squares (PLS) Classification Combinatorial & Spectral Libraries Input & Output Association Mining (AM) Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 MACHINE LEARNING AND MATERIALS DISCOVERY Requirements • • • • • • Global minima Categorical data Missing / variable data Skewed distributions Large data sets Scalable Pitfalls • Local minima • Categorical data difficult • Convergence difficult for large data sets • Few outliers can lead to poor performance Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Database administration & management DATA STORAGE • thermodynamic, crystallographic & property data bases • combinatorial experimental data Oracle, Unix, Supercomputing, SQL DATA CURATION • taxonomy and ontology of materials science data • data sharing , networking / cyber infrastructure JAVA, HTML, Python DATA REPRESENTATION • object oriented programming language • visualization of high dimensional data Data mining algorithms KNOWLEDGE DISCOVERY • Clustering analysis • Quantitative Structure-Activity Relationships (QSAR) for materials design Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 WHY COUPLE COMPUTATIONAL MATERIALS SCIENCE AND INFORMATICS ? • Accelerated insertion of materials into engineering systems • Rapid multiscale design and optimization of materials properties • Establishment of new structure –property correlations among large, heterogeneous and distributed data sets • Discovery of new chemistries and compounds • Formulation and / or refinement of new theories for materials behavior • Rapid identification of critical data and theoretical needs for future problems Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SOFT MODELING vs. HARD MODELING Interpretation Softmodeling modeling Soft Data Mining & Visualization Feature Extraction Data Warehousing Knowledge Patterns Transformed Data Constitutive Equations Dimensional Analysis and/or Hardmodeling modeling Theory Hard Original Data Data bases Krishna Rajan TMS / ASM Materials Informatics Workshop Rajan 2006 Cincinatti, OH October 15th Krishna DATA MAP 22x22 (484 cells – 2500 data points): Silicon nitride descriptors Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DIMENSIONALITY REDUCTION Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 EIGENVALUE DECOMPOSITION Sp = λ p eigenvector eigenvalue SP = PΛ Eigenvalues on the diagonal of this diagonal matrix Eigenvectors forming the columns of this matrix If V is nonsingular, this become the eigenvalue decomposition Eigenvalue Matrix S = P ΛP −1 Loading Matrix=Eigenvector Matrix Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PREDICTIONS PLS ‘Upgradeable’ linear predictive tool Quadratic and interaction terms :flexibility T U Factors Responses Calculate latent factors from original predictor variables T=XW where T- latent factors, X – predictors, W weights Use latent factors to predict response Y=TQ+E where Y - responses, Q – loadings, Enoise • W maximizes covariance between Y and T Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS AND THE PERIODIC TABLE Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DATA WAREHOUSING Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 BIVARIATE MAPS 1. # of atoms/unit cell 2. Valence electron # 3. Electronegativity 4. 1st Ionization potential 5. Atomic radius 6. Melting point 7. Boiling point 8. Density 6 4 1 2 20 2 10 0 4 3 2 0 10 4 5 0 3 5 2 1 4000 6 2000 0 10000 7 5000 0 8 40 20 0 2 4 1 6 0 10 2 20 0 2 3 4 0 5 4 10 1 5 2 3 0 6 2000 4000 7 0 5000 10000 0 20 40 8 Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SEEKING PATTERNS 6 PC1 4 2 0 -2 -4 -6 6 PC2 4 2 0 -2 -4 -6 2 PC3 1 0 -1 -2 -6 -4 -2 0 PC1 2 4 6 -6 -4 -2 0 PC2 2 4 6 -2 -1 0 1 2 PC3 Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 MENDELEEV SEQUENCING 5 54 4 Cs, Ba, La, Ce, Pr, Nd, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, 55 Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, Pb, and Bi 27 Scores on PC 3 (12.02%) 3 14 57 56 28 2 525853 15 1 29 30 0 3 -1 -4 18 11 1 4 -2 41 44 38 31 3742 49 51 32 43 3639 40 46 34 33 35 4750 48 23 21 76 8 -3 17 24 22 2019 910 -2 Na, Mg, Al -3 26 25 16 5 1213 2 45 -1 0 K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu,Scores Zn, andon GaPC 2 (22.61%) 1 2 Sr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, In, and Sn Each cluster represent each row in the periodic table Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DEVELOPING PHYSICAL LAWS Valence Electron Melting Heat of number Point Vaporization Heat of Fusion MartynovBatsanov electronegativity Boiling Point Specific Heat Density Pauling electronegativity First Ionization Potential Pseudopotential Radius Covalent radius Molar Volume Metallic radius • Location of property in loading plot indicates influence of property on PC • Atomic weight has no influence on PC1, high influence on PC2 and 3 • Clustered properties indicate relationships • Electrical and thermal conductivity • Melting point and density • Molar volume and atomic radius • Melting and boiling points, heats of fusion and vaporization, and valence number • Pauling electronegativity and first ionization potential Atomic Weight Thermal Conductivity Electrical Conductivity Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DEVELOPING PHYSICAL LAWS Activation energy for self-diffusion versus melting point Fe Ni Co Linear regression Pt 60 Cu 50 Ag 18R Au 40 g Al 29. Copper (Cu) Bokshtein & Zhukhovitskii 30 Pb 20 500 1000 1500 T [K] m From Glicksman 2000 2500 Thermal Conductivity (W/(m*K)) Activation Energy, Q [kcal/mol] 70 6. Carbon, diamond (C) 100 32. Germanium (Ge) 10 52. Tellurium (Te) 83. Bismuth (Bi) 1 1e-12 34. Selenium (Se) 1e-10 1e-8 1e-6 1e-4 0.01 RPI Electrical Conductivity (0.00000001*ohm*m) Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PCA of 34 binary, ternary, quaternary compounds: Score plot 9 descriptors from NIST data - Lattice parameters (a, b, c, α, β, γ) - c/a, b/c - V/abc Orthorhombic 1 2 22 9 8 21 10 23 3 Tetragonal (I-42d) Scores Plot Hexagonal 15 25 26 27 28 10 PC 3 ( 5.85%) Cubic 30 5 7 33 24 20 11 0 16 4 18 1 9 21 382 23 10 15 16 422 28 27 26 25 32 34 31 14 12 13 -10 31 34 32 19 17 -20 14 13 12 -30 29 -40 50 6 Monoclinic Tetragonal (P42/mmc) 50 0 0 -50 PC 2 (29.17%) -50 -100 PCA Score plot (34 binary, ternary, quaternary compounds) PC 1 (61.71%) Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Cluster analysis of 34 binary, ternary, quaternary compounds α, β, γ = 90° α, β, γ ≠ 90° 1 2 9 22 25 28 27 26 3 23 10 15 8 4 16 21 12 13 14 31 32 34 29 5 7 11 20 24 18 30 33 6 17 19 Tetragonal (I4/mcm) Cubic (Fm-3m) Tetragonal (I-42d) Cubic (Im-3,Pm-3) Monoclinic Hexagonal (P63/mmc) Hexagonal (P-62m,P-6) Orthorhombic Cubic (P4132,P-43m,Pm-3, Pm-3,Fd-3ms) Hexagonal (P63/mmc,P6/mmm, P-6m2) / Trigonal (P-3m1) Tetragonal (P42/mmc) Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 RULE EXTRACTION Atomic packing of single element Ex.) Engel’s model using “# of sp electrons/atom”† BCC < 1.5 HCP 1.7 – 2.1 FCC 2.5-3 sp electrons/atom† Li Al Mg Cu Sc 1 3 2 11 2 BCC FCC FCC HCP HCP Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PATTERN DETECTION Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DATA MINING DATABASES prediction classification 6 PC3(19.47%) 4 2 0 -2 -4 -6 -6 -4 PC 1(4 4 -2 9% covalent ionic 2 0 0 -2 2 9.8 6 -4 4 ) 6 -6 2 PC (2 2 3 .4 % ) Graduciz, Gaune-Escard Univ.Novi Sad, Serbia CNRS, Marseilles Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 COMPUTATIONAL ISSUES Focus on properties of signal / macroscopic behavior rather than noise/ error. Assume complexity !!! Establish multivariate database: Seek DIVERSITY in datasets Data can come across length and time scales Utilize data dimensionality reduction techniques Model relationships in data to seek heuristic relationships: Analyze variation and correlation in data Covariance Advanced statistical learning tools can deal with: - skewed data - missing data - differentiate between local and global minima - ultra large scale datasets - variable uncertainty Establish correlations across diverse data sets ( ie. length & time scales) Identify outliers: explore cause Develop predictive models - Target requirements of missing data - Quantitatively assess data diversity •Singular value decomposition •Support vector machines •Association mining •Fuzzy clustering Krishna Rajan TMS / ASM Materials Informatics Workshop Rajan 2006 Cincinatti, OH October 15th Krishna CRYSTAL CHEMISTRY DESIGN Ching et.al J. Amer. Ceram.Soc. 85 75-80 (2002) 1. Assess influence of latent variables ( i.e. electronic structure parameters) on properties of known data 2. Establish heuristic relationships on database of all input variables instead of phenomenological relationships in bivariate manner 3. Use statistical learning to predict new materials behavior on new multivariate input data 4. Inverse problem approach to formulate quantitative structureproperty relationships Krishna Rajan TMS / ASM Materials Informatics Workshop Rajan 2006 Cincinatti, OH October 15th Krishna Virtual library: via informatics Refractory metals “Real” library: via first principles calculations Suh and Rajan (2005, 2006) Krishna Rajan TMS / ASM Materials Informatics Workshop Rajan 2006 Cincinatti, OH October 15th Krishna LINKING DISPARATE LENGTH SCALES Visualizing Associations Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SYNTHESIZING REMOTE DATA SETS hcp fcc Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 STRATEGY FOR MATERIALS DESIGN • Search and retrieve • Structure databases • Refining descriptors • Diffraction spectra databases • Developing predictions • Hyper spectral imaging databases • Filling in missing data Mapping / Visualization of databases: • Correlations • Trend analysis Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 CONCLUSIONS Science was originally empirical, like Leonardo, making wonderful drawings of nature. Next came the theorists who tried to write down the equations that explained the observed behaviors, like Kepler or Einstein. Then when we got to complex enough systems like the clustering of a million galaxies, there came the computer simulations, the computational branch of science. Now we are getting into the data exploration part of science, which is kind of a little bit of them all The Facts of the Matter: Finding, understanding, and using information about our physical world (2000) Information science based design of materials……next stage of the scientific discovery process ”.. Dr. Alex Szalay: Virtual Observatory Project Division of Materials Research: Cyber infrastructure and cyber discovery in materials science (2006) May 20th 2003 Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006