Graphical Data Mining for Computational Estimation in Materials Science Applications Aparna Varde Ph.D. Dissertation August 15, 2006 Committee Members Prof. Elke Rundensteiner (Advisor) Prof. Carolina Ruiz Prof. David Brown Prof. Neil Heffernan Prof. Richard Sisson Jr. (Head of Materials Science, WPI) This work is supported by the Center for Heat Treating Excellence and by Department of Energy Award DE-FC-07-01ID14197 1 Introduction Scientific domains: Experiments conducted with given input conditions Results plotted as graphs: Good visual depictions Experimental results help in analysis: Assist decision-making Performing experiment: Consumes time and resources 2 Motivating Example Heat Treating of Materials • Controlled heating & cooling of materials to achieve mechanical & thermal properties Pneumatic on/off switch Pneumatic cylinder K-type thermocouple Performing experiments involves • • • • One time cost: $1000s Recurrent costs: $100s Time: 5 to 6 hours Human labor Desirable to estimate • Graphs given input conditions • Conditions to achieve given graph Connecting rod Computer with Data Acquisition Card Probe tip Furnace Oil beaker Thermocouple for Oil temp. CHTE Experimental Setup 3 Problem Definition To develop an estimation technique with following goals: 1. Given input conditions in an experiment, estimate resulting graph 2. Given desired graph in an experiment, estimate conditions to obtain it 4 Proposed Estimation Approach: AutoDomainMine 5 Knowledge Discovery in AutoDomainMine 6 Estimation of Graph in AutoDomainMine 7 Estimation of Conditions in AutoDomainMine 8 Main Tasks Task 1 AutoDomainMine Learning Strategy of Integrating Clustering and Classification [AAAI-06 Poster, ACM SIGART’s ICICIS-05] Task 2 Learning Domain-Specific Distance Metrics for Graphs Task 3 Designing Semantics-Preserving Representatives for Clusters [ACM KDD’s MDM-05, MTAP-06 Journal] [ACM SIGMOD’S IQIS-06, ACM CIKM-06] 9 Task 2: Learning Domain-Specific Distance Metrics for Graphs 10 Motivation Various distance metrics • Absolute position of points • Statistical observations • Critical features Issues • Not known what metrics apply • Multiple metrics may be relevant Need for distance metric learning in graphs Example of domain-specific problem 11 Proposed Distance Metric Learning Approach: LearnMet Given • Training set with actual clusters of graphs Additional Input • Components: distance metrics applicable to graphs LearnMet Metric • D = ∑wiDi 12 Evaluate Accuracy Use pairs of graphs A pair (ga,gb) is TP - same predicted, same actual cluster: (g1, g2) TN - different predicted, different actual clusters: (g2,g3) FP - same predicted cluster, different actual clusters: (g3,g4) FN - different predicted, same actual clusters: (g4,g5) 13 Evaluate Accuracy (Contd.) How do we compute error for whole set of graphs? • For all pairs Error Measure • Failure Rate FR • FR = (FP+FN) / (TP+TN+FP+FN) Error Threshold (t) • Extent of FR allowed • If (FR < t) then clustering is accurate 14 Adjust the Metric Weight Adjustment Heuristic: for each Di • New wi = wi – sfi (DFNi/DFN + DFPi/DFP) [KDD’s MDM-05] 15 Evaluation of LearnMet • G = number of graphs, e.g., = 25 • GC2 = total number of pairs, e.g., = 300 • Select subset of GC2 pairs per epoch Accuracy of Learned Metrics over Test Set Learning Efficiency over Training Set Details: MTAP-06 Effect of pairs per epoch (ppe) Observations • Highest accuracy with middle range of ppe • Learning efficiency best with low ppe • Average accuracy with LearnMet 86% 16 Task 3: Designing SemanticsPreserving Representatives for Clusters 17 Motivation Different combinations of conditions could lead to a single cluster Graphs in a cluster could have variations Need for designing representatives that • Incorporate semantics • Avoid visual clutter • Cater to various users 18 Proposed Approach for Designing Representatives: DesRept 19 Candidates for Conditions 1. Nearest Representative Set of conditions in Cluster A Nearest Representative for Cluster A Return set of conditions closest to all others in cluster Notion of distance: Domain-specific distance metric from decision tree paths [CIKM-06] 20 Candidates for Conditions (Contd.) 2. Summarized Representative Cluster A Build sub-clusters of condition using domain knowledge Return nearest subcluster representatives Sort them Subclusters within the Cluster A Summarized Representative for Cluster 21 Candidates for Conditions (Contd.) 3. Combined Representative Cluster A Combined Representative for Cluster A Return all sets of conditions Sort them in ascending order 22 Candidates for Graphs 1. Nearest Representative Select graph that is nearest neighbor for all others Notion of distance: Domain-specific metric from LearnMet 23 Candidates for Graphs (Contd.) 2. Medoid Representative Select graph closest to average of all graphs Average of y-coordinate values since x-coordinates are same 24 Candidates for Graphs (Contd.) 3. Summarized Representative Construct average graph with prediction limits Average: centroid, Prediction limits: domain-specific thresholds 25 Candidates for Graphs (Contd.) 3. Combined Representative Construct superimposed graph of all graphs in cluster Same x-values, so plot y-values on a common x-axis 26 Effectiveness Measure for Candidates Minimum Description Length Principle • Theory: Representative, Examples: all items in cluster Representative: Measure Complexity (ease of interpretation) Complexity = log2 N for graphs, log2 AV for conditions, • N = number of points to store representative graph • A = number of attributes for conditions, • V = number of values in representative set of conditions Examples: Measure distance of items from representative (information loss) Distance for graphs = log2 (1/G)∑{i=1 to G} D(r,gi) • D: distance using domain-specific metric • G: total number of graphs in cluster • gi: each graph • r: representative graph Encoding [SIGMOD IQIS-06] Effectiveness= UBC*Complexity + UBD*Distance • UBC, UBD: User bias % weights for complexity and distance 27 Evaluation of DesRept: Conditions Details • Data Set Size = 400, Number of Clusters = 20 Observations • Overall winner is Summarized • As weight for complexity increases, Nearest wins • Designed better than Random 28 Evaluation of DesRept: Graphs Details • Data Set Size = 400, Number of Clusters = 20 Observations • Overall winner is Summarized • As weight for complexity increases, Nearest / Medoid wins • Designed better than Random 29 User Evaluation of AutoDomainMine System Formal user surveys in different applications Evaluation Process • Compare estimation with real data in test set • If they match estimation is accurate Accuracy: Estimating Conditions Observations • Estimation Accuracy around 90 to 95 % Accuracy: Estimating Graphs 30 Related Work Similarity Search [HK-01, WF-00] • Non-matching conditions could be significant Mathematical Modeling [M-95, S-60] • Existing models not applicable under certain situations Case-based Reasoning [K-93, AP-03] • Adaptation of cases not feasible with graphs Learning nearest neighbor in high-dimensional spaces: [HAK-00] • Focus is dimensionality reduction, do not deal with graphs Distance metric learning given basic formula: [XNJR-03] • Deal with position-based distances for points, no graphs involved Similarity search in multimedia databases [KB-04] • Use various metrics in different applications, do not learn a single metric Image Rating: [HH-01] • User intervention involved in manual rating Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representatives PDA Displays in Levels of Detail: [BGMP-01] • Do not evaluate different types of representatives 31 Summary Dissertation Contributions • AutoDomainMine: Integrating Clustering and Classification for Estimation [AAAI-06 Poster, ACM SIGART’s ICICIS-05] • LearnMet: Learning Domain-Specific Distance Metrics for Graphs [ACM KDD’s MDM-05, MTAP-06 Journal] • DesRept: Designing Semantics-Preserving Representatives for Clusters [ACM SIGMOD’s IQIS-06, ACM CIKM-06] • Trademarked Tool for Computational Estimation in Materials Science [ASM HTS-05, ASM HTS-03] Future Work • Image Mining, e.g., Comparing Nanostructures • Data Stream Matching, e.g., Stock Market Analysis • Visual Displays, e.g., Summarizing Web Information 32 Publications Dissertation-Related Papers 1. 2. 3. 4. 5. 6. 7. Designing Semantics-Preserving Representatives for Scientific Input Conditions, A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr., In CIKM, Arlington, VA, Nov 2006. Integrating Clustering and Classification for Estimating Process Variables in Materials Science. A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr. In AAAI, Poster Track, Boston, MA, Jul 2006. Effectiveness of Domain-Specific Cluster Representatives for Graphical Plots. A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr. In ACM SIGMOD IQIS, Chicago, IL, Jun 2006. LearnMet: Learning Domain-Specific Distance Metrics for Plots of Scientific Functions. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. Accepted in the International MTAP Journal, Springer Publications, Special Issue on Multimedia Data Mining, 2006. Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. In ACM KDD MDM, Chicago, IL, Aug 2005, pp. 107-112. Apriori Algorithm and Game-of-Life for Predictive Analysis in Materials Science. A. Varde, M. Takahashi, E. Rundensteiner, M. Ward, M. Maniruzzaman and R. Sisson Jr. In KES Journal, IOS Press, Netherlands, Vol. 8, No. 4, 2004, pp. 213 – 228. Data Mining over Graphical Results of Experiments with Domain Semantics. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. In ACM SIGART ICICIS, Cairo, Egypt, Mar 2005, pp. 603 – 611. 33 Publications (Contd.) 8. 9. 10 . QuenchMiner: Decision Support for Optimization of Heat Treating Processes. A. Varde, M. Takahashi, E. Rundensteiner, M. Ward, M. Maniruzzaman and R. Sisson Jr. In IEEE IICAI, Hyderabad, India, Dec 2003, pp. 993 – 1003. Estimating Heat Transfer Coefficients as a Function of Temperature by Data Mining. A. Varde, E. Rundensteiner, M. Maniruzzaman and R. Sisson Jr. In ASM HTS, Pittsburgh, PA, Sep 2005. The QuenchMiner Expert System for Quenching and Distortion Control. A. Varde, E. Rundensteiner, M. Maniruzzaman and R. Sisson Jr. In ASM HTS, Indianapolis, IN, Sep 2003, pp. 174 – 183. Other Papers 11. 12. MEDWRAP: Consistent View Maintenance over Distributed Multi-Relation Sources. A. Varde and E. Rundensteiner. In DEXA. Aix-en-Provence, France, Sep 2002, pp. 341 – 350. SWECCA for Data Warehouse Maintenance. A. Varde and E. Rundensteiner. In SCI, Orlando, FL, Jul 2002, Vol. 5, pp. 352 – 357. 13. MatML: XML for Information Exchange with Materials Property Data, A. Varde, 14. E. Begley, S. Fahrenholz-Mann. In ACM KDD DM-SPP, Philadelphia, PA, Aug 2006. Semantic Extensions to Domain-Specific Markup Languages. A. Varde, E. Rundensteiner, M. Mani, M. Maniruzzaman and R. Sisson Jr. In IEEE CCCT, Austin, TX, Aug 2004, Vol. 2, pp. 55 – 60. 34 Acknowledgments First of all, my Advisor: Prof. Elke Rundensteiner Committee: Prof. Carolina Ruiz,Prof. David Brown, Prof Neil Heffernan External Member: Prof. Richard D. Sisson Jr., Head of Materials Program Director of Metal Processing Institute: Prof. Diran Apelian Domain Expert: Dr. Mohammed Maniruzzaman Members of Center for Heat Treating Excellence CS Department Head: Prof. Michael Gennert Former CS Department Head: Prof. Micha Hofri WPI Administration (CS, Materials): In particular Mrs. Rita Shilansky Reviewers of Conferences and Journals where my papers got accepted Members of DSRG, AIRG, KDDRG and Quenching Research Group Colleagues and Friends: Shuhui, Sujoy, Viren, Olly, Mariana, Rimma, Maged, Bin, Lydia, Shimin and others… Great Thanks to my Family: Parents Dr. Sharad Varde and Dr. (Mrs.) Varsha Varde, Grandparents Mr. D.A. Varde and Mrs. Vimal Varde, Brother Ameya Varde and Sister-in-law Deepa Varde All the attendees of my Ph.D. Defense Finally, God for guiding me throughout my doctoral journey 35 Thank You 36