Program for the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD-2003 Washington, DC, USA August 24-27, 2003 Program Highlights Invited Talks On-Line Science: The World-Wide Telescope as a Prototype for the New Computational Science, Jim Gray, Microsoft Research Statistical Learning from Relational Data, Daphne Koller, Stanford University Analyzing Customer Behavior at Amazon.com, Andreas Weigend, Chief Scientist, Amazon.com Research and Industrial/Government Tracks 34 research papers divided into two tracks with nine sessions in total 13 industrial/government papers in four sessions 36 research posters 10 industrial/government posters 2 panels 7 tutorials: o Data Mining for Computer Security o Data Mining for Machine Learners o Information Extraction from the World Wide Web o Multi-Relational Data Mining o Privacy-Preserving Data Mining o Sequence Data Mining Techniques and Applications o The Top 10 Data Mining Mistakesand How to Avoid Them 9 workshops: o BIOKDD03: Data Mining in Bioinformatics o Data Cleaning, Record Linkage and Object Consolidation o Data Mining Standards, Services and Platforms o Fractals and Self Similarity in Data Mining: Issues and Approaches o Link Analysis o MDM/KDD 2003: Integrated Media Mining o MRDM 2003: Multi-relational Data Mining o Operational Text Classification o WebKDD2003: WebMining as a Premise to Intelligent and Effective Web Applications Summarized Technical Program Sunday SIGKDD 2003 Opening Awards Ceremony Innovation Award Talk KDD Cup 2003 Joint KDD/ICML Invited Talk 2 Joint KDD/ICML Sessions 1 Tutorial Monday Invited Talk Research Track o Clustering and Pattern Discovery (4 papers) o Temporal Data (4 papers) o Classification and Contrast Sets (3 papers) Industrial/Government Track o IT (4 papers) o Science (3 papers) 1 Panel 1 Tutorial Poster Highlights Poster Session Tuesday Invited Talk Research Track o Relational and Graph Data (3 papers) o Data Streams and Sequential Data (3 papers) o Web Mining and Data Cubes (3 papers) o Distance-based Methods (3 papers) o Frequent Sets (5 papers) o Data Reduction and Visualization (3 papers) Industrial/Government Track o Healthcare (3 papers) o Systems (3 papers) 1 Panel 1 Tutorial Wednesday Saturday, August 23 16:00-20:00 (Concourse) Registration 9 Workshops 4 Tutorials Notes: Sunday, August 24 9:00-18:00 (Concourse) Registration 10:00-10:15 (International Ballroom – Center) Opening Remarks Ted Senator, General Chair Pedro Domingos, Christos Faloutsos, Program Chairs 10:15-10:30 (International Ballroom – Center) Award Presentations Chairs: Mark Craven, Daryl Pregibon 10:30-11:30 (International Ballroom – Center) Award Talk Chair: Gregory Piatetsky-Shapiro Innovation Award Talk by Heikki Mannila 11:30-12:30 (International Ballroom – Center) KDD Cup Awards Chairs: Johannes Gehrke, Paul Ginsparg, Jon Kleinberg 12:30-14:00 Lunch (on your own) 14:00-15:00 (International Ballroom – Center) Joint KDD/ICML Invited Talk Chair: Pedro Domingos Statistical Learning from Relational Data Daphne Koller, Stanford University Much of the data in the world is relational in nature, involving multiple objects, related to each other in a variety of ways. Examples include both structured databases such as customer transaction data, semi-structured data such as hyperlinked pages on the world-wide web or networks of interacting genes, and unstructured data such as text. In this talk, I will describe a statistical framework for learning from relational data. The approach is based on probabilistic models, which have been applied with great success to a variety of machine learning tasks. Generally, this framework has been applied to data represented as fixed-length attribute-value vectors, or to sequence data. I will describe the language of probabilistic relational models (PRMs), which extend probabilistic graphical models with the expressive power of object-relational languages. PRMs model the uncertainty over the attributes of objects in the domain as well as uncertainty over the existence of relations between objects. I will present techniques for automatically learning PRMs directly from a relational data set, and applications of these techniques to various tasks, Sunday, August 24 such as: collective classification of an entire set of related entities; clustering a set of linked entities into coherent groups; and even predicting the existence of links between entities. The talk will demonstrate the applicability of the techniques on several domains, such as web data and biological data. We discuss some recent trends and events, e.g., the dot com meltdown, and some ways for the field to respond to the challenges, and the opportunities. 15:00-16:00 (International Ballroom – Center) Joint KDD/ICML Session I Chair: Pedro Domingos BEST RESEARCH PAPER AWARD Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg, Eva Tardos Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong, Andrew Moore, Gergory Cooper, Michael Wagner 15:00-18:30 (Georgetown Room) Tutorial: The Top 10 Data Mining Mistakesand How to Avoid Them John F. Elder, Elder Research, USA 16:00-16:30 Coffee Break 16:30-18:30 (International Ballroom – Center) Joint KDD/ICML Session II Chair: Tom Fawcett XRules: An Effective Structural Classifier for XML Data Mohammed Zaki, Charu Aggarwal Learning on the Test Data: Leveraging "Unseen" Features Ben Taskar, Ming Fai Wong, Daphne Koller Information-Theoretic Co-clustering Inderjit Dhillon, Subramanyam Mallela, Dharmendra Modha ICML BEST STUDENT PAPER AWARD A Kernel between Sets of Vectors Risi Kondor, Tony Jebara Notes: Notes: Monday, August 25 7:30-8:30 Continental Breakfast 8:00-18:00 (Concourse) Registration 8:00-17:00 (Exhibit Hall) Exhibits 8:30-9:30 (International Ballroom – Center) Invited Talk Chair: Christos Faloutsos On-Line Science: The World-Wide Telescope as a Prototype for the New Computational Science Jim Gray, Microsoft Research Computational science has historically meant simulation; but, there is an increasing role for analysis and mining of online scientific data. As a case in point, half of the world's astronomy data is public. The astronomy community is putting all that data on the Internet so that the Internet becomes the world's best telescope: it has the whole sky, in many bands, and in detail as good as the best 2-yearold telescopes. It is useable by all astronomers everywhere. This is the vision of the virtual observatory -also called the World Wide Telescope (WWT). As one step along that path I have been working with the Sloan Digital Sky Survey (especially Alex Szalay of Johns Hopkins) and CalTech to federate their data in web services on the Internet, and to make it easy to ask questions of the database (see http://skyserver.sdss.org). This talk explains the rationale for the WWT, discusses how we designed the database, and talks about some data mining tasks. It also describes computer science challenges of publishing, federating, and mining scientific data, and argues that XML web services are key to federating diverse data sources. 9:30-10:00 Coffee Break 10:00-12:00 Research Track 1 (Monroe Room) Clustering and Pattern Discovery Chair: Gregory Piatetsky-Shapiro Privacy-Preserving K-Means Clustering over Vertically Partitioned Data Jaideep Vaidya, Chris Clifton Assessment and Pruning of Hierarchical Model Based Clustering Jeremy Tantrum, Alejandro Murua, Werner Stuetzle Monday, August 25 Generative Model-Based Clustering of Directional Data Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Suvrit Sra An Alternative Hypothesis-Testing Strategy for Pattern Discovery Richard Bolton, Niall Adams 10:00-12:00 Research Track 2 (Military Room) Temporal Data Chair: Sunita Sarawagi Indexing Multi-Dimensional Time-Series with Support for Multiple Distance Measures Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, Eamonn Keogh Translation-Invariant Mixture Models for Curve Clustering Darya Chudova, Scott Gaffney, Eric Mjolsness, Padhraic Smyth Generating English Summaries of Time Series Data Using the Gricean Maxims Somayajulu Sripada, Ehud Reiter, Jim Hunter, Jin Yu To Buy or Not to Buy: Mining Airline Fare Data to Minimize Ticket Purchase Price Oren Etzioni, Craig Knoblock, Rattapoon Tuchinda, Alexander Yates 10:00-12:00 Industrial/Govt. Track (Georgetown Room) IT Chair: Michael Pazzani SIGKDD-2003 Program Committee, cont. Kai Ming Ting, Monash University, Australia Hannu Toivonen, University of Helsinki, Finland Alexander Tuzhilin, New York University, USA Geoff Webb, Monash University, Australia Stefan Wrobel, Fraunhofer AIS and University of Bonn, Germany Yiming Yang, Carnegie Mellon University, USA Philip Yu, IBM T. J. Watson Research Center, USA Osmar Zaiane, University of Alberta, Canada Ruben Zamar, University of British Columbia, Canada Zijian Zheng, Microsoft Corporation, USA Industrial/Government Track Program Committee Scott Bennett, SRA International, USA Eric Bloedorn, Mitre, USA John Elder, Elder Research, USA Herb Edelstein, Two Crows, USA Ronen Feldman, ClearForest, USA Steve Gallant, Xchange, USA Monte Hancock, CSI, USA Richard Lathrop, University of California – Irvine, USA Brian Lent, Intelligent Results, USA Chris Merz, Mastercard, USA Claudia Pearce, NSA, USA Dorian Pyle, Data Miners, USA Bharat Rao, Siemens, Germany Neal Rothleder, digiMine, USA Joseph Sirosh, Fair Isaac, USA Ming Tan, RulesPower, USA Ramasamy Uthurusamy, General Motors, USA Best Paper Awards Committee Passenger-Based Predictive Modeling of Airline No-show Rates Richard D. Lawrence, Se J. Hong, Jacques Cherrier The Data Mining Approach to Automated Software Testing Mark Last, Menahem Friedman, Abraham Kandel Critical Event Prediction for Proactive Management in Large-scale Computer Clusters R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma Information Awareness: A Prospective Technical Assessment David Jensen, Matt Rattigan, and Hannah Blau Corinna Cortes, AT&T Labs - Research, USA Charles Elkan, University of San Diego, USA H.V. Jagadish, University of Michigan, USA David Madigan, Rutgers University, USA Raymond Ng, University of British Columbia, Canada Padhraic Smyth, University of California, Irvine, USA Alexander Tuzhilin, New York University, USA ACM SIGKDD Chair Won Kim, Cyber Database Solutions, USA SIGKDD-2003 Program Committee, cont. Daniel Keim, University of Konstanz, Germany Eamonn Keogh, University of California, Riverside, USA Masaru Kitsuregawa, University of Tokyo, Japan Jon Kleinberg, Cornell University, USA Ron Kohavi, Blue Martini Software, USA Nick Koudas, AT&T Labs – Research, USA Hans-Peter Kriegel, University of Munich, Germany Vipin Kumar, University of Minnesota, USA Diane Lambert, Bell Labs, USA Nada Lavrac, Jozef Stefan Institute, Slovenia Wenke Lee, Georgia Institute of Technology, USA David Lin, University of Memphis, USA Sheng Ma, IBM T. J. Watson Research Center, USA Dragos Margineantu, The Boeing Company, USA Brij Masand, Data Miners, Inc., USA Llew Mason, Blue Martini Software, USA Andrew McCallum, University of Massachusetts, Amherst, USA Vasileios Megalooikonomou, Temple University, USA Marina Meila, University of Washington, USA Dunja Mladenic, Jozef Stefan Institute, Slovenia Raymond Mooney, University of Texas, Austin, USA Katharina Morik, University of Dortmund, Germany Rajeev Motwani, Stanford University, USA Richard Muntz, University of California, Los Angeles, USA Raymond Ng, University of British Columbia, Canada William Stafford Noble, University of Washington, USA Stephen North, AT&T Labs – Research, USA David Page, University of Wisconsin, Madison, USA Dmitry Pavlov, NEC Research Institute, USA Jian Pei, State University of New York at Buffalo, USA David Pennock, Overture Services, Inc., USA Gregory Piatetsky-Shapiro, KDnuggets, USA Foster Provost, New York University, USA Raghu Ramakrishnan, University of Wisconsin, Madison, USA Pat Riddle, University of Auckland, Australia Greg Ridgeway, RAND, USA Mehran Sahami, Google and Stanford University, USA Lorenza Saitta, University of Piemonte Orientale, Italy Joerg Sander, University of Alberta, Canada Sunita Sarawagi, IIT Bombay, India Dale Schuurmans, University of Waterloo, Canada Steven L. Scott, University of Southern California, USA Ken Sevcik, University of Toronto, Canada Jude Shavlik, University of Wisconsin, Madison, USA Arno Siebes, Utrecht University, Netherlands Simeon Simoff, University of Technology Sidney, Australia Myra Spiliopoulou, Otto-von-Guericke-Universitaet Magdeburg, Germany Jaideep Srivastava, University of Minnesota, USA Werner Stuetzle, University of Washington, USA Latanya Sweeney, Carnegie Mellon University, USA Monday, August 25 12:00-13:30 (International Ballroom – Center) Lunch 13:30-15:00 Research Track 1 (Military Room) Classification and Contrast Sets Chair: Lorenza Saitta Classifying Large Data Sets Using SVMs with Hierarchical Clusters Hwanjo Yu, Jiong Yang, Jiawei Han Cross-Training: Learning Probabilistic Mappings Between Topics Sunita Sarawagi, Soumen Chakrabarti, Shantanu Godbole On Detecting Differences Between Groups Geoff Webb, Shane Butler, Douglas Newlands 13:30-15:00 Industrial/Govt. Track (Monroe Room) Science Chair: Bharat Rao Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro, Tom Khabaza, Sridhar Ramaswamy Frequent-Subsequence-Based Prediction of Outer Membrane Proteins Rong She, Fei Chen, Ke Wang, Martin Ester, Jennifer L. Gardy, Fiona S. L. Brinkman Discovery of Climate Indices using Clustering Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Steven Klooster, Christopher Potter 13:30-17:00 (Georgetown Room) Tutorial: Multi-Relational Data Mining Luc DeRaedt, Albert-Ludwigs-University Freiburg, Germany Saso Dzeroski, Jozef Stefan Institute, Slovenia 15:00-15:30 Coffee Break Monday, August 25 15:30-17:00 (International Ballroom – Center) Panel: Privacy and Data Mining: Friends or Foes? Chair: Rakesh Agrawal, IBM Almaden Research Center The explosive progress in networking, storage, and processor technologies has created an unprecedented capability to collect, store, and process massive amounts of data. Data mining, with its promise of efficiently discovering valuable, non-obvious information from large databases, is posing an interesting dilemma. Applications abound where data mining could do enormous good. However, under misguided hands, in conjunction with other advanced technologies, it could be vulnerable to misuse. Indeed, of late, data mining has come to be portrayed by some as a potential threat to civil liberties and privacy. The goal of this panel is to debate and understand the concerns with data mining and to identify research directions that may address those concerns. Panelists will address the following specific questions: 1. Perceived concerns with data mining 2. How real are those concerns 3. What the data mining community is doing to address those concerns 4. What more needs to be done Panelists: Christopher Clifton, Purdue University Lawrence Cox, National Center for Health Statistics James Dempsey, Center for Democracy & Technology Mike Gurski, Information & Privacy Commission, Ontario, Canada Bhavani Thuraisingham, National Science Foundation Jeff Ullman, Stanford University 17:00-18:30 (International Ballroom – Center) Poster Highlights Chair: Usama Fayyad 18:30-20:30 (Exhibit Hall) Poster Session and Reception SIGKDD-2003 Program Committee Niall Adams, Imperial College, UK Deepak K. Agarwal, AT&T Labs – Research, USA Mihael Ankerst, The Boeing Company, USA Chid Apte, IBM T. J. Watson Research Center, USA Lars Asker, Stockholm University, Sweden Daniel Barbara, George Mason University, USA Roberto Bayardo, IBM Almaden Research Center, USA Kristin Bennett, Rensselaer Polytechnic Institute, USA Michael Berthold, Tripos, Inc., USA Richard Bolton, Imperial College, UK Pavel Brazdil, University of Porto, Portugal Carla Brodley, Purdue University, USA Wray Buntine, Helsinki Institute for Information Technology, Finland Rich Caruana, Cornell University, USA Soumen Chakrabarti, IIT Bombay, India Phillip Chan, MIT/FIT, USA Surajit Chaudhuri, Microsoft Research, USA Ken Church, AT&T Labs – Research, USA Chris Clifton, Purdue University, USA William Cohen, Carnegie Mellon University, USA David Cohn, Google, USA Mark Craven, University of Wisconsin, Madison, USA Tamraparni Dasu, AT&T Labs – Research, USA Umeshwar Dayal, Hewlett-Packard Laboratories, USA Luc De Raedt, Albert-Ludwigs-University Freiburg, Germany Thomas G. Dietterich, Oregon State University, USA Susan Dumais, Microsoft Research, USA William DuMouchel, AT&T Labs – Research, USA Jennifer Dy, Northeastern University, USA Saso Dzeroski, Jozef Stefan Institute, Slovenia Charles Elkan, University of California, San Diego, USA Martin Ester, Simon Fraser University, Canada Usama Fayyad, DMX Group, USA Doug Fisher, Vanderbilt University, USA Gary William Flake, Overture Services, Inc., USA Takeshi Fukuda, IBM Tokyo Laboratory, Japan Minos Garofalakis, Bell Labs, USA Johannes Gehrke, Cornell University, USA Lee Giles, Pennsylvania State University, USA Henry Goldberg, NASD, USA Marko Grobelnik, Jozef Stefan Institute, Slovenia Dimitrios Gunopulos, University of California, Riverside, USA Jiawei Han, University of Illinois at Urbana, USA David Heckerman, Microsoft Research, USA Haym Hirsh, Rutgers University, USA Piotr Indyk, MIT, USA Yannis Ioannidis, University of Athens, Greece H.V. Jagadish, University of Michigan, USA David Jensen, University of Massachusetts, Amherst, USA Thorsten Joachims, Cornell University, USA SIGKDD-2003 Organizing Committee General Chair: Ted Senator, DARPA, USA Associate General Chair: Hillol Kargupta, University of Maryland, Baltimore County, USA Program Chairs: Pedro Domingos, University of Washington, USA Christos Faloutsos, Carnegie Mellon University, USA Industrial/Government Track Chairs: Paul Bradley, Microsoft Research, USA Michael Pazzani, University of California, Irvine, USA Best Paper Awards Chair: Daryl Pregibon, AT&T Labs - Research, USA Exhibits Chairs: Kirk Borne, Raytheon and NASA Goddard Space Flight Ctr, USA David Vennergrund, SRA International Inc., USA Government Relations Chairs: Eric Bloedorn, MITRE Corp., USA Ashok Srivastava, RIACS/NASA Ames Research Ctr, USA KDD Cup Chairs: Johannes Gehrke, Cornell University, USA Paul Ginsparg, Cornell University, USA Jon Kleinberg, Cornell University, USA Local Arrangements Chair: Tim Oates, University of Maryland, Baltimore County, USA Local Publicity Chair: Lisa Singh, Georgetown University, USA Panels Chair: Steve Lawrence, Google, USA Proceedings Chair: Lise Getoor, University of Maryland, College Park, USA Publicity Chair: Osmar R. Zaïane, University of Alberta, Canada Registration Chairs: Rita Doerr, Department of Defense, USA Anupam Joshi, University of Maryland, Baltimore County, USA Sponsorship Chairs: Herb Edelstein, Two Crows Corp., USA John F. Elder IV, Elder Research Inc., USA Student Awards Chair: Mark Craven, University of Wisconsin, Madison, USA Treasurer: Henry Goldberg, NASD, USA Tutorials Chair: Ramakrishnan Srikant, IBM Almaden Research Ctr, USA Webmaster: Osmar R. Zaïane, University of Alberta, Canada Workshops Chair: Charu Aggarwal, IBM T. J. Watson Research Ctr, USA Poster Papers – Research Track Stylistic Mining of Electronic Messages for Multiple Authorship Discrimination: First Results Shlomo Argamon, Marin Saric, Sterling Stein Mining High Dimensional Data for Classifier Knowledge Raj Bhatnagar, Goutham Kurra, Wen Niu Finding Recent Frequent Itemsets Adaptively over Online Data Streams Joong Hyuk Chang, Won Suk Lee Probabilistic Discovery of Time Series Motifs Bill Chiu, Eamonn Keogh, Stefano Lonardi Understanding Captions in Biomedical Publications William Cohen, Richard Wang, Robert Murphy Using Randomized Response Techniques for PrivacyPreserving Data Mining Wenliang Du, Zhijun Zhan Applications of Sampling and Fractional Factorial Designs to Model-Free Data Squashing William DuMouchel, Deepak K. Agarwal Experiments with Random Projections for Machine Learning Dmitriy Fradkin, David Madigan Accurate Decision Trees for Mining High-Speed Data Streams Joao Gama, Ricardo Rocha, Pedro Medas Correlating Synchronous and Asynchronous Data Streams Sudipto Guha, Dimitrios Gunopulos, Nick Koudas A Web Page Prediction Model Based On Click-stream Tree Representation of User Behavior Sule Gunduz, M. Tamer Ozsu Natural Communities in Large Linked Networks John Hopcroft, Omar Khan, Brian Kulis, Bart Selman Navigating Massive Data Sets via Local Clustering Michael E. Houle Mining Viewpoint Patterns in Image Databases Wynne Hsu, Jing Dai, Mong Li Lee Playing Hide-And-Seek with Correlations Christopher Jermaine Interactive Exploration of Coherent Patterns in Timeseries Gene Expression Data Daxin Jiang, Jian Pei, Aidong Zhang Poster Papers – Research Track, cont. Efficient Decision Tree Construction on Streaming Data Ruoming Jin, Gagan Agrawal Efficient Decision Tree Construction on Streaming Data Ruoming Jin, Gagan Agrawal Acknowledgements The SIGKDD 2003 Conference gratefully acknowledges the contributions of the following institutions: Gold Sponsors A Bag-of-Paths Model for Representing Document Structure with Application to Web Mining Sachindra Joshi, Neeraj Agrawal, Raghu Krishnapuram, Sumit Negi Nantonac Collaborative Filtering: Recommendation Based on Order Responses Toshihiro Kamishima Silver Sponsors A Two-Way Visualization Method for Clustered Data Yehuda Koren, David Harel Empirical Comparisons of Various Voting Schemes in Boosting and Bagging Kelvin Leung, D. Stott Parker Mining Data Records in Web Pages Bing Liu, Robert Grossman, Yanhong Zhai On Computing, Storing and Querying Frequent Patterns Guimei Liu, Hongjun Lu, Wenwu Lou, Jeffrey Xu Yu Online Novelty Detection on Temporal Sequences Junshui Ma, Simon Perkins Bronze Sponsors Distributed Cooperative Mining for Information Consortia Satoshi Morinaga, Kenji Yamanishi, Jun-ichi Takeuchi Learning Relational Probability Trees Jennifer Neville, David Jensen, Lisa Friedland, Michael Hay Graph-Based Anomaly Detection Caleb Noble, Diane Cook CARPENTER: Finding Closed Patterns in Long Biological Datasets Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki New Unsupervised Clustering Algorithm for Large Datasets William Peter, John Chiochetti Improving Spatial Locality Programs via Data Mining Karlton Sequeira, Mohammed Zaki, Boleslaw Szymanski, Christopher Carothers Sponsoring Organizations Wednesday, August 26 Poster Papers – Research Track, cont. 8:00-12:00 (Concourse) Registration Mining Phenotypes and Informative Genes from Gene Expression Data Chun Tang, Aidong Zhang, Jian Pei 8:30-17:00 Full Day Workshops: Weighted Association Rule Mining Using Weighted Support and Significance Framework Feng Tao, Fionn Murtagh, Mohsen Farid BIOKDD03: Data Mining in Bioinformatics (Monroe East) Data Cleaning, Record Linkage and Object Consolidation (Georgetown West) Fractals and Self Similarity in Data Mining: Issues and Approaches (Map Room – terrace level) Link Analysis (Georgetown East) MDM/KDD 2003: Integrated Media Mining (Caucus Room – terrace level) MRDM 2003: Multi-relational Data Mining (Monroe West) Operational Text Classification (Hemisphere Room) PaintingClass: Interactive Construction, Visualization and Exploration of Decision Trees Soon Tee Teoh, Kwan-Liu Ma Time and Sample Efficient Discovery of Markov Blankets and Direct Causal Relations Ioannis Tsamardinos, Constantin F. Aliferis, Alexander Statnikov Distributed Multivariate Regression Based on Influential Observations Hang Yu, Ee-Chien Chang Efficiently Handling Feature Redundancy in HighDimensional Data Lei Yu, Huan Liu WebKDD2003: WebMining as a Premise to Intelligent and Effective Web Applications (Military Room) 8:30-12:00 Half Day Workshop: Data Mining Standards, Services and Platforms (Conservatory – terrace level) 8:30-12:00 Tutorial: Data Mining for Computer Security (Lincoln West) Carla Brodley, Purdue University Philip Chan, MIT/FIT Tutorial: Data Mining for Machine Learners (Thoroughbred Room) Johannes Gehrke, Cornell University Jiawei Han, University of Illinois at Urbana 12:00-13:30 Lunch (on your own) 13:30-17:00 Tutorial: Privacy-Preserving Data Mining (Lincoln West) Chris Clifton, Purdue University Tutorial: Sequence Data Mining Techniques and Applications (Thoroughbred Room) Mark Craven, University of Wisconsin, Madison Sunita Sarawagi, IIT Bombay Poster Papers – Industrial/Government Track An Adaptive Nearest Neighbor Search for a Parts Acquisition ePortal Rafael Alonso, Jeffrey A. Bloom, Hua Li, CHumki Basu Architecting a Knowledge Discovery Engine for Military Commanders Utilizing Massive Runs of Simulations Philip Barry, Jianping Zhang, Mary McDonald Data Quality through Knowledge Engineering Tamraparni Dasu, Gregg T. Vesonder, Jon R. Wright Similarity Analysis on Government Regulations Gloria T. Lau, Kincho H. Law, Gio Wiederhold Experimental Design for Solicitation Campaigns Uwe F. Mayer, Armand Sarkissian . Towards NIC-based Intrusion Detection M. Otey, S. Parthasarathy, A. Ghoting, G. Li, S. Narravula Data-Driven Validation, Completion and Construction of Event Relation Networks Chang-Shing Perng, David Thoenen, Sheng Ma, Genady Grabarnik, Joseph Hellerstein Visualizing Concept Drift Kevin B. Pratt, Gleb Tschapek Poster Papers – Industrial/Government Track, cont. Experimental Study of Discovering Essential Information from Customer Inquiry Keiko Shimazu, Atsuhito Momma, Koichi Furukawa Applying Data Mining in Investigating Money Laundering Crimes Zhongfei Zhang, John J. Salerno, Philip S. Yu Tuesday, August 25 panel will attempt to address the possible future directions for Data Mining and KDD. Will we continue a healthy evolution to being a scientific field of study with a healthy contributing community? Will we go more down the path of systems and engineering? What are the next challenge problems? What are the milestones that define healthy growth and significant advances? Is data mining destined to continue to be a visible area of focus and research, or will it evolve towards embedded technology studied as part of other systems? The presence of a significant set of research challenge problems against which measurable progress can be made is a crucial component for the growth of a scientific field. What will these challenge problems look like for KDD and Data Mining over the next 10 years and beyond? Panelists: Rakesh Agrawal, IBM Almaden Research Gregory Piatetsky-Shapiro, KDnuggets Daryl Pregibon, AT&T Research Ragu Ramakrishnan, University of Wisconsin, Madison Ramasamy Uthurusamy, General Motors 18:30-19:30 (Adams Room) Transfer meeting - KDD 2003 and KDD 2004 organizing committees 19:30-22:00 Program committee dinner (by invitation only) Tuesday, August 25 16:00-18:30 Research Track 1 (Monroe Room) Frequent Sets Chair: Geoff Webb Screening and Interpreting Multi-item Associations Based on Log-linear Modeling Xintao Wu, Daniel Barbara, Yong Ye Fast Vertical Mining Using Diffsets Mohammed Zaki, Karam Gouda CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets Jianyong Wang, Jiawei Han, Jian Pei Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining Mohammad El-Hajj, Osmar R. Zaiane Mining Unexpected Rules by Pushing User Dynamics Ke Wang, Yuelong Jiang, Laks Lakshmanan 16:00-17:30 Research Track 2 (International Ballroom – Center) Data Reduction and Visualization Chair: Mihael Ankerst Efficient Data Reduction with EASE Hervé Brönnimann, Bin Chen, Manoranjan Dash, Peter Haas, Peter Scheuermann PROXIMUS: A Framework for Analyzing Very High Dimensional Discrete-Attributed Datasets Mehmet Koyuturk, Ananth Grama Visualizing Changes in the Structure of Data for Exploratory Feature Extraction Elias Pampalk, Werner Goebl, Gerhard Widmer 17:30-18:30 (International Ballroom – Center) Panel: Data Mining: The Next 10 Years Chair: Usama Fayyad, President, DMX Group After nearly a decade and a half of KDD conferences and a significant growth in demand for data mining technology driven by a glut in data, data mining has grown as a healthy research community. However, we still struggle on two important fronts: the scientific and the commercial. On the scientific front, Data Mining still needs to reach a stronger level of attracting steady contributions from the related fields. On the commercial fronts, the huge opportunity has not yet been met with adequate tools and solutions. This Tuesday, August 25 7:30-8:30 Continental Breakfast 8:00-18:00 (Concourse) Registration 8:00-17:00 (Exhibit Hall) Exhibits 8:30-9:30 (International Ballroom – Center) Invited Talk Chair: Paul Bradley Analyzing Customer Behavior at Amazon.com Andreas Weigend, Chief Scientist, Amazon.com The first part of the talk gives an overview of the different kinds of data available at Amazon.com, emphasizing that data mining needs to drive actions such as emails, coupons, and recommendations of products, product groups, or site features. The scope of the actions ranges from the individual customer, over pre-computed customer segments, to the entire customer base. The second part presents joint work with Bruce D'Ambrosio (Cleverset, Inc.) on probabilistic relational models for customer behavior, both for discovering static customer attributes, and for dynamically predicting the intention of the customer and the outcome of a session. The third part outlines current research problems, such as modeling and eventually influencing the long-term behavior of customers. In addition to the importance of machine learning, it shows the central role principles of behavioral economics, judgment and decision making play in computational marketing. 9:30-10:00 Coffee Break 10:00-11:30 Research Track 1 (Monroe Room) Relational and Graph Data Chair: Ray Mooney Aggregation-Based Feature Invention and Relational Concept Classes Claudia Perlich, Foster Provost Algorithms for Estimating Relative Importance in Networks Scott White, Padhraic Smyth CloseGraph: Mining Closed Frequent Graph Patterns Xifeng Yan, Jiawei Han Tuesday, August 25 Tuesday, August 25 10:00-11:30 Research Track 2 (Georgetown Room) Data Streams and Sequential Data Chair: Johannes Gehrke 14:00-15:30 Research Track 2 (Georgetown Room) Distance-Based Methods Chair: Martin Ester Mining Concept-Drifting Data Streams using Ensemble Classifiers Haixun Wang, Wei Fan, Philip Yu, Jiawei Han Towards Systematic Design of Distance Functions for Data Mining Applications Charu Aggarwal Efficient Elastic Burst Detection in Data Streams Yunyue Zhu, Dennis Shasha Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen Bay, Mark Schwabacher Fragments of Order Aristides Gionis, Teija Kujala, Heikki Mannila 10:00-11:30 Industrial/Govt. Track (Military Room) Healthcare Chair: Eric Bloedorn Mining Hepatitis Data with Temporal Abstraction Tu B. Ho, Trong Dung Nguyen, S. Kawasaki, S. Q. Le, H. Yokoi, K. Takabayashi Clinical and Financial Outcomes Analysis with Existing Hospital Patient Records R. Bharat Rao, Radu S. Niculescu, Colin Germond, Harsha Rao BEST APPLICATION PAPER AWARD Empirical Bayesian Data Mining for Discovering Patterns in Post-Marketing Drug Safety David M. Fram, June S. Almenoff, William DuMouchel 11:45-13:45 (International Ballroom – Center) SIGKDD Business Lunch 14:00-15:30 Research Track 1 (Monroe Room) Web Mining and Data Cubes Chair: Ronny Kohavi Eliminating Noisy Information in Web Pages for Data Mining Lan Yi, Bing Liu, Xiaoli Li SEWeP: Using Site Semantics and a Taxonomy to Enhance the Web Personalization Process Magdalini Eirinaki, Michalis Vazirgiannis, Iraklis Varlamis Extracting Semantics from Data Cubes using Cube Transversals and Closures Alain Casali, Rosine Cicchetti, Lotfi Lakhal Adaptive Duplicate Detection Using Learnable String Similarity Measures Mikhail Bilenko, Raymond Mooney 14:00-15:30 Industrial/Govt. Track (Military Room) Systems Chair: Monte Hancock Knowledge-Based Data Mining Sholom M. Weiss, Stephen J. Buckley, Shubir Kapoor, Søren Damgaard The Anatomy of a Multimodal Information Filter Yi-Leh Wu, King-Shy Goh, Beitao Li, Huaxing You, Edward Y. Chang Golden Path Analyzer: Using Divide-and-Conquer to Cluster Web Clickstreams Kamal Ali, Steven P. Ketchpel 15:30-16:00 Coffee Break 15:45-18:45 (Georgetown Room) Tutorial: Information Extraction from the World Wide Web William Cohen, Carnegie Mellon University Andrew McCallum, University of Massachusetts, Amherst