Knowledge Discovery in Databases & Information Retrieval University of Texas at Austin School of nformation i Knowledge Management Systems Presented April 29, 2003 By Anne Marie Donovan Knowledge Discovery in Databases “The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad, Piatetsky-Shapiro, and Smyth, 1996, p. 30) Also known as knowledge extraction, information harvesting, data archeology, and information extraction (p. 28) Information Retrieval “The methods and processes for searching relevant information out of information systems that contain extremely large numbers of documents” (Rocha, 2001, 1.1) “The ultimate goal of IR is to produce or recommend relevant information to users” (1.2) “Traditional IR does not identify users and classifies subjects only with unchanging keywords and categories” (1.2) Institutions that use KDD/IR systems Require knowledge-based decisions Have a large quantity of accessible, relevant, historical and current data Have a high payoff for correct decisions Financial: banking & investment Medical: healthcare & insurance Sales: marketing & customer relations (Piatetsky-Shapiro, 1998, Slides 28-31) Database Management Systems File Systems Relational Database Management Systems (RDBMS) Object-Oriented Database Management Systems (OODBMS) Object-Relational Database Management Systems (ORDBMS) (Devarakonda, 2001, ORDBMS) Relational Database Management Systems (RDBMS) Relational databases are composed of many relations in the form of two-dimensional tables of rows and columns RDBMS advantages include the SQL standard (enables migration between database systems), rapid data access and large storage capacity RDBMS disadvantages include an inability to handle complex data types and relationships (Devarakonda, 2001, RDBMS) Object-Oriented Database Management Systems (OODBMS) OODBMS use abstract data types (ADTs) in which the internal data structure is hidden OODBMS data is managed through two sets of relations, one describing the interrelations of data items and another describing the abstract relationships OODBMS handle complex data relationships, but suffer from poor performance and problems of scalability (Devarakonda, 2001, OODBMS) Object-Relational Database Management Systems (ORDBMS) ORDBMS store all database information in tables, but some entries have richer data structure that are also called abstract data types (ADTs). ORDBMS exhibit features of both the relational and object models such as scalability and support for rich data types Their main advantage is massive scalability (Devarakonda, 2001, ORDBMS) The KDD Process Collecting and pre-processing data The problem of continually increasing volumes of data The problem of increasingly complex forms of data Identifying and extracting useful knowledge from large data repositories What knowledge is in the data set? What can be observed about the data set? Presenting the knowledge in usable forms (Fayyad et al., 1996) The KDD Process (continued) Data management problems in data collection, storage, and retrieval Translation, change detection, integration, duplication, summarization; aggregation, timeliness/datedness (Widom, 1995) The impracticality of manual analysis Billions of records and hundreds of fields Increasing desire for on-the-fly analysis and more flexible presentation (Fayyad et al., p. 28) The KDD Process (continued) A need to automate the knowledge discovery and extraction processes Data selection and pre-processing Data transformation and mining Interpretation and evaluation (p. 28) Automation requires attention to: Data collection, storage, and retrieval Statistical foundations of search and retrieval processes (p. 29) Stages in the KDD process Learning the application domain Creating a target data set Data cleaning and preprocessing Data reduction and projection Choosing the function of data mining Choosing the data mining algorithm Data mining Interpretation Using discovered knowledge (pp. 30-31) Data mining The application of specific algorithms to a data set for the purpose of extracting data patterns (p. 28) “Fitting models to or determining patterns from observed data” (p. 31) Data warehousing Collecting and “cleaning” transactional data to make it available for online analysis and decision support (p. 30) Data mining tasks Classification: predicting an item class Forecasting: predicting a parameter value Clustering: finding groups of items Description: describing a group Deviation detection: finding changes Link analysis: finding relationships and associations Visualization: presenting data visually to facilitate human discovery (Piatetsky-Shapiro, 1998, Slide 17) Components of data mining systems Model functions: classification, regression, clustering, etc. (pp. 31 -32) Model representation: decision trees and rules, linear models, non-linear models, example-based methods, etc. (p. 32) Preference criterion: quantitative criterion embedded in the search algorithm; implicit criterion embedded in the KDD process Search algorithms: parameter search (given a model) or model search over model space There is NO universal search algorithm Each type of search suits specific types of search problems The searcher must be careful to properly formulate the question The searcher must understand the search goal (p. 31) Every search can be improved by an increase in data or query context Creating context for KDD and IR Extending IR throughout the social network of an organization, e.g., Answer Garden (Ackerman, 1994 & Ackerman and MacDonald, 1996) Providing social context for data exchange, e.g., PeopleGarden (Xiong and Donath, 1999) Relational database reverse engineering, “extracts a conceptual model from an existing relational database by analyzing data instances as well as metadata” (Lee and Hwang, 2002, Conclusion) KD & IR problems for Web resources Collecting and pre-processing data Even more continually changing data Complex data; streaming & multi-media The problem of identifying and extracting useful knowledge from Web resources No consistent data models; no context A lack of descriptive information Presenting the knowledge in usable forms More and more wireless devices and timesensitive, multi-media applications Current methods for Web KD & IR Collecting and pre-processing data Web crawlers and link-based ranking Human indexing and categorization Identifying and extracting useful knowledge from Web resources Keyword search on natural language text Topical directories or topical Web sites Presenting the knowledge in usable forms Content presented in native format (plugins) or in HTML Automating KD & IR for the Web Semantic markup to enable machine understanding/processing (RDF/S & DAML/OIL) & inference analysis Intelligent search engines and agents to exploit semantic statements Ontologies to provide context (a data model) for agents (Shah et. al.) Automating KD & IR for the Web (continued) Automated data collection, automated context collection (data pre-processing) Value-added services (query routing) Integrated query systems/knowledge delivery systems (accessibility) Social accounting metrics to provide context for humans (Smith, 2002, p. 52) Enhanced presentation for the Web Reformatting for presentation Differentiated service Variable visualization • Adaptive graphics, “a unifying framework that allows visual representations of information to be customized and mixed together into new ones” (Boier-Martin, 2003, pp. 6-9) • Previewing & interactive content • Selective presentation & customized views KDD and IR for pervasive computing Achieving “ubiquitous data access” (Cherniack, Franklin, & Zdonik, 2001, slide 7) Data management problems • Dissemination (context dependent pull/push) • Synchronization (multiple collectors/devices) • Recharging (renewing) multiple data streams • Profile-driven data management KDD and IR for pervasive computing (continued) Achieving “ubiquitous data access” (Cherniack, Franklin, & Zdonik, 2001, slide 7) Location aware, mobile devices Service discovery for mobile services Distributed sensors/collectors (slides 827) Next generation KDD & IR will…. Focus on solving business problems, not data analysis problems Embed knowledge discovery engines Integrate access to enterprise and external data on the back-end Integrate knowledge discovery process with knowledge delivery tools (Piatetsky-Shapiro, 1998, Slide 7) Next generation KDD & IR will…. Manage information retrieval contextually Allow contextual query/continuous query Synchronize multiple data flows from disparate sensors/input devices Enable KD in virtual networks of peer-topeer databases (data “clusters” or “cubes”) Interpolate or extrapolate for missing data (Cherniack et. al., 2001, slides 115-138) Next generation KDD & IR will…. Recognize individual users Characterize information resources Provide a way to exchange knowledge between users and information resources (push and pull of information Adapt to the user community and enable the reuse and recombination of information as well as its exchange (Rocha, 2001, 1.2) KDD research problems Massive data sets & high dimensionality User interaction & prior knowledge Determining statistical significance Missing data Understandability of patterns Management of changing data & knowledge Data integration Non-standard, multimedia, & objectoriented data (Fayyad, Piatetsky-Shapiro, & Smyth, 1996, pp. 33-34) “Top Ten” IR research issues Integrated solutions Distributed IR Efficient, flexible indexing and retrieval "Magic” (automatic query expansion) Interfaces and browsing Routing and filtering Effective retrieval Multimedia retrieval Information extraction Relevance feedback (Croft, 1995) Total Information Awareness - DARPA on the bleeding edge…... New database technologies Database architectures Database population New search algorithms and data models Genysis Goal is to produce technology enabling ultra-large, all-source information repositories http://www.darpa.mil/iao/Genisys.htm Social Issues Communicating context Creating trust/social value Inciting cooperation/collaboration Privacy tradeoffs: convenience/service or security/privacy? References Ackerman, M. S. (1998, July). Augmenting the organizational memory: A field study of Answer Garden. ACM Transactions on Information Systems, 16(3), 203-204. Retrieved March 28, 2003 from http://doi.acm.org/10.1145/290159.290160 Ackerman, M. S., & Malone, T. W. (1990, April). Answer Garden: A tool for growing organizational memory. ACM SIGOIS Bulletin, 11(.2-3), 31-39. Retrieved March 28, 2003 from http://doi.acm.org/10.1145/91474.91485 Ackerman, M. S., & McDonald, D. W. (1996). Proceedings of the ACM Conference on Computer-Supported Cooperative Work 1996 (CSCW96 Boston, MA). Retrieved March 28, 2003 from http://doi.acm.org/10.1145/240080.240203 Boier-Martin, I. M.. (2003, January/February). Adaptive graphics. In T. Rhyne (Ed.) Visualization Viewpoints, IEEE Computer Graphics and Application, 23(1), 6-10. Retrieved April 5, 2003 from http://www.research.ibm.com/people/i/imartin/papers/visviewpoints.pdf References Chakrabarti, S., Srivastava, S., Subramanyam, M., & Tiware, M. (2000). Using Memex to archive and mine community Web browsing experience. A paper presented at the 9th International World Wide Web Conference, Amsterdam, May 15-19, 2000. Retrieved April 12, 2003 from http://www9.org/w9cdrom/98/98.html Croft, W. B. (1995, November). What do people want from information retrieval?: The top 10 research issues for companies that use and sell IR systems. D-Lib Magazine. Retrieved April 5, 2003 from http://sunsite.anu.edu.au/mirrors/dlib/dlib/november95/11croft.html DARPA Information Awareness Office. (2003a). Genysis. Retrieved from the DARPA Information Awareness Office Web site at: http://www.darpa.mil/iao/Genisys.htm DARPA Information Awareness Office. (2003b). Total Information Awareness System. Retrieved from the DARPA Information Awareness Office Web site at: http://www.darpa.mil/iao/TIASystems.htm References Devarakonda, R. (2001, March). Object-Relational database systems - The road ahead. ACM Crossroads Student Magazine. Retrieved April 12, 2003 from www.acm.org/crossroads/xrds7-3/ordbms.html Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996, November). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34. Retrieved March 03, 2003 from http://wwwhome.cs.utwente.nl/~mpoel/colleges/dwdm/ACM_artikelen/fayyad 2.pdf Lee, D., & Hwang, Y. (2002, March 1). Extracting semantic metadata and its visualization. ACM Crossroads Student Magazine. Retrieved March 27, 2003 from www.acm.org/crossroads/xrds7-3/smeva.html Piatetsky-Shapiro, G. (1998, December 4). Data mining and knowledge discovery tools: The next generation. Retrieved February 27, 2003 from kdnuggets.com at http://www.kdnuggets.com/gpspubs/dama-nextgen-98/index.htm References Rauber, A., Aschenbrenner, A., Witvoet, O., Bruckner, R. M., & Kaiser, M. (2002, December). Uncovering information hidden in Web archives: A glimpse at Web analysis building on data warehouses. D-Lib Magazine, 8(12). Retrieved March 28, 2003 from http://www.dlib.org/dlib/december02/rauber/12rauber.html Rocha, L. M. (2001). TalkMine: A soft computing approach to adaptive knowledge recommendation [Electronic version]. In V. Loia & S. Sessa (Eds.), Studies in fuzziness and soft computing: Vol. 75. Soft computing agents: New trends for designing autonomous systems. (pp. 89-116). New York: Springer. Retrieved March 28, 2003 from http://www.c3.lanl.gov/~rocha/softagents.html Shah, U., Finin, T., Joshi, A., Cost, R. S., & Mayfield, J. (2002, November). Information retrieval on the Semantic Web. Paper presented at The ACM Conference on Information and Knowledge Management , November 2002. Retrieved March 28, 2003 from http://www.csee.umbc.edu/~finin/papers/cikm02/cikm02.pdf References Smith, M. (2002). Tools for navigating large social cyberspaces. Communications of the ACM, 45(4), 51-55. Retrieved March 28, 2003 from http://delivery.acm.org/10.1145/510000/505272/p51smith.html?key1=505272&key2=5541680501&coll=GUIDE&dl=GUIDE&C FID=9914049&CFTOKEN=12943474 Whitted, T. (1999, July/August). Draw on the Wall. IEEE Computer Graphics and Applications, 19(4), 6-9. Retrieved April 8, 2003 from ieeeexplore.ieee.org at: http://ieeexplore.ieee.org/iel5/38/16795/00773957.pdf?isNumber=16795&arnu mber=773957&prod=JNL&arSt=6&ared=9&arAuthor=Whitted%2C+T. Widom, J. (1995, November). Research problems in data warehousing. Proceedings of the 4th International Conference on Information and Knowledge Management (CIKM). Retrieved March 28, 2003 from http://www.ischool.utexas.edu/~i385tkms/readings/Widom-1995ResearchProblems.pdf References Xion, R., & Donath, J. (1999). PeopleGarden: Creating data portraits for users. CHI Letters, 1(1). 37-44. Retrieved April 8, 2003 from http://smg.media.mit.edu/papers/Xiong/pgarden_uist99.pdf