Knowledge Management Systems: Development and Applications Part III: Case Studies and Future Hsinchun Chen, Ph.D. McClelland Professor, Director, Artificial Intelligence Acknowledgement: NSF DLI1, DLI2, Lab and Hoffman ENSDL, DG, ITR, IDM, CSS, NIH/NLM, NCI, NIJ, CIA, NCSA, HP, SAP Commerce Lab 美國亞歷桑那大學, 陳炘鈞 博 The University of Arizona 士 Founder, Knowledge Computing Corporation Knowledge Management Systems: Case Studies Multi-lingual Knowledge Portal (1M): Meta searching, post-retrieval analysis, summarization, categorization, AI Lab tooolkits • Knowledge Portals are online searching systems that provide large amount of information resources and services within a specific domain. – Providing frequently updated and highly domain-specific information. – Providing efficient and precise searching service. – Providing advanced analysis functionalities which can help users find the information needed among huge amount of data. – Providing additional tools such as Personalization and Alerting System to facilitate the searching tasks. NanoPort: Knowledge Portal for Nanotechnology Researchers • Goal: – Providing information services to nanotechnology researchers. – The design of the content and function is based on the feedback of Nanoscale Science and Engineering (NSSE) experts. • Content: – 1,000,000 high quality nanotechnology-related webpages in database. – Meta-search 4 search engines, 5 online databases and 3 online journals • Key Features: – – – – • Dynamic summarization Folder display Visualization using self-organizing map (SOM) Patent nalysis Funding: – US National Science Foundation (NSF) Nano Initiative • Demo: – http://nanoport.org/ Folder display Visualization using SOM Folder display Visualization with SOM The original page Input keywords Summary Select search engines Select online databases Summarize result dynamically Select online journals Highlight the summary in the original page with corresponding color Click on the summary sentence and jump to its position in the original page MedTextus: English Medical Intelligence • Goal: – Providing information services to researchers in medical domain. • Content: – Meta-search 5 large medicine-related online databases and journals. • Key Features: – Keyword suggester – Folder display – Visualization using SOM • Funding: – US National Library of Medicine (NLM) • Demo: – http://ai23.bpa.aizona.edu/medtextus/ Folder display Visualization with SOM Result page Select databases Input keywords Keyword suggested by the system Keyword suggester Advanced search options eBizPort: English Business Intelligence • Goal: – Providing business, trading and financial information services to commercial users. • Content: – 500,000 high quality webpages in database. – Meta-search 10 authoritative online business magazines. • Key Features: – – – – – Search by date Keyword suggester Dynamic summarization Folder display Visualization using SOM • Demo: – http://ai18.bpa.arizona.edu:8080/ebizport/ Result page Folder display and SOM Keyword suggester Keyword suggested by the system Limit the date of the result pages Date of the result page Chinese Medical Intelligence (CMI) • Goal: – Providing medical and health information services to both researchers and public. • Content: – 350,000 high quality medical-related webpages collected from mainland China, Hong Kong and Taiwan. – Meta-search 3 large general Chinese search engines. • Key Features: – – – – Built-in Simplified/Traditional Chinese encoding conversion Dynamic summarization for both Simplified and Traditional Chinese Automatic categorization Visualization using SOM • Demo: – http:// 128.196.40.169:8000/gbmed/ Simplified Chinese summary Chinese folder display Chinese visualization with SOM Results are from both Simplified and Traditional Chinese Select websites from mainland China, Hong Kong and Taiwan Traditional Chinese summary Original encoding of the result Simplified/Traditional Chinese summarization Select search engines from mainland Chinese results China,Traditional Hong Kong and Taiwan haven been converted into simplified Chinese Chinese Business Intelligence (CBI) • Goal: – Providing business, trading and financial information services to Chinese commercial users. • Content: – 300,000 high quality webpages collected from Mainland China, Hong Kong and Taiwan. • Key Feature: – – – – Built-in Simplified/Traditional Chinese encoding conversion Dynamic summarization for both Simplified and Traditional Chinese Folder display Visualization using SOM • Demo – http://ai14.bpa.arizona.edu:8081/nanoport/ The largest business, trading and financial websites in mainland China, HongBoth KongSimplified and Taiwan and Traditional Chinese display results folder are retured Simplified Chinese summary Chinese summarizer Traditional Chinese summary Chinese visualization with SOM Spanish Business Intelligence Portal Keyword: comercio electronico Keyword suggestion from Scirus and Concept Space Detailed directory of Spanish business resources on the Web Search, Organize, Search , Organize,or Organize , Visualize or Visualizeresults results Meta searches 7 major sources and provides searching of its own collection (PIN) Supports boolean searching and allows the display of 10, 20, 30, 50, or 100 results per each meta searchers Search Page Summarizer Result Page Web pages visualized by selforganizing map (SOM) algorithm Categorizer Automatic keyword suggestion Web pages grouped by key organized by phrasesResults extracted by mutual Summarize in 3 orA5three-sentence meta searchers information algorithm (nonsentences summary on left categorization) exclusive Visualizer Original page shown on right Search Page Spanish Business Taxonomy Web sites about the topic “Electronic Commerce” in Spanish speaking countries Arabic Medical Intelligence Portal Search Page Result Page Categorizer Provides a virtual Arabic keyboard to facilitate input Visualizer Lessons Learned • The content selection and functionality design of knowledge portal should meet the need of real users. • Using meta-search together with other traditional data collecting methods can improve the recall without sacrificing the precision of the knowledge portal. • The structure of the webpage may introduce noise into the dynamic summary. • The AI Lab toolkits support scalable multi-lingual spidering, indexing, searching, summarization, and categorization • New Spanish and Arabic portals completed • New cross-lingual web retrieval engine completed Biomedical Informatics (10M): Biomedical content, biomedical ontologies, linguistic phrasing, categorization, text mining HelpfulMED Search of Medical Websites HelpfulMED search of Evidence-based Databases What does database cover? Search which databases? How many documents? Enter search term Consulting HelpfulMED Cancer Space (Thesaurus) Enter search term Select relevant search terms New terms are posted Search again... Or find relevant webpages Browsing HelpfulMED Cancer Map 1 Visual Site Browser Top level map 2 3 Diagnosis, Differential 4 Brain Neoplasms 5 Brain Tumors Genescene Overview Knowledge Base Integrate gene relations from literature and outside databases and provide knowledge for learning and evaluation in data mining Text Mining Process Medline abstracts and extract gene relations automatically from the text Data Mining Process gene expression data (and existing knowledge) and use different algorithms to extract regulatory networks Interface & Visualization Allow searching for keywords, display a map of the relations extracted from the text and/or from the microarray Genescene Overview JIF Ontologies External Databases HUGO Publications Medline XML Parser Publications & GO Meta Information UMLS Knowledge Base Titles & Abstracts GeneScene Text Mart Relation Parsers Lexical lookup UMLS AZ Noun Phraser POS Tagging Adjuster & Tagger Full Parser FSA Relation Grammar Relations in flat files Concept Space Relations in flat files Co-occurrence relations Feature Structures GeneScene Data Mart Text Mining GeneScene Information Retrieval Visualization Data Mining Spring Algorithm Micro Array Data Bayesian Networks Association Rule Mining Problem: Gene Pathway •Title Key roles for E2F1 in signaling p53- dependent apoptosis and in cell division within developing tumors. •Abstract: Apoptosis induced by the p53 tumor suppressor can attenuate cancer growth in preclinical animal models. Inactivation of the pRb proteins in mouse brain epithelium by the T121 oncogene induces aberrant proliferation and p53-dependent apoptosis. p53 inactivation causes aggressive tumor growth due to an 85% reduction in apoptosis. Here, we show that E2F1 signals p53-dependent apoptosis since E2F1 deficiency causes an 80% apoptosis reduction. E2F1 acts upstream of p53 since transcriptional activation of p53 target genes is also impaired. Yet, E2F1 deficiency does not accelerate tumor growth. Unlike normal cells, tumor cell proliferation is impaired without E2F1, counterbalancing the effect of apoptosis reduction. These studies may explain the apparent paradox that E2F1 can act as both an oncogene and a tumor suppressor in experimental systems Action Protocols Graphic Representation p53 reads "E2F1 signals p53-dependent apoptosis" E2F1 apoptosis p53 infers So, I'm assuming... a straight line pathway... E2F1 apoptosis Expert errs and corrects E2F1 reads "E2F1 acts upstream of p53" p53 apoptosis E2F1 p53 reads "E2F1 deficiency does not accelerate tumor growth" apoptosis tumor growth Final graph Prepositions: OF/BY/IN OF BY IN q0 Nominalization (-ion) q5 Adjective, noun, verb (-ed) Adjective, Noun, verb (-ed) Nominalization (-ion) Nominalization (-ion) Negation q4 NP, 5: str1 NP q1 Aux, 1: tr13 OF q6 OF Nominalization (-ion) q7 mod Aux mod Negation q2 Adjective, noun, verb (-ed) q18 q13 NP verb aux OF verb verb q14 verb Nominalization (-ion) q15 q3 mod OF q8 BY q9 NP q11 BY q10 q12 NP IN IN NP NP BY IN q16 NP q17 IN Example Map (one abstract) Select interesting relations to visualize Overview Double click to expand Expanded node Finding the truth: p38 acts as a negative feedback for Ras signaling Lessons Learned: • Biomedical information is precise but terminologies fluid • SOM performance for medical documents = 80% • Biomedical professionals need search and analysis help • Biomedical linguistic parsing and ontologies are promising for biomedical text mining • The need for integrated biomedical data (gene microarray) and text mining (literature) • New testbeds completed: p53, AP1, and yeast COPLINK Crime Data Mining (10M): Intelligence and security informatics, crime association, crime network analysis and visualization COPLINK Connect Consolidating & Sharing Information promotes problem solving and collaboration Records Management Systems (RMS) Gang Database Mugshots Database COPLINK Connect Functionality • Generic, common XML based criminal elements representation • Data migration (batch and incremental) and mapping for all major databases and legacy systems • Database independent: ODBC compliance data warehouse • Multi-layered Web-based architecture: database server, Web server, browser • Powerful and flexible search tools for various reports, e.g., incidents, warrants, pawns, etc. • Graphical browser-based GUI interface for ease of use, training and maintenance H. Chen, J. Schroeder, R. V. Hauck, L. Ridgeway, H. Atabakhsh, H. Gupta, C. Boarman, K. Rasmussen, and A. W. Clements, “COPLINK Connect: Information and Knowledge Management for Law Enforcement,” Decision Support Systems, Special Issue on Digital Government, 2003. COPLINK Detect Consolidated information enables targeted problem solving via powerful investigative criminal association analysis COPLINK Detect Functionality • Simple association rule mining applied to criminal elements relationships • Generic, common XML based representation for criminal relationships • Incremental data migration and association analysis on databases • Support powerful, multi-attribute queries using partial crime information • Graphical browser-based GUI interface for simple crime relationship analysis and case retrieval H. Chen, D. Zeng, H. Atabakhsh, W. Wyzga, J. Schroeder, “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, 2003. COPLINK Detect 2.0/2.5 COPLINK Connect/Detect Status • Systems stable and shown useful. Commercialized and supported by KCC • Systems deployed at: TPD, UAPD, PPD, Phoenix, Huntsville (TX), Des Moines (Iowa), Ann Arbor (Michigan), Boston (Massachusetts), Montgomery county (sniper investigation) • Systems under deployment: Salt River (AZ), Cambridge (Massachusetts), Redmond (Washington), many others • COPLINK acclaims at LA Times and New York Times, Newsweek (sniper investigation) COPLINK Visual Data Mining Research COPLINK Criminal Network Analysis: Association Tree, Association Network Analysis, Temporal-Spatial Visualization • P1000: A Picture is worth 1000 words. • Use visual representations and effective HCI to assist in more efficient and effective crime analysis • Leverage different representations and algorithms: hyperbolic trees, network placement algorithms, structural analysis, geospatial mapping, time visualization H. Chen, D. Zeng, H. Atabakhsh, W. Wyzga, J. Schroeder, “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, 2003. A 9/11 Terrorist Network COPLINK Association Tree and Network (2nd generation) Figure 1a: Relations among multiple criminal elements are shown on both a hyperbolic tree (right) and a hierarchical list (left). Figure 1b: A hyperbolic tree with multiple levels of investigative leads. Figure 2c: A user may choose only the type that is of interest (e.g., person) and view crime associations (e.g., person name, address). Figure 2a: The initial layout of a criminal network before analysis. Figure 2b: The network is analyzed and automatically adjusted to reflect subgroups and central criminal figures. COPLINK Criminal Structural Analysis (3rd generation) • Criminal association identification – Using shortest-path algorithms to find the strongest associations between two or more criminals in a network • SNA (Social Network Analysis) – Using blockmodel analysis to detect subgroups and patterns of interactions between groups – Identifying leaders, gatekeepers, and outliers from a criminal network J. Xu & H. Chen, “Criminal Network Analysis: A Data Mining Perspective,” Decision Support Systems, 2004, forthcoming. The proposed framework COPLINK SNA Experiment • Data Sets – TPD incident summaries • Time period—Narcotics: 2000-present; Gangs: 1995-present • Size Total # # subSize of sub- – Two testing networks • Narcotics (60 individuals) • Gang (24 individuals) individuals networks newtorks Narcotics 12,842 2,628 1-10: 2,587 11-20: 31 21-100: 9 502: 1 Gangs 4,376 289 1-10: 264 11-20: 20 21-100: 4 2,595: 1 A narcotic network example A bubble represents a subgroup labeled by its leaders name Switch between narcotic network and gang network A line impliesAthat some point represents an individuals inindividual one grouplabeled interactby with some individuals his name in the other group. The thicker the link, the more individual interactions between the two groups A line represents a link between two persons The rankings of the members of a selected group (green). The size of a bubble is proportionalShow to the network and number of individuals reset network in the group Adjust level of details A gang network example The leader The reduced network structure A clique A gatekeeper Patterns Found • The chain structure of the narcotic network • Implications: disrupt the network by breaking the chain • The star structure of the gang network • Implications: disrupt the network by removing the leader Expert Validation A group of black gangs White gangs who involved in murders and shootings White gangs who sold crack cocaine “(211) and (173) are best friends” “Yes, these two groups are together very often” “He is very important. He has a lot of money and sells drugs. His girl friend brings a lot of dancers in the city and buy drugs.” Lessons Learned: • • • • • • Data warehousing and gateway approaches are needed for information consolidation XML and data normalization are critical Co-occurrence analysis and link analysis are extremely useful for crime investigation Visual data mining is essential for criminal network analysis Wireless (laptop, PDA, cell phone) application is essential KM techniques may create unintended cultural and practice side effects GetSmart Concept Maps: Knowledge creation, transfer and mapping Meaningful Learning A Continuum Meaningful Learning Creative Production Most School LearningRote • Substantive synthesis • Relate to experiences • Intentionally connect to prior knowledge • Practice, rehearsal and thoughtful replication contribute to meaningful learning. • Memorization • Unrelated to experience • No effort to link to existing Learning knowledge * Adapted from Novak’s model of meaningful learning Six Steps of Information Search: A Constructivist Approach Learners are actively involved in building on what they already know to come to a new understanding of the subject under study. Introduce a problem. Identify a general area for investigation. Initiation Selection Presentation Explore information to form a focus. Exploration Collection Formulation Summarize the topic and prepare to present to the intended audience. Gather information that defines/supports the focus. GetSmart Learning Tools Digital Library Curriculum Keyword Suggestion Filtered Material A Place to Store Work Assignments Announcements Linked Resources Knowledge Representation Concept Map Customized Resources A Concept Map about Concept Maps GetSmart Interface Navigation bar Search tools Concept map management tools Meta search options 1 By right clicking on a node you can delete the node, change the properties of the node, or add a resource to the node. Resources can be URLs, Maps, or Notes. 2 You can either type a URL, or click the “Add From URL Clipboard” button. 3 4 This is the clipboard. Simply highlight the URL you would like to add to a node and Click OK. Your URL will appear in the window, click the Done button to add it to your map. Printing Choosing the Print option will cause a new window to open. This map will show your map, the title of the map, and any URL’s, notes, or maps you have linked to your map. Usage: Overall at UA and VT • 114 student users – all UA students (54) turned in all assignments (VT assignments still pending) • 4,000+ user sessions • 1,000+ maps created for homework and presentations • 600+ searches performed • 50+ maps created as a group • 40,000+ relationships represented in the maps Results (1) • 120 cue phrases were used to extract 37,674 links, which accounted for 93% of the pool. • These cue phrases were categorized into the proposed link types: – About 50 cue phrases map to the five previously determined link types: hierarchical, componential, comparative, influential, and procedural. – Over 50% of cue phrases expressed hierarchical and componential relationships. – Descriptive relationships accounted for a large portion (30%), which were analyzed further. Link Type Distribution 35.00% 32.67% Over 50% of the links expressed hierarchical or componential relationships 30.00% 29.60% 25.00% 21.30% Descriptive relationships accounted for a large portion at 30%, so we further analyze this link type 20.00% 15.00% 9.65% 10.00% 3.86% 5.00% 2.91% 0.00% Hierarchical Link types Hierarchical Componential Componential Comparative Number* Percentage 8,026 12,307 Influential Procedural Descriptive Representative cue phrases 21.30% example, such as, case, type, member, is a 32.67% consist, contain, include, compose, part, made of Comparative 1,455 3.86% like, compare, similar, differ, alternative Influential 3,635 9.65% lead to, cause, result, influence, determine 1,097 11,153 37,673 2.91% next, go to, procedure 29.60% use/implement/present/advantages/feature 100.00% Procedural Descriptive Sum * The number of links which had those identified cue phrases in them Lessons Learned: • • • • Digital library and concepts maps support meaningful learning Digital library systems provide support for community knowledge creation. Semi-open link systems are useful for capturing knowledge and learning process NSDL is not a “library.” It should be a learning or knowledge creation environment. Knowledge Management Systems: Future Other Emerging Categorization Challenges/Opportunities: • Multilingual terminology and semantic issues • Web analysis and categorization issues • E-Commerce information (transactions) classification issues • Multimedia content and wireless delivery issues • Future: semantic web, multilingual web, multimedia web, wireless web! The Road Ahead • • • • The Semantic Web: XML, RDF, Ontologies The Wireless Web: WML, WIFI, display The Multimedia Web: content indexing and analysis The Multilingual Web: cross-lingual MT and IR Requirements For Successful KMS Implementation (General) • Sponsor for the application • Business case for the application clearly understood and measurable • High likelihood of having a significant impact on the business • Good quality, relevant data in sufficient quantities • The right people – business domain, data management, and data mining experts Requirements For Successful KMS Implementation (KM Specific) • Information overload is more than anyone can handle • Productivity gained and decision improvements evident among knowledge workers • Organization’s IT infrastructure ready • Need to integrate with consulting, process, content, and policy considerations For Project Information at AI Lab: • http://ai.bpa.arizona.edu • hchen@bpa.arizona.edu