BRIDGING THE GAP BETWEEN UNSTRUCTURED DATA AND STRUCTURED DATA A presentation by W H Inmon .Doc Email Program .Txt The informal systems of the corporation: - unstructured data - .doc files - .txt files - .xls files - email - transcripted telephone The formal systems of a corporation: - structured systems - structured data - corporate transactions - corporate reports - corporate databases -customer files - audit reports 80% 20% .Doc Program Email .Txt It is estimated that less than 20% of corporate systems are structured. search engines web content legal discovery .Doc ontology Email applications .Txt taxonomy dbms email archive document mgmt compliance business intelligence Program imagine what would happen if the two worlds could be integrated……. ERP OLTP transactions the world of dbms, analytics, and other processing opens up. search engines web content legal discovery .Doc ontology Email taxonomy applications .Txt dbms email archive document mgmt compliance business intelligence Program tight integration between the two types of data. .Doc Email .Txt ERP OLTP transactions .Doc Program Email .Txt There is a gulf between the two worlds: - technology - business practice - organizational - historical .Doc Email Program .Txt Think of the possibilities! Imagine this - Reports and visualization show a lot. have you ever wondered why you can’t hook up your Business Objects to email? or telephone conversations? .Doc Business Intelligence Email .Txt There is a fundamental disconnect between unstructured data and business intelligence. So what would happen if we had powerful visualization for text? liver cancer skin cancer diabetes blood pressure thirst correlative information becomes very easy to spot for the general population for women for women who smoke doing analysis on sub populations of women for women who smoke over the age to 50 for the general population for women who smoke over the age to 50 the contrast between the different correlations of different populations leads to great insight broken wait too long late service did not fit installation salesman attitude delivery what about looking at customer feedback – complaints? now you can see the broader picture of what is happening but there are plenty of other places where the technology applies – - manufacturing warranties – (what patterns of defects are there?) - Weblogs (marketing – who is saying what?) - customer complaints – (what are the problem products?) - general email – (What’s the buzz? what is on people’s minds?) - insurance claims (what are the circumstances of accidents?) .Doc Email .Txt another possibility is the monitoring of email and the transport of email to the structured environment Monitoring emails and other corporate conversations - compliance – making sure that email is being used properly - compliance - corporate standard for language .Doc Email .Txt Sarbanes Oxley HIPAA BASEL II A bunch of emails and conversations: Jan 3 - vp to vp “This is going to be a real barn burner of a quarter….” Jan 5 – finance to vp “It looks like we are going to do $9,000,000 this quarter…” Jan 5 – president to analyst “This quarter looks like we are going to break new records…” Feb 1 – employee to employee “Did you see the stock market? Everything is going down…” Feb 3 – president to vp “What is happening to sales in the midwest? We didn’t expect this…” Feb 3 – vp to vp “The sales cycle looks like it is extending. The economy is tanking…” Feb 4 – sales manager to vp “It looks like we are going to be a little short this quarter…” Feb 6 – president to vp “What are we going to do to get sales up? Do we need to do some discounting?” Mar 2 – sales person to vp “Demand has dried up. We aren’t going to close as many sales this quarter as we thought…” What do you do with them? Examining emails (“combing” them) for important corporate information: Jan 3 - vp to vp “This is going to be a real barn burner of a quarter….” Jan 5 – finance to vp “It looks like we are going to do $9,000,000 this quarter…” Jan 5 – president to analyst “This quarter looks like we are going to break new records…” Sarbanes Oxley quarter stock sales discount demand sales cycle Feb 1 – employee to employee “Did you see the stock market? Everything is going down…” Feb 3 – president to vp “What is happening to sales in the midwest? We didn’t expect this…” Feb 3 – vp to vp “The sales cycle looks like it is extending. The economy is tanking…” Feb 4 – sales manager to vp “It looks like we are going to be a little short this quarter…” Feb 6 – president to vp “What are we going to do to get sales up? Do we need to do some discounting?” Mar 2 – sales person to vp “Demand has dried up. We aren’t going to close as many sales this quarter as we thought…” external categories sales email – Feb 2 email – Mar 5 phone – Mar 8 ……………… quarter email – Jan 2 email – Jan 4 email – Feb 5 ……………… sales cycle email – Feb 24 phone conversation – Mar 14 meeting notes – Mar 18 ……………………………. discount phone conversation – Jan 6 email – Jan 12 email – Jan 14 ………………………….. Structured Environment The “combed” information is brought over to the structured environment. Now you can use standard tools, such as Cognos, Business Objects, Crystal Reports, MicroStrategy to do analysis. But there are other ways that communications can be used customer data probabilistic match Emails and telephone conversations can be linked to CDI/CRM data. A true 360 degree view of the customer can be formed. “I placed an order last week and when it arrived it was the wrong size. And then your company would not take it back. I’m mad.” how easy is it going to be to engage Mrs Jones until she has satisfaction about her order A true 360 degree view of the customer can be formed. communications demographics delivering on the promise of CDI can’t I just use a search engine to link the two worlds? integration .Doc integration Email .Txt integration Program integration search engines do not integrate textual information integration .Doc integration Email .Txt integration Program integration text doesn’t need to be searched, it needs to be integrated integration .Doc integration Email .Txt integration Program integration “ha” “head ache” “heart attack” “Hepatitis A” integration .Doc integration Email .Txt Program integration integration “oblique fractured ulna” “oblique fractured tibia” “obliq fractured tarsi” “broken bone” What is meant by editing, integrating text? integration .Doc integration Email .Txt integration Program integration 1 – stop word editing 2 – stemming 3 – synonym replacement 4 – synonym concatenation 5 – homograph resolution 6 – alternate spelling resolution 7 – external category classification 8 – theming 9 – probabilistic matching 10 – negation exclusion 11 – concept clustering 12 – mid process editing 13 – change sensitivity .Doc Email Program .Txt DW 2.0 Interactive Transaction data The arc hitec ture for the next genera tion of d ata wa rehousing Very current Textual subjects Reference, master data Internal, external Captured text Integrated Current++ Sim ple pointers A p p l A p p l A p p l Detailed S S S S u u u u b b b b j j j j B u s i n e s s Continuous snapshot data Profile data Text id ...... S u b j j b u S j b u S T e c h n i c a l Linkage Summary Text to subj Textual subjects Reference, master data Internal, external Captured text Near line Less than current Sim ple pointers Detailed S S S S u u u u b b b b j j j j B u s i n e s s Continuous snapshot data Profile data Text id ...... S u b j j b u S j b u S T e c h n i c a l Linkage Summary Text to subj Textual subjects Reference, master data Internal, external Archival Captured text Older Sim ple pointers Detailed S S S S u u u u b b b b j j j j Continuous snapshot data Profile data Text id ...... S u b j j b u S Linkage Text to subj For a detailed description of how the unstructured environment should be linked to the structured environment, go to www.inmoncif.com and look for DW 2.0 TM or go to www.inmondatasystems.com Summary Unstruc tured c om p onent C j b u S B u s i n e s s T e c h n i c a l Struc tured c om p onent Copyright 2006 Bill Inmon and Inmon Data Systems C DW 2.0 is a trademark of Bill Inmon and Inmon Data Systems. All rights reserved. “The architecture for the next generation of data warehousing” is copyrighted by Bill Inmon and Inmon Data Systems. 2006 Structured Environment visualization Unstructured Data DB2 probabilistic match Query Business Objects, Cognos, MicroStrategy, Crystal Reports