Bridging the Gap Between Unstructured Data and Structured Data

BRIDGING THE GAP BETWEEN UNSTRUCTURED
DATA AND STRUCTURED DATA
A presentation by
W H Inmon
.Doc
Email
Program
.Txt
The informal systems of the corporation:
- unstructured data
- .doc files
- .txt files
- .xls files
- email
- transcripted telephone
The formal systems of a corporation:
- structured systems
- structured data
- corporate transactions
- corporate reports
- corporate databases
-customer files
- audit reports
80%
20%
.Doc
Program
Email
.Txt
It is estimated that less than 20% of corporate
systems are structured.
search
engines
web content
legal discovery
.Doc
ontology
Email
applications
.Txt
taxonomy
dbms
email archive
document mgmt
compliance
business
intelligence
Program
imagine what would happen if the
two worlds could be integrated…….
ERP
OLTP
transactions
the world of dbms, analytics, and other processing opens up.
search
engines
web content
legal discovery
.Doc
ontology
Email
taxonomy
applications
.Txt
dbms
email archive
document mgmt
compliance
business
intelligence
Program
tight integration between
the two types of data.
.Doc
Email
.Txt
ERP
OLTP
transactions
.Doc
Program
Email
.Txt
There is a gulf between the two worlds:
- technology
- business practice
- organizational
- historical
.Doc
Email
Program
.Txt
Think of the possibilities!
Imagine this -
Reports and visualization show a lot.
have you ever wondered why you
can’t hook up your Business Objects to
email? or telephone conversations?
.Doc
Business
Intelligence
Email
.Txt
There is a fundamental disconnect
between unstructured data
and business intelligence.
So what would happen if we had powerful visualization
for text?
liver cancer
skin cancer
diabetes
blood pressure
thirst
correlative information becomes
very easy to spot
for the general population
for women
for women who smoke
doing analysis on
sub populations
of women
for women who smoke
over the age to 50
for the general population
for women who smoke
over the age to 50
the contrast between the different correlations of different populations
leads to great insight
broken
wait too long
late
service
did not fit
installation
salesman
attitude
delivery
what about looking at customer feedback – complaints?
now you can see the broader picture of what is happening
but there are plenty of other places where
the technology applies –
- manufacturing warranties – (what patterns of defects are there?)
- Weblogs (marketing – who is saying what?)
- customer complaints – (what are the problem products?)
- general email – (What’s the buzz? what is on people’s minds?)
- insurance claims (what are the circumstances of accidents?)
.Doc
Email
.Txt
another possibility is the monitoring
of email and the transport of email
to the structured environment
Monitoring emails and other corporate conversations -
compliance – making sure that email is
being used properly
- compliance
- corporate standard for language
.Doc
Email
.Txt
Sarbanes Oxley
HIPAA
BASEL II
A bunch of emails and conversations:
Jan 3 - vp to vp
“This is going to be a real barn burner of a quarter….”
Jan 5 – finance to vp
“It looks like we are going to do $9,000,000 this quarter…”
Jan 5 – president to analyst
“This quarter looks like we are going to break new records…”
Feb 1 – employee to employee
“Did you see the stock market? Everything is going down…”
Feb 3 – president to vp
“What is happening to sales in the midwest? We didn’t expect this…”
Feb 3 – vp to vp
“The sales cycle looks like it is extending. The economy is tanking…”
Feb 4 – sales manager to vp
“It looks like we are going to be a little short this quarter…”
Feb 6 – president to vp
“What are we going to do to get sales up? Do we need to do some discounting?”
Mar 2 – sales person to vp
“Demand has dried up. We aren’t going to close as many sales this
quarter as we thought…”
What do you do with them?
Examining emails (“combing” them) for important corporate information:
Jan 3 - vp to vp
“This is going to be a real barn burner of a quarter….”
Jan 5 – finance to vp
“It looks like we are going to do $9,000,000 this quarter…”
Jan 5 – president to analyst
“This quarter looks like we are going to break new records…”
Sarbanes Oxley
quarter
stock
sales
discount
demand
sales cycle
Feb 1 – employee to employee
“Did you see the stock market? Everything is going down…”
Feb 3 – president to vp
“What is happening to sales in the midwest? We didn’t expect this…”
Feb 3 – vp to vp
“The sales cycle looks like it is extending. The economy is tanking…”
Feb 4 – sales manager to vp
“It looks like we are going to be a little short this quarter…”
Feb 6 – president to vp
“What are we going to do to get sales up? Do we need to do some discounting?”
Mar 2 – sales person to vp
“Demand has dried up. We aren’t going to close as many sales this
quarter as we thought…”
external
categories
sales
email – Feb 2
email – Mar 5
phone – Mar 8
………………
quarter
email – Jan 2
email – Jan 4
email – Feb 5
………………
sales cycle
email – Feb 24
phone conversation – Mar 14
meeting notes – Mar 18
…………………………….
discount
phone conversation – Jan 6
email – Jan 12
email – Jan 14
…………………………..
Structured
Environment
The “combed” information is brought over to
the structured environment.
Now you can use standard tools, such as Cognos, Business Objects,
Crystal Reports, MicroStrategy to do analysis.
But there are other ways that communications can be used
customer data
probabilistic
match
Emails and telephone conversations can be linked
to CDI/CRM data.
A true 360 degree view
of the customer can be
formed.
“I placed an order last week and
when it arrived it was the wrong
size. And then your company
would not take it back. I’m mad.”
how easy is it going to be to engage
Mrs Jones until she has satisfaction
about her order
A true 360 degree view
of the customer can be
formed.
communications
demographics
delivering on the promise of CDI
can’t I just use a search engine
to link the two worlds?
integration
.Doc
integration
Email
.Txt
integration
Program
integration
search engines do not integrate textual information
integration
.Doc
integration
Email
.Txt
integration
Program
integration
text doesn’t need to be searched, it needs to be integrated
integration
.Doc
integration
Email
.Txt
integration
Program
integration
“ha”
“head ache”
“heart attack”
“Hepatitis A”
integration
.Doc
integration
Email
.Txt
Program
integration
integration
“oblique fractured ulna”
“oblique fractured tibia”
“obliq fractured tarsi”
“broken bone”
What is meant by editing, integrating text?
integration
.Doc
integration
Email
.Txt
integration
Program
integration
1 – stop word editing
2 – stemming
3 – synonym replacement
4 – synonym concatenation
5 – homograph resolution
6 – alternate spelling resolution
7 – external category classification
8 – theming
9 – probabilistic matching
10 – negation exclusion
11 – concept clustering
12 – mid process editing
13 – change sensitivity
.Doc
Email
Program
.Txt
DW 2.0
Interactive
Transaction
data
The arc hitec ture for the next
genera tion of d ata wa rehousing
Very
current
Textual
subjects
Reference,
master data
Internal, external
Captured
text
Integrated
Current++
Sim ple
pointers
A
p
p
l
A
p
p
l
A
p
p
l
Detailed
S S S S
u u u u
b b b b
j j j j
B
u
s
i
n
e
s
s
Continuous
snapshot
data
Profile
data
Text id ......
S
u
b
j
j
b
u
S
j
b
u
S
T
e
c
h
n
i
c
a
l
Linkage
Summary
Text to subj
Textual
subjects
Reference,
master data
Internal, external
Captured
text
Near line
Less than
current
Sim ple
pointers
Detailed
S S S S
u u u u
b b b b
j j j j
B
u
s
i
n
e
s
s
Continuous
snapshot
data
Profile
data
Text id ......
S
u
b
j
j
b
u
S
j
b
u
S
T
e
c
h
n
i
c
a
l
Linkage
Summary
Text to subj
Textual
subjects
Reference,
master data
Internal, external
Archival
Captured
text
Older
Sim ple
pointers
Detailed
S S S S
u u u u
b b b b
j j j j
Continuous
snapshot
data
Profile
data
Text id ......
S
u
b
j
j
b
u
S
Linkage
Text to subj
For a detailed description of
how the unstructured environment
should be linked to the structured
environment, go to www.inmoncif.com
and look for DW 2.0 TM
or go to www.inmondatasystems.com
Summary
Unstruc tured
c om p onent
C
j
b
u
S
B
u
s
i
n
e
s
s
T
e
c
h
n
i
c
a
l
Struc tured
c om p onent
Copyright 2006 Bill Inmon and Inmon Data Systems
C
DW 2.0 is a trademark of Bill Inmon and
Inmon Data Systems. All rights reserved.
“The architecture for the next generation of data warehousing”
is copyrighted by Bill Inmon and Inmon Data Systems. 2006
Structured Environment
visualization
Unstructured Data
DB2
probabilistic
match
Query
Business Objects,
Cognos,
MicroStrategy,
Crystal Reports