DW data modelling nearer to source

advertisement
Data Warehousing and Knowledge Management
Data Warehousing:
the New Knowledge Management
Architecture for Humanities
Research?
Janet Delve
University of Portsmouth, UK
UKAIS 2004
Slide 1
Data Warehousing and Knowledge Management
Introduction
Data Warehouses everywhere
• Amazon
• Wal*Mart
• Opodo
DWs used a lot in industry, and scientific
research, but not in humanities research.
Written paper covers linguistics and history. Talk
covers history in detail and gestures towards
linguistics.
Slide 2
Data Warehousing and Knowledge Management
Overview
Introduction
Data modelling and traditional databases
Source-oriented data modelling
Data Mining
Philosophy of data warehousing
Background of DWs
Basic components of a data warehouse (DW)
Advantages of DWs
Findings –Humanities and DWs
Humanities and DWs – some issues
Examples of possible Humanities DWs
Ideas for the future?
Slide 3
Data Warehousing and Knowledge Management
Data Modelling
Relational data modelling – material split into
many tables in order to gain enhanced
performance – no duplication, updating or
insertion anomalies etc.
Source-oriented data modelling – emphasis on
modelling data as closely as possible to original
source which is included in its entirety for
posterity.
DW data modelling nearer to source-oriented
approach in spirit.
Slide 4
Data Warehousing and Knowledge Management
Traditional databases
ERD p117 Harvey and Press
Slide 5
Data Warehousing and Knowledge Management
Traditional databases
Harvey and Press p.129
Slide 6
Data Warehousing and Knowledge Management
Historical Data
This can be difficult to model because:
• It is irregular in structure,
• It is complex
• It is erratic in terms of when it occurs
Using a relational database can mean data from a
single source being spit into many tables.
Slide 7
Data Warehousing and Knowledge Management
Source-oriented data modelling
‘a semantic network tempered by hierarchical
considerations’ [Thaller 1991, 155].
Its flexible nature gives kleiw a ‘rubber band data
structures’ facility [Denley 1994, 37].
The fluid nature of creating a database with kleiw
marks it out as an ‘organic’ DBMS.
Slide 8
Data Warehousing and Knowledge Management
Data Mining
The whole field is often referred to as data mining,
which is also a major component within the
field.
Data mining (DM) is normally used on large
quantities (terabytes) of data, to find meaningful
patterns. Neural nets, statistical modelling,
decision trees are just some AI methods used.
SQL can be used too. Parallel data processing is
used with DM.
In order to mine data, it must be kept in a suitable
system - a data warehouse is ideal.
Slide 9
Data Warehousing and Knowledge Management
Philosophy of data warehousing
‘Data warehousing is an architecture, not a
technology. There is the architecture, and there
is the underlying technology, and they are two
very different things. Unquestionably there is a
relationship between data warehousing and
database technology, but they are most
certainly not the same. Data warehousing
requires the support of many different kinds of
technology.’
Inmon 2002
Slide 10
Data Warehousing and Knowledge Management
Background of DWs
Business-oriented – serve the analytical needs
of a company. The ordinary DBMS is still
needed for the day-to-day queries, and also to
feed the DW.
W.H. Inmon, father of DW. Cabinet effect –1991
R. Kimball, expert on dimensional modelling
Need for single, integrated source of clean data,
particularly for multinational etc. companies
Supporting technology from e.g. Oracle, Prism
Solutions, IBM
Slide 11
Data Warehousing and Knowledge Management
Data Marts
Data marts contain DW data but are restricted to
one department or one business process.
The industry is divided about data marts,
Inmon recommends building the DW first, then
siphoning off the data to data marts.
Kimball believes you should build several data
marts first, then integrate them into a DW.
Slide 12
Data Warehousing and Knowledge Management
Basic components of a Data
Warehouse (DW)
A DW is subject-oriented, integrated, non-volatile &
time-variant.
The major subjects for an insurance company are
customer, policy, premium and claim.
Previously data modelled around applications car, health, life and accident.
Integration is the most important facet of a DW.
Previous inconsistencies are ironed out and all
data unambiguously entered into DW. Many
sources of data can be placed in DW.
Slide 13
Data Warehousing and Knowledge Management
Basic components of a Data
Warehouse (DW)
Non-volatile data in a DW means that it is not
changed in the way data is in operational
database – data is loaded en masse and isn’t
updated. Obviates need for normalisation.
Time- variant – DW time horizon 5 –10 years,
operational database 2-3 months. DW
snapshots, operational database current data,
DW always has element of time, operational
database may or may not have. Inmon 2002
Slide 14
Data Warehousing and Knowledge Management
Basic components of a Data Warehouse (DW)
Kimball p7
Slide 15
Data Warehousing and Knowledge Management
Typical Architecture of a Data Warehouse
Slide 16
Data Warehousing and Knowledge Management
Meta Data
Meta data is extremely important in a DW. It is
used:
• to log the extraction and loading of data into the
warehouse;
• in query management to locate the most
appropriate data source and also to help end
users to build queries;
• to show how the data has been mapped when
carrying out data cleansing and
transformations;
• To manage all the data in the DW – recording
where data came from, when etc.
Slide 17
Data Warehousing and Knowledge Management
Basic components of a Data
Warehouse (DW)
Fact Tables
‘A fact table is the primary table in a dimensional
model where the numerical performance
measurements of the business are stored…
The measurement data resulting from a business
process is stored in a single data mart
Since measurement data is overwhelmingly the
largest part of any data mart, we avoid
duplicating it in multiple places around the
enterprise’
Kimball 2002
Slide 18
Data Warehousing and Knowledge Management
Basic components of a Data
Warehouse (DW)
Dimension tables
These contain the textual descriptors of the
business. Their depth and breadth define the
usefulness of the DW.
Contains data that doesn’t change frequently
Can have 50-100 attributes.
Not usually normalized. (Snowflake and starflake)
Coding disparaged (Long term view)
Slide 19
Data Warehousing and Knowledge Management
Basic components of a Data Warehouse (DW)
Star schema Kimball p51
Slide 20
Data Warehousing and Knowledge Management
Basic components of a Data Warehouse (DW)
Kimball p43
Slide 21
Data Warehousing and Knowledge Management
Basic components of a Data Warehouse (DW)
Kimball p39
Slide 22
Data Warehousing and Knowledge Management
Data Warehousing Tools and
Technologies
Building a data warehouse is a complex task
because there is no vendor that provides an
‘end-to-end’ set of tools.
Necessitates that a data warehouse is built using
multiple products from different vendors.
Ensuring that these products work well together
and are fully integrated is a major challenge.
Slide 23
Data Warehousing and Knowledge Management
Advantages of DWs
• Flexibility in modelling data.
• Time dimension – country-specific calendars
and synchronization across multiple time
zones.
• Easy to add external data and summarised
data.
• Built for analysis.
• Built for huge volumes of data (terabytes of
data – a trillion 1012).
• Can cope with ‘idiosyncrasies of geographic
location dimensions’ within GISs.
Slide 24
Data Warehousing and Knowledge Management
Possible advantages of DWs
• Indexing facilities of DW.
• Publishing the ‘right data’ – data collected from
a variety of sources and edited for quality and
consistency.
• DW seeks to collate all data so a variety of
different subsets can be analysed whenever
required.
• Easy to extend DW and add material from a new
source.
• Data cleansing techniques.
• Tracking facility afforded by meta data
Slide 25
Data Warehousing and Knowledge Management
Disadvantages of DWs
• Some humanities data fits into the ‘numerical
fact’ topology, some doesn’t
• Technology not easy and is based on having
existing databases to extract from
• Regular snapshots not the same but they could
equate to data sets taken at different periods of
time (e.g. 1841 census, 1861 census)
• A lot to learn.
Slide 26
Data Warehousing and Knowledge Management
Findings – Humanities and
DWs
NAGARA
(National Association of Government Archives and
Records Administrators)
Article on DWs by Mary Klauda of the Minnesota
Historical Society 1999 (archivist)
Eastern Connecticut schools DW 2002
Bo Wandschneider – University of Guelph, Canada
-DW and the use of census data. ICPSR (Interuniversity Consortium for Political and Social
Research)
Slide 27
Data Warehousing and Knowledge Management
Findings – Humanities and
DWs
University of California DW – memo to
Humanities department
Social Science DW – Human Resources DW
project of Human Sciences Research Council,
South Africa
GEOBASE, Israel. DW of Israel’s regional
statistics, supported by National Planning
Authority in the Ministry of Interior Affairs.
Slide 28
Data Warehousing and Knowledge Management
Humanities and DWs – some
issues
Scale – can cope with really large country /
state -wide problems.
Can analyse e.g. British censuses 1841-1901
(108).
Can put several databases together to produce a
time run – e.g Hearth taxes, window taxes, poll
taxes, land taxes, poor rates all in one DW.
Oracle site licenses.
Slide 29
Data Warehousing and Knowledge Management
Examples of possible History DWs
Slide 30
Data Warehousing and Knowledge Management
Examples of possible History DWs
PROPERTY
INFORMATION
-------------------------------Property Id
Property description
Property value
Etc
MANOR
---------------------------ManorId
Holding Id
Property Id
Original Owner Id
Date
Manor Value
Tax (Hides)
Cottar Population
Bordar Population
Villein Population
Sokeman Population
Pries Population
Number of Burgesses
Number of slaves
Etc.
HOLDING DETAILS
----------------------------Holding ID
King
Tenant in Chief
Manor Lord
VILL
Etc.
ORIGINAL OWNER
---------------------------Original Owner ID
Etc.
Slide 31
Data Warehousing and Knowledge Management
Examples of possible History
DWs
Data from a variety of sources over time– hearth
tax, poor rates, trade directories, census, street
directories, wills and inventories, GIS maps for
a city e.g. Winchester.
Voting data – poll book data and rate book data up
to 1870 for whole country (note some data
missing).
Port data – all data from portbooks for all British
ports together with yearly trade figures.
Street directories for whole country for last 100
years.
Taxation overview – different types / areas /
periods.
Slide 32
Data Warehousing and Knowledge Management
Examples of possible History
DWs
19th C British census data doesn’t fit into the
typical DW model as it doesn’t have the
numerical facts to go into a fact table.
However, there’s a recent development in DWs –
‘factless’ fact tables.
There is real scope to be able to model historical
data using these.
Slide 33
Data Warehousing and Knowledge Management
Examples of possible History DWs
Kimball p247
Slide 34
Data Warehousing and Knowledge Management
Examples of possible Humanities DWs
Language DW – could contain databases of
different languages for comparison, or many
databases of same languages over larger area.
DW of worldwide scholarly community / whole
culture
GIS or archaeological DW by continent etc. rather
than country.
DW of biographies.
DW of library catalogues or archives for enhanced
public access.
Slide 35
Data Warehousing and Knowledge Management
Ideas for the future?
Instead of ‘me and my database’ - emphasis on
smallish, individual, national projects,
Maybe
‘Our integrated warehouse’ – emphasis on large
scale, collaborative, international projects?
Slide 36
Download