Data warehouses in the path from databases to archives Gabriel David

advertisement
Workshop on Database Preservation
Data warehouses in the path from
databases to archives
Gabriel David
gtd@fe.up.pt
PresDB 2007, Gabriel David
‹#›
Context



Organizations are increasingly relying on databases as the
main component of their record keeping systems.
More information, more risk of loosing it in data
repositories turned unreadable.
When the current technology gets obsolete
•
•
•
•

Hardware
Operating systems
Database management systems
Applications.
The paperless office increases the risk of losing significant
chunks of organizational memory and thus harming the
cultural heritage.
PresDB 2007, Gabriel David
‹#›
Previous research

Preserve the technology
• Preserve specimens of the machines, system software and
applications, in all their main versions, so that the backups of
every significant system could be used whenever needed

Simulation
• Simulating the older hardware in newer machines

Migration
technical
• Up-to-date DBMS
- New DBMS, deep reengineering
• Open neutral format
- Conversion of database contents into XML dialect.
PresDB 2007, Gabriel David
‹#›
Preserving digital information


General problem: preserving the wealth of
information that is being generated in digital form
or converted to it
Several projects
• Fundamental models for integrating preservation into the
management of current records
- From the beginning of IS development
- Added to a previously existing IS
• Solutions to concrete problems that arise in specialized
domains
PresDB 2007, Gabriel David
‹#›
Operational system: football data
Contract
Player
PK
PK,FK1
PK,FK2
PK
NrPass
ShortName
FullName
Birthdate
Nacionality
Team
Player
Team
Date
PK
Acro
Name
Address
City
AgreedDuration
BreakDate
Amount
Enroled
PK,FK1
PK,FK2
PK,FK3
PK,FK3
One fact
=
One record
NrPass
Acro
Year
Championship
Event
PK
Event#
FK2
FK2
FK1
FK1
FK1,FK2
FK1,FK2
Minute
Type
Position
NrPass
Acro
Week
Game
Year
Champ
Week
Championship
PK
PK
Champ
Year
PK,FK1
PK,FK1
PK
Champion
Year
Week
PresDB 2007, Gabriel David
Game
PK,FK1
PK,FK1
PK,FK1
PK
Champ
Year
Week
Game
FK2
FK3
Visited
Visitor
Date
‹#›
Star Event
Team
PK
Largely
denormalized
Team_id
Acro
Name
Address
City
Player
PK
Day
Player_id
PK
NrPass
ShortName
FullName
Birthday
Nacionality
Rich
dimensions
(authority
files)
Game
PK
Game_id
Game#
Week_id
Week
Year_id
Year
Champ_id
Campeonato
Day_id
Event
FK1
FK2
FK4
FK5
FK3
FK6
Day
Month
Year
MonthName
Date
WeekDay
Week
Game_id
Team_id
Player_id
Opponent_id
Day_id
Minute_Id
Home
Type
Quant(=1)
Minute
PK
Basic
facts
PresDB 2007, Gabriel David
Minute_id
Minute
Quarter
Part
‹#›
Star GameResult
Team
PK
Team_id
Acro
Name
Address
City
Sum of events of
type Goal, including
self-goals from the
opponent team
Game
GameResult
PK
Game_id
Game#
Week_id
Week
Year_id
Year
Champ_id
Campeonato
3 points for win
1 point for tie
and 0 for defeat
Day
FK1
FK2
FK3
FK4
Game_id
Team_id
Opponent_id
Day_id
Home
GoalsScored
GoalsSuffered
Points
TotalPoints
Classification
PresDB 2007, Gabriel David
PK
Day_id
Day
Month
Year
MonthName
Date
WeekDay
Week
‹#›
Parallel attitude
Data warehouse designer



Approach a database-centred
operational information system (IS) to
specify a data warehouse (DW)
Integrated model of the organization
Merge information from a diversity of
sources, systems and technologies
Archivist



Analyse a document-centred
organizational IS to specify an archiving
policy and system
Integrated model of the organization
Merge information from a diversity of
sources, systems and technologies

Process-centric methodology

Process-centric methodology

Specify data marts

Classify related series of documents


Long-term validity and integrity
requirements
Evaluation attitude


• leave out irrelevant details in the data


Stable monotonic archive
Evaluation attitude
• leave out irrelevant details in the documents

Expose information contents in a simple
and systematic way.
Long-term validity and integrity
requirements

Stable monotonic archive
Expose information contents in a simple
and systematic way.
PresDB 2007, Gabriel David
‹#›
Differences
Archivist
Data warehouse designer

Answer the information needs of the
management

preserve the memory of the
organization and its processes, for
future generations

Decision support

Monitoring

Trend analysis and forecast

Goals are different

Concrete decisions on evaluation and elimination procedures may differ

Different details on metadata
PresDB 2007, Gabriel David
‹#›
Research proposal


Explore the adequateness of the DW approach as a
target vehicle to perform, with respect to a given IS,
the functions considered essential from an archivist
viewpoint like appraisal, classification, elimination,
description, and access while respecting properties
like authenticity and integrity.
i. e. How to preserve the information in a database
• Not taking the DB as a single digital object
• DB is a complex representation of the facts produced by a set of
processes
PresDB 2007, Gabriel David
‹#›
Three relevant research areas

Description of documents to render them available
for retrieval, across the frontiers of domain
modelling, document nature and storage technology
• grown along with the Web

Concerns of archivists with the fragility and opacity
of digital materials
• broader research agenda fuelled by the needs of organizations
and increased awareness of the need for new approaches

Data Warehouse construction techniques
• transform complex structures in operational DB to simple stars
PresDB 2007, Gabriel David
‹#›
One solution for DB preservation


Serialize the database into a single archiving
model
Store the data dictionary
• Table names
• Column names
• Integrity constraints

Store the actual values of each column in each
table line.
PresDB 2007, Gabriel David
‹#›
Why it is not enough




Data is just part of the problem in a database
system
Most real information systems are structured in
three layers: data + business rules + presentation
The presentation layer may contain not too much
knowledge
The data and the business rules layers keep their
own part of the semantics of the data
• In certain cases, the values are meaningless without the code
that discloses their interpretation
PresDB 2007, Gabriel David
‹#›
Database layers
presentation
business rules
F1
archive
F2
data
operational
DB
PresDB 2007, Gabriel David
F1, F2
data
warehouse
‹#›
Idea

Perform a previous step of eliciting implicit
knowledge in the application code and storing it
as explicit columns in the new data model.
• This operation is a typical step in a DW design process.

Investigate
• Transformation rules from typical structures in operational
systems into simple star DW structures
• XML version for exchange and archive
• Metadata needs: technical, semantic, authenticity
PresDB 2007, Gabriel David
‹#›
Download