Workshop on Database Preservation Data warehouses in the path from databases to archives Gabriel David gtd@fe.up.pt PresDB 2007, Gabriel David ‹#› Context Organizations are increasingly relying on databases as the main component of their record keeping systems. More information, more risk of loosing it in data repositories turned unreadable. When the current technology gets obsolete • • • • Hardware Operating systems Database management systems Applications. The paperless office increases the risk of losing significant chunks of organizational memory and thus harming the cultural heritage. PresDB 2007, Gabriel David ‹#› Previous research Preserve the technology • Preserve specimens of the machines, system software and applications, in all their main versions, so that the backups of every significant system could be used whenever needed Simulation • Simulating the older hardware in newer machines Migration technical • Up-to-date DBMS - New DBMS, deep reengineering • Open neutral format - Conversion of database contents into XML dialect. PresDB 2007, Gabriel David ‹#› Preserving digital information General problem: preserving the wealth of information that is being generated in digital form or converted to it Several projects • Fundamental models for integrating preservation into the management of current records - From the beginning of IS development - Added to a previously existing IS • Solutions to concrete problems that arise in specialized domains PresDB 2007, Gabriel David ‹#› Operational system: football data Contract Player PK PK,FK1 PK,FK2 PK NrPass ShortName FullName Birthdate Nacionality Team Player Team Date PK Acro Name Address City AgreedDuration BreakDate Amount Enroled PK,FK1 PK,FK2 PK,FK3 PK,FK3 One fact = One record NrPass Acro Year Championship Event PK Event# FK2 FK2 FK1 FK1 FK1,FK2 FK1,FK2 Minute Type Position NrPass Acro Week Game Year Champ Week Championship PK PK Champ Year PK,FK1 PK,FK1 PK Champion Year Week PresDB 2007, Gabriel David Game PK,FK1 PK,FK1 PK,FK1 PK Champ Year Week Game FK2 FK3 Visited Visitor Date ‹#› Star Event Team PK Largely denormalized Team_id Acro Name Address City Player PK Day Player_id PK NrPass ShortName FullName Birthday Nacionality Rich dimensions (authority files) Game PK Game_id Game# Week_id Week Year_id Year Champ_id Campeonato Day_id Event FK1 FK2 FK4 FK5 FK3 FK6 Day Month Year MonthName Date WeekDay Week Game_id Team_id Player_id Opponent_id Day_id Minute_Id Home Type Quant(=1) Minute PK Basic facts PresDB 2007, Gabriel David Minute_id Minute Quarter Part ‹#› Star GameResult Team PK Team_id Acro Name Address City Sum of events of type Goal, including self-goals from the opponent team Game GameResult PK Game_id Game# Week_id Week Year_id Year Champ_id Campeonato 3 points for win 1 point for tie and 0 for defeat Day FK1 FK2 FK3 FK4 Game_id Team_id Opponent_id Day_id Home GoalsScored GoalsSuffered Points TotalPoints Classification PresDB 2007, Gabriel David PK Day_id Day Month Year MonthName Date WeekDay Week ‹#› Parallel attitude Data warehouse designer Approach a database-centred operational information system (IS) to specify a data warehouse (DW) Integrated model of the organization Merge information from a diversity of sources, systems and technologies Archivist Analyse a document-centred organizational IS to specify an archiving policy and system Integrated model of the organization Merge information from a diversity of sources, systems and technologies Process-centric methodology Process-centric methodology Specify data marts Classify related series of documents Long-term validity and integrity requirements Evaluation attitude • leave out irrelevant details in the data Stable monotonic archive Evaluation attitude • leave out irrelevant details in the documents Expose information contents in a simple and systematic way. Long-term validity and integrity requirements Stable monotonic archive Expose information contents in a simple and systematic way. PresDB 2007, Gabriel David ‹#› Differences Archivist Data warehouse designer Answer the information needs of the management preserve the memory of the organization and its processes, for future generations Decision support Monitoring Trend analysis and forecast Goals are different Concrete decisions on evaluation and elimination procedures may differ Different details on metadata PresDB 2007, Gabriel David ‹#› Research proposal Explore the adequateness of the DW approach as a target vehicle to perform, with respect to a given IS, the functions considered essential from an archivist viewpoint like appraisal, classification, elimination, description, and access while respecting properties like authenticity and integrity. i. e. How to preserve the information in a database • Not taking the DB as a single digital object • DB is a complex representation of the facts produced by a set of processes PresDB 2007, Gabriel David ‹#› Three relevant research areas Description of documents to render them available for retrieval, across the frontiers of domain modelling, document nature and storage technology • grown along with the Web Concerns of archivists with the fragility and opacity of digital materials • broader research agenda fuelled by the needs of organizations and increased awareness of the need for new approaches Data Warehouse construction techniques • transform complex structures in operational DB to simple stars PresDB 2007, Gabriel David ‹#› One solution for DB preservation Serialize the database into a single archiving model Store the data dictionary • Table names • Column names • Integrity constraints Store the actual values of each column in each table line. PresDB 2007, Gabriel David ‹#› Why it is not enough Data is just part of the problem in a database system Most real information systems are structured in three layers: data + business rules + presentation The presentation layer may contain not too much knowledge The data and the business rules layers keep their own part of the semantics of the data • In certain cases, the values are meaningless without the code that discloses their interpretation PresDB 2007, Gabriel David ‹#› Database layers presentation business rules F1 archive F2 data operational DB PresDB 2007, Gabriel David F1, F2 data warehouse ‹#› Idea Perform a previous step of eliciting implicit knowledge in the application code and storing it as explicit columns in the new data model. • This operation is a typical step in a DW design process. Investigate • Transformation rules from typical structures in operational systems into simple star DW structures • XML version for exchange and archive • Metadata needs: technical, semantic, authenticity PresDB 2007, Gabriel David ‹#›