ITEC 423 Data Warehousing and Data Mining Lecture 3 Architecture Architecture is the art and science of designing buildings and other structures; Architecture is as a system design decision that is usually not easily changed. There are many different architectural choices available with different solutions for Data transfer Data Staging Area Data storage Information Delivery A General Data Warehouse Architecture A General Data Warehouse Architecture with Staging Area A General Data Warehouse Architecture with Staging Area and Data Marts Architectural Types Architectural Types :Centralized Data Warehouse Takes into account the enterprise-level information requirements Atomic level data at the lowest level of granularity is stored Some summarized data may be included Queries and applications access the central data warehouse. No separate data marts Architectural Types- Independent Data Marts Evolves in companies where the organizational units develop their own data marts for their own specific purposes Each data mart serves a particular organizational unit More than one version of the truth may be found Data marts are independent of one another Different data marts may have inconsistent data definitions and standards Such variances hinder analysis of data across data marts. Architectural Types-Federated An existing legacy of an assortment of DSS in the form of operational systems, extracted datasets, primitive data marts, … May not be possible to discard investment and start from scratch Practical solution is a federated architectural type data may be physically or logically integrated through shared key fields, overall global metadata , distributed queries, and such other methods No one overall data warehouse Architectural Types- Data-Mart Bus Conformed supermarts approach Analyzing requirements for a specific business subject such as orders, shipments, billings, insurance claims, car rentals ... Build the first data mart (supermart) using business dimensions and metrics These business dimensions will be shared in the future data marts. Conform dimensions among the various data marts Result would be logically integrated supermarts that will provide an enterprise view of the data Data marts contain atomic data organized as a dimensional data model Results from adopting an enhanced bottom-up approach to data warehouse development Architectural Types- Hub and Spoke Similar to the centralized data warehouse architecture: enterprise-wide data warehouse Atomic data is stored in the centralized data warehouse The centralized data warehouse feeds data to the dependent data marts on the spokes Dependent data marts may be developed for departmental analytical needs, specialized queries, data mining ... Dependent data mart may have normalized, denormalized, summarized, or dimensional data structures based on individual requirements Most queries are directed to the dependent data marts Centralized data warehouse may also be used for querying Results from adopting a top-down approach to data warehouse development. Building Blocks of Data Warehouses Production Internal Source Data Archived External Production Data Problems • variations in the data formats. • data residing on different hardware platforms • data supported by different DBMS/OS Choose appropriate data from different operational systems based on DW requirements comes from the various operational systems of the enterprise • financial systems • manufacturing systems • the supply chain Internal Data Determine strategies for collecting data • Start with most significant data • Limit to the most important portions Increases complexity of transformation and integration process Held by individuals and departments in private files • • • • Spreadsheets Documents customer profiles departmental databases Archived Data Backup data old data of the operational databases are stored in archived files. Decisions related to archiving how often Different methods of archiving Recent data archived to a separate archival database that may still be online. which portions Older data archived to flat files on disk storage. Oldest data archived to tape cartridges or microfilm may be kept off-site. Data warehouse keeps historical snapshots of data. need historical data for analysis over time. Look into your archived data sets. Depending on your data warehouse requirements, you have to include sufficient historical data. External Data External data is used especially by decision makers statistics relating to their industry produced by external agencies and national statistical offices. market share data of competitors. standard values of financial indicators for their business to check on their performance. Production data and archived data give you a picture based on what you are doing or have done in the past. Is not enough for understanding industry trends and compare performance Extraction Transformation Loading Data Staging Component Data Extraction Deals with numerous data sources • Source data may be from different source machines in diverse data formats. Tools are available on the market for data extraction. • Outside tools suitable for certain data sources • Develop in-house programs to do the data extraction. After extraction where to keep the data for further preparation? Perform the extraction function in the legacy system Extract the source into a separate physical environment from which moving the data into the data warehouse would be easier. • extract the source data into a • group of flat files • a data -staging relational data base • a combination of both. Data Transformation Data for a data warehouse comes from many disparate sources • Clean the data from each source: misspellings, resolution, missing data, duplicates • Standardize data elements: data types, lengths, synonyms/homonyms • Combine related information • Purge useless data • Choose appropriate keys • Summarize if necessary Data feed is not just an initial load. • Same (maybe slightly adapted) transformation process will be applied periodically. Data Loading The initial load moves large volumes of data • very time consuming. Periodically Extract/Transform/Load • Yearly • Quarterly • Monthly • Daily Data Storage Component A separate repository Large volumes of historical data for analysis not for quick retrieval of individual pieces of information multidimensional databases store data aggregated at different levels Information Delivery Component complex queries, multidimensional analysis, and statistical analysis Ad hoc reports casual user need prepackaged information novice user once in a while need prefabricated reports and preset queries business analyst need ability to do complex analysis using the information in the data warehouse power user need to be able to navigate throughout the data warehouse, pick up interesting data, format his or her own queries, drill through the data layers, and create custom reports and ad hoc queries. Information Delivery Component Information Delivery Component Information fed into executive information systems (EIS) is meant for senior executives and high-level managers. Some data warehouses also provide data to data mining applications. • knowledge discovery systems where the mining algorithms help to discover trends and patterns from the data In your data warehouse , you may include several information delivery mechanisms. • online queries and reports • scheduled reports through e-mail or intranet • information delivery over the Internet Metadata Component Similar to the data dictionary or the data catalog in a DBMS Data about the data in the data warehouse. key architectural component of the data warehouse. Types of Metadata: • Operational metadata • Extraction and transformation metadata • End-user metadata Importance of Metadata • connects all parts of the data warehouse . • provides information about the contents and structures to the developers. • makes the contents recognizable to the end users. Management and Control Component sits on top of all the other components. coordinates the services and activities within the data warehouse. controls the data transformation and the data transfer into the data warehouse storage. moderates the information delivery to the users. works with the database management systems and enables data to be properly stored in the repositories. monitors the movement of data into the staging area and from there into the data warehouse storage itself. interacts with the metadata component to perform the management and control functions.