Infrastructure for Data Warehouses Basics Of Data Access Buffer Bus Structure Memory Cache Data Store Machine Memory Buffer Buffer Data Store Basics Of Data Access: Storage Data on a single disk all share one controller. Striping data randomly across several disks reduces contention for controller time. Databases requiring 100% uptime use striping or mirroring to facilitate backup and maintenance. Backups can be written from one copy while processing proceeds with the other one. Striping, particularly in a RAID environment, permits replacement of failed hardware without bringing down the database. Basics Of Data Access: Retrieval The speed of processing a given retrieval is primarily governed by the number of disk accesses required to execute it. Data is transferred to and from the disk in buffer sized units. On large systems these buffers (blocks) can be set by the code; on PC’s the buffer sizes (sectors) are fixed. A block may contain several records. If all of the records in a block can be processed before another retrieval is needed then processing is faster. Basics Of Data Access: Busses A bus transfers data from device to device. In single systems the bus is internal. In distributed systems the network acts as the bus. Busses transfer data in units of a word. Normally a word is smaller than a buffer unit so transfer takes several bus cycles. (For networks packets do the same thing as words on a backplane bus.) Busses can service only one unit on the bus network at a time. Multiple units on the same bus can generate bus contention. Basics Of Data Access: Cache Cache is high speed data storage location that stores the most recently used data that is to be transferred between units in a system. Cache speeds up processing by taking advantage of data reuse (looping) typical of most programs, by reducing the number of physical DASD accesses required. Memory cache (as opposed to CPU cache) is a location in main memory and can be set by the system administrator. Program Characteristics Transaction Systems Access few records at a time. Require records from random locations. Update and modify data frequently. Data Warehouse Systems Access a number of records at a time. Require records in order. Update and modify data infrequently. System Tuning Transaction Systems Small buffers Large cache Fast busses Data Warehouse Systems Large buffers Small cache Wide busses Acxiom Overview Acxiom, creates and delivers Customer and Information Management Solutions that enable many of the largest, most respected companies in the world to build great relationships with their customers. Acxiom achieves this by blending data, technology and services to provide the most advanced customer information infrastructure available in the marketplace today. Data Warehouses The characteristics of an Acxiom data warehouse generally are... • • • • • • • • Large multi-terabyte databases Large periodic sequential data loads Denormalized database schema Sequential reads/full table scans Little or no indices Little or no transaction logging Robust periodic backup solutions Performance measured using megabytes/gigabytes per second (MBPS, GBPS) Data Warehouses IBM The processing platform is generally a large global class server or cluster of servers running UNIX. The database is; A large vertical database that is denormalized with few tables but very long with sorted data and are sometimes several billion rows. The data is striped across the storage in a manner that prevents physical hot spots and takes advantage of the wide bandwidth. Database The storage subsystem is very fast with wide bandwidth and high levels of redundancy which permits the ability to move large amounts of sequential data in a very short time. Data Warehouses IBM Transactional Databases The characteristics of an Acxiom transactional database generally are... • Small, usually no larger than a few terabytes • Random and simultaneous inserts, updates, deletes, and queries • Random reads and writes • Normalized database schema • Transaction logging and archiving with incremental and periodic backup solutions • Generally sub-second response required per transaction taking into account concurrency • Performance measured using transactions per second (TPS) and I/O latency Transactional Databases IBM The processing platform is generally a medium/enterprise class server The database is; A normalized database that utilizes lookup tables. The data is stored randomly within a table but striped across the storage to prevent physical hot spots. Database The storage subsystem is very fast with low latency and nominal bandwidth and high levels of redundancy which permits the ability to move small amounts of selected data quickly. Transactional Databases IBM Hybrid Databases The characteristics of an Acxiom hybrid database generally are... • Medium sized, usually three to ten terabytes • Random and simultaneous inserts, updates, deletes, and queries • Random and sequential reads and writes • Loosely normalized database schema • Indices used sparingly • Usually a batch maintenance process • Transaction logging and archiving with incremental and periodic backup solutions • Generally sub-second response required per transaction taking into account concurrency • Performance measured using TPS, I/O latency, and MBPS Hybrid Databases IBM The processing platform is generally a medium sized global class server The database is; A large vertical database that is loosely normalized with few tables but very long with sorted data and are sometimes more than a billions rows. The data is striped across the storage in a manner that prevents physical hot spots and takes advantage of the wide bandwidth. Database The storage subsystem is very fast with wide bandwidth and high levels of redundancy which permits the ability to move large amounts of random and sequential data in a very short time. Hybrid Databases IBM What’s New/ Future Innovations Grid or scale-out environments... • Utilize low cost commodity based servers • Low cost/no cost operating systems • Many servers can be working on one problem with the aggregate processing power being more that one large server for less money • Not locked into a single vendor or supplier • When adding a new node, able to use current technology at a lower price • Need to understand and factor in peripheral costs such as network, administration, data center etc. Parallel Grid Clustered Grid IBM server IBM server IBM server IBM server IBM server IBM server pSeries pSeries pSeries pSeries pSeries DB DB DB DB DB DB OS OS DB pSeries Distributed Grid Database • Shared nothing environment, each partition has its own resources allowing unlimited scalability (up to 999 partitions). Any partition can receive connections and • Centralized management of partitioned environment. distribute queries among the other nodes. • Data is equally distributed across all partitions. Summary Understand the process in which the database is to be used and fashion a solution to meet the requirements and customer expectations Even though a DBA may only be responsible for the database, many factors such as operating system and hardware configuration affect the functionality of the database and thus are a concern to the DBA. A DBA must relate the database to its environment to achieve an optimized solution. A large multi-terabyte database is not a scary monster, it is the same as dealing with a smaller database, just add a few more zeros.