Jan Janke Software Engineer CERN / GS-AIS October 25 - 29, 2010 JINR/CERN Grid and Management Information Systems Data Warehouses in Administrative Computing Recap: Data Warehouses Theory Data Warehouses and Information Systems in AIS ◦ ◦ ◦ ◦ Foundation, HR and FI Information Systems Complex Data Extraction Processes Pixel-Perfect Reporting Dashboards Detailed Data Warehouse Example ◦ Management Data Layer (MDL) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 2 Data Warehouses in Administrative Computing Recap: Data Warehouses Theory Data Warehouses and Information Systems in AIS ◦ ◦ ◦ ◦ Foundation, HR and FI Information Systems Complex Data Extraction Processes Pixel-Perfect Reporting Dashboards Detailed Data Warehouse Example ◦ Management Data Layer (MDL) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 3 Jan Janke: "Data Warehouses and Analytical Data Processing ..." 4 Jan Janke: "Data Warehouses and Analytical Data Processing ..." 5 Jan Janke: "Data Warehouses and Analytical Data Processing ..." 6 Provides means to administrate CERN Enables physicists to focus on their work Allows management to make the right moves Jan Janke: "Data Warehouses and Analytical Data Processing ..." 7 Heterogeneous computing landscape Various specialised OLTP systems Planning needs Legal Requirements Support administrative staff Enforce security and safety on site Allow management to make decisions Jan Janke: "Data Warehouses and Analytical Data Processing ..." 8 Specialised Systems ◦ Accounting, ERP for CERN stores ◦ External contracts management ◦ Payroll, treasury management, … Specialised small user groups Distinct databases Systems only accessible to authorised specialists High availability and performance, real-time data Jan Janke: "Data Warehouses and Analytical Data Processing ..." 9 General Financial Information System ◦ Single system ◦ Access to data from multiple sources ◦ Different levels of complexity Specialised small user groups Distinct databases Systems only accessible to authorised specialists High availability and performance, real-time data Jan Janke: "Data Warehouses and Analytical Data Processing ..." 10 General Financial Information System ◦ Single system ◦ Access to data from multiple sources ◦ Different levels of complexity Users from all areas of CERN Single data warehouse Security is extremely important! System is accessible CERN wide. High availability and performance, but no necessity for real-time data Jan Janke: "Data Warehouses and Analytical Data Processing ..." 11 Keep data in sync with data providers Master complex data extraction process Ensure high query performance Base for detailed data analysis Technologies: o ORACLE RAC database o Java Enterprise web applications o In-house developed frameworks o Third-party BI and reporting tools Jan Janke: "Data Warehouses and Analytical Data Processing ..." 12 Data Warehouses in Administrative Computing Recap: Data Warehouses Theory Data Warehouses and Information Systems in AIS ◦ ◦ ◦ ◦ Foundation, HR and FI Information Systems Complex Data Extraction Processes Pixel-Perfect Reporting Dashboards Detailed Data Warehouse Example ◦ Management Data Layer (MDL) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 13 Jan Janke: "Data Warehouses and Analytical Data Processing ..." 14 OLTP OLAP Data source Operations OLTP (consolidated) Data purpose Run the business Reporting, analysis Inserts, updates High Periodic batch jobs Query complexity Low High DB design Normalized Star, snowflake Availability Critical Less critical Target Operational staff Middle/higher Mgmt. Jan Janke: "Data Warehouses and Analytical Data Processing ..." 15 OLTP OLAP Data source Operations OLTP (consolidated) Data purpose Run the business Reporting, analysis Inserts, updates High Periodic batch jobs Query complexity Low Depends … DB design Normalized Snowflake and others Availability Critical May be very critical Target Operational staff Mgmt. + Operations Jan Janke: "Data Warehouses and Analytical Data Processing ..." 16 1NF ◦ 1 table = 1 relation, no repeating groups or duplicate rows 2NF ◦ All non prime attributes depend on all parts (attributes) of a composite key 3NF ◦ All non prime attributes depend only on the (whole) key Not in 3NF, why ? Course Category Winner Origin Monaco ‘10 Formula 1 M. Webber Australia Japan ‘10 Formula 1 S. Vettel Germany Japan ‘10 Rally S. Ogier France Jan Janke: "Data Warehouses and Analytical Data Processing ..." 17 item Sales Fact Table item_key branch_key Branch branch_key branch_name branch_type location_key units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key Street city state_or_province country Measures Source: http://www.executionmih.com/data-warehouse/star-snowflake-schema.php (16/10/2010) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 18 item Sales Fact Table item_key branch_key location_key Branch branch_key branch_name branch_type Measures units_sold dollars_sold avg_sales supplier item_key item_name brand type supplier_key supplier_key Supplier_type location location_key street city_key city city_key city state_or_province country Source: http://www.executionmih.com/data-warehouse/star-snowflake-schema.php (16/10/2010) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 19 FI ERP HR … Source: http://www.deakin.edu.au/ddw/what-is.php (16/10/2010) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 20 Data Mining Drilldown ◦ Finer detail granularity (e.g. add a group-by column) Slice & dice ◦ Play with the dimensions Combine different dimensions Remove/add a dimension Analyse fact changes Jan Janke: "Data Warehouses and Analytical Data Processing ..." 21 Data Warehouses in Administrative Computing Recap: Data Warehouses Theory Data Warehouses and Information Systems in AIS ◦ ◦ ◦ ◦ Foundation, HR and FI Information Systems Complex Data Extraction Processes Pixel-Perfect Reporting Dashboards Detailed Data Warehouse Example ◦ Management Data Layer (MDL) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 22 Jan Janke: "Data Warehouses and Analytical Data Processing ..." 23 Common data layer for various AIS services Data interfaces for other CERN services Common applications (e.g. mgmt. of roles) Operative systems HR Information System (HRT) FI Information System (CET) … more domain specific information systems Jan Janke: "Data Warehouses and Analytical Data Processing ..." 24 ORACLE HR CERN Training Application Safety & access systems EDH (Electronic Document Handling) Accounting Application ERP system for CERN stores Contract follow-up … Jan Janke: "Data Warehouses and Analytical Data Processing ..." 25 Source databases: ◦ ORACLE 10g ◦ Microsoft Excel HR/FI Information Systems: ◦ ORACLE 10g ◦ Java Enterprise web applications ◦ SAP Business Objects tool family Jan Janke: "Data Warehouses and Analytical Data Processing ..." 26 Nightly scheduled batch jobs Extractions organised in SQL scripts Run by self-developed “batch runner” ◦ Controls Order of execution (sequential, parallel) Criticality Logging Problem escalation (automatic emails) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 27 General definitions Jan Janke: "Data Warehouses and Analytical Data Processing ..." 28 Batches & commands Jan Janke: "Data Warehouses and Analytical Data Processing ..." 29 New hardware for DEV databases (gain > 1h) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 30 Jan Janke: "Data Warehouses and Analytical Data Processing ..." 31 Pre-aggregated summaries Benefit from query rewrite Source: ORACLE 10g Documentation / Data Warehousing Guide Jan Janke: "Data Warehouses and Analytical Data Processing ..." 32 Don’t use remote tables if you need query rewrite Create materialized view log on all source tables Jan Janke: "Data Warehouses and Analytical Data Processing ..." 33 Use snapshots to efficiently access remote tables ◦ Syntax: CREATE SNAPSHOT … AS [Your Query] ◦ Refresh options: FAST COMPLETE FORCE Jan Janke: "Data Warehouses and Analytical Data Processing ..." 34 PL/SQL is data source instead of a table May increase performance in environments with heavy PL/SQL use 1 CREATE OR REPLACE TYPE myTableFormat AS OBJECT( col_a NUMBER, col_b DATE, col_c VARCHAR2(25) ) / CREATE OR REPLACE TYPE myTableType AS TABLE OF myTableFormat / Jan Janke: "Data Warehouses and Analytical Data Processing ..." 35 2 CREATE OR REPLACE FUNCTION myFunc RETURN myTableType PIPELINED IS BEGIN FOR i in 1 .. 5 LOOP PIPE ROW ( myTableFormat( i, SYSDATE+i, 'Row '||i ) ); END LOOP; RETURN; END; END; / Jan Janke: "Data Warehouses and Analytical Data Processing ..." 36 3 SELECT * FROM TABLE( myFunc() ); col_a --------1 2 3 4 5 col_b ---------27/10/2010 28/10/2010 29/10/2010 30/10/2010 31/10/2010 col_c ---------Row 1 Row 2 Row 3 Row 4 Row 5 Use a pipelined function if you require a data source other than a table! Jan Janke: "Data Warehouses and Analytical Data Processing ..." 37 Star schema like Highly de-normalised incl. duplication of data Use single-attribute keys wherever possible Performance matters! ◦ ◦ ◦ ◦ ◦ ◦ Be careful when extracting over database links Certain tables from operational systems are copied Deletion & recreation of indexes Use partitions Manual control of statistics collection Optimizing execution plans very time-consuming Jan Janke: "Data Warehouses and Analytical Data Processing ..." 38 Column and ordering selection Sub reports Various output formats (e.g. HTML, PDF) Charts Self-service reporting Automated scheduled report execution Row and column based access control Jan Janke: "Data Warehouses and Analytical Data Processing ..." 39 Which data (columns) am I allowed to see? As a supervisor I may not be entitled to see the health insurance category. A safety or medical officer may not see the salary, etc. Which rows are visible to me? Unit leader of B only sees persons from Unit B. Name Unit Tel Salary Category Meyer A 12345 $ 4,900 3 Schmidt B 23456 $ 6,400 1 Cook B 34567 $ 5,700 2 Jan Janke: "Data Warehouses and Analytical Data Processing ..." 40 Jan Janke: "Data Warehouses and Analytical Data Processing ..." 41 Use of Apache FOP library ◦ Examples: Employment & training attestations Swiss / French card application forms Business Objects XI Enterprise ◦ Direct use ◦ Indirect use via Business Objects Java SDK ◦ Examples: Salary slips Car stickers Work orders Jan Janke: "Data Warehouses and Analytical Data Processing ..." 42 Commercial tool family from SAP Advantages ◦ Rich reporting possibilities (interactive or via SDK) ◦ Appealing dashboards using Xcelsius ◦ Only a few users need the knowledge to design reports Drawbacks ◦ ◦ ◦ ◦ Two-way data storage (file system & database) Sometimes stability problems Time-intensive administration and maintenance Expensive Jan Janke: "Data Warehouses and Analytical Data Processing ..." 43 Designed locally using MS Office and Xcelsius. Data comes from the MDL data warehouse. Published as Flash to the BO Server. Jan Janke: "Data Warehouses and Analytical Data Processing ..." 44 Data Warehouses in Administrative Computing Recap: Data Warehouses Theory Data Warehouses and Information Systems in AIS ◦ ◦ ◦ ◦ Foundation, HR and FI Information Systems Complex Data Extraction Processes Pixel-Perfect Reporting Dashboards Detailed Data Warehouse Example ◦ Management Data Layer (MDL) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 45 KPI data warehouse Very extensible Fixed generic schema Feeds management dashboards Performance: Currently ca. 170 GB data in two tables Generality: Different forms of data sources, new sources are added and removed all the time. Integration with existing tools and development frameworks (ORACLE, Excel, BO, …) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 46 MDL_HEADERS n n MDL_DIMENSIONS MDL_VALUES describes n n n MDL_RAW_DATA MDL_SUMMARY_DATA n MDL_LOOKUP_DATA describes MDL_LOOKUP_INFO Jan Janke: "Data Warehouses and Analytical Data Processing ..." 47 MDL_HEADERS n n MDL_DIMENSIONS MDL_VALUES describes n n n MDL_RAW_DATA MDL_SUMMARY_DATA n MDL_LOOKUP_DATA describes MDL_LOOKUP_INFO Jan Janke: "Data Warehouses and Analytical Data Processing ..." 48 MDL_HEADERS n n MDL_DIMENSIONS MDL_VALUES describes n n n MDL_RAW_DATA MDL_SUMMARY_DATA n MDL_LOOKUP_DATA describes MDL_LOOKUP_INFO Jan Janke: "Data Warehouses and Analytical Data Processing ..." 49 MDL_HEADERS n n MDL_DIMENSIONS MDL_VALUES describes n n n MDL_RAW_DATA MDL_SUMMARY_DATA n MDL_LOOKUP_DATA describes MDL_LOOKUP_INFO Jan Janke: "Data Warehouses and Analytical Data Processing ..." 50 Range Partitioning … 2008 2009 2010 Hash Partitioning Hash Partitioning Hash Partitioning Data Set 1 Data Set 1 Data Set 1 Data Set 2 Data Set 2 Data Set 2 … Data Set n … Data Set n … … Data Set n Jan Janke: "Data Warehouses and Analytical Data Processing ..." 51 Keep it simple Redesign / add data source if required Use partitions and indexes SELECT dimension1, dimension3, sum( value2) FROM mdl_raw_data WHERE data_id = 45 AND value_date > 20100000 GROUP BY dimension1, dimension2 ORDER BY 1, 2; Jan Janke: "Data Warehouses and Analytical Data Processing ..." 52 High data volumes + analysis = data warehouse OLTP vs. OLAP Use the facilities the tool provides ◦ Materialized views, snapshots, pipelined functions Keep things extensible and simple! Partitions are very helpful Jan Janke: "Data Warehouses and Analytical Data Processing ..." 53 Data Warehouses in Administrative Computing Recap: Data Warehouses Theory Data Warehouses and Information Systems in AIS ◦ ◦ ◦ ◦ Foundation, HR and FI Information Systems Complex Data Extraction Processes Pixel-Perfect Reporting Dashboards Detailed Data Warehouse Example ◦ Management Data Layer (MDL) Jan Janke: "Data Warehouses and Analytical Data Processing ..." 54