Data Warehouses and Analytical Data Processing in CERN*s

advertisement
Jan Janke
Software Engineer
CERN / GS-AIS
October 25 - 29, 2010
JINR/CERN Grid and Management Information Systems



Data Warehouses in Administrative Computing
Recap: Data Warehouses Theory
Data Warehouses and Information Systems in AIS
◦
◦
◦
◦

Foundation, HR and FI Information Systems
Complex Data Extraction Processes
Pixel-Perfect Reporting
Dashboards
Detailed Data Warehouse Example
◦ Management Data Layer (MDL)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
2



Data Warehouses in Administrative Computing
Recap: Data Warehouses Theory
Data Warehouses and Information Systems in AIS
◦
◦
◦
◦

Foundation, HR and FI Information Systems
Complex Data Extraction Processes
Pixel-Perfect Reporting
Dashboards
Detailed Data Warehouse Example
◦ Management Data Layer (MDL)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
3
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
4
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
5
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
6



Provides means to administrate CERN
Enables physicists to focus on their work
Allows management to make the right moves
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
7







Heterogeneous computing landscape
Various specialised OLTP systems
Planning needs
Legal Requirements
Support administrative staff
Enforce security and safety on site
Allow management to make decisions
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
8

Specialised Systems
◦ Accounting, ERP for CERN stores
◦ External contracts management
◦ Payroll, treasury management, …
Specialised small
user groups
Distinct
databases
Systems only
accessible to
authorised specialists
High availability
and performance,
real-time data
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
9

General Financial Information System
◦ Single system
◦ Access to data from multiple sources
◦ Different levels of complexity
Specialised small
user groups
Distinct
databases
Systems only
accessible to
authorised specialists
High availability
and performance,
real-time data
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
10

General Financial Information System
◦ Single system
◦ Access to data from multiple sources
◦ Different levels of complexity
Users from all
areas of CERN
Single data
warehouse
Security is extremely
important! System is
accessible CERN
wide.
High availability
and performance,
but no necessity
for real-time data
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
11




Keep data in sync with data providers
Master complex data extraction process
Ensure high query performance
Base for detailed data analysis
Technologies:
o ORACLE RAC database
o Java Enterprise web applications
o In-house developed frameworks
o Third-party BI and reporting tools
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
12



Data Warehouses in Administrative Computing
Recap: Data Warehouses Theory
Data Warehouses and Information Systems in AIS
◦
◦
◦
◦

Foundation, HR and FI Information Systems
Complex Data Extraction Processes
Pixel-Perfect Reporting
Dashboards
Detailed Data Warehouse Example
◦ Management Data Layer (MDL)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
13
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
14
OLTP
OLAP
Data source
Operations
OLTP (consolidated)
Data purpose
Run the business
Reporting, analysis
Inserts, updates
High
Periodic batch jobs
Query complexity
Low
High
DB design
Normalized
Star, snowflake
Availability
Critical
Less critical
Target
Operational staff
Middle/higher Mgmt.
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
15
OLTP
OLAP
Data source
Operations
OLTP (consolidated)
Data purpose
Run the business
Reporting, analysis
Inserts, updates
High
Periodic batch jobs
Query complexity
Low
Depends …
DB design
Normalized
Snowflake and others
Availability
Critical
May be very critical
Target
Operational staff
Mgmt. + Operations
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
16

1NF
◦ 1 table = 1 relation, no repeating groups or duplicate rows

2NF
◦ All non prime attributes depend on
all parts (attributes) of a composite key

3NF
◦ All non prime attributes depend only on the (whole) key
Not in 3NF, why ?
Course
Category
Winner
Origin
Monaco ‘10 Formula 1
M. Webber
Australia
Japan ‘10
Formula 1
S. Vettel
Germany
Japan ‘10
Rally
S. Ogier
France
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
17
item
Sales Fact Table
item_key
branch_key
Branch
branch_key
branch_name
branch_type
location_key
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
Street
city
state_or_province
country
Measures
Source: http://www.executionmih.com/data-warehouse/star-snowflake-schema.php (16/10/2010)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
18
item
Sales Fact Table
item_key
branch_key
location_key
Branch
branch_key
branch_name
branch_type
Measures
units_sold
dollars_sold
avg_sales
supplier
item_key
item_name
brand
type
supplier_key
supplier_key
Supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_province
country
Source: http://www.executionmih.com/data-warehouse/star-snowflake-schema.php (16/10/2010)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
19
FI
ERP
HR
…
Source: http://www.deakin.edu.au/ddw/what-is.php (16/10/2010)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
20


Data Mining
Drilldown
◦ Finer detail granularity (e.g. add a group-by column)

Slice & dice
◦ Play with the dimensions
 Combine different dimensions
 Remove/add a dimension
 Analyse fact changes
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
21



Data Warehouses in Administrative Computing
Recap: Data Warehouses Theory
Data Warehouses and Information Systems in AIS
◦
◦
◦
◦

Foundation, HR and FI Information Systems
Complex Data Extraction Processes
Pixel-Perfect Reporting
Dashboards
Detailed Data Warehouse Example
◦ Management Data Layer (MDL)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
22
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
23



Common data layer for various AIS services
Data interfaces for other CERN services
Common applications (e.g. mgmt. of roles)
Operative systems
HR Information System (HRT)
FI Information System (CET)
… more domain specific information systems
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
24








ORACLE HR
CERN Training Application
Safety & access systems
EDH (Electronic Document Handling)
Accounting Application
ERP system for CERN stores
Contract follow-up
…
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
25

Source databases:
◦ ORACLE 10g
◦ Microsoft Excel

HR/FI Information Systems:
◦ ORACLE 10g
◦ Java Enterprise web applications
◦ SAP Business Objects tool family
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
26



Nightly scheduled batch jobs
Extractions organised in SQL scripts
Run by self-developed “batch runner”
◦ Controls
 Order of execution (sequential, parallel)
 Criticality
 Logging
 Problem escalation (automatic emails)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
27
General
definitions
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
28
Batches &
commands
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
29
New hardware for DEV
databases (gain > 1h)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
30
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
31


Pre-aggregated summaries
Benefit from query rewrite
Source: ORACLE 10g Documentation / Data Warehousing Guide
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
32


Don’t use remote tables if you need query rewrite
Create materialized view log on all source tables
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
33

Use snapshots to efficiently access remote tables
◦ Syntax: CREATE SNAPSHOT … AS [Your Query]
◦ Refresh options:
 FAST
 COMPLETE
 FORCE
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
34


PL/SQL is data source instead of a table
May increase performance in environments with
heavy PL/SQL use
1
CREATE OR REPLACE TYPE myTableFormat
AS OBJECT(
col_a
NUMBER,
col_b
DATE,
col_c
VARCHAR2(25) )
/
CREATE OR REPLACE TYPE myTableType
AS TABLE OF myTableFormat
/
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
35
2
CREATE OR REPLACE FUNCTION myFunc
RETURN myTableType PIPELINED IS
BEGIN
FOR i in 1 .. 5
LOOP
PIPE ROW ( myTableFormat(
i, SYSDATE+i, 'Row '||i ) );
END LOOP;
RETURN;
END;
END;
/
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
36
3
SELECT * FROM TABLE( myFunc() );
col_a
--------1
2
3
4
5
col_b
---------27/10/2010
28/10/2010
29/10/2010
30/10/2010
31/10/2010
col_c
---------Row 1
Row 2
Row 3
Row 4
Row 5
Use a pipelined function if you require a
data source other than a table!
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
37




Star schema like
Highly de-normalised incl. duplication of data
Use single-attribute keys wherever possible
Performance matters!
◦
◦
◦
◦
◦
◦
Be careful when extracting over database links
Certain tables from operational systems are copied
Deletion & recreation of indexes
Use partitions
Manual control of statistics collection
Optimizing execution plans very time-consuming
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
38







Column and ordering selection
Sub reports
Various output formats (e.g. HTML, PDF)
Charts
Self-service reporting
Automated scheduled report execution
Row and column based access control
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
39
Which data (columns) am I allowed to see? As a supervisor I
may not be entitled to see the health insurance category. A
safety or medical officer may not see the salary, etc.
Which rows are
visible to me? Unit
leader of B only
sees persons from
Unit B.
Name
Unit
Tel
Salary
Category
Meyer
A
12345
$ 4,900
3
Schmidt
B
23456
$ 6,400
1
Cook
B
34567
$ 5,700
2
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
40
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
41

Use of Apache FOP library
◦ Examples:
 Employment & training attestations
 Swiss / French card application forms

Business Objects XI Enterprise
◦ Direct use
◦ Indirect use via Business Objects Java SDK
◦ Examples:
 Salary slips
 Car stickers
 Work orders
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
42


Commercial tool family from SAP
Advantages
◦ Rich reporting possibilities (interactive or via SDK)
◦ Appealing dashboards using Xcelsius
◦ Only a few users need the knowledge to design reports

Drawbacks
◦
◦
◦
◦
Two-way data storage (file system & database)
Sometimes stability problems
Time-intensive administration and maintenance
Expensive
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
43
Designed locally using MS Office and Xcelsius.
Data comes from the MDL data warehouse.
Published as Flash to the BO Server.
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
44



Data Warehouses in Administrative Computing
Recap: Data Warehouses Theory
Data Warehouses and Information Systems in AIS
◦
◦
◦
◦

Foundation, HR and FI Information Systems
Complex Data Extraction Processes
Pixel-Perfect Reporting
Dashboards
Detailed Data Warehouse Example
◦ Management Data Layer (MDL)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
45




KPI data warehouse
Very extensible
Fixed generic schema
Feeds management dashboards
Performance: Currently ca. 170 GB data in two tables
Generality: Different forms of data sources, new
sources are added and removed all the time.
Integration with existing tools and development
frameworks (ORACLE, Excel, BO, …)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
46
MDL_HEADERS
n
n
MDL_DIMENSIONS
MDL_VALUES
describes
n
n
n
MDL_RAW_DATA
MDL_SUMMARY_DATA
n
MDL_LOOKUP_DATA
describes
MDL_LOOKUP_INFO
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
47
MDL_HEADERS
n
n
MDL_DIMENSIONS
MDL_VALUES
describes
n
n
n
MDL_RAW_DATA
MDL_SUMMARY_DATA
n
MDL_LOOKUP_DATA
describes
MDL_LOOKUP_INFO
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
48
MDL_HEADERS
n
n
MDL_DIMENSIONS
MDL_VALUES
describes
n
n
n
MDL_RAW_DATA
MDL_SUMMARY_DATA
n
MDL_LOOKUP_DATA
describes
MDL_LOOKUP_INFO
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
49
MDL_HEADERS
n
n
MDL_DIMENSIONS
MDL_VALUES
describes
n
n
n
MDL_RAW_DATA
MDL_SUMMARY_DATA
n
MDL_LOOKUP_DATA
describes
MDL_LOOKUP_INFO
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
50
Range Partitioning
…
2008
2009
2010
Hash
Partitioning
Hash
Partitioning
Hash
Partitioning
Data Set 1
Data Set 1
Data Set 1
Data Set 2
Data Set 2
Data Set 2
…
Data Set n
…
Data Set n
…
…
Data Set n
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
51



Keep it simple
Redesign / add data source if required
Use partitions and indexes
SELECT dimension1, dimension3, sum( value2)
FROM mdl_raw_data
WHERE data_id = 45
AND value_date > 20100000
GROUP BY dimension1, dimension2
ORDER BY 1, 2;
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
52



High data volumes + analysis = data warehouse
OLTP vs. OLAP
Use the facilities the tool provides
◦ Materialized views, snapshots, pipelined functions


Keep things extensible and simple!
Partitions are very helpful
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
53



Data Warehouses in Administrative Computing
Recap: Data Warehouses Theory
Data Warehouses and Information Systems in AIS
◦
◦
◦
◦

Foundation, HR and FI Information Systems
Complex Data Extraction Processes
Pixel-Perfect Reporting
Dashboards
Detailed Data Warehouse Example
◦ Management Data Layer (MDL)
Jan Janke: "Data Warehouses and Analytical Data Processing ..."
54
Download