My experience building a custom ETL system

advertisement
My experience building a custom
ETL system
Problems, solutions and Oracle quirks
or
How scary Oracle can look for a Java developer
Agenda
• WHY do we need an ETL?
• HOW it works
• Experience:
– task
– existing solutions?
– problem or Oracle quirk
– my solution
WHY do we need an ETL?
• OLTP - in most cases calculation is non-trivial:
– SQLs grow in size & complexity
– increased maintenance effort
– poor SQL performance
• business values calculation is implemented in most
of reports independently - no code reuse:
– maintenance effort is multiplied by number of reports
– copy-paste-driven development
WHY do we need an ETL?
Balance / UPL calculation:
OLTP-like RC
• 46 lines
• 7 joins
• 3 levels
after RADIO
• 13 lines
• 3 tables
• 1 level
RADIO: Data flow
Routines:
Streams
•
• Propagate changes to RC - replica tables
Triggers
• Collect transaction info
• Collect changes – change log tables
Processing routines
• Read changes
• Analyze transaction info
• Denormalize, calculate additional fields
• Insert into denormalized tables
• O!
•
•
FLAG – extract rows/entities
for each event in transaction,
sort
MAIN – given event rows,
run actions
SNAP – when we have all TX
from OLTP snap – calculate
appropriate DN snap
Radio: Data flow
CLDao
CLOG
tables
FLAG
job
TX info
DN
tables
Action
MAIN
job
Dao
DNDao
PLSQL code generator
>600kB of pl/sql code:
• TX element row  create type as object
• resulting DN row  create type as object
• action for each event type  create type as object
Code maintenance is pain  use higher level language !!
JPLSQL = java+pl/sql: jsp-like parser for producing pl/sql.
• XML-based DB structure
• XML-based flag/action mapping
• power of Java
Streams, Triggers & CLOGs
• after trigger
• my equals
• duplicate scn
After trigger
• we keep TX apply state in package variables
• before trigger is invoked
• SUDDENLY!, transaction is rolled back –
package variables stay altered!
Use only after triggers
My equals
We need to filter changes, that happened in
columns we don’t collect. But what we do
with Oracle’s null ?
nvl(:new.val = :old.val,
:new.val is null and :old.val is null)
Simple inline in JPLSQL.
Streams duplicate SCN
Ingredients:
• several sessions
• several tables
• Streams replication
Apply process can produce message with same SCN.
Oracle BUG ID: ???
Processing
• Job control
• FLAG  MAIN communication
• MAIN
Processing: JOB control
• identification – how do we know a job is running ?
• communication – how do we communicate a job ?
– dbms_alert has implicit commits
– dbms_pipe is not compatible with RAC
• sleep – conditioned wait
Processing: FLAG
• 250 lines of SQL
• 300 lines of explain plan
• 1 kTX p/second
Processing: FLAG
FLAG
job
MAIN
job
TX info
FLAG computes:
– table of
• table of
– table of
» number
– event types
• event occasions
– table index for this event
» TX row id
3-dimensional table problems:
• ordering (no order by in collect statement in 10g)
• storing – nested table doesn’t preserve ordering
Processing: MAIN
What is different from Java
• object has default constructors – very useful for bulk creation
• encapsulation is bad - package method access is slower, than variable access
• reading from package variable is much, much faster, than reading from tables
 cache everything
Processing: MAIN
What is very different from Java
• object/record assignment works by value, not by reference
Processing: MAIN
Java-like toString:
• get all object fields using user_source view
• execute immediate …
• very useful for debugging
Processing: MAIN
Tom Kyte’s “when others” rule exception:
• we really want to catch all kind of errors:
– infrastructure logic
– business logic constraints
– Oracle internal errors
• we really want to stop after any error
Post-processing: Deployer
Relieves system engineers from deployment paint
•
•
•
•
Read installation bundle
Read DB objects
Compute difference
Build patch
Each object type has it’s own:
• create / change statement syntax
• system view structure
Post-processing: Deployer
Oracle has object dependencies:
• pl/sql depends on tables
• tables depend on user types
• user types depend on their parent types
Misc
• SNAP – Trade/RC scn mapping
• datatest – xmlforest, emails
• very slow dbms_output retrieval
Questions ?
Download