My experience building a custom ETL system Problems, solutions and Oracle quirks or How scary Oracle can look for a Java developer Agenda • WHY do we need an ETL? • HOW it works • Experience: – task – existing solutions? – problem or Oracle quirk – my solution WHY do we need an ETL? • OLTP - in most cases calculation is non-trivial: – SQLs grow in size & complexity – increased maintenance effort – poor SQL performance • business values calculation is implemented in most of reports independently - no code reuse: – maintenance effort is multiplied by number of reports – copy-paste-driven development WHY do we need an ETL? Balance / UPL calculation: OLTP-like RC • 46 lines • 7 joins • 3 levels after RADIO • 13 lines • 3 tables • 1 level RADIO: Data flow Routines: Streams • • Propagate changes to RC - replica tables Triggers • Collect transaction info • Collect changes – change log tables Processing routines • Read changes • Analyze transaction info • Denormalize, calculate additional fields • Insert into denormalized tables • O! • • FLAG – extract rows/entities for each event in transaction, sort MAIN – given event rows, run actions SNAP – when we have all TX from OLTP snap – calculate appropriate DN snap Radio: Data flow CLDao CLOG tables FLAG job TX info DN tables Action MAIN job Dao DNDao PLSQL code generator >600kB of pl/sql code: • TX element row create type as object • resulting DN row create type as object • action for each event type create type as object Code maintenance is pain use higher level language !! JPLSQL = java+pl/sql: jsp-like parser for producing pl/sql. • XML-based DB structure • XML-based flag/action mapping • power of Java Streams, Triggers & CLOGs • after trigger • my equals • duplicate scn After trigger • we keep TX apply state in package variables • before trigger is invoked • SUDDENLY!, transaction is rolled back – package variables stay altered! Use only after triggers My equals We need to filter changes, that happened in columns we don’t collect. But what we do with Oracle’s null ? nvl(:new.val = :old.val, :new.val is null and :old.val is null) Simple inline in JPLSQL. Streams duplicate SCN Ingredients: • several sessions • several tables • Streams replication Apply process can produce message with same SCN. Oracle BUG ID: ??? Processing • Job control • FLAG MAIN communication • MAIN Processing: JOB control • identification – how do we know a job is running ? • communication – how do we communicate a job ? – dbms_alert has implicit commits – dbms_pipe is not compatible with RAC • sleep – conditioned wait Processing: FLAG • 250 lines of SQL • 300 lines of explain plan • 1 kTX p/second Processing: FLAG FLAG job MAIN job TX info FLAG computes: – table of • table of – table of » number – event types • event occasions – table index for this event » TX row id 3-dimensional table problems: • ordering (no order by in collect statement in 10g) • storing – nested table doesn’t preserve ordering Processing: MAIN What is different from Java • object has default constructors – very useful for bulk creation • encapsulation is bad - package method access is slower, than variable access • reading from package variable is much, much faster, than reading from tables cache everything Processing: MAIN What is very different from Java • object/record assignment works by value, not by reference Processing: MAIN Java-like toString: • get all object fields using user_source view • execute immediate … • very useful for debugging Processing: MAIN Tom Kyte’s “when others” rule exception: • we really want to catch all kind of errors: – infrastructure logic – business logic constraints – Oracle internal errors • we really want to stop after any error Post-processing: Deployer Relieves system engineers from deployment paint • • • • Read installation bundle Read DB objects Compute difference Build patch Each object type has it’s own: • create / change statement syntax • system view structure Post-processing: Deployer Oracle has object dependencies: • pl/sql depends on tables • tables depend on user types • user types depend on their parent types Misc • SNAP – Trade/RC scn mapping • datatest – xmlforest, emails • very slow dbms_output retrieval Questions ?