Advanced PowerCenter V 7.1.1 by 11262004 www.infocrest.com 1 Copyright ©2004 Table of Contents Introduction and Objectives Mapping Techniques Local Variables Expression Transformation Aggregator Aggregate Functions De-Normalizing Lookup Transformation Dynamic Lookups Transaction Control Transaction Based Processing Joiner Stored Procedure Debugger Tips Mapping Error Handling Duplicate Keys Audits Workflow Manager Design Sessions & Tasks Worklets Workflows Triggers and Schedulers Testing Session and Server Variables Parameter Files Incremental Loads 11262004 www.infocrest.com 3 8 10 11 15 17 18 21 27 33 36 39 43 50 58 64 66 68 71 73 79 81 85 92 95 100 103 Partitioning Architecture Review Guidelines and Definitions Partition Methods Cache Partitioning Limitations Demo Performance & Tuning Tuning 101 Target Bottlenecks Source Bottlenecks Mapping Bottlenecks Session Bottlenecks System Bottlenecks Using Performance Counter Command Line Utilities pmcmd pmrep pmrepagent Repository MX Views Viewing Dependencies Performance Historical Data Documentation 2 Copyright ©2004 104 105 107 111 113 114 116 130 131 135 137 139 141 144 146 155 160 163 167 168 172 175 177 Introduction 11262004 www.infocrest.com 3 Copyright ©2004 Founded in 1993 Leader in Enterprise Data Integration Platform More than 1500 customers Global presence High growth Public since April 1999 Symbol: INFA 11262004 www.infocrest.com 4 Copyright ©2004 Training & Knowledge Transfer: Creator of the Informatica Boot Camp INFA Project reviews PM/PC Implementation methodologies Founded in 1997 in Los Angeles, CA Principal: Jean Abi-Jaoudé Informatica Partner since 1997 Fortune 100 clients (GE, Qualcomm, Disney ..) 11262004 www.infocrest.com 5 Copyright ©2004 Class Objectives Advanced Mapping Techniques Mapping Error Handling Workflow Techniques Parameter Files Partitioning Performance and Tuning Command Line Utilities MX Views 11262004 www.infocrest.com 7 Copyright ©2004 Mapping Techniques 11262004 www.infocrest.com 8 Copyright ©2004 Transformation Objects Expression Transformation Aggregator Transformation Dynamic commits Transaction Based Processing Joiner Transformation Lookup Caches SQL overrides Dynamic lookups Transaction Control Sorted input De-normalizer techniques Aggregate functions Lookup Transformation Using local variables, forward and self references Using an expression with a sequence generator to achieve very large primary key sequences Using sorted input Self-joins Stored Procedures Pre and post load procedures Calling unconnected stored procedures 11262004 www.infocrest.com 9 Copyright ©2004 Local Variables Self Reference Forward Reference A local variable can refer to a variable that is further down the port list Using this method, you can compare data from 2 or more consecutive rows Un-initialized Values Keep the current value of v_EmployeeID IF v_IsDupe is true A local variable can refer to itself as in: v_CustomerIDLagged takes the value of v_CustomerID before v_CustomerID is evaluated Strings: empty (not a NULL value!) Numbers: 0 Dates: 01/01/1753 Transformations Local variables available in Expression, Aggregator and Rank 11262004 www.infocrest.com 10 Copyright ©2004 Expression Transformation Lagger Technique When you need to make a decision or set a value based on the contents of 2 or more consecutive rows Expression holds one or more field values in memory for one or more consecutive rows Makes use of the top to bottom order of evaluation for variable ports One input port, two variable ports and one output port needed for each lagged field Add one variable port for each additional row you wish to hold in memory Use variable forward reference to assign v_ShipCountry to v_ShipCountry_Lagged before v_ShipContry’s value is changed Usage Example One row holds a start time, the next row holds an end time, hold the first row in memory to compute the elapsed time 11262004 www.infocrest.com 11 Copyright ©2004 Expression Transformation Lagger Mapping Needs Dummy ‘pusher’ row to push the last row of data out of the lagger expression Flat files: add an extra row using a pre-session command task Relational: add an UNION ALL statement in a SQL override Filter or Router group to remove the first row coming out of the lagger expression This row holds empty (un-initialized values from the expression) Source Qualifier with UNION ALL SQL override 11262004 www.infocrest.com Lagger expression to detect duplicates. Has a flag with 3 values: •dupe •not dupe •first row 12 Router to direct the flow to dupe and non dupes target. Router will reject the first row based on the flag value Copyright ©2004 Expression Transformation Generating Very Large Surrogate Key Values Sequence generator limited to 2 billion + Use an expression to increment a multiplier when the sequence reaches a given threshold Use a mapping variable to store the multiplier value in the repository after each run 11262004 www.infocrest.com 13 Copyright ©2004 Aggregator Sorted Input Use it whenever you can Improves the pipeline performance Only the rows needed for one group are kept in memory All aggregations are done using RAM, not disk cache You need to pre-sort the data using either a SQL override in your source qualifier, a Sorter Transformation or pre-sorted flat files You need to sort the data in the order of the ‘group by’ ports in your aggregator The Informatica Server will fail the session if the data is not sorted in strict ascending or descending order The Informatica Server will revert to unsorted input behavior when: 11262004 www.infocrest.com There is no ‘group by’ port in your aggregator You select ‘Incremental Aggregation’ in the aggregator properties You have nested aggregate functions in your aggregator’s expressions 15 Copyright ©2004 Aggregator Why Sorted Input Improves Performance Informatica Server processes data in buffered stages Read, Transform and Write stages normally operate simultaneously, feeding each others buffers An unsorted Aggregator is a bottleneck, as the write stage must wait until all aggregation is done before processing any row Read Read Transform unsorted sorted Write Write Elapsed Time Elapsed Time 11262004 www.infocrest.com Transform 16 Copyright ©2004 Aggregate Functions Nesting functions Conditional clause You can nest aggregate functions as in MAX(SUM(order_amount)) The Informatica server ignores ‘sorted input’ when you nest aggregate functions You cannot mix nested and non-nested functions in the same aggregator Use instead of an IIF statement for aggregate expressions Syntax is : FUNCT(input, [Boolean expression]) Example: AVG (price, price > 2000), computes the average price value when price is above 2000 The input will be taken into account for the aggregation if and only if the Boolean expression returns TRUE Availability With output ports only, no variable port can have an aggregate expression 11262004 www.infocrest.com 17 Copyright ©2004 De-Normalizing Definition BookID Language 1324 English 1324 French 1324 Spanish 1325 English BookID EnglishFlag FrenchFlag SpanishFlag 1324 Y Y Y 1325 Y N N Pivot Methods Pivot table to arrange data in multiple field in a single row rather than one field in multiple rows Using the first, last, min or max functions with a conditional clause For simple cases Using local variables and forward and self reference techniques For more complex aggregations or concatenation of string values Limitations Use only when the input has a known number of repeating rows or a know maximum number of repeating rows 11262004 www.infocrest.com 18 Copyright ©2004 De-Normalizing Using Aggregate Functions Use with a conditional expression to extract the value of the first or last row that matches the condition Amount Year Quarter 254,556 2003 first 546,332 2003 second 129,034 2003 third 490,223 2003 fourth 165,768 2004 first 265,443 2004 second 510,412 2004 third 435,690 2004 fourth 11262004 www.infocrest.com Year Q1_Amount Q2_Amount Q3_Amount Q4_Amount 2003 254,556 546,332 129,034 490,223 2004 165,768 265,443 510,412 435,690 In this case, it does not matter what aggregate expression is used. Last, Max or Avg would do just as well. 19 Copyright ©2004 De-Normalizing Using Local Variables Local variables inside an aggregator are evaluated at each row In the debugger, you can only see values for each group, if the aggregator is sorted Local variables are needed for custom aggregation, like string concatenations Group by bookid Use forward reference technique to identify new groups Output ports hold the value of the flag for each group 11262004 www.infocrest.com 20 Use expressions to set each language flag. Set flag to ’Y’ if book is published in that language. Uses selfreference to keep the previous setting unless it is a new group Copyright ©2004 Lookup Transformation Lookup Caches 1 - Informatica engine issues a SELECT statement against the database when the first row arrives at the lookup. The result set is split into 2 caches 3 - Index cache holds the columns used in the lookup condition 2 - Data cache holds the columns in the output ports 11262004 www.infocrest.com 4 - Caches are queried internally for each row passing through the lookup. The query is based on the lookup condition and the values of the lookup input ports 21 Copyright ©2004 Lookup Transformation Lookup query Always look at the session log and review the SQL query issued to populate lookup caches This query always includes an ORDER BY clause. This clause is generated automatically by Informatica and cannot be modified directly Since every column is included in this ORDER BY, it can become very expensive The only way around is to override the SQL query, add your own ORDER BY and finish the statement with a comment delimiter (such as -- ) Your custom ORDER BY must include the columns used in the lookup condition, in the order their appear in the condition tab The engine expects rows to be sorted in the caches, so an ORDER BY clause is mandatory TRANSF_1_1_1> DBG_21097 Default sql to create lookup cache: SELECT OrderID,CustomerID FROM Orders ORDER BY CustomerID,OrderID TRANSF_1_1_1> DBG_21079 Creating Lookup Cache : (Thu Sep 02 07:58:34 2004) TRANSF_1_1_1> DBG_21682 Lookup table row count : 830 TRANSF_1_1_1> DBG_21297 Lookup cache row count : 830 Session log extract 11262004 www.infocrest.com 23 Copyright ©2004 Lookup Transformation Lookup query If you only need a subset of the lookup table, override the SQL and write a WHERE clause You can perform joins in the SQL override as long as the joined tables come from the same database You must use the syntax SELECT <column name> AS <port name>. Your session will fail otherwise even though the query may validate in the transformation Use this field to set the lookup condition: OrderCount >= Zero and leave the ‘Zero’ port unconnected You can use mapping variables and parameters in your override, for dynamic queries You have to cache the lookup when you have a SQL override, otherwise you will get a run-time error 11262004 www.infocrest.com 24 Copyright ©2004 Lookup Transformation Sharing Lookup Caches Un-named Named For multiple lookups on the same table in the same mapping Automatically shared if condition ports are the same (although operators can be different) The first lookup transformation must extract the same data set as subsequent lookups or extract a superset Cannot share a dynamic lookup To share lookups among sessions Lookup structures must be identical (same conditions, same output ports) SQL overrides must be identical Session’s high precision setting must be the same for all sessions sharing the lookup files You mix dynamic and static shared lookups, but you can’t run them in simultaneous sessions Tip 11262004 www.infocrest.com Make your shared lookup transformation reusable 25 Copyright ©2004 Lookup Transformation Shared Lookup Usage Data Warehousing Dimensions may need to be looked up several times during the load process Optimize your workflow by having the first lookup create a named persistent lookup file Alter your other lookups to read from these cache files Example: Orders lookup used 3 times, in sessions 1, 2 and 3 First session must re-cache from the source Debugging 11262004 www.infocrest.com Sessions with large lookups are time-consuming to debug If you can’t use a subset of the lookup data, make the cache persistent The first run will have to create the cache files but subsequent runs will be much faster 26 Copyright ©2004 Dynamic Lookups Dynamic Lookups Use when looking up a target table, when you need to keep lookup cache and target contents in sync Dynamic lookups can insert new rows or update existing rows in the cache (but cannot delete rows) Updated, inserted or unchanged rows are flagged by the lookup transformation via the NewLookupRow port The row type is not changed by the dynamic lookup, you need to set up Update Strategies downstream to change the actual row type Ports Select ignore in comparison for fields you don’t need to compare to determine the update status Lookup fields Source fields 11262004 www.infocrest.com Lookup fields not used in the condition must be manually associated with their corresponding source field 27 Select ignore null in comparison so null source values don’t trigger an update Copyright ©2004 Dynamic Lookups Ports (cont.) You can associate a lookup field holding a surrogate key (integer or small integer) with the special Sequence-ID value The startup value of Sequence-ID is the highest field value found in the initial cache data for the surrogate key field or 1 if the initial cache is empty This value is incremented every time a new row is added to the lookup cache You can use this feature in-lieu of a Sequence Generator transformation Select Sequence-ID from the drop-down menu NewLookupRow Port Pass this field to a router to branch out the data flow based on the insert/update/no-change status 3 possible values: 11262004 www.infocrest.com 0 = no change 1 = row inserted in the lookup cache 2 = row updated in the lookup cache 28 Copyright ©2004 Dynamic Lookups Properties Lookup caching must be enabled Toggle dynamic lookup. Automatically adds the NewLookupRow port when on When the lookup updates a row in the cache, you can choose to return either old (before the update) or new (after the update) values. Useful for slowly changing dimensions, when you need both values. 11262004 www.infocrest.com 29 Copyright ©2004 Dynamic Lookups Properties (cont.) Updating the cache Insert Else Update Lookup cache can be updated if the incoming source row exists in the cache but the values of one or more fields differ between the source row and the cached row By default, the dynamic lookup does not perform both insert and update operations on the lookup cache You have to turn on ‘insert else update’ or ‘update else insert’ properties to enable this feature, depending on the row type coming into the transformation When the row type is ‘insert’ (the default) Off: lookup cache is not updated, only new source rows are inserted into the cache On: both inserts and updates are performed Update Else Insert 11262004 www.infocrest.com When the row type is ‘update’ (you need an Update Strategies transformation upstream to change the row type to update). Off: no lookup cache inserts are performed, only updates On: both inserts and updates are performed 31 Copyright ©2004 Dynamic Lookups Caveats SQL override with a WHERE clause When using a subset of data for building the lookup cache Lookup cache and target may get out of sync if your source stream does not use the same filter as your WHERE clause SQL override includes a WHERE clause to exclude some productID Filter must exclude the same productIDs from the source Ignore NULLS in comparison When using the Sequence-ID feature to generate surrogate keys, you may get duplicate key values if your lookup cache does not hold the entire data set You may need an Expression Transformation after the dynamic lookup to make sure you are not passing NULL values from the source into the target Speed Penalty 11262004 www.infocrest.com Dynamic lookups are a bit slower than regular lookups because they have to update their cache contents on the fly Biggest hit when the lookup cache is paging to disk and there are many updated or inserted rows 32 Copyright ©2004 Transaction Control Overview Active (row level) transformation Enable dynamic commit or rollback points Works for relational targets or MQ series targets Naming convention: TC_<what it does> Iconic view 11262004 www.infocrest.com 33 Copyright ©2004 Transaction Control Setting Up the Mapping Define a Boolean condition expression and a transaction type. Transaction type will be executed when the expression evaluates to TRUE Transaction constants TC_COMMIT_BEFORE, TC_ROLLBACK_BEFORE: TC_COMMIT_AFTER, TC_ROLLBACK_AFTER: Commit /rollback rows for the current transaction, not including the current row. Initiate a new transaction starting at the current row Commit/rollback rows for the current transaction including the current row. Will Initiate a new transaction starting after the current row TC_CONTINUE_TRANSACTION 11262004 www.infocrest.com Do not commit or rollback at this point 34 Copyright ©2004 Transaction Control Setting Up the Session Set to user defined automatically when you have a valid transaction control transformation Checked by default, otherwise server rollbacks left over transactions if the last row does not coincide with a commit point You can tell the server to rollback transactions from the last commit point when it encounters non-fatal errors 11262004 www.infocrest.com 35 Copyright ©2004 Transaction Based Processing Transformation Scope Property For Aggregator, Sorter, Joiner and Rank transformation If ‘Transaction’ is selected, the transformation logic will only apply to rows within the same transaction Example scenario Source: Target A ‘shipments’ table, containing one row per order per ship date Challenge A very large flat file containing order details, at the line item level This file is already sorted by order id You need to aggregate the data by order id and ship date to populate this table The file is huge and you want to avoid either sorting the entire data set in the aggregator or sorting the data in a pre-process Transaction Based Solution 11262004 www.infocrest.com The data is already sorted by order id. Any additional sorting can be done within the relatively small set of order lines that comprise each order. Each set of line items will trigger a new transaction, and the scope of the sorter and the aggregator will be a single transaction 36 Copyright ©2004 Transaction Processing Example Mapping Transaction control, set the type to TC_COMMIT_BEFORE every time a new order id is encountered Aggregate by order id and ship date, sorted input is On Expression to detect a change in order ids. Compare the current order id with the previous order id, a change means we have the first row of a new order Sort by order id and ship date, scope set to ‘Transaction‘ Transaction control, to reset the commit point to target based Sample input, sorted by order id but not by ship date Sample output, one row per distinct order id and ship date, ordered by order id and ship date 11262004 www.infocrest.com 37 Copyright ©2004 Joiner Sorted Input Not selected Cache the Master table rows in memory (or disk) Brings in detail rows one at the time and compares them to the master rows depending on your join type and condition You get better performance if the master table is the smallest of the two sources even if, logically this table is not the master Selected 11262004 www.infocrest.com Works only if both sources are pre-sorted Saves time by not having to cache the entire master source If your sources are not sorted, you can use a SQL override or a Sorter transformation before the Joiner Pulls rows from both sources and compares them using the join type and condition Fastest way to join if your sources are already sorted, otherwise weigh in the cost of using a database sort or a Sorter transformation 39 Copyright ©2004 Joiner Caching Specify a cache directory on a separate disk to reduce i/o contentions Use RAM cache only for best performance Monitor caches files on server while session is running TRANSF_1_1_1> DBG_21077 Create joiner cache on master relation : (Wed Sep 08 05:31:04 2004) TRANSF_1_1_1> CMN_1690 Created new data file [D:\Program Files\Informatica PowerCenter 7.1\Server\Cache\PMJNR530_19.dat] and index file [D:\Program Files\Informatica PowerCenter 7.1\Server\Cache\PMJNR530_19.idx] for Joiner [JNR_OldNew]. TRANSF_1_1_1> DBG_21214 Finished joiner cache on master relation : (Wed Sep 08 05:31:04 2004) TRANSF_1_2_1> DBG_21603 Open master relation cache for detail joiner : (Wed Sep 08 05:31:04 2004) TRANSF_1_2_1> DBG_21215 Finished processing detail relation : (Wed Sep 08 05:31:04 2004) 11262004 www.infocrest.com 40 Copyright ©2004 Joiner Self-Join Join two instances of the same source Aggregate dollars per company Join source data to aggregated data Join two branches of the same pipeline 11262004 www.infocrest.com 41 Copyright ©2004 Joiner Self-Join To join two branches of the same pipeline, you must use sorted data and turn on the sorted input property If you cannot use sorted data, you must read your source twice and join the two pipelines You can also join output from multi-group transformations such as Router, XML source qualifier or Custom transformation If you join 2 branches with an Aggregator in one branch, the Aggregator must use sorted input as well 11262004 www.infocrest.com 42 Copyright ©2004 Stored Procedure Pre and Post Load Stored Procedures Connections $Source alias is automatically selected for Source Pre and Post-Load types $Target alias is selected for Target Pre and Post Load types When changing from a Source to a Target type, the corresponding connection alias is restored, overwriting any custom relational connections you might have selected. Properties tab, Stored Procedure Type Call Text Accepts only hard-coded values, mapping parameters or variables will not be expanded Use Pre/Post SQL if you want dynamic stored procedure calls Use double quotes if your input parameters include spaces; single quotes are ignored 11262004 www.infocrest.com 43 Copyright ©2004 Stored Procedure Pre and Post Load Stored Procedures Execution Plan To select the execution order of procedures having the same type Interface is similar to the Target Load Plan function 1- Choose Stored Procedures Plan… in the Mappings menu 2- Move stored procedures within a group using the up and down arrows 11262004 www.infocrest.com 44 3- The Execution order property is updated to reflect the execution plan Copyright ©2004 Stored Procedure Pre and Post Load Stored Procedures Source Pre Load and Target Pre Load types Source Post Load and Target Post Load types Run after the mapping has finished writing to the target(s) Similar to Post SQL statements Debugging Run before the mapping starts reading rows Similar to Pre SQL statement You need to use a session instance (within a workflow) to have the debugger execute your Pre/Post Load procedures Session error handling Either Stop or Continue If you choose Stop and your session fails on a Post Load procedure, your target rows are already committed Under Config Objects Error Handling 11262004 www.infocrest.com 45 Copyright ©2004 Stored Procedure Calling Normal Unconnected Stored Procedures Using PROC_RESULT, one output parameter, return value is lost Using a local variable, one output parameter, return value is kept Local variable and associated output parameter must have the same datatype Using PROC_RESULT and a local variable, two output parameters 11262004 www.infocrest.com 47 Copyright ©2004 Stored Procedure Calling Normal Unconnected Stored Procedures Nesting Stored Procedure calls using PROC_RESULT The output of the innermost procedure is passed as an input parameter to the outermost procedure You can use PROC_RESULT once per procedure call Within a conditional expression Your expression must evaluate to TRUE for the procedure to be called Within an expression attached to an output port The port must be connected forward to the data flow or your procedure will not run 11262004 www.infocrest.com 48 Copyright ©2004 Stored Procedure Normal Stored Procedures Session Error Handling Fatal errors in connected or unconnected normal procedures will always cause the session to fail Non-fatal errors will increment the session error counter. Your session will fail when this counter reaches the threshold set in the session properties Non-fatal error rows are skipped by the server and written the the session log and reject file 11262004 www.infocrest.com 49 Copyright ©2004 Debugger Tips 11262004 www.infocrest.com 50 Copyright ©2004 Debugger Tips Reusable sessions Create a reusable session to debug your mapping when practical Use for both debugger and workflow Cuts down on development time, specially if you have lots of sources and targets Many parameters are not available in debugger sessions Using Existing Sessions Choose a session instance to run the debugger within the context of a workflow If you want to test mapping or workflow parameters and variables If you want your debug session to execute Pre or Post Load stored procedures Choose a session instance or a reusable session to test all session attributes not available with a simple debug session 11262004 www.infocrest.com To run session components such as command tasks or emails 51 Copyright ©2004 Debugger Tips Drawback Remember to validate and save the reusable session when you make a big change in your mapping Adding source or target Adding transformation Sometimes, you’ll have to disconnect and reopen your folder in the Workflow Manager to register mapping changes Beware of overridden parameters, like SQL overrides Session override takes precedence over the mapping Session instance overrides take precedence over the reusable session object Configuration objects You can’t specify a line buffer size or a block buffer size in a debug session But you can create a configuration object with the settings you want and use it with your debug session 11262004 www.infocrest.com 52 Copyright ©2004 Debugger Tips Workspace Organize your workspace to display Debug & session logs at the bottom, using full window length Transformation instances and target instance above, side by side Remember to switch to session log pane after the debug session is initialized First row Monitor the time it takes to bring in the first row of data It is acceptable ? If not, review the SQL query Or, a flat file source may be rejecting all the rows 11262004 www.infocrest.com Line buffer too small Wrong date/time conversions 53 Copyright ©2004 Debugger Tips Source data Always examine source data closely during the first run Look for bad or unexpected data formats (date/time, amounts) Look for truncated data Make sure the data is indeed what you expect Data movement Follow data through each transformation during the first run Pay attention to unnecessary type conversions and back and forth type conversions Verify the logic in each complex expression Look at how the data moves within your mapping Do you have too many data paths ? It there a way to simplify the logic ? Record the time it takes to load cached lookups 11262004 www.infocrest.com Review the SQL query in the session log Do you need to bring in all the fields in the lookup ? You may want to override the ORDER BY clause inserted automatically by the server Do you have a complex join condition ? » Make sure the conditions with an equal sign ‘=‘ are put at the top of the list 55 Copyright ©2004 Debugger Tips Evaluate Expression Find out the value of a port, a variable or enter your own expression to evaluate Not available for every transformation, only for: Aggregator Expression Filter Rank Router Update Strategy Results are displayed in between square brackets as in [VALUE] Easy to spot unexpected padded char values, such as [VALUE ] Unexpected padded values cause trouble in comparisons Lookups Expressions Find out the value of a mapping parameter or variable Start value, enter the name of the variable or parameter Current value, use the variable or parameter with a variable function 11262004 www.infocrest.com SETMAXVARIABLE SETMINVARIABLE SETCOUNTVARIABLE 56 Copyright ©2004 Debugger Tips Discard or Not Discard ? When discarding target data, the Informatica server does NOT execute pre or post SQL commands against the target database write sequence generator values to the repository truncate target tables prepare insert or delete SQL statements against the targets verify if the target tables actually exist in the target database Choose to write target data for a true test of the target system The debugger only commits rows to the database at the end of a completed (successful) debug session Debugger shutdown If running, click first on the ‘Break Now’ icon 11262004 www.infocrest.com 57 for a faster response Copyright ©2004 Mapping Error Handling 11262004 www.infocrest.com 58 Copyright ©2004 Error Handling 1. Guidelines 2. Overview of most common errors 3. Handling data errors 4. Handling duplicate key errors 5. Example of an error table 6. Performing audits 11262004 www.infocrest.com 59 Copyright ©2004 Error Handling Guidelines Data quality Develop a set of standards or adopt existing company standards Standards should include Definition of an error Error handling procedures in your mappings and sessions How data gets rejected or corrected What to do with rejected data What data gets rejected and why Ignore Save for manual review Save for automated reload Define your error strategy Errors are inevitable A good error strategy tries to: Catch errors and reject bad rows Store these rows somewhere Set a threshold for an acceptable number of errors Provide a way to report on rejected rows as part of the load process 11262004 www.infocrest.com 60 Copyright ©2004 Typical Errors Data errors Database errors The load process did not terminate within the allotted time window Audit/Balance A session or a job on which your session depends did not run properly Missing trigger file Time constraints Exceeds error threshold Wrong connection settings or line buffer too small Insufficient resources (disk space or memory) Dependencies errors Row violates primary key constraints Inserting a NULL value in a non null field Session errors Incorrect or corrupted source data Unexpected values Data conflicts with business rules Referential integrity failure Row counts or sum totals do not match Data mismatch between production systems and data warehouse Server errors Informatica server or database server goes down Bad network connections Hardware failure 11262004 www.infocrest.com 61 Copyright ©2004 Data Errors Handling in mappings At the field level Program defensively, don’t leave open doors or loose ends Handle NULL values, blanks or zeroes consistently and in accordance to your error strategy Use consistent default values for cases such as Make sure these default values are understood and approved by your users Override the default ERROR() function for output ports Missing or blank data Invalid or NULL lookup return value Invalid or corrupted source data Use a constant value or expression At the row level Use a custom ERROR() function To force a row to be skipped To describe the details of an error condition in the session log Beware, this method can throw off a Joiner Transformation downstream Rows rejected this way will not be written to the reject file Transfer rejected rows to a dedicated error table or error file 11262004 www.infocrest.com For automated reloads For audit purposes Most flexible solution but also the most expensive in terms of work and processing overhead 63 Copyright ©2004 Duplicate Keys Informatica Server behavior During normal load (row by row) Using External loaders Rows that violate database constraints will be rejected by the database Rejected rows are written to the reject file (.bad) Details also appear in the session log (normal tracing) Error handling is automatic but there is a big performance hit if there is more than a few errors Load may succeed but indexes will be left in unusable state Rejected rows will be written to the .ldrreject file Error details may be found in the .ldrlog file Handling If performing full refreshes If performing updates whenever possible, eliminate duplicates at the source, before the load Load in normal mode and let Informatica handle it Run post process audit to correct rejected rows Alternative, when performing full refreshes or updates Be proactive and catch potential duplicates inside your mapping Most expensive solution, reserve for critical tables 11262004 www.infocrest.com 64 Copyright ©2004 Duplicate Keys Normal mapping processing goes here… Mapping solutions for full refresh 1 - Relational source, duplicate rows routed to an error table 11262004 www.infocrest.com 2 - Flat file source, duplicate rows destroyed, last row of duplicate group sent to the target Normal mapping processing goes here… 65 Copyright ©2004 Audits Create a versatile error table Should provide storage for all types of audit errors Not tied to one particular source or target table format One row per error All data values displayed as string Includes a concatenated value of the primary key fields to reference the source or target system Source and target names are fully qualified Identifies the process (mapping) that wrote rows to the table Data type String 32 TargetName String 32 SourceFieldName String 32 SourceFieldValue String 500 TargetFieldname String 32 TargetFieldValue String 500 PrimaryKeyConcat String 100 ErrorType String 16 AuditProcess String 64 Date/time 19 InsertDate If you need to keep data for several days, archive the daily table at the end of each load process Run an automated report After each load Total number of errors per error type of sources and targets having triggered an error 11262004 List www.infocrest.com 66 Precision SourceName Truncate the error table before each load Primary key errors Data errors Row count errors Control totals errors Should be reused by all mappings that perform audits Field name Copyright ©2004 Audits Row by row audit For critical tables, when extracting data from production systems to data warehouse Source must be unchanged or staged for the audit to work Joiner needed only when source and target in different databases or source is a flat or VSAM file 11262004 www.infocrest.com 67 Copyright ©2004 Workflow Manager 11262004 www.infocrest.com 68 Copyright ©2004 Assembling Workflows Step 1 Step 2 Assemble worklets into workflow Implement workflow level extra functionality Step 5 Assemble load stages and/or subject areas into worklets Step 4 Gather unit-tested mappings into reusable sessions Step 3 Design flow, analyze dependencies and devise error strategy Implement trigger mechanisms and scheduler Step 6 Test worklets, workflow, triggers and scheduler 11262004 www.infocrest.com 69 Copyright ©2004 Step 1 - Design Large Workflows Complex workflows handling complete loads offer more flexibility and functionality than a collection of small workflows The new workflow manager interface makes it easy to design large and complex load processes The workflow manager workspace window is self documenting The workflow monitor will give you a very good idea of the load’s progress, using the Gantt chart view Exception If you have an external scheduler, you can either: 11262004 www.infocrest.com Run single session workflows and let the scheduler handle Informatica job dependencies Run more complex workflows from your scheduler and let the workflows handle the job dependencies 71 Copyright ©2004 Step 1 - Design The Big Picture Analyze your sources and targets to determine your dependencies at the mapping level Devise triggering mechanisms to start the process when the sources are ready Use a diagram to visualize the flow of data Design your error strategy at the session level and workflow level What to do when a session fails Audit and balancing procedures to capture data discrepancies Think about production issues 11262004 www.infocrest.com Who will get noticed when an error occurs and by what means Design restart-ability into your workflows, worklets and mappings 72 Copyright ©2004 Step 2 - Sessions Reusable Sessions After a session is unit-tested, move it to the project folder as a reusable object Reusable sessions are easier to manage Easy to copy between folders Can be exported/imported as XML objects Easy to integrate into worklets and nested worklets All reusable sessions conveniently listed in the Navigator, under the ‘Sessions’ folder. Reusable session instance 11262004 www.infocrest.com 73 Copyright ©2004 Step 2 - Sessions Overriding Reusable Sessions Instances Reusable sessions provides an additional level of override at the instance level, either in a workflow or a worklet The instance override takes precedence over the master object If you change a property in the master object and that property is overridden at the instance level, the override will still be in effect You must use the revert button to cancel an override, even if the value of the overridden property is the same as the value in the master object Transformation properties cannot be overridden at the instance level The target connection in this instance has been changed. Click Revert to return to the connection specified in the original session object 11262004 www.infocrest.com 74 Copyright ©2004 Step 2 - Sessions Error Strategies Use a reusable session configuration to: set the error threshold set the number of session logs to save set the log tracing level… 1- Define common error handling behavior in a session configuration object 2- Click to pick your reusable session config from a menu of available configurations 11262004 www.infocrest.com 75 Copyright ©2004 Step 2 - Sessions Error Strategies Use a reusable email objects in your session components to send pre-formatted messages on failure or success 1- Create an email task you can reuse for each session failure 11262004 www.infocrest.com 2- attach this email to your sessions as a reusable component 76 Copyright ©2004 Step 2 – Sessions And Tasks Error Strategies Use task general properties to implement error handling at the task level Fail parent… Fail parent if this task fails Simply marks the enclosing Workflow or Worklet as failed Does not stop or abort the enclosing Workflow or Worklet Use when your scheduler relies on a failed status code to flag an error at the workflow level Changes the end status of the enclosing Workflow to failed when the task fails Fail parent if this task does not run 11262004 www.infocrest.com Changes the end status of the enclosing Workflow to failed when the task does not run For instance, if a link condition between the previous task and this task evaluates to FALSE 77 Copyright ©2004