Step 3 - Worklets Logical associations Assemble tasks that form a logical step into a worklet, I.e: Load Staging Load Dimensions Load Facts… Parallel Paths Avoid the temptation of connecting all your sessions directly to the worklet start task, even if they can run concurrently Know your system’s glass ceiling and build accordingly Start with one path or two parallel paths and build up from there, while watching your system performance Start with this scenario and add more parallel paths if your hardware permits Don’t start with this scenario X 11262004 www.infocrest.com 2 Copyright ©2004 Step 3 - Worklets Usability Build your worklets as restartable units Nest worklets For groups of interdependent sessions For sessions that must be executed in a prescribed order For instance, a session doing a batch delete that must precede a session doing the batch insert Variables A worklet level variables can be set to the value of a workflow level variable A workflow level variable cannot be set to the value of a worklet level variable Worklet task Parameters Assign the value of a workflow variable to a worklet variable 11262004 www.infocrest.com 3 Copyright ©2004 Step 4 - Workflow Flow Join your worklets to form the master workflow Link worklets in series when Worklets already contain parallel paths Worklets must be executed in sequence Functionality Add workflow level functionality Timed notifications Success email Suspend on error email Modify default links between worklets 11262004 www.infocrest.com Set links to return false on error 4 Copyright ©2004 Step 4 - Workflow Error handling Suspend on Error Stops the workflow if a task errors Only stop the tasks in the same execution path, other paths still run Can send an email on suspension Restart workflow from the Workflow monitor after error fix Task that suspended the workflow is restarted from the top, unless you specified recovery for the failed session. In that case you can use ‘Recover Workflow From Task’ Works well with fully restartable mappings Email task must be reusable. Only one email is sent per suspension. 11262004 www.infocrest.com 5 Copyright ©2004 Step 4 - Workflow Error handling Using tasks properties and link conditions Link condition between a failed task and the next task must evaluate to FALSE if you don’t want the workflow to continue If a link evaluates to FALSE in a Workflow with multiple execution branches, only the branch affected by the error is stopped Fail parent if this task fails is set PrevTaskStatus=SUCCEEDED Stop the execution of the top branch if the first worklet fails and mark the workflow as failed when the bottom branch completes 11262004 www.infocrest.com 6 Copyright ©2004 Step 4 - Workflow Error handling Using link conditions and control tasks Use a control task if you want to stop or abort the entire Workflow or an entire Worklet. As with session properties, setting the Control task to Fail Parent… only marks the enclosing object as failed but does not stop or abort it. PrevTaskStatus=SUCCEEDED Set to Stop top-level workflow PrevTaskStatus=FAILED Stop the execution of both branches, but set the final status of the workflow to stopped, not failed 11262004 www.infocrest.com 7 Copyright ©2004 Step 5 - Triggers Are the sources available? You will probably need to wait for some kind of triggering event before you can start the workflow Event Wait This task is the easiest way to implement triggers Trigger files must be directed to the Informatica server machine Will remove the trigger file if checked If you want to archive trigger files instead of deleting them, add a command task: Delete Filewatch file property cleared 11262004 www.infocrest.com 8 Copyright ©2004 Step 5 - Triggers Waiting for Multiple Source Systems One workflow branch for each system, each with an event wait task Re-join the branches at a Decision task or Event raise task (optional but cleaner) Make sure the task where all branches rejoin has the ‘Treat links as’ property set to ‘AND’ (default) 11262004 www.infocrest.com 10 Copyright ©2004 Step 5 - Scheduler Availability Reusable, as a special object Non-reusable, under Workflows Edit Scheduler NonReusable 11262004 www.infocrest.com Reusable 11 Copyright ©2004 Step 5 - Scheduler Basic Properties Runs every 15 minutes, starting 3/7/03 15:21 and ending 3/24/03 Start when server starts Default mode, not scheduled Run again as soon as previous run is completed Calendar based run windows 11262004 www.infocrest.com 12 Copyright ©2004 Step 5 - Scheduler Custom Repeats Repeat frequency Repeat any day of the month, or several days a month Repeat any day of the week, or several days a week 11262004 www.infocrest.com Here, repeats every last Saturday of the month 13 Copyright ©2004 Step 5 - Scheduler Custom Repeats The scheduler cannot specify a time window within a day (I.e. run every day between 8PM and 11PM) For this, use a link condition between the start task and the next task and schedule the Workflow to run continuously or every (n) minutes Runs if workflow started between 8 and 10:59 PM 11262004 www.infocrest.com 14 Copyright ©2004 Step 6 - Testing One piece at a time Verify the functionality of your worklets using the ‘Start Task’ command Start task Testing Worklet Tasks In order to test and run individual tasks within a worklet, you can copy all the tasks in the worklet and paste them into a new empty workflow These test workflows can also be used in production, if you have to rerun a worklet or part of a worklet while the main workflow is still running 11262004 www.infocrest.com 15 Copyright ©2004 Step 6 - Testing Gantt Chart Monitor this view and the server performance at the same time Identify workflow bottlenecks quickly (candidates for partitioning) Monitor sessions performances when they run concurrently 11262004 www.infocrest.com 16 Session and Server Variables 11262004 www.infocrest.com 18 Copyright ©2004 Session & Server Variables Session Variables Some session properties can be parameterized with the following variable names: $DBConnection_Name $BadFile_Name $InputFile_Name $OutputFile_Name $LookupFile_Name $PMSessionLogFile Use parameters in session properties to override Source, target, lookup or stored procedure connections Source, target or lookup file names Reject file names Session log file names Session parameters do not have default values You must provide a value in a parameter file or the session will fail 11262004 www.infocrest.com 19 Copyright ©2004 Session & Server Variables Using Session Variables in Session Properties 1- use parameter names in the session properties 2- specify a parameter file name in the general properties [ses_BrowserReport] $InputFile_dailylog_part1=daily_1.log $InputFile_dailylog_part2=daily_2.log 11262004 www.infocrest.com 20 3- add an entry for each parameter used in the session properties Copyright ©2004 Session & Server Variables Server Variables Specifies the default location of various folders on the Informatica server machine such as Also provides default values for the following properties Root directory Session log directory Cache and Temp directories Source, target and lookup file directories External procedures directory Success or failure email user Session and workflow log count Session error threshold These variables are set at the server level and cannot be overridden with a parameter file 11262004 www.infocrest.com 21 Copyright ©2004 Session & Server Variables Using Session and Server Variables in Session Components 1- select a session component, either non-reusable or reusable 2- use either session or server parameters within the command 11262004 www.infocrest.com [ses_BrowserReport] $InputFile_dailylog_part1=daily_1.log $InputFile_dailylog_part2=daily_2.log $PMSessionLogFile=ses_BrowserReport.log 22 3- add an entry for each session parameter used Copyright ©2004 Parameter Files Workflow Parameter Files Use to override workflow or worklet user-defined variables Can also contain values for parameters and variables used in sessions and mappings within the workflow The path to this log file can be provided at the ‘pmcmd’ command line If you use both, the command line argument has precedence Workflow properties 11262004 www.infocrest.com 23 Copyright ©2004 Parameter Files Format [Heading] parameterName = parameterValue Heading format Workflow: Worklet: [folderName.WF:workflowName.WT:workletName] Nested Worklet [folderName.WF:workflowName] [folderName.WF:workflowName.WT:workletName.WT:nestedWorkletName] Session Workflow name required if session [folderName.WF:workflowName.ST:sessionName] name is not unique in folder [folderName.sessionName] Folder name required if session [sessionName] name is not unique in repository Names in heading are case-sensitive. Parameters and variable names are not Example String are not quoted 11262004 www.infocrest.com [tradewind.WF:WKF_BrowserReport.ST:SES_loadFacts] $$lastLoadDate=10/2/2003 23:34:56 Default date formats: $$Filter=California,Nevada •mm/dd/yyyy hh24:mi:ss Fact_Mapplet.$$maxValues=78694.2 •mm/dd/yyyy Mapplet variable prefix 24 Copyright ©2004 Incremental Load Using Mapping Variables for Incremental Load Speeds up the load by processing only the rows added or changed since the last load Requires a good knowledge of the source systems to figure out exactly what a new or added row is You can use Informatica’s mapping variable to implement incremental load parameters For added safety, save the variables in parameter files: Our process only updates values in the parameter file if the data is valid (balanced). If not, we can rerun the load with the old parameter values Informatica updates the variables in the repository upon completion of the load process 11262004 www.infocrest.com 26 Copyright ©2004 Partitioning 11262004 www.infocrest.com 27 Copyright ©2004 Informatica Server Architecture 1 - session is started 3 - load manager starts DTM 4 - DTM fetches session’s mapping in Repository 5 - DTM creates and starts stage threads 2 - load manager finds a slot for session 11262004 www.infocrest.com 28 DTM Architecture Efficient, multi-threaded architecture data is processed in stages -Each stage is buffered -Stage processes overlap There is one DTM process for every running session (process runs as pmdtm) at least one thread per reader, transformation and writer stage other threads to control and monitor the overall process Data is being transformed as more data is being read Data is being written as more data is being transformed User control as a user you have control over many performance aspects -Memory usage per session and, in some cases, per transformation -Buffer block size (data is moved in memory by chunks) -Disk allocations, for caches, indexes and logs -Server allocation, with PowerCenter 11262004 www.infocrest.com 29 Copyright ©2004 Partitioning Guidelines When to do it After unit-testing Mapping should be free of error and optimized Only mappings seen as bottlenecks during volume testing should be candidates for partitioning How to do it Look at the big picture What sessions are running concurrently with the session you want to partition What other processes may be running at the same time as the partitioned session A monster session you wish to partition should not be scheduled to run concurrently with other sessions Reserve some time on the server to do your partitioning initial testing If you have to compete with other processes, the test results may be skewed or meaningless Add one partition at a time Monitor the system closely. Look at RAM usage, CPU usage and disk I/O Monitor the session closely. Re-compute the total throughput at each step Each partition adds a new set of concurrent threads and a new set of memory caches Sooner or later, you will hit your server’s glass ceiling. At which point, performance will degrade Add partition point at transformations where you suspect there is a bottleneck Partition points redistribute the data among partitions and allow for process overlap 11262004 www.infocrest.com 30 Copyright ©2004 Definitions Source Pipeline Data flowing from a source qualifier to transformations and targets There are two pipelines per joiner transformation. The master source pipeline stops at the joiner transformation Pipelines are processed sequentially by the Informatica Server Partition Point Set at the transformation object level Define the boundaries between DTM threads Default partition points are created for Source Qualifier or Normalizer transformations (reader threads) Target instances (writer threads) Rank and unsorted Aggregator transformations Data can be redistributed between partitions at each partition point Pipeline Stage Area between partition points, where the partition threads operate 3 default stages Reader stage, reads the source data and brings it in to the Source Qualifier Transformation stage, moves data from the Source Qualifier up to a Target instance Writer stage, writes data to the target(s) Adding a partition points creates one more transformation stage The processing in each stage can overlap, resulting in improved performance 11262004 www.infocrest.com 31 Copyright ©2004 Flow Example 2 partitions, 5 stages, 10 threads Partition points Stage 1 Open one concurrent connection per partition, for relational sources and targets 11262004 www.infocrest.com Stage 2 Stage 3 Stage 4 Stage 5 Data can be redistributed between partition threads at partition points, using these methods: •Pass through •Round robin •Hash keys •Key/Value range 32 Copyright ©2004 Partition Methods You can change the partitioning method at each partition point to redistribute data between stage threads more efficiently All methods but pass-through come at the cost of some performance Round Robin Distributes data evenly between stage threads Use in the transformation stage, when reading from unevenly partitioned sources Hash Key Keeps data belonging to the same group in the same partition so the data is aggregated or sorted properly Use with Aggregator, Sorter, Joiner and Rank transformations Hash auto keys Hash keys generated by the server engine, based on ‘groups by’ and ‘order by’ ports in transformations Hash user keys Define the ports you want to group by 11262004 www.infocrest.com 34 Copyright ©2004 Partition Methods Key/Value Range Define a key (one or more ports) Define a range of values for each partition Use with relational sources or targets You can specify additional SQL filters for relational sources or override the SQL entirely Workflow Manager does not validate key ranges (missing or overlapping) Pass Through Use when you want to create a new pipeline stage without redistributing data. If you want to set a partition point at an aggregator with sorted input, passthrough is the only method available DB Target Partitioning Only available for DB2 targets Queries system tables and distributes output to the appropriate nodes 11262004 www.infocrest.com 35 Copyright ©2004 Partitions and Caches Partitioned Cache Files Each partitioned cache only hold the data needed to process that partition Caches are partitioned automatically for Aggregator and Rank transformations Joiner caches will be partitioned if you set a partition point at the Joiner transformation Lookup caches will be partitioned if When using a joiner with sorted input and multiple partitions for both master and detail sides, make sure all the data before the joiner is kept into one partition to maintain the sort order, then use the hash auto-key partition method at the joiner. To keep the data in one partition: Flat files: use a pass-through partition point at the source qualifer with the flat file source connected to the first partition and dummy (empty) files connected to the other partitions Relational: use a key range partition point at the source qualifier to bring the entire data set into the first partition You set a hash auto key partition point at the Lookup transformation You use only equality operators in the lookup condition The database is set for case sensitive comparisons Sorter caches are not partitioned. 11262004 www.infocrest.com 36 Copyright ©2004 Limitations Partition points Cannot delete default partition points at the reader or writer stages Cannot delete default partition points at Rank or unsorted Aggregator transformation unless You cannot partition a pipeline that contains an XML source Joiners Cannot add a partition point at a Sequence Generator or an unconnected transformation A transformation can only receive input from one pipeline stage. You cannot add a partition point if it violates this rule. XML sources There is only one partition for the session There is a partition point upstream that uses hash keys You cannot partition the pipeline that contains the Master source unless you create a partition point at the Joiner Transformation Mapping changes After you partition a session, you could make some changes to the underlying mapping that would violate the partitioning rules above. Theses changes would not get validated in the Workflow Manager and the session would fail. 11262004 www.infocrest.com 37 Copyright ©2004 Limitations Hash Auto Keys Make sure the row grouping stays the same when you have one auto key partition point feeding data to several ‘grouping’ transformation, such as a Sorter followed by an Aggregator. If the grouping is different, you may not get the results you expect. External Loaders You cannot partition a session that feeds an external loader The session may validate but the server will fail the session The exception is Oracle external loader, under certain conditions One potential solution is to load the target data into a flat file then use an external loader to push the data to the database On UNIX, the server pipes data through to the external loader as the output data is produced and you would loose this advantage with this solution Debugger You cannot run a session with multiple partitions in the debugger Resources Partitioning can be a great help in speeding up sessions but it can use up resources very quickly Review the session’s performance in the production environment to make sure you are not hitting your system’s glass ceiling 11262004 www.infocrest.com 38 Copyright ©2004 Partitioning Demo You have a mapping that reads data from web log files and aggregates values by user session in a daily table In addition, you run a top 10 most active sessions report file You have three web servers, each dedicated to its own subject area Data for one user session can be spread across several log files Log file sizes vary between servers You have a persistent session id in the logs Log file reader Filter out unwanted transactions 11262004 www.infocrest.com Sort input by session ID Aggregate log data per session Rank the top 10 most active sessions 39 Copyright ©2004 Partition Demo – Strategy Define your strategy Using partitions, you can process several log files concurrently, one log per partition Because the log files vary in sizes, you need to re-balance the data load across partitions To keep the sorter and aggregator working properly, you need to group the data load by session id across partitions. This way, data that belongs to the same user session will always be processed in the same partition For the rank transformation, you need all the data to be channeled through one partition, so it can extract the top 10 sessions from the entire data set 3 partitions: each will read a separate log file Partition point #1; Pass-through, to read the each log file entirely 11262004 www.infocrest.com Partition point #2; Round-robin, to even out the load between partitions Partition point #4; Hash auto keys, this ranker does not use a ‘group by’ port. All data will be lumped into one default group and one partition Partition point #3; Hash auto keys, the server will channel data based on session id, the ‘sort by’ port 40 Copyright ©2004 Partition Demo – Implementation I Create one partition per source file 1 Edit session task properties in workflow manager 2 Select Mapping tab 5 Click ‘Edit Partition Point’ 4 Select your source qualifier 6 Click ‘Add’ twice 3 Select Partitions sub-tab 11262004 www.infocrest.com Source qualifier partition point can only be ‘Pass Through’ for flat file sources or ‘Pass Through’ and ‘Key Range’ for relational sources 42 Copyright ©2004 Partition Demo – Implementation II Specify Source Files 2 Select your source qualifier 3 Type filenames 1 Select Transformations sub-tab 11262004 www.infocrest.com 43 Copyright ©2004 Partition Demo – Implementation III Re-balance the data load 3 Click ‘Add Partition Point’ 2 Select your filter 1 Select Partitions sub-tab 4 Select ‘Round Robin’ 11262004 www.infocrest.com 44 Copyright ©2004 Partition Demo – Implementation IV Re-group the data for the Sorter and Aggregator transformation 1 Select your sorter The Aggregator transformation has the ‘Sorted Input’ property set and therefore does not have a default partition point. Since we added a partition point at the Sorter, we don’t need one at the Aggregator. 2 Click ‘Add Partition Point’ 3 Select ‘Hash Auto Keys’ 11262004 www.infocrest.com 45 Copyright ©2004 Partition Demo – Implementation V Ensure the default Rank transformation partition point is set correctly Every Rank transformation gets a default partition point set to hash auto keys, and this is the behavior we want 11262004 www.infocrest.com 46 Copyright ©2004 Partition Demo – Implementation VI Set the defaults for your target Top 10 file 2 Select your flat file target When you write to a partitioned flat file target, data for each partition ends up in its own file. Click ‘Merge Partitioned Files’ to have the server merge those files into one. 1 Select Transformations sub-tab 11262004 www.infocrest.com 47 Copyright ©2004 Partition Demo – Implementation VII Set Session performance parameters 1 Select Properties tab 2 Increase total DTM buffer size if needed. Depends on the number of partitions and the number of sources and targets. 3 Check ‘Collect Performance Data’ box for a test run to see how your partitioning strategy is performing 11262004 www.infocrest.com 48 Copyright ©2004 Partition Demo – Test Run I Monitor Session statistics through the Workflow Monitor Select Transformation Statistics tab Number of input rows for each log file Number of output rows for the relational target table. Load is spread evenly across partitions Output rows sent to the flat file top 10 target, confined to one partition 11262004 www.infocrest.com 50 Copyright ©2004 Partition Demo – Test Run II Monitor Session performance through the Workflow Monitor Select Performance tab, only visible while the session is running and until you close the window. These numbers are saved in the ‘.perf’ file Round-robin evens out the load at the Filter transformation # output rows in Filter != # input rows in Sorter. Data was redistributed. Partition [1] All output rows from the aggregator end up in this partition to be ranked. A single group is created. 11262004 www.infocrest.com 51 Copyright ©2004 Partition Demo – Test Run III Examine Pipeline stage threads performance in session log Thread by thread performance, for each pipeline stage Total run time Idle time Busy percentage High idle time means a thread is waiting for data, look for a bottleneck upstream Scroll down to the Run Info section 11262004 www.infocrest.com 52 Copyright ©2004 Performance & Tuning 11262004 www.infocrest.com 53 Copyright ©2004 Informatica Tuning 101 Collect base performance data Establish reference points for your particular system - Your goal is to measure optimal I/O performance on your system Create pass through mappings for each main source/target combination Make notes of the read and write throughput counters in the session statistics Time these sessions and compute Mb/hour or Gb/hour numbers Do this for various combinations of file and relational sources and targets Try and have the system to yourself when you run your benchmarks Collect performance data for your existing mappings Before tuning them Collect read and write throughput data - Collect Mb/hour ot Gb/hour data - Identify and remove the bottlenecks in your mappings Keep notes of what you do and how it affects the performance Go after one problem at a time and re-check performance after each change If a fix does not provide speed improvement, revert to your previous configuration 11262004 www.infocrest.com 54 Copyright ©2004 Collecting Reference Data Use a pass-through mapping a source definition a source qualifier a target definition No transformations no transformation thread best possible engine performance for this source and target combination 11262004 www.infocrest.com 55 Copyright ©2004 Identifying Bottlenecks 1-Writing to a slow target ? 2-Reading from a slow source ? 3-Transformation inefficiencies ? 4-Session inefficiencies ? 5-System not optimized ? 11262004 www.infocrest.com 56 Copyright ©2004 Target Bottleneck Change session’s writer to a file write 11262004 www.infocrest.com 58 Copyright ©2004 Target Bottleneck Common sources of problems Indexes or key constraints Database commit points too high or too low Common Solutions Drop indexes and key constraints before loading, rebuild after loading Use bulk loading or external loaders when practical Experiment with the frequency of database commit points 11262004 www.infocrest.com 59 Copyright ©2004 Source Bottleneck OR 11262004 www.infocrest.com 60 Copyright ©2004 Source Bottleneck Common sources of problems Inefficient SQL query Table partitioning does not fit the query Common Solutions analyze the query issued by the Source Qualifier. It appears in the session log. Most SQL interpreter tools allow you to view an execution plan for your query. consider using database optimizer hints to make sure correct indexes are used consider indexing tables when you have order by or group by clauses try database parallel queries if supported try partitioning the session if appropriate If you have table partitioning, make sure your query does not pull data across partition lines If you have a query filter on non-indexed columns, try moving the filter outside of the query, into a Filter Transformation 11262004 www.infocrest.com 61 Copyright ©2004 Mapping Bottleneck Under Properties -> Performance 11262004 www.infocrest.com 62 Copyright ©2004 Mapping Bottleneck Common sources of problems too many transforms unused links between ports too many input/output or outputs ports connected out of aggregator, ranking, lookup transformations unnecessary data-type conversions Common solutions eliminate transformation errors if several mappings read from the same source, try single pass reading optimize datatypes, use integers for comparisons. don’t convert back and forth between datatypes optimize lookups and lookup tables, using cache and indexing tables put your filters early in the data flow, use a simple filter condition for aggregators, use sorted input, integer columns to group by and simplify expressions if you use reusable sequence generators, increase number of cached values if you use the same logic in different data streams, apply it before the streams branch off optimize expressions: - isolate slow and complex expressions - reduce or simplify aggregate functions - use local variables to encapsulate repeated computations - integer computations are faster than character computations - use operators rather that the equivalent function, ‘||’ faster than CONCAT(). 11262004 www.infocrest.com 63 Copyright ©2004 Session Bottleneck 11262004 www.infocrest.com 64 Copyright ©2004 Session Bottleneck Common sources of problems inappropriate memory allocation settings under-utilized or over-utilized resources (CPU and RAM) error tracing override set to high level Common solutions experiment with DTM buffer pool and buffer block size -As good starting point is 25MB for DTM buffer and 64K for buffer block size make sure to keep data caches and indexes in memory -Avoid paging to disk, but be aware of your RAM limits run sessions in parallel, in parallel workflow execution paths, whenever possible -Here also, be cautious not to hit your glass ceiling if your mapping allows it, use partitioning experiment with database commit interval turn off decimal arithmetic (it is off by default) use debugger rather than high error tracing,reduce your tracing level for production runs -Create a reusable session configuration object to store tracing level and block buffer size don’t stage your data if you can avoid it, read directly from original sources look at the performance of your session components (run each separately) 11262004 www.infocrest.com 66 Copyright ©2004 System Bottleneck 11262004 www.infocrest.com 67 Copyright ©2004 System Bottleneck Common sources of problems slow network connections overloaded or under-powered servers slow disk performance Common Solutions get the best machines to run your server. Better yet, use several servers against the same repository (power center only). use multiple CPUs and session partitioning make sure you have good network connections between Informatica server and database servers Locate the Repository database on the Informatica server machine shutdown unneeded processes or network services on your servers use 7 bit ASCII data movement (the default) if you don’t need Unicode evaluate hard disk performance, try locating sources and targets on different drives Use different drives for transformation caches, if they don’t fit in memory get as much RAM as you can for your servers 11262004 www.infocrest.com 68 Copyright ©2004 Using Statistics Counters View Session statistics through the Workflow Monitor These numbers are available in realtime, they are updated every few seconds. Select Transformation Statistics tab Number of input rows for each source file Number of output rows for the relational target table. Load is spread evenly across partitions Output rows sent to the flat file top 10 target, confined to one partition 11262004 www.infocrest.com 69 Copyright ©2004 Using Performance Counters Turning it on In the Workflow Manager, edit session Collecting Performance data requires an additional 200K of memory per session 1 - Select Properties tab 3 - Check ‘Collect Performance Data’ box for a test run to see how your partitioning strategy is performing 2 - Select Performance section 11262004 www.infocrest.com 70 Copyright ©2004 Using Performance Counters Monitor Session performance through the Workflow Monitor Select Performance tab, only visible while the session is running and until you close the window. These numbers are saved in the ‘.perf’ file Input rows and output rows counters for each transformation Error rows counters for each transformation 11262004 www.infocrest.com Read from disk/cache, Write to disk/cache counters for ranks, aggregators and joiners 71 Copyright ©2004 Using Performance Counters How to use the counters Input & output rows to verify data integrity Rows repartition at a partition point Error rows Did you expect this transformation to reject rows due to error ? Read/Write to disk If the counters have non-zero values, your transformation is paging to disk Read/Write to cache Use in conjunction with read/write to disk to estimate the size of the cache needed to hold everything within RAM New group key Aggregator and ranker Number of groups created Does this number seem right ? If not, your grouping condition may be wrong Old group key Aggregator and ranker Number of times a group was reused Rows in Lookup Cache Lookup only Use to estimate the total cache size 11262004 www.infocrest.com 72 Copyright ©2004 Using Run Info Counters Using Session log’s Run Info Only available when the session is finished One entry per stage per partition Counters: Run time, total run time for the thread Idle time, total time the thread spent doing nothing (included in total run time) Busy percentage, a function of the two counters abover Replaces V5 buffer efficiency counters Scroll down to the Run Info section 11262004 www.infocrest.com 74 Copyright ©2004 Using Run Info Counters Run Info Busy Percentage You need to compare the values for each stage to properly evaluate where the bottleneck may be You want to look for a high value (busy) that stands out. This indicates a problem area. High values across the board are indicators of an optimized session Bottlenecks in red 11262004 www.infocrest.com Reader Transform Writer High % Low % Low % Low % High % Low % Low % Low % High % 75 Copyright ©2004 Review Quiz 1. 2. 3. What is a benefit of buffered processing stages ? a) Safety net against network errors b) Lower memory requirements c) Overlapping data processing How do you identify a target bottleneck ? a) By changing the output of the session to point to a flat file instead of a relational target b) By reading the Run-Info section of the session log and looking for a low busy percentage at the writer stage c) By replacing the mapping with a pass-through mapping connected to the same target The ‘Collect Performance Data’ option is enabled by default ? a) No, never b) Yes, always c) No, unless you run a debugging session 11262004 www.infocrest.com 76 Copyright ©2004 Review Quiz 4. 5. You have a shared session memory set to 25MB and a buffer block size set to 64K. How many rows of data can the server move to memory in a single operation ? a) 40,000 rows if average row size is 655 bytes b) 100 rows if average row size is 655 bytes c) 2,500 rows if average row size is 64k The Aggregator Transformation’s ‘Write To Cache’ tells the number of rows written to the disk cache ? a) TRUE. b) FALSE 11262004 www.infocrest.com 77 Copyright ©2004 Command Line Utilities 11262004 www.infocrest.com 78 Copyright ©2004 Overview Pmcmd Communicates with the Informatica Server Located in the server install directory Use with external scheduler tools or server scripts Use for remote administration when Workflow Manager or Monitor GUI is not accessible On windows, in the ‘bin’ folder On Unix, at the parent level Pmrep & Pmrepagent Communicates with the Repository Server Use to backup repository Use to change database connections parameters or server variables ($PM…) Use to perform security related tasks located in the repository server install directory On windows, in the ‘bin’ folder On Unix, at the parent level 11262004 www.infocrest.com 79 Copyright ©2004 Working with pmcmd & pmrep (I) Two modes Command line Pass the entire command and parameters to the utility Use when writing server scripts that automate server or repository functions Example Main command Connection parameters need flags (-u, -p,…) >>pmcmd startworkflow -u User -p Password -s InfaServer:4001 wkf_LoadFactTables Interactive Maintain a connection to the server or repository until typing exit Use to enter series of commands, when operating server or repository remotely Example: Just type the utility name at the console to start interactive mode 11262004 www.infocrest.com main parameter >>pmcmd >> >>Informatica™PMCMD 7.1 (1119) >>Copyright © Informatica Corporation 1994-2004 >>All Rights Reserved >> >>Invoked at Fri Apr 25 13:14:23 2003 >> >>pmcmd>connect >>username:User >>password: >>server address <[host:]portno>: InfaServer:4001 80 type a command name without parameters and the utility prompts for the parameters (not available for all commands) Copyright ©2004 Working with pmcmd & pmrep (II) Getting help Command line Interactive Type >>pmcmd help | more to get a paged list of all pmcmd or pmrep commands and arguments Type >>pmcmd help <commandname> to get help on a specific command Type help or help <commandname> at the pmcmd or pmrep prompt Example >>pmrep help backup >> backup >> -o <output file name> >> -f (override existing output file) >>help completed successfully Getting help on the repository server’s backup command. Non-interactive mode Terminating an interactive session Type exit at the pmcmd or pmrep prompt You can also type quit at the pmcmd prompt 11262004 www.infocrest.com 82 Copyright ©2004 Using pmcmd (I) Running Workflows and Tasks Commands for starting, stopping and aborting tasks and workflows You can start a process in wait or nowait mode. You can also specify a parameter file Commands to resume a workflow or worklet starttask startworflow stoptask stopworkflow aborttask abortworkflow schedule Workflow unschedule Workflow resumeworkflow resumeworklet Commands to wait for a process to finish waittask waitworkflow Utility will return control to the user when a given process terminates Task names are fully qualified. If task is within a worklet, use the syntax workletname.taskname Specify the folder (-f) and workflow (-w) hosting the task Example >>pmcmd starttask -u joe -p 1234 -s InfaServer:4001 -f prodFolder -w 11262004 wkf_loadFacts ses_LoadTrans www.infocrest.com 83 Copyright ©2004 Using pmcmd (II) Server Administration Interactive mode: pingserver shutdownserver version Server and workflows status. You can get info about all, running or scheduled workflows Gathering Information About Server getserverdetails getserverproperties Server name,type and version, repository name Repository connect disconnect Both modes Need a user name, password and a server address and port to connect getsessionstatistics gettaskdetails getworkflowdetails getrunningsessionsdetails Interactive Mode Only Setting defaults and properties 11262004 www.infocrest.com Set a default folder or a run mode valid for the entire session setfolder, unsetfolder setwait,setnowait 84 Copyright ©2004 Using pmcmd (III) Return Codes In Command-Line mode, ‘pmcmd’ returns a value to indicate the status of the last command Zero Non zero Command was successful. If starting a workflow or task in wait mode, zero indicates successful completion of the process. In no-wait mode, zero indicates the server successfully received and processed the request Error status, such as invalid user name or password, or wrong command parameters See your Informatica documentation for a list of the latest return codes Caching return codes Within a dos batch file, use the ERRORLEVEL variable Check for exact values, starting with the highest one as in: pmcmd pingserver Infa61:4001 IF ERRORLEVEL == 1 GOTO error IF ERRORLEVEL == 0 GOTO success Within a Perl script, you can use the $? variable shifted by 8 as in: system(‘pmcmd pingserver Infa61:4001’); $returnVal = $? >> 8; 11262004 www.infocrest.com 85 Copyright ©2004 PMREP Commands Change Management Commands CreateDeployment Group AddToDeploymentGroup ClearDeploymentGroup DeployDeploymentGroup DeleteDeploymentGroup CreateLabel ApplyLabel DeleteLabel Deployment group functions to create, add to, deploy, clear or delete a group.Groups can either be static or dynamic. Label functions to create, apply or delete a label. Checkin UndoCheckout DeployFolder Folder copy ExecuteQuery FindCheckout Validate Executes an existing query 11262004 www.infocrest.com 86 Copyright ©2004 PMREP Commands Persistent Input Files You can create reusable input files for some repository and versioning commands These files describe the objects that will be affected by these operations Input files can be created manually or by using repository commands Operations that support input files: Operations that can create a persistent input file Add to Deployment Group Apply Label Validate Object Export List Object Dependencies Execute Query List Object Dependencies Deployment Control Files XML files written to specify deployment options such as ‘Replace Folder’ or ‘Retain Mapping Persistent Values’ Used with 11262004 www.infocrest.com Deploy Folder Deploy Deployment Group 87 Copyright ©2004 PMREP Commands Repository Commands ListObjects Listtablesbysess Import and Export repository objects as XML files Updateseqgenvals List objects dependent on another object (or objects if you use an input file) for a given type and folder ObjectExport ObjectImport List source and target table instance names for a given session ListObjectDependencies Lists repository objects for a given type and folder Change values for non-reusable sequence generators in a mapping For instance, you can reset dimension key generators start values to 1 when reloading data from scratch (second initial load) in a data mart Updatesrcprefix Updatetargprefix 11262004 www.infocrest.com Change the value of a source or target owner name for a given table in a given session 88 Copyright ©2004 PMREPAGENT Commands Repository Commands Backup Backup a repository to a file. Repository must be stopped Create Create a new repository in a pre-configured database Delete Delete repository tables from a database Restore Restore a repository from a backup file to an empty database Upgrade Upgrade an existing repository to the latest version 11262004 www.infocrest.com 90 Copyright ©2004 Repository MX Views 11262004 www.infocrest.com 91 Copyright ©2004 Repository Views Summary Provided for reporting on the Repository Historical load performance Documentation Dependencies These views take most of the complexity out of the production repository tables Use them whenever possible rather than going against production repository tables Never modify the production repository tables themselves 11262004 www.infocrest.com 92 Copyright ©2004 Repository Views Accessing repository MX views Cannot be accessed by Informatica directly Direct access to these tables is prohibited by Informatica You cannot import these table definitions in the Source designer either This can however be circumvented: Create a copy of the views using a different name in the production repository (potentially dangerous) Create a copy of the views in a different database (safer but slower) » » use different view names create with an account that has read permission into the production repository views Can be queried by other database tools SQL*plus for Oracle or SQL query analyzer for MS SQL server Perl scripts using DBI-DBD modules PHP scripts 11262004 www.infocrest.com 93 Copyright ©2004 Repository Views All views at a glance REP_DATABASE_DEFS A list of sources subfolders for each folder REP_SCHEMA List of folders and version info REP_SESSION_CNXS Info about database connections in reusable sessions REP_SESSION_INSTANCES Info about session instances in workflows or worklets REP_SRC_FILE_FLDS Detailed info about flat file, ISAM & XML source fields REP_SRC_FILES Detailed info about flat, ISAM & XML source definitions REP_SRC_FLD_MAP Info about data transformations at the field level for relational sources REP_SRC_MAPPING Sources for each mapping REP_SRC_TBL_FLDS Detailed info about relational source fields REP_SRC_TBLS Info about relational sources for each folder 11262004 www.infocrest.com 94 REP_TARG_FLD_MAP Info about data transformations at the field level for relational targets REP_TARG_MAPPING Targets for each mapping REP_TARG_TBL_COLS -Detailed info about relational targets columns REP_TARG_TBL_JOINS Primary/Foreign key relationship between targets, per folder REP_TARG_TLBS Info about relational targets per folder REP_TBL_MAPPING List of sources & targets per mapping, with filters, group bys and SQL overrides REP_WORKFLOWS Limited info about workflows REP_SESS_LOG Historical data about session runs REP_SESS_TBL_LOG Historical load info for targets REP_FLD_MAPPING Describe data path from source field to target field Copyright ©2004 Repository Views Usage Dependencies You are changing a source or a target table and need to know what mappings are affected You are changing a database connection and need to know which sessions are affected Useful Views Source dependencies Target dependencies REP_SRC_MAPPING REP_TARG_MAPPING Connections dependencies REP_SESSION_INSTANCES REP_TARG_MAPPING Subject area: name for folder in repository tables and views TARGET_NAME TARG_BUSNAME SUBJECT_AREA MAPPING_NAME Explains how the data is transformed from the source to the target VERSION_ID REP_SRC_MAPPING SOURCE_NAME VERSION_NAME SRC_BUSNAME CONDITIONAL_LOAD SUBJECT_AREA MAPPING_NAME SOURCE_FILTER WORKFLOW_NAME GROUP_BY_CLAUSE SQL_OVERRIDE VERSION_ID VERSION_NAME MAPPING_COMMENT MAPPING_LAST_SAVED REP_SESSION_INSTANCES SUBJECT_AREA SESSION_INSTANCE_NAME IS_TARGET VERSION_ID DESCRIPTION CONNECTION_NAME MAPPING_COMMENT CONNECTION_ID MAPPING_LAST_SAVED Refers to folder versioning 11262004 www.infocrest.com 95 Copyright ©2004 Repository Views Dependencies Example queries Display all mappings in the TradeWind folder that have a source or a target named ‘customers’ select distinct mapping_name from rep_src_mapping where source_name = 'customers' and subject_area = 'Tradewind’ union select distinct mapping_name from rep_targ_mapping where target_name like 'customers’ and subject_area = 'Tradewind' Display all workflows and worklets using a server target connection named ‘Target_DB’ select distinct workflow_name from rep_session_instances where connection_name = ’Target_DB' and is_Target = 1 11262004 www.infocrest.com 96 Copyright ©2004 Repository Views Usage Session Performance Run a report on historical load performance for given targets Run a post-load check on warning, errors and rejected rows for all sessions within a folder Useful Views Target performance REP_SESS_LOG REP_SESS_TBL_LOG SUBJECT_AREA Session performance SESSION_NAME REP_SESS_LOG SESSION_INSTANCE REP_SESS_TBL_LOG SUCCESSFUL_ROWS SUBJECT_AREA FAILED_ROWS SESSION_NAME FIRST_ERROR_CODE SESSION_INSTANCE Both first and last error messages TABLE_NAME TABLE_BUSNAME Time when server received the start session request SUCCESSFUL_ROWS FAILED_ROWS ACTUAL START SESSION_TIMESTAMP SESSION_LOG_FILE LAST_ERROR LAST_ERROR_CODE A post load process can access session logs and reject files START_TIME END_TIME 11262004 www.infocrest.com LAST_ERROR_CODE LAST_ERROR TABLE_INSTANCE_NAME Start and end times for the writer stage FIRST_ERROR_MSG BAD_FILE_LOCATION SESSION_TIMESTAMP BAD_FILE_LOCATION 98 Copyright ©2004 Repository Views Session Performance Example queries Display historical load data and elapsed load times for a target called ‘T_Orders’ in the folder ‘TradeWind’ The method to compute the elapsed time will be database dependent select successful_rows, failed_rows, <end_time - start_time> from rep_sess_tbl_log where table_name = ’T_Orders' and subject_area = 'Tradewind’ order by session_timestamp desc Display a post - load report showing sessions having error or warning messages The variable load-start-time is taken from repository production table opb_wflow_run This query assumes there is only one workflow called ‘DailyDatawarehouseLoad’ in the repository select session_instance_name, successful_rows, failed_rows, first_error_msg from rep_sess_log where subject_Area = ‘TradeWind’ and first_error_Msg != ‘No errors encountered.’ and session_timestamp >= (select max(start_time) from opb_wflow_run where workflow_name = ‘DailyDatawarehouseLoad’) 11262004 www.infocrest.com 99 Copyright ©2004 Repository Views Documentation Usage Document the schema (sources and targets with their respective fields) for a given folder or the entire repository REP_TARG_TBL_COLS Useful Views REP_SRC_TBLS REP_TARG_TBLS REP_SRC_TBL_FLDS REP_TARG_TBL_COLS REP_SRC_TBLS REP_TARG_TBLS SUBJECT_AREA REP_SRC_TBL_FLDS TABLE_NAME COLUMN_NAME TABLE_BUSNAME COLUMN_BUSNAME COLUMN_NAME COLUMN_ID COLUMN_BUSNAME SUBJECT_AREA COLUMN_NUMBER TABLE_ID COLUMN_ID TABLE_NAME SUBJECT_AREA TABLE_NAME VERSION_ID TABLE_BUSNAME TABLE_NAME TABLE_BUSNAME VERSION_NAME TABLE_ID BUSNAME COLUMN_NUMBER DESCRIPTION SUBJECT_AREA VERSION_ID COLUMN_DESCRIPTION COLUMN_KEYTYPE VERSION_NAME KEY_TYPE DATA_TYPE DESCRIPTION SOURCE_TYPE DATA_TYPE_GROUP FIRST_COLUMN_ID DATA_PRECISION DATA_PRECISION TABLE_CONSTRAINT DATA_SCALE DATA_SCALE CREATE_OPTIONS NEXT_COLUMN NEXT_COLUMN_ID FIRST_INDEX_ID VERSION_ID IS_NULLABLE LAST_SAVED VERSION_NAME SOURCE_COLUMN_ID DATABASE_TYPE DATABASE_NAME SCHEMA_NAME FIRST_FIELD_ID SOURCE_DESCRIPTION VERSION_ID VERSION_NAME LAST_SAVED 11262004 www.infocrest.com 100 Copyright ©2004 Repository Views Documentation Usage Document the path from each source field to each target column, with the data transformations in between For each mapping, document the sources and targets objects including SQL overrides and group conditions Useful Views REP_FLD_MAPPING REP_TBL_MAPPING Source & Target level REP_TBL_MAPPING Field level REP_FLD_MAPPING SOURCE_FIELD_NAME SOURCE_NAME SRC_FLD_BUSNAME SRC_BUSNAME SOURCE_NAME TARGET_NAME SRC_BUSNAME TARG_BUSNAME TARGET_COLUMN_NAME SUBJECT_AREA In Source Qualifier properties MAPPING_NAME VERSION_ID In Filter transformations VERSION_NAME SOURCE_FILTER In Aggregator transformations CONDITIONAL_LOAD GROUP_BY_CLAUSE SQL_OVERRIDE Digest of all transformations that occur between the source field and the target column. Sometimes cryptic and hard to read… TARG_COL_BUSNAME SUBJECT_AREA MAPPING_NAME VERSION_ID VERSION_NAME TRANS_EXPRESSION USER_COMMENT DBA_COMMENT DESCRIPTION MAPPING_COMMENT MAPPING_COMMENT MAPPING_LAST_SAVED 11262004 www.infocrest.com MAPPING_LAST_SAVED 101 Copyright ©2004 Repository Views - Documentation Example queries Display source schema (all source definitions and field properties) for the folder ‘TradeWind’ The output is sorted by column number to keep it in sync with the field order of each source definition select table_name, column_name, source_type, data_precision, data_scale from rep_src_tbl_flds where subject_area = 'TradeWind' and version_name = '010000‘ and table_name in (select table_name from rep_Src_tbls where subject_area = 'TradeWind') order by table_name, column_number Display the path of data from source field to target column, each with its concatenated transformation expression, for the mapping ‘OrdersTimeMetric’ in the folder’ TradeWind’ select source_field_name, target_column_name, trans_expression from rep_fld_mapping where mapping_name = 'OrdersTimeMetric’ and subject_area = ‘Tradewind’ 11262004 www.infocrest.com 102 Copyright ©2004 Repository Views - Documentation Sample output ShipCountry Country :SD.Orders.ShipCountry RequiredDate OnTime_Orders SUM(IIF (DATE_COMPARE(iif (isnull(:SD.Orders.RequiredDate), :SD.Orders.ShippedDate,:SD.Orders.RequiredDate), :SD.Orders.ShippedDate) >= 0, 1, 0)) ShippedDate OnTime_Orders SUM(IIF (DATE_COMPARE(iif (isnull(:SD.Orders.RequiredDate), :SD.Orders.ShippedDate,:SD.Orders.RequiredDate), :SD.Orders.ShippedDate) >= 0, 1, 0)) OrderID Late_Orders COUNT(:SD.Orders.OrderID) SUM(IIF (DATE_COMPARE(iif (isnull(:SD.Orders.RequiredDate), :SD.Orders.ShippedDate,:SD.Orders.RequiredDate), :SD.Orders.ShippedDate) >= 0, 1, 0)) RequiredDate and ShippedDate source fields both feed the OnTime_Orders column 11262004 www.infocrest.com :SD. Prefix for Source definition 103 Copyright ©2004