Module 1 - WordPress.com

Module 1 DS324EE – DataStage Enterprise Edition Concept Review Ascential’s Enterprise Data Integration Platform Command & Control ANY SOURCE CRM ERP SCM RDBMS Legacy Real-time Client-server Web services Data Warehouse Other apps. DISCOVER PREPARE TRANSFORM Gather relevant information for target enterprise applications Cleanse, correct and match input data Standardize and enrich data and load to targets Data Profiling Data Quality Extract, Transform, Load Parallel Execution Meta Data Management ANY TARGET CRM ERP SCM BI/Analytics RDBMS Real-time Client-server Web services Data Warehouse Other apps. Course Objectives  You will learn to: – Build DataStage EE jobs using complex logic – Utilize parallel processing techniques to increase job performance – Build custom stages based on application needs  Course emphasis is: – Advanced usage of DataStage EE – Application job development – Best practices techniques Course Agenda  Day 1 – – – –   – Combining Data – Configuration Files Review of EE Concepts Sequential Access Standards DBMS Access Day 2 – EE Architecture – Transforming Data – Sorting Data Day 3  Day 4 – – – – Extending EE Meta Data Usage Job Control Testing Module Objectives  Provide a background for completing work in the DSEE advanced course  Ensure all students will have a successful advanced class  Tasks – Review parallel processing concepts Review Topics  DataStage architecture  DataStage client review – – – – Administrator Manager Designer Director  Parallel processing paradigm  DataStage Enterprise Edition Client-Server Architecture Command & Control Microsoft® Windows NT/2000/XP ANY TARGET ANY SOURCE Designer Discover Extract Director Administrator Repository Manager Prepare Cleanse Transform Transform Extend Integrate Server Repository Microsoft® Windows NT or UNIX Parallel Execution Meta Data Management CRM ERP SCM BI/Analytics RDBMS Real-Time Client-server Web services Data Warehouse Other apps. Process Flow  Administrator – add/delete projects, set defaults  Manager – import meta data, backup projects  Designer – assemble jobs, compile, and execute  Director – execute jobs, examine job run logs Administrator – Licensing and Timeout Administrator – Project Creation/Removal Functions specific to a project. Administrator – Project Properties RCP for parallel jobs should be enabled Variables for parallel processing Administrator – Environment Variables Variables are category specific OSH is what is run by the EE Framework DataStage Manager Export Objects to MetaStage Push meta data to MetaStage Designer Workspace Can execute the job from Designer DataStage Generated OSH The EE Framework runs OSH Director – Executing Jobs Messages from previous run in different color Stages Can now customize the Designer’s palette Popular Stages Row generator Peek Row Generator  Can build test data Edit row in column tab Repeatable property Peek  Displays field values – Will be displayed in job log or sent to a file – Skip records option – Can control number of records to be displayed  Can be used as stub stage for iterative development (more later) Why EE is so Effective  Parallel processing paradigm – More hardware, faster processing – Level of parallelization is determined by a configuration file read at runtime  Emphasis on memory – Data read into memory and lookups performed like hash table Scalable Systems  Parallel processing = executing your application on multiple CPUs – Scalable processing = add more resources (CPUs, RAM, and disks) to increase system performance 1 2 • 3 4 5 6 Example system containing 6 CPUs (or processing nodes) and disks Scaleable Systems: Examples Three main types of scalable systems  Symmetric Multiprocessors (SMP), shared memory  Clusters: UNIX systems connected via networks  MPP note SMP: Shared Everything • Multiple CPUs with a single operating system • Programs communicate using shared memory • All CPUs share system resources (OS, memory with single linear address space, disks, I/O) When used with enterprise edition: • Data transport uses shared memory • Simplified startup cpu cpu cpu cpu enterprise edition treats NUMA (NonUniform Memory Access) as SMP Traditional Batch Processing Operational Data Transform Clean Load Archived Data Data Warehouse Disk Disk Disk Source Traditional approach to batch processing: • Write to disk and read from disk before each processing operation • Sub-optimal utilization of resources • a 10 GB stream leads to 70 GB of I/O • processing resources can sit idle during I/O • Very complex to manage (lots and lots of small jobs) • Becomes impractical with big data volumes • disk I/O consumes the processing • terabytes of disk required for temporary staging Target Pipeline Multiprocessing Data Pipelining • Transform, clean and load processes are executing simultaneously on the same processor • rows are moving forward through the flow Operational Data Archived Data Transform Clean Load Data Warehouse Target Source • • • • Start a downstream process while an upstream process is still running. This eliminates intermediate storing to disk, which is critical for big data. This also keeps the processors busy. Still has limits on scalability Think of a conveyor belt moving the rows from process to process! Partition Parallelism Data Partitioning • Break up big data into partitions • Run one partition on each processor • 4X times faster on 4 processors With data big enough: 100X faster on 100 processors • This is exactly how the parallel databases work! • Data Partitioning requires the same transform to all partitions: Aaron Abbott and Zygmund Zorn undergo the same transform Node 1 Transform A-F Node 2 G- M Source Data Transform N-T U-Z Node 3 Transform Node 4 Transform Combining Parallelism Types Putting It All Together: Parallel Dataflow Pipelining Source Data Source Transform Clean Load Data Warehouse Target EE Program Elements • Dataset: uniform set of rows in the Framework's internal representation - Three flavors: 1. file sets *.fs : stored on multiple Unix files as flat files 2. persistent: *.ds : stored on multiple Unix files in Framework format read and written using the DataSet Stage 3. virtual: *.v : links, in Framework format, NOT stored on disk - The Framework processes only datasets—hence possible need for Import - Different datasets typically have different schemas - Convention: "dataset" = Framework data set. • Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file). - All the partitions of a dataset follow the same schema: that of the dataset Repartitioning Putting It All Together: Parallel Dataflow with Repartioning on-the-fly Pipelining Source Data U-Z N-T G- M A-F Transform Customer last name Clean Customer zip code Load Data Warehouse Credit card number Target Source Without Landing To Disk! DataStage EE Architecture DataStage Engine: Orchestrate Framework: Provides data integration platform Provides parallel processing Orchestrate Program (sequential data flow) Flat Files Relational Data Clean 1 Import Analyze Merge Clean 2 Centralized Error Handling and Event Logging Configuration File Performance Visualization Orchestrate Application Framework and Runtime System Parallel access to data in RDBMS Parallel pipelining Clean 1 Import Merge Clean 2 Parallel access to data in files Analyze Inter-node communications Parallelization of operations DataStage Enterprise Edition: Best-of-breed scalable data integration platform No limitations on data volumes or throughput Introduction to DataStage EE  DSEE: – Automatically scales to fit the machine – Handles data flow among multiple CPU’s and disks  With DSEE you can: – Create applications for SMP’s, clusters and MPP’s… enterprise edition is architecture-neutral – Access relational databases in parallel – Execute external applications in parallel – Store data across multiple disks and nodes Job Design VS. Execution User assembles the flow using the DataStage Designer: …and gets: parallel access, propagation, transformation, and load. The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file. No need to modify or recompile your design! Partitioners and Collectors  Partitioners distribute rows into partitions – implement data-partition parallelism   Collectors = inverse partitioners Live on input links of stages running – in parallel (partitioners) – sequentially (collectors)  Use a choice of methods Example Partitioning Icons partitioner Exercise  Complete exercises 1-1 and 1-2, and 1-3 Module 2 DSEE Sequential Access Module Objectives  You will learn to: – Import sequential files into the EE Framework – Utilize parallel processing techniques to increase sequential file access – Understand usage of the Sequential, DataSet, FileSet, and LookupFileSet stages – Manage partitioned data stored by the Framework Types of Sequential Data Stages  Sequential – Fixed or variable length  File Set  Lookup File Set  Data Set Sequential Stage Introduction  The EE Framework processes only datasets  For files other than datasets, such as flat files, enterprise edition must perform import and export operations – this is performed by import and export OSH operators (generated by Sequential or FileSet stages)  During import or export DataStage performs format translations – into, or out of, the EE internal format  Data is described to the Framework in a schema How the Sequential Stage Works  Generates Import/Export operators  Types of transport – Performs direct C++ file I/O streams – Source programs which feed stdout (gunzip) send stdout into EE via sequential pipe Using the Sequential File Stage Both import and export of general files (text, binary) are performed by the SequentialFile Stage. Importing/Exporting Data – Data import: – Data export EE internal format EE internal format Working With Flat Files  Sequential File Stage – Normally will execute in sequential mode – Can execute in parallel if reading multiple files (file pattern option) – Can use multiple readers within a node on fixed width file – DSEE needs to know   How file is divided into rows How row is divided into columns Processes Needed to Import Data  Recordization – Divides input stream into records – Set on the format tab  Columnization – Divides the record into columns – Default set on the format tab but can be overridden on the columns tab – Can be “incomplete” if using a schema or not even specified in the stage if using RCP File Format Example Record delimiter Field 1 , Field 1 , Field 1 , Last field nl Final Delimiter = end Field Delimiter Field 1 , Field 1 , Field 1 , Last field , nl Final Delimiter = comma Sequential File Stage  To set the properties, use stage editor – Page (general, input/output) – Tabs (format, columns)  Sequential stage link rules – One input link – One output links (except for reject link definition) – One reject link  Will reject any records not matching meta data in the column definitions Job Design Using Sequential Stages Stage categories General Tab – Sequential Source Multiple output links Show records Properties – Multiple Files Click to add more files having the same meta data. Properties - Multiple Readers Multiple readers option allows you to set number of readers Format Tab File into records Record into columns Read Methods Reject Link  Reject mode = output  Source – All records not matching the meta data (the column definitions)  Target – All records that are rejected for any reason  Meta data – one column, datatype = raw File Set Stage  Can read or write file sets  Files suffixed by .fs  File set consists of: 1. Descriptor file – contains location of raw data files + meta data 2. Individual raw data files  Can be processed in parallel File Set Stage Example Descriptor file File Set Usage  Why use a file set? – 2G limit on some file systems – Need to distribute data among nodes to prevent overruns – If used in parallel, runs faster that sequential file Lookup File Set Stage  Can create file sets  Usually used in conjunction with Lookup stages Lookup File Set > Properties Key column specified Key column dropped in descriptor file Data Set  Operating system (Framework) file  Suffixed by .ds  Referred to by a control file  Managed by Data Set Management utility from GUI (Manager, Designer, Director)  Represents persistent data  Key to good performance in set of linked jobs Persistent Datasets  Accessed from/to disk with DataSet Stage.  Two parts: – Descriptor file:  input.ds contains metadata, data location, but NOT the data itself – Data file(s) record ( partno: int32; description: string; ) contain the data  multiple Unix files (one per node), accessible in parallel  node1:/local/disk1/… node2:/local/disk2/… Quiz! • True or False? Everything that has been data-partitioned must be collected in same job Data Set Stage Is the data partitioned? Engine Data Translation  Occurs on import – From sequential files or file sets – From RDBMS  Occurs on export – From datasets to file sets or sequential files – From datasets to RDBMS  Engine most efficient when processing internally formatted records (I.e. data contained in datasets) Managing DataSets   GUI (Manager, Designer, Director) – tools > data set management Alternative methods – Orchadmin    Unix command line utility List records Remove data sets (will remove all components) – Dsrecords  Lists number of records in a dataset Data Set Management Display data Schema Data Set Management From Unix  Alternative method of managing file sets and data sets – Dsrecords  Gives record count – Unix command-line utility – $ dsrecords ds_name I.e.. $ dsrecords myDS.ds 156999 records – Orchadmin  Manages EE persistent data sets – Unix command-line utility I.e. $ orchadmin rm myDataSet.ds Exercise  Complete exercises 2-1, 2-2, 2-3, and 2-4. Blank Module 3 Standards and Techniques Objectives  Establish standard techniques for DSEE development  Will cover: – – – – – – – Job documentation Naming conventions for jobs, links, and stages Iterative job design Useful stages for job development Using configuration files for development Using environmental variables Job parameters Job Presentation Document using the annotation stage Job Properties Documentation Organize jobs into categories Description shows in DS Manager and MetaStage Naming conventions  Stages named after the – Data they access – Function they perform – DO NOT leave defaulted stage names like Sequential_File_0  Links named for the data they carry – DO NOT leave defaulted link names like DSLink3 Stage and Link Names Stages and links renamed to data they handle Create Reusable Job Components  Use enterprise edition shared containers when feasible Container Use Iterative Job Design    Use copy or peek stage as stub Test job in phases – small first, then increasing in complexity Use Peek stage to examine records Copy or Peek Stage Stub Copy stage Transformer Stage Techniques  Suggestions – Always include reject link. – Always test for null value before using a column in a function. – Try to use RCP and only map columns that have a derivation other than a copy. More on RCP later. – Be aware of Column and Stage variable Data Types.  Often user does not pay attention to Stage Variable type. – Avoid type conversions.  Try to maintain the data type as imported. The Copy Stage With 1 link in, 1 link out: the Copy Stage is the ultimate "no-op" (place-holder): – Partitioners – Sort / Remove Duplicates – Rename, Drop column … can be inserted on: – – input link (Partitioning): Partitioners, Sort, Remove Duplicates) output link (Mapping page): Rename, Drop. Sometimes replace the transformer: Developing Jobs 1. Keep it simple • 2. Start small and Build to final Solution • • • 3. Use view data, copy, and peek. Start from source and work out. Develop with a 1 node configuration file. Solve the business problem before the performance problem. • 4. Jobs with many stages are hard to debug and maintain. Don’t worry too much about partitioning until the sequential flow works as expected. If you have to write to Disk use a Persistent Data set. Final Result Good Things to Have in each Job  Use job parameters  Some helpful environmental variables to add to job parameters – $APT_DUMP_SCORE  Report OSH to message log – $APT_CONFIG_FILE  Establishes runtime parameters to EE engine; I.e. Degree of parallelization Setting Job Parameters Click to add environment variables DUMP SCORE Output Setting APT_DUMP_SCORE yields: Double-click Partitoner And Collector Mapping Node--> partition Use Multiple Configuration Files  Make a set for 1X, 2X,….  Use different ones for test versus production  Include as a parameter in each job Exercise  Complete exercise 3-1 Module 4 DBMS Access Objectives  Understand how DSEE reads and writes records to an RDBMS  Understand how to handle nulls on DBMS lookup  Utilize this knowledge to: – Read and write database tables – Use database tables to lookup data – Use null handling options to clean data Parallel Database Connectivity Traditional Client-Server Client enterprise edition Client Sort Client Client Client Load Client Parallel RDBMS Parallel RDBMS  Only RDBMS is running in parallel   Each application has only one connection Suitable only for small data volumes     Parallel server runs APPLICATIONS Application has parallel connections to RDBMS Suitable for large data volumes Higher levels of integration possible RDBMS Access Supported Databases enterprise edition provides high performance / scalable interfaces for:  DB2  Informix  Oracle  Teradata Users must be granted specific privileges, depending on RDBMS. RDBMS Access Supported Databases  Automatically convert RDBMS table layouts to/from enterprise edition Table Definitions  RDBMS nulls converted to/from nullable field values  Support for standard SQL syntax for specifying: – field list for SELECT statement – filter for WHERE clause – open command, close command   Can write an explicit SQL query to access RDBMS EE supplies additional information in the SQL query RDBMS Stages  DB2/UDB Enterprise  Teradata Enterprise  Informix Enterprise  ODBC  Oracle Enterprise RDBMS Usage  As a source – Extract data from table (stream link) – Extract as table, generated SQL, or user-defined SQL – User-defined can perform joins, access views – Lookup (reference link) – Normal lookup is memory-based (all table data read into memory) – Can perform one lookup at a time in DBMS (sparse option) – Continue/drop/fail options  As a target – Inserts – Upserts (Inserts and updates) – Loader RDBMS Source – Stream Link Stream link DBMS Source - User-defined SQL Columns in SQL statement must match the meta data in columns tab Exercise  User-defined SQL – Exercise 4-1 DBMS Source – Reference Link Reject link Lookup Reject Link “Output” option automatically creates the reject link Null Handling  Must handle null condition if lookup record is not found and “continue” option is chosen  Can be done in a transformer stage Lookup Stage Mapping Link name Lookup Stage Properties Reference link Must have same column name in input and reference links. You will get the results of the lookup in the output column. DBMS as a Target DBMS As Target  Write Methods – – – –  Delete Load Upsert Write (DB2) Write mode for load method – – – – Truncate Create Replace Append Target Properties Generated code can be copied Upsert mode determines options Checking for Nulls  Use Transformer stage to test for fields with null values (Use IsNull functions)  In Transformer, can reject or load default value Exercise  Complete exercise 4-2 Module 5 Platform Architecture Objectives  Understand how enterprise edition Framework processes data  You will be able to: – Read and understand OSH – Perform troubleshooting Concepts  The EE Platform  OSH (generated by DataStage Parallel Canvas, and run by DataStage Director)  Conductor,Section leaders,players.  Configuration files (only one active at a time, describes H/W)  Schemas/tables  Schema propagation/RCP  Buildop,Wrapper  Datasets (data in Framework's internal representation) DS-EE Program Elements EE Stages Involve A Series Of Processing Steps Output Data Set schema: prov_num:int16; member_num:int8; custid:int32; Input Data Set schema: prov_num:int16; member_num:int8; custid:int32; • Piece of Application Logic Running Against Individual Records • Parallel or Sequential Business Logic Partitioner EE Stage • Three Sources – Ascential Supplied – Commercial tools/applications – Custom/Existing programs DS-EE Program Elements Stage Execution Dual Parallelism Eliminates Bottlenecks! • EE Delivers Parallelism in Two Ways – Pipeline – Partition • Block Buffering Between Components Producer – Eliminates Need for Program Load Balancing – Maintains Orderly Data Flow Pipeline Consumer Partition Stages Control Partition Parallelism  Execution Mode (sequential/parallel) is controlled by Stage – default = parallel for most Ascentialsupplied Stages – User can override default mode – Parallel Stage inserts the default partitioner (Auto) on its input links – Sequential Stage inserts the default collector (Auto) on its input links – user can override default execution mode (parallel/sequential) of Stage (Advanced tab)  choice of partitioner/collector (Input – Partitioning Tab)  How Parallel Is It?  Degree of Parallelism is determined by the configuration file – Total number of logical nodes in default pool, or a subset if using "constraints".  Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage OSH  DataStage EE GUI generates OSH scripts – Ability to view OSH turned on in Administrator – OSH can be viewed in Designer using job properties  The Framework executes OSH  What is OSH? – Orchestrate shell – Has a UNIX command-line interface OSH Script  An osh script is a quoted string which specifies: – The operators and connections of a single Orchestrate step – In its simplest form, it is: osh “op < in.ds > out.ds”  Where: – op is an Orchestrate operator – in.ds is the input data set – out.ds is the output data set OSH Operators  Operator is an instance of a C++ class inheriting from APT_Operator  Developers can create new operators  Examples of existing operators: – Import – Export – RemoveDups Enable Visible OSH in Administrator Will be enabled for all projects View OSH in Designer Operator Schema OSH Practice  Exercise 5-1 Orchestrate May Add Operators to Your Command Let’s revisit the following OSH command: $ osh " echo 'Hello world!' [par] > outfile " 1)“wrapper” (turning a Unix command into an DS/EE Operator) 4)export 2)partitioner 3)collector The Framework silently inserts operators (steps 1,2,3,4) (from dataset to Unix flat file) Elements of a Framework Program Steps, with internal and terminal datasets and links, described by schemas • Step: unit of OSH program – one OSH command = one step – at end of step: synchronization, storage to disk • Datasets: set of rows processed by Framework – Orchestrate data sets: – persistent (terminal) *.ds, and – virtual (internal) *.v. – Also: flat “file sets” *.fs • Schema: data description (metadata) for datasets and links. Orchestrate Datasets • Consist of Partitioned Data and Schema • Can be Persistent (*.ds) or Virtual (*.v, Link) • Overcome 2 GB File Limit What you program: GUI = What gets generated: OSH What gets processed: data files of x.ds $ osh “operator_A > x.ds“ Node 1 Node 2 Node 3 Node 4 Operator A Operator A Operator A Operator A . . . Multiple files per partition Each file up to 2GBytes (or larger) Computing Architectures: Definition Dedicated Disk Shared Disk Disk Disk CPU CPU CPU CPU CPU Memory Uniprocessor Shared Memory SMP System Shared Nothing Disk Disk Disk Disk CPU CPU CPU CPU Memory Memory Memory Memory Clusters and MPP Systems (Symmetric Multiprocessor) • PC • Workstation • Single processor server • IBM, Sun, HP, Compaq • 2 to 64 processors • Majority of installations • 2 to hundreds of processors • MPP: IBM and NCR Teradata • each node is a uniprocessor or SMP Job Execution: Orchestrate Conductor Node – – – – C Processing Node P P – Forks Players processes (one/Stage) – Manages up/down communication. • Players Processing Node – The actual processes associated with Stages – Combined players: one process only – Send stderr to SL SL P Step Composer Creates Section Leader processes (one/node) Consolidates massages, outputs them Manages orderly shutdown. • Section Leader SL P • Conductor - initial DS/EE process P P • Communication: – Establish connections to other players for data flow – Clean up upon completion. - SMP: Shared Memory - MPP: TCP Working with Configuration Files  You can easily switch between config files:  '1-node' file - for sequential execution, lighter reports—handy for testing    'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism 'BigN-nodes' file - aims at full data-partitioned parallelism Only one file is active while a step is running  The Framework queries (first) the environment variable: $APT_CONFIG_FILE  # nodes declared in the config file needs not match # CPUs  Same configuration file can be used in development and target machines Scheduling Nodes, Processes, and CPUs  DS/EE does not: – know how many CPUs are available – schedule  Who knows what? Nodes Ops User Y N Orchestrate Y Y O/S  Nodes = # logical nodes declared in config. file Ops = # ops. (approx. # blue boxes in V.O.) Processes = # Unix processes CPUs = # available CPUs Processes CPUs Nodes * Ops N " Y Who does what? – DS/EE creates (Nodes*Ops) Unix processes – The O/S schedules these processes on the CPUs Configuring DSEE – Node Pools { 3 4 1 2 node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} } Configuring DSEE – Disk Pools { 3 4 1 2 node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} } Re-Partitioning Parallel to parallel flow may incur reshuffling: Records may jump between nodes node 1 node 2 partitioner Re-Partitioning X-ray Partitioner with parallel import When a partitioner receives: node 1 node 2 N • sequential input (1 partition), it creates N partitions • parallel input (N partitions), it outputs N partitions*, may result in re-partitioning * Assuming no “constraints” N Automatic Re-Partitioning partition 1 If Stage 2 runs in parallel, DS/EE silently inserts a partitioner upstream of it. If Stage 1 also runs in parallel, re-partitioning occurs. partition 2 Stage 1 partitioner Stage 2 In most cases, automatic repartitioning is benign (no reshuffling), preserving the same partitioning as upstream. Re-partitioning can be forced to be benign, using either: same preserve partitioning Partitioning Methods  Auto  Hash  Entire  Range  Range Map Collectors • Collectors combine partitions of a dataset into a single input stream to a sequential Stage ... data partitions collector –Collectors do NOT synchronize data sequential Stage Partitioning and Repartitioning Are Visible On Job Design Partitioning and Collecting Icons Partitioner Collector Setting a Node Constraint in the GUI Reading Messages in Director  Set APT_DUMP_SCORE to true  Can be specified as job parameter  Messages sent to Director log  If set, parallel job will produce a report showing the operators, processes, and datasets in the running job Messages With APT_DUMP_SCORE = True Exercise  Complete exercise 5-2 Blank Module 6 Transforming Data Module Objectives  Understand ways DataStage allows you to transform data  Use this understanding to: – Create column derivations using user-defined code or system functions – Filter records based on business criteria – Control data flow based on data conditions Transformed Data  Transformed data is: – Outgoing column is a derivation that may, or may not, include incoming fields or parts of incoming fields – May be comprised of system variables  Frequently uses functions performed on something (ie. incoming columns) – Divided into categories – I.e.      Date and time Mathematical Logical Null handling More Stages Review  Stages that can transform data – Transformer   Parallel Basic (from Parallel palette) – Aggregator (discussed in later module)  Sample stages that do not transform data – – – – Sequential FileSet DataSet DBMS Transformer Stage Functions  Control data flow  Create derivations Flow Control  Separate record flow down links based on data condition – specified in Transformer stage constraints  Transformer stage can filter records  Other stages can filter records but do not exhibit advanced flow control – Sequential – Lookup – Filter Rejecting Data  Reject option on sequential stage – Data does not agree with meta data – Output consists of one column with binary data type  Reject links (from Lookup stage) result from the drop option of the property “If Not Found” – Lookup “failed” – All columns on reject link (no column mapping option)  Reject constraints are controlled from the constraint editor of the transformer – Can control column mapping – Use the “Other/Log” checkbox Rejecting Data Example Contstraint – Other/log option Property Reject Mode = Output “If Not Found” property Transformer Stage Properties Transformer Stage Variables  First of transformer stage entities to execute  Execute in order from top to bottom – Can write a program by using one stage variable to point to the results of a previous stage variable  Multi-purpose – – – – Counters Hold values for previous rows to make comparison Hold derivations to be used in multiple field dervations Can be used to control execution of constraints Stage Variables Show/Hide button Transforming Data  Derivations – Using expressions – Using functions   Date/time Transformer Stage Issues – Sometimes require sorting before the transformer stage – I.e. using stage variable as accumulator and need to break on change of column value  Checking for nulls Checking for Nulls  Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition  Can be handled in constraints, derivations, stage variables, or a combination of these Nullability Can set the value of null; i.e.. If value of column is null put “NULL” in the outgoing column Source Field Destination Field Result not_nullable not_nullable not_nullable nullable Source value propagates; destination value is never null. nullable not_nullable WARNING messages in log. If source value is null,a fatal error occurs. Must handle in transformer. nullable nullable Source value propagates to destination. Source value or null propagates. Transformer Stage - Handling Rejects 1. Constraint Rejects – All expressions are false and reject row is checked 2. Expression Error Rejects – Improperly Handled Null Transformer: Execution Order • Derivations in stage variables are executed first • Constraints are executed before derivations • Column derivations in earlier links are executed before later links • Derivations in higher columns are executed before lower columns Two Transformers for the Parallel Palette  All > Processing >  Parallel > Processing  Transformer  Basic Transformer  Is the non-Universe transformer   Has a specific set of functions Makes server style transforms available on the parallel palette  Can use DS routines No DS routines available  No need for shared container to get Universe functionality on the parallel palette  •Program in Basic for both transformers Transformer Functions From Derivation Editor  Data & Time  Logical  Mathematical  Null Handling  Number  Raw  String  Type Conversion  Utility Timestamps and Dates  Data & Time  Also some in Type Conversion Exercise  Complete exercises 6-1, 6-2, and 6-3 Module 7 Sorting Data Objectives  Understand DataStage EE sorting options  Use this understanding to create sorted list of data to enable functionality within a transformer stage Sorting Data  Important because – Transformer may be using stage variables for accumulators or control breaks and order is important – Other stages may run faster – I.e Aggregator – Facilitates the RemoveDups stage, order is important – Job has partitioning requirements  Can be performed – Option within stages (use input > partitioning tab and set partitioning to anything other than auto) – As a separate stage (more complex sorts) Sorting Alternatives • Alternative representation of same flow: Sort Option on Stage Link Sort Stage Sort Utility  DataStage – the default  SyncSort  UNIX Sort Stage - Outputs  Specifies how the output is derived Sort Specification Options  Input Link Property – Limited functionality – Max memory/partition is 20 MB, then spills to scratch  Sort Stage – Tunable to use more memory before spilling to scratch.  Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE Removing Duplicates  Can be done by Sort – Use unique option OR  Remove Duplicates stage – Has more sophisticated ways to remove duplicates Exercise  Complete exercise 7-1 Blank Module 8 Combining Data Objectives  Understand how DataStage can combine data using the Join, Lookup, Merge, and Aggregator stages  Use this understanding to create jobs that will – Combine data from separate input streams – Aggregate data to form summary totals Combining Data  There are two ways to combine data: – Horizontally: Several input links; one output link (+ optional rejects) made of columns from different input links. E.g.,    Joins Lookup Merge – Vertically: One input link, output with column combining values from all input rows. E.g.,  Aggregator Recall the Join, Lookup & Merge Stages  These "three Stages" combine two or more input links according to values of user-designated "key" column(s).  They differ mainly in: – Memory usage – Treatment of rows with unmatched key values – Input requirements (sorted, de-duplicated) Joins - Lookup - Merge: Not all Links are Created Equal! • enterprise edition distinguishes between: - The Primary Input (Framework port 0) - Secondary - in some cases "Reference" (other ports) • Naming convention: Primary Input: port 0 Secondary Input(s): ports 1,… Joins Lookup Merge Left Right Source LU Table(s) Master Update(s) Tip: Check "Input Ordering" tab to make sure intended Primary is listed first Join Stage Editor Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge) One of four variants: – Inner – Left Outer – Right Outer – Full Outer Several key columns allowed 1. The Join Stage Four types: • Inner • Left Outer • Right Outer • Full Outer  2 sorted input links, 1 output link – "left" on primary input, "right" on secondary input – Pre-sort make joins "lightweight": few rows need to be in RAM 2. The Lookup Stage Combines: – one source link with – one or more duplicate-free table links Source input 0 One or more tables (LUTs) 1 2 0 1 Lookup Output Reject no pre-sort necessary allows multiple keys LUTs flexible exception handling for source input rows with no match The Lookup Stage  Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging) – Space time trade-off: presort vs. in RAM table  On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link  On an SMP, no physical duplication of a Lookup Table occurs The Lookup Stage  Lookup File Set – Like a persistent data set only it contains metadata about the key. – Useful for staging lookup tables  RDBMS LOOKUP – SPARSE Select for each row.  Might become a performance bottleneck.  – NORMAL  Loads to an in memory hash table first. 3. The Merge Stage  Combines – one sorted, duplicate-free master (primary) link with – one or more sorted update (secondary) links. – Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but opposite to lookup).  Follows the Master-Update model: – Master row and one or more updates row are merged if they have the same value in user-specified key column(s). – A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored) – Unmatched ("Bad") master rows can be either kept  dropped  – Unmatched ("Bad") update rows in input link can be captured in a "reject" link – Matched update rows are consumed. The Merge Stage Master One or more updates 0 0 1 2 1 2 Merge Output Rejects Allows composite keys Multiple update links Matched update rows are consumed Unmatched updates can be captured Lightweight Space/time tradeoff: presorts vs. in-RAM table Synopsis: Joins, Lookup, & Merge Joins Lookup Merge Model Memory usage RDBMS-style relational light Source - in RAM LU Table heavy Master -Update(s) light # and names of Inputs Mandatory Input Sort Duplicates in primary input Duplicates in secondary input(s) Options on unmatched primary Options on unmatched secondary On match, secondary entries are exactly 2: 1 left, 1 right both inputs OK (x-product) OK (x-product) NONE NONE reusable 1 Source, N LU Tables 1 Master, N Update(s) no OK Warning! [fail] | continue | drop | reject NONE reusable all inputs Warning! OK only when N = 1 [keep] | drop capture in reject set(s) consumed 1 Nothing (N/A) 1 out, (1 reject) unmatched primary entries 1 out, (N rejects) unmatched secondary entries # Outputs Captured in reject set(s) In this table: • , <comma> = separator between primary and secondary input links (out and reject links) The Aggregator Stage Purpose: Perform data aggregations Specify:  Zero or more key columns that define the aggregation units (or groups)  Columns to be aggregated  Aggregation functions: count (nulls/non-nulls) standard error sum of weights variance  sum max/min/range %coeff. of variation un/corrected sum of squares mean standard deviation The grouping method (hash table or pre-sort) is a performance issue Grouping Methods  Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed – doesn’t require sorted data – good when number of unique groups is small. Running tally for each group’s aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM. – Example: average family income by state, requires .05MB of RAM  Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out. – requires input sorted by grouping keys – can handle unlimited numbers of groups – Example: average daily balance by credit card Aggregator Functions  Sum  Min, max  Mean  Missing value count  Non-missing value count  Percent coefficient of variation Aggregator Properties Aggregation Types Aggregation types Containers  Two varieties – Local – Shared  Local – Simplifies a large, complex diagram  Shared – Creates reusable object that many jobs can include Creating a Container  Create a job  Select (loop) portions to containerize  Edit > Construct container > local or shared Using a Container  Select as though it were a stage Exercise  Complete exercise 8-1 Module 9 Configuration Files Objectives  Understand how DataStage EE uses configuration files to determine parallel behavior  Use this understanding to – Build a EE configuration file for a computer system – Change node configurations to support adding resources to processes that need them – Create a job that will change resource allocations at the stage level Configuration File Concepts  Determine the processing nodes and disk space connected to each node  When system changes, need only change the configuration file – no need to recompile jobs  When DataStage job runs, platform reads configuration file – Platform automatically scales the application to fit the system Processing Nodes Are  Locations on which the framework runs applications  Logical rather than physical construct  Do not necessarily correspond to the number of CPUs in your system – Typically one node for two CPUs  Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node Optimizing Parallelism  Degree of parallelism determined by number of nodes defined  Parallelism should be optimized, not maximized – Increasing parallelism distributes work load but also increases Framework overhead  Hardware influences degree of parallelism possible  System hardware partially determines configuration More Factors to Consider  Communication amongst operators – Should be optimized by your configuration – Operators exchanging large amounts of data should be assigned to nodes communicating by shared memory or high-speed link  SMP – leave some processors for operating system  Desirable to equalize partitioning of data  Use an experimental approach – Start with small data sets – Try different parallelism while scaling up data set sizes Factors Affecting Optimal Degree of Parallelism  CPU intensive applications – Benefit from the greatest possible parallelism  Applications that are disk intensive – Number of logical nodes equals the number of disk spindles being accessed EE Configuration File  Text file containing string data that is passed to the Framework – Sits on server side – Can be displayed and edited  Name and location found in environmental variable APT_CONFIG_FILE  Components – – – – Node Fast name Pools Resource Sample Configuration File { node “Node1" { fastname "BlackHole" pools "" "node1" resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" } resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" } } } Node Options • Node name - name of a processing node used by EE – Typically the network name – Use command uname -n to obtain network name  Fastname – – Name of node as referred to by fastest network in the system – Operators use physical node name to open connections – NOTE: for SMP, all CPUs share single connection to network  Pools – Names of pools to which this node is assigned – Used to logically group nodes – Can also be used to group resources  Resource – Disk – Scratchdisk Node Pools  Node “node1" { fastname “server_name” pool "pool_name” }  "pool_name" is the name of the node pool. I.e. “extra”  Node pools group processing nodes based on usage. – Example: memory capacity and high-speed I/O.  One node can be assigned to multiple pools. Default node pool (” ") is made up of each node defined in the config file, unless it’s qualified as belonging to a different pool and it is not designated as belonging to the default pool (see following example).  Resource Disk and Scratchdisk node “node_0" { fastname “server_name” pool "pool_name” resource disk “path” {pool “pool_1”} resource scratchdisk “path” {pool “pool_1”} ... }   Resource type can be disk(s) or scratchdisk(s) "pool_1" is the disk or scratchdisk pool, allowing you to group disks and/or scratchdisks. Disk Pools  Disk pools allocate storage  Pooling applies to both disk types pool "bigdata"  By default, EE uses the default pool, specified by “” Sorting Requirements Resource pools can also be specified for sorting:  The Sort stage looks first for scratch disk resources in a “sort” pool, and then in the default disk pool  Sort uses as many scratch disks as defined in the first pool it finds Configuration File: Example { node "n1" { fastname “s1" pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {} resource disk "/data/n1/d2" {} resource scratchdisk "/scratch" } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {} resource scratchdisk "/scratch" } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {} resource scratchdisk "/scratch" } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {} resource scratchdisk "/scratch" } ... 6 4 5 2 3 1 } {"sort"} {} {} {} Resource Types  Disk  Scratchdisk  DB2  Oracle  Saswork  Sortwork  Can exist in a pool – Groups resources together Using Different Configurations Lookup stage where DBMS is using a sparse lookup type Building a Configuration File  Scoping the hardware: – Is the hardware configuration SMP, Cluster, or MPP? – Define each node structure (an SMP would be single node):      Number of CPUs CPU speed Available memory Available page/swap space Connectivity (network/back-panel speed) – Is the machine dedicated to EE? If not, what other applications are running on it? – Get a breakdown of the resource usage (vmstat, mpstat, iostat) – Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them? To Create Hardware Specifications Complete one per node _____________ TCP Addr Complete one per disk subsystem _____________ Hardware Type ________________ Switch Addr Shared between nodes Y or N Storage Type ________________ Storage Size ________________ Hardware Type ___________ Node Name _____________ # CPUs # channels/ controllers____ Read Cache size _______________ _____________ Throughput ____ Write Cache size _______________ CPU Speed _____________ Memory Read hit ratio _______________ Write hit ratio _______________ I/O rate (R/W) _______________ _____________ Page Space _____________ Swap Space _____________ In addition, record all coexistence usage (other applications or subsystems sharing this disk subsystem). “Ballpark” Tuning As a rule of thumb, generally expect the application to be I/O constrained. First try to parallelize (spread out) the I/O as much as possible. To validate that you’ve achieved this: 1. 2. 3. 4. 5. Calculate the theoretic I/O bandwidth available (don’t forget to reduce this by the amount other applications on this machine or others are impacting the I/O subsystem). See DS Performance and tuning for calculation methods. Determine the I/O bandwidth being achieved by the DS application (rows/sec * bytes/row). If the I/O rate isn’t approximately equal to the theoretical, there is probably a bottleneck elsewhere (CPU, Memory, etc). Attempt to tune to the I/O bandwidth. Pay particular attention to I/O intensive competing workloads such as database logging, paging/swapping, etc. Some useful commands: iostat (I/O activity) vmstat (system activity) mpstat (processor utilization sar (resource usage) Exercise  Complete exercise 9-1 and 9-2 Blank Module 10 Extending DataStage EE Objectives  Understand the methods by which you can add functionality to EE  Use this understanding to: – Build a DataStage EE stage that handles special processing needs not supplied with the vanilla stages – Build a DataStage EE job that uses the new stage EE Extensibility Overview Sometimes it will be to your advantage to leverage EE’s extensibility. This extensibility includes:  Wrappers  Buildops  Custom Stages When To Leverage EE Extensibility Types of situations: Complex business logic, not easily accomplished using standard EE stages Reuse of existing C, C++, Java, COBOL, etc… Wrappers vs. Buildop vs. Custom  Wrappers are good if you cannot or do not want to modify the application and performance is not critical.  Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces.  Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces. Building “Wrapped” Stages You can “wrapper” a legacy executable:  Binary  Unix command  Shell script … and turn it into a enterprise edition stage capable, among other things, of parallel execution… As long as the legacy executable is:  amenable to data-partition parallelism   no dependencies between rows pipe-safe   can read rows sequentially no random access to data Wrappers (Cont’d) Wrappers are treated as a black box  EE has no knowledge of contents  EE has no means of managing anything that occurs inside the wrapper  EE only knows how to export data to and import data from the wrapper  User must know at design time the intended behavior of the wrapper and its schema interface  If the wrappered application needs to see all records prior to processing, it cannot run in parallel. LS Example  Can this command be wrappered? Creating a Wrapper To create the “ls” stage Used in this job --- Wrapper Starting Point Creating Wrapped Stages From Manager: Right-Click on Stage Type > New Parallel Stage > Wrapped We will "Wrapper” an existing Unix executables – the ls command Wrapper - General Page Name of stage Unix command to be wrapped The "Creator" Page Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others. Wrapper – Properties Page  If your stage will have properties appear, complete the Properties page This will be the name of the property as it appears in your stage Wrapper - Wrapped Page Interfaces – input and output columns these should first be entered into the table definitions meta data (DS Manager); let’s do that now. Interface schemas • Layout interfaces describe what columns the stage: – Needs for its inputs (if any) – Creates for its outputs (if any) – Should be created as tables with columns in Manager Column Definition for Wrapper Interface How Does the Wrapping Work? – Define the schema for export and import  Schemas become interface schemas of the operator and allow for by-name column access – Define multiple inputs/outputs required by UNIX executable input schema export stdin or named pipe UNIX executable stdout or named pipe import output schema QUIZ: Why does export precede import? Update the Wrapper Interfaces  This wrapper will have no input interface – i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property. Resulting Job Wrapped stage Job Run  Show file from Designer palette Wrapper Story: Cobol Application  Hardware Environment: – IBM SP2, 2 nodes with 4 CPU’s per node.  Software: – DB2/EEE, COBOL, EE  Original COBOL Application: – Extracted source table, performed lookup against table in DB2, and Loaded results to target table. – 4 hours 20 minutes sequential execution  enterprise edition Solution: – Used EE to perform Parallel DB2 Extracts and Loads – Used EE to execute COBOL application in Parallel – EE Framework handled data transfer between DB2/EEE and COBOL application – 30 minutes 8-way parallel execution Buildops Buildop provides a simple means of extending beyond the functionality provided by EE, but does not use an existing executable (like the wrapper) Reasons to use Buildop include:   Speed / Performance Complex business logic that cannot be easily represented using existing stages – Lookups across a range of values – Surrogate key generation – Rolling aggregates   Build once and reusable everywhere within project, no shared container necessary Can combine functionality from different stages into one BuildOps – The user performs the fun tasks: encapsulate the business logic in a custom operator – The enterprise edition interface called “buildop” automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution. – Exploits extensibility of EE Framework BuildOp Process Overview From Manager (or Designer): Repository pane: Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped} • "Build" stages from within enterprise edition • "Wrapping” existing “Unix” executables General Page Identical to Wrappers, except: Under the Build Tab, your program! Logic Tab for Business Logic Enter Business C/C++ logic and arithmetic in four pages under the Logic tab Main code section goes in Per-Record page- it will be applied to all rows NOTE: Code will need to be Ansi C/C++ compliant. If code does not compile outside of EE, it won’t compile within EE either! Code Sections under Logic Tab Temporary variables declared [and initialized] here Logic here is executed once BEFORE processing the FIRST row Logic here is executed once AFTER processing the LAST row I/O and Transfer Under Interface tab: Input, Output & Transfer pages First line: output 0 Optional renaming of output port from default "out0" Write row Input page: 'Auto Read' Read next row In-Repository Table Definition 'False' setting, not to interfere with Transfer page I/O and Transfer First line: Transfer of index 0 • Transfer from input in0 to output out0. • If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written Building Stages Simple Example  Example - sumNoTransfer – Add input columns "a" and "b"; ignores other columns that might be present in input – Produces a new "sum" column – Do not transfer input columns a:int32; b:int32 sumNoTransfer sum:int32 No Transfer From Peek: NO TRANSFER • Causes: - RCP set to "False" in stage definition and - Transfer page left blank, or Auto Transfer = "False" • Effects: - input columns "a" and "b" are not transferred - only new column "sum" is transferred Compare with transfer ON… Transfer TRANSFER • Causes: - RCP set to "True" in stage definition or - Auto Transfer set to "True" • Effects: - new column "sum" is transferred, as well as - input columns "a" and "b" and - input column "ignored" (present in input, but not mentioned in stage) Adding a Column With Row ID Out Table; Output Columns vs. Temporary C++ Variables Columns    DS-EE type Defined in Table Definitions Value refreshed from row to row Temp C++ variables  C/C++ type  Need declaration (in Definitions or Pre-Loop page)  Value persistent throughout "loop" over rows, unless modified in code Exercise  Complete exercise 10-1 and 10-2 Exercise  Complete exercises 10-3 and 10-4 Custom Stage  Reasons for a custom stage: – Add EE operator not already in DataStage EE – Build your own Operator and add to DataStage EE  Use EE API  Use Custom Stage to add new operator to EE canvas Custom Stage DataStage Manager > select Stage Types branch > right click Custom Stage Number of input and output links allowed Name of Orchestrate operator to be used Custom Stage – Properties Tab The Result Blank Module 11 Meta Data in DataStage EE Objectives  Understand how EE uses meta data, particularly schemas and runtime column propagation  Use this understanding to: – Build schema definition files to be invoked in DataStage jobs – Use RCP to manage meta data usage in EE jobs Establishing Meta Data  Data definitions – Recordization and columnization – Fields have properties that can be set at individual field level  Data types in GUI are translated to types used by EE – Described as properties on the format/columns tab (outputs or inputs pages) OR – Using a schema file (can be full or partial)  Schemas – Can be imported into Manager – Can be pointed to by some job stages (i.e. Sequential) Data Formatting – Record Level  Format tab  Meta data described on a record basis  Record level properties Data Formatting – Column Level  Defaults for all columns Column Overrides  Edit row from within the columns tab  Set individual column properties Extended Column Properties Field and string settings Extended Properties – String Type  Note the ability to convert ASCII to EBCDIC Editing Columns Properties depend on the data type Schema  Alternative way to specify column definitions for data used in EE jobs  Written in a plain text file  Can be written as a partial record definition  record {final_delim=end, delim=",", quote=double} Can be imported into the DataStage repository ( first_name: string; last_name: string; gender: string; birth_date: date; income: decimal[9,2]; state: string; ) Creating a Schema  Using a text editor – Follow correct syntax for definitions – OR  Import from an existing data set or file set – On DataStage Manager import > Table Definitions > Orchestrate Schema Definitions – Select checkbox for a file with .fs or .ds Importing a Schema Schema location can be on the server or local work station Data Types  Date  Vector  Decimal  Subrecord  Floating point  Raw  Integer  Tagged  String  Time  Timestamp Partial Schemas  Only need to define column definitions that you are actually going to operate on  Allowed by stages with format tab – – – – – Sequential file stage File set stage External source stage External target stage Column import stage Runtime Column Propagation  DataStage EE is flexible about meta data. It can cope with the situation where meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP).  RCP is always on at runtime.  Design and compile time column mapping enforcement. – RCP is off by default. – Enable first at project level. (Administrator project properties) – Enable at job level. (job properties General tab) Enabling RCP at Project Level Enabling RCP at Job Level Enabling RCP at Stage Level   Go to output link’s columns tab For transformer you can find the output links columns tab by first going to stage properties Using RCP with Sequential Stages   To utilize runtime column propagation in the sequential stage you must use the “use schema” option Stages with this restriction: – – – – Sequential File Set External Source External Target Runtime Column Propagation  When RCP is Disabled – DataStage Designer will enforce Stage Input Column to Output Column mappings. – At job compile time modify operators are inserted on output links in the generated osh. Runtime Column Propagation  When RCP is Enabled – DataStage Designer will not enforce mapping rules. – No Modify operator inserted at compile time. – Danger of runtime error if column names incoming do not match column names outgoing link – case sensitivity. Exercise  Complete exercises 11-1 and 11-2 Module 12 Job Control Using the Job Sequencer Objectives  Understand how the DataStage job sequencer works  Use this understanding to build a control job to run a sequence of DataStage jobs Job Control Options  Manually write job control – Code generated in Basic – Use the job control tab on the job properties page – Generates basic code which you can modify  Job Sequencer – Build a controlling job much the same way you build other jobs – Comprised of stages and links – No basic coding Job Sequencer  Build like a regular job  Type “Job Sequence”  Has stages and links  Job Activity stage represents a DataStage job  Links represent passing control Stages Example Job Activity stage – contains conditional triggers Job Activity Properties Job to be executed – select from dropdown Job parameters to be passed Job Activity Trigger  Trigger appears as a link in the diagram  Custom options let you define the code Options  Use custom option for conditionals – Execute if job run successful or warnings only   Can add “wait for file” to execute Add “execute command” stage to drop real tables and rename new tables to current tables Job Activity With Multiple Links Different links having different triggers Sequencer Stage  Build job sequencer to control job for the collections application Can be set to all or any Notification Stage Notification Notification Activity Sample DataStage log from Mail Notification  Sample DataStage log from Mail Notification Notification Activity Message  E-Mail Message Exercise  Complete exercise 12-1 Blank Module 13 Testing and Debugging Objectives  Understand spectrum of tools to perform testing and debugging  Use this understanding to troubleshoot a DataStage job Environment Variables Parallel Environment Variables Environment Variables Stage Specific Environment Variables Environment Variables Compiler The Director Typical Job Log Messages:  Environment variables  Configuration File information  Framework Info/Warning/Error messages  Output from the Peek Stage  Additional info with "Reporting" environments  Tracing/Debug output – Must compile job in trace mode – Adds overhead Job Level Environmental Variables • Job Properties, from Menu Bar of Designer • Director will prompt you before each run Troubleshooting If you get an error during compile, check the following:  Compilation problems – If Transformer used, check C++ compiler, LD_LIRBARY_PATH – If Buildop errors try buildop from command line – Some stages may not support RCP – can cause column mismatch . – Use the Show Error and More buttons – Examine Generated OSH – Check environment variables settings  Very little integrity checking during compile, should run validate from Director. Highlights source of error Generating Test Data  Row Generator stage can be used – Column definitions – Data type dependent  Row Generator plus lookup stages provides good way to create robust test data from pattern files Blank

Module 1 - WordPress.com

Related documents

Products

Support

Module 1 - WordPress.com

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib