DataStage Enterprise Edition - Object Arena Software Solutions

advertisement
DataStage
Enterprise Edition
Proposed Course Agenda

Day 1
–
–
–
–

EE Architecture
Transforming Data
DBMS as Target
Sorting Data
Day 3
–
–
–
–
Review of EE Concepts
Sequential Access
Best Practices
DBMS as Source
Day 2
–
–
–
–


Combining Data
Configuration Files
Extending EE
Meta Data in EE
Day 4
– Job Sequencing
– Testing and Debugging
The Course Material
Course Manual
Exercise Files and
Exercise Guide
Online Help
Using the Course Material

Suggestions for learning
–
–
–
–
Take notes
Review previous material
Practice
Learn from errors
Intro
Part 1
Introduction to DataStage EE
What is DataStage?


Design jobs for Extraction, Transformation, and
Loading (ETL)
Ideal tool for data integration projects – such as,
data warehouses, data marts, and system
migrations

Import, export, create, and managed metadata for
use within jobs

Schedule, run, and monitor jobs all within
DataStage

Administer your DataStage development and
execution environments
DataStage Server and Clients
DataStage Administrator
Client Logon
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage

Define global and project properties in
Administrator

Import meta data into Manager

Build job in Designer

Compile Designer

Validate, run, and monitor in Director
DataStage Projects
Quiz– True or False

DataStage Designer is used to build and compile
your ETL jobs

Manager is used to execute your jobs after you
build them

Director is used to execute your jobs after you
build them

Administrator is used to set global and project
properties
Intro
Part 2
Configuring Projects
Module Objectives

After this module you will be able to:
– Explain how to create and delete projects
– Set project properties in Administrator
– Set EE global properties in Administrator
Project Properties

Projects can be created and deleted in
Administrator

Project properties and defaults are set in
Administrator
Setting Project Properties

To set project properties, log onto Administrator,
select your project, and then click “Properties”
Licensing Tab
Projects General Tab
Environment Variables
Permissions Tab
Tracing Tab
Tunables Tab
Parallel Tab
Intro
Part 3
Managing Meta Data
Module Objectives

After this module you will be able to:
– Describe the DataStage Manager components and
functionality
– Import and export DataStage objects
– Import metadata for a sequential file
What Is Metadata?
Data
Source
Transform
Meta
Data
Target
Meta
Data
Meta Data
Repository
DataStage Manager
Manager Contents

Metadata describing sources and targets: Table
definitions

DataStage objects: jobs, routines, table
definitions, etc.
Import and Export

Any object in Manager can be exported to a file

Can export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from
one project to another

Use to share DataStage jobs and projects with
other developers
Export Procedure

In Manager, click “Export>DataStage
Components”

Select DataStage objects for export

Specified type of export: DSX, XML

Specify file path on client machine
Quiz: True or False?

You can export DataStage objects such as jobs,
but you can’t export metadata, such as field
definitions of a sequential file.
Quiz: True or False?

The directory to which you export is on the
DataStage client machine, not on the DataStage
server machine.
Exporting DataStage Objects
Exporting DataStage Objects
Import Procedure


In Manager, click “Import>DataStage
Components”
Select DataStage objects for import
Importing DataStage Objects
Import Options
Exercise

Import DataStage Component (table definition)
Metadata Import

Import format and column destinations from
sequential files

Import relational table column destinations

Imported as “Table Definitions”

Table definitions can be loaded into job stages
Sequential File Import Procedure

In Manager, click Import>Table
Definitions>Sequential File Definitions

Select directory containing sequential file and
then the file

Select Manager category

Examined format and column definitions and edit
is necessary
Manager Table Definition
Importing Sequential Metadata
Intro
Part 4
Designing and Documenting Jobs
Module Objectives

After this module you will be able to:
–
–
–
–
–
–
–
–
Describe what a DataStage job is
List the steps involved in creating a job
Describe links and stages
Identify the different types of stages
Design a simple extraction and load job
Compile your job
Create parameters to make your job flexible
Document your job
What Is a Job?

Executable DataStage program

Created in DataStage Designer, but can use
components from Manager

Built using a graphical user interface

Compiles into Orchestrate shell language (OSH)
Job Development Overview

In Manager, import metadata defining sources
and targets

In Designer, add stages defining data extractions
and loads

And Transformers and other stages to defined
data transformations

Add linkss defining the flow of data from sources
to targets

Compiled the job

In Director, validate, run, and monitor your job
Designer Work Area
Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers
Job
properties
Compile
Tools Palette
Adding Stages and Links

Stages can be dragged from the tools palette or
from the stage type branch of the repository view

Links can be drawn from the tools palette or by
right clicking and dragging from one stage to
another
Sequential File Stage

Used to extract data from, or load data to, a
sequential file

Specify full path to the file

Specify a file format: fixed width or delimited

Specified column definitions

Specify write action
Job Creation Example Sequence

Brief walkthrough of procedure

Presumes meta data already loaded in repository
Designer - Create New Job
Drag Stages and Links Using
Palette
Assign Meta Data
Editing a Sequential Source Stage
Editing a Sequential Target
Transformer Stage

Used to define constraints, derivations, and
column mappings

A column mapping maps an input column to an
output column

In this module will just defined column mappings
(no derivations)
Transformer Stage Elements
Create Column Mappings
Creating Stage Variables
Result
Adding Job Parameters

Makes the job more flexible

Parameters can be:
– Used in constraints and derivations
– Used in directory and file names

Parameter values are determined at run time
Adding Job Documentation

Job Properties
– Short and long descriptions
– Shows in Manager

Annotation stage
– Is a stage on the tool palette
– Shows on the job GUI (work area)
Job Properties Documentation
Annotation Stage on the Palette
Annotation Stage Properties
Final Job Work Area with
Documentation
Compiling a Job
Errors or Successful Message
Intro
Part 5
Running Jobs
Module Objectives

After this module you will be able to:
–
–
–
–
–
Validate your job
Use DataStage Director to run your job
Set to run options
Monitor your job’s progress
View job log messages
Prerequisite to Job Execution
Result from Designer compile
DataStage Director

Can schedule, validating, and run jobs

Can be invoked from DataStage Manager or
Designer
– Tools > Run Director
Running Your Job
Run Options – Parameters and
Limits
Director Log View
Message Details are Available
Other Director Functions

Schedule job to run on a particular date/time

Clear job log

Set Director options
– Row limits
– Abort after x warnings
Module 1
DSEE – DataStage EE
Review
Ascential’s Enterprise
Data Integration Platform
Command & Control
ANY SOURCE
CRM
ERP
SCM
RDBMS
Legacy
Real-time
Client-server
Web services
Data Warehouse
Other apps.
DISCOVER
PREPARE
TRANSFORM
Gather
relevant
informatio
n for target
enterprise
application
s
Cleanse,
correct and
match
input data
Standardiz
e and
enrich data
and load to
targets
Data Profiling
Data Quality
Extract,
Transform,
Load
Parallel Execution
Meta Data Management
ANY TARGET
CRM
ERP
SCM
BI/Analytics
RDBMS
Real-time
Client-server
Web services
Data Warehouse
Other apps.
Course Objectives

You will learn to:
– Build DataStage EE jobs using complex logic
– Utilize parallel processing techniques to increase job
performance
– Build custom stages based on application needs

Course emphasis is:
– Advanced usage of DataStage EE
– Application job development
– Best practices techniques
Course Agenda

Day 1
–
–
–
–


– Combining Data
– Configuration Files
Review of EE Concepts
Sequential Access
Standards
DBMS Access
Day 2
– EE Architecture
– Transforming Data
– Sorting Data
Day 3

Day 4
–
–
–
–
Extending EE
Meta Data Usage
Job Control
Testing
Module Objectives

Provide a background for completing work in the
DSEE course

Tasks
– Review concepts covered in DSEE Essentials course

Skip this module if you recently completed the
DataStage EE essentials modules
Review Topics

DataStage architecture

DataStage client review
–
–
–
–
Administrator
Manager
Designer
Director

Parallel processing paradigm

DataStage Enterprise Edition (DSEE)
Client-Server Architecture
Command & Control
Microsoft® Windows NT/2000/XP
ANY TARGET
ANY SOURCE
Designer
Discover
Extract
Administrator
Repository
Manager
Prepare Transform
Transform
Cleanse
Extend
Integrate
Director
Server
Repository
Microsoft® Windows NT or UNIX
Parallel Execution
Meta Data Management
CRM
ERP
SCM
BI/Analytics
RDBMS
Real-Time
Client-server
Web services
Data Warehouse
Other apps.
Process Flow

Administrator – add/delete projects, set defaults

Manager – import meta data, backup projects

Designer – assemble jobs, compile, and execute

Director – execute jobs, examine job run logs
Administrator – Licensing and
Timeout
Administrator – Project
Creation/Removal
Functions
specific to a
project.
Administrator – Project Properties
RCP for parallel
jobs should be
enabled
Variables for
parallel
processing
Administrator – Environment
Variables
Variables are
category
specific
OSH is what is
run by the EE
Framework
DataStage Manager
Export Objects to MetaStage
Push meta
data to
MetaStage
Designer Workspace
Can execute
the job from
Designer
DataStage Generated OSH
The EE
Framework
runs OSH
Director – Executing Jobs
Messages
from previous
run in different
color
Stages
Can now customize the Designer’s palette
Select desired stages
and drag to favorites
Popular Developer Stages
Row
generator
Peek
Row Generator

Can build test data
Edit row in
column tab
Repeatable
property
Peek

Displays field values
– Will be displayed in job log or sent to a file
– Skip records option
– Can control number of records to be displayed

Can be used as stub stage for iterative
development (more later)
Why EE is so Effective

Parallel processing paradigm
– More hardware, faster processing
– Level of parallelization is determined by a
configuration file read at runtime

Emphasis on memory
– Data read into memory and lookups performed like
hash table
Parallel Processing Systems

DataStage EE Enables parallel processing =
executing your application on multiple CPUs
simultaneously
– If you add more resources
(CPUs, RAM, and disks) you increase system
performance
1
2
•
3
4
5
6
Example system containing
6 CPUs (or processing nodes)
and disks
Scaleable Systems: Examples
Three main types of scalable systems

Symmetric Multiprocessors (SMP): shared
memory and disk

Clusters: UNIX systems connected via networks

MPP: Massively Parallel Processing
note
SMP: Shared Everything
• Multiple CPUs with a single operating system
• Programs communicate using shared memory
• All CPUs share system resources
(OS, memory with single linear address space,
disks, I/O)
When used with Enterprise Edition:
• Data transport uses shared memory
• Simplified startup
cpu
cpu
cpu
cpu
Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP
Traditional Batch Processing
Operational Data
Transform
Clean
Load
Archived Data
Data
Warehouse
Disk
Disk
Disk
Source
Traditional approach to batch processing:
• Write to disk and read from disk before each processing operation
• Sub-optimal utilization of resources
• a 10 GB stream leads to 70 GB of I/O
• processing resources can sit idle during I/O
• Very complex to manage (lots and lots of small jobs)
• Becomes impractical with big data volumes
• disk I/O consumes the processing
• terabytes of disk required for temporary staging
Target
Pipeline Multiprocessing
Data Pipelining
• Transform, clean and load processes are executing simultaneously on the same processor
• rows are moving forward through the flow
Operational Data
Archived Data
Source
Transform
Clean
Load
Data
Warehouse
Target
• Start a downstream process while an upstream process is still
running.
• This eliminates intermediate storing to disk, which is critical for big data.
• This also keeps the processors busy.
• Still has limits on scalability
Think of a conveyor belt moving the rows from process to process!
Partition Parallelism
Data Partitioning
• Break up big data into partitions
• Run one partition on each processor
• 4X times faster on 4 processors With data big enough:
100X faster on 100 processors
• This is exactly how the parallel
databases work!
• Data Partitioning requires the
same transform to all partitions:
Aaron Abbott and Zygmund Zorn
undergo the same transform
Node 1
Transform
A-F
G- M
Source
Data
Node 2
Transform
N-T
U-Z
Node 3
Transform
Node 4
Transform
Combining Parallelism Types
Putting It All Together: Parallel Dataflow
Pipelining
Source
Data
Source
Transform
Clean
Data
Warehouse
Load
Target
Repartitioning
Putting It All Together: Parallel Dataflow
with Repartioning on-the-fly
Pipelining
U-Z
N-T
G- M
A-F
Source
Data
Transform
Customer last name
Data
Warehouse
Clean
Customer zip code
Load
Credit card number
Source
Targe
Without Landing To Disk!
EE Program Elements
•
Dataset: uniform set of rows in the Framework's internal representation
- Three flavors:
1. file sets *.fs : stored on multiple Unix files as flat files
2. persistent: *.ds : stored on multiple Unix files in Framework format
read and written using the DataSet Stage
3. virtual:
*.v : links, in Framework format, NOT stored on disk
- The Framework processes only datasets—hence possible need for Import
- Different datasets typically have different schemas
- Convention: "dataset" = Framework data set.
•
Partition: subset of rows in a dataset earmarked for processing by the
same node (virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset
DataStage EE Architecture
DataStage:
Orchestrate Framework:
Provides data integration platform
Provides application scalability
Orchestrate Program
(sequential data flow)
Flat Files
Relational Data
Clean 1
Import
Analyze
Merge
Clean 2
Centralized Error Handling
and Event Logging
Configuration File
Performance
Visualization
Orchestrate Application Framework
and Runtime System
Parallel access to data in RDBMS
Parallel pipelining
Clean 1
Import
Merge
Clean 2
Parallel access to data in files
Analyze
Inter-node communications
Parallelization of operations
DataStage Enterprise Edition:
Best-of-breed scalable data integration platform
No limitations on data volumes or throughput
Introduction to DataStage EE

DSEE:
– Automatically scales to fit the machine
– Handles data flow among multiple CPU’s and disks

With DSEE you can:
– Create applications for SMP’s, clusters and MPP’s…
Enterprise Edition is architecture-neutral
– Access relational databases in parallel
– Execute external applications in parallel
– Store data across multiple disks and nodes
Job Design VS. Execution
Developer assembles data flow using the Designer
…and gets: parallel access, propagation, transformation, and
load.
The design is good for 1 node, 4 nodes,
or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile the design
Partitioners and Collectors

Partitioners distribute rows into partitions
– implement data-partition parallelism


Collectors = inverse partitioners
Live on input links of stages running
– in parallel (partitioners)
– sequentially (collectors)

Use a choice of methods
Example Partitioning Icons
partitioner
Exercise

Complete exercises 1-1 and 1-2, and 1-3
Module 2
DSEE Sequential Access
Module Objectives

You will learn to:
– Import sequential files into the EE Framework
– Utilize parallel processing techniques to increase
sequential file access
– Understand usage of the Sequential, DataSet, FileSet,
and LookupFileSet stages
– Manage partitioned data stored by the Framework
Types of Sequential Data Stages

Sequential
– Fixed or variable length

File Set

Lookup File Set

Data Set
Sequential Stage Introduction

The EE Framework processes only datasets

For files other than datasets, such as flat files,
Enterprise Edition must perform import and
export operations – this is performed by import
and export OSH operators generated by
Sequential or FileSet stages

During import or export DataStage performs
format translations – into, or out of, the EE
internal format

Data is described to the Framework in a schema
How the Sequential Stage Works

Generates Import/Export operators, depending on
whether stage is source or target

Performs direct C++ file I/O streams
Using the Sequential File Stage
Both import and export of general files (text, binary) are
performed by the SequentialFile Stage.
Importing/Exporting Data
– Data import:
– Data export
EE internal format
EE internal format
Working With Flat Files

Sequential File Stage
– Normally will execute in sequential mode
– Can be parallel if reading multiple files (file pattern
option)
– Can use multiple readers within a node
– DSEE needs to know
How file is divided into rows
 How row is divided into columns

Processes Needed to Import Data

Recordization
– Divides input stream into records
– Set on the format tab

Columnization
– Divides the record into columns
– Default set on the format tab but can be overridden on
the columns tab
– Can be “incomplete” if using a schema or not even
specified in the stage if using RCP
File Format Example
Record delimiter
Field 1
,
Field 1
,
Field 1
, Last field
nl
Final Delimiter = end
Field Delimiter
Field 1
,
Field 1
,
Field 1
, Last field
, nl
Final Delimiter = comma
Sequential File Stage

To set the properties, use stage editor
– Page (general, input/output)
– Tabs (format, columns)

Sequential stage link rules
– One input link
– One output links (except for reject link definition)
– One reject link

Will reject any records not matching meta data in the column
definitions
Job Design Using Sequential Stages
Stage categories
General Tab – Sequential Source
Multiple output links
Show records
Properties – Multiple Files
Click to add more files having
the same meta data.
Properties - Multiple Readers
Multiple readers option allows
you to set number of readers
Format Tab
File into records
Record into columns
Read Methods
Reject Link

Reject mode = output

Source
– All records not matching the meta data (the column
definitions)

Target
– All records that are rejected for any reason

Meta data – one column, datatype = raw
File Set Stage

Can read or write file sets

Files suffixed by .fs

File set consists of:
1. Descriptor file – contains location of raw data files +
meta data
2. Individual raw data files

Can be processed in parallel
File Set Stage Example
Descriptor file
File Set Usage

Why use a file set?
– 2G limit on some file systems
– Need to distribute data among nodes to prevent
overruns
– If used in parallel, runs faster that sequential file
Lookup File Set Stage

Can create file sets

Usually used in conjunction with Lookup stages
Lookup File Set > Properties
Key column
specified
Key column
dropped in
descriptor file
Data Set

Operating system (Framework) file

Suffixed by .ds

Referred to by a control file

Managed by Data Set Management utility from
GUI (Manager, Designer, Director)

Represents persistent data

Key to good performance in set of linked jobs
Persistent Datasets

Accessed from/to disk with DataSet Stage.

Two parts:
– Descriptor file:

contains metadata, data location, but NOT the data itself
– Data file(s)


input.ds
record (
partno: int32;
description:
string;
)
contain the data
multiple Unix files (one per node), accessible in parallel
node1:/local/disk1/…
node2:/local/disk2/…
Quiz!
• True or False?
Everything that has been data-partitioned must be
collected in same job
Data Set Stage
Is the data partitioned?
Engine Data Translation

Occurs on import
– From sequential files or file sets
– From RDBMS

Occurs on export
– From datasets to file sets or sequential files
– From datasets to RDBMS

Engine most efficient when processing internally
formatted records (I.e. data contained in datasets)
Managing DataSets


GUI (Manager, Designer, Director) – tools > data
set management
Alternative methods
– Orchadmin
Unix command line utility
 List records
 Remove data sets (will remove all components)

– Dsrecords

Lists number of records in a dataset
Data Set Management
Display data
Schema
Data Set Management From Unix

Alternative method of managing file sets and data
sets
– Dsrecords

Gives record count
– Unix command-line utility
– $ dsrecords ds_name
I.e.. $ dsrecords myDS.ds
156999 records
– Orchadmin

Manages EE persistent data sets
– Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds
Exercise

Complete exercises 2-1, 2-2, 2-3, and 2-4.
Module 3
Standards and Techniques
Objectives

Establish standard techniques for DSEE
development

Will cover:
–
–
–
–
–
–
–
Job documentation
Naming conventions for jobs, links, and stages
Iterative job design
Useful stages for job development
Using configuration files for development
Using environmental variables
Job parameters
Job Presentation
Document using the
annotation stage
Job Properties Documentation
Organize jobs into
categories
Description shows in DS
Manager and MetaStage
Naming conventions

Stages named after the
– Data they access
– Function they perform
– DO NOT leave defaulted stage names like
Sequential_File_0

Links named for the data they carry
– DO NOT leave defaulted link names like DSLink3
Stage and Link Names
Stages and links
renamed to data they
handle
Create Reusable Job Components

Use Enterprise Edition shared containers when
feasible
Container
Use Iterative Job Design



Use copy or peek stage as stub
Test job in phases – small first, then increasing in
complexity
Use Peek stage to examine records
Copy or Peek Stage Stub
Copy stage
Transformer Stage
Techniques

Suggestions – Always include reject link.
– Always test for null value before using a column in a
function.
– Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later.
– Be aware of Column and Stage variable Data Types.

Often user does not pay attention to Stage Variable type.
– Avoid type conversions.

Try to maintain the data type as imported.
The Copy Stage
With 1 link in, 1 link out:
the Copy Stage is the ultimate "no-op" (place-holder):
– Partitioners
–
Sort / Remove Duplicates
–
Rename, Drop column
… can be inserted on:
– input link (Partitioning): Partitioners, Sort, Remove Duplicates)
–
output link (Mapping page): Rename, Drop.
Sometimes replace the transformer:
Developing Jobs
1.
Keep it simple
•
2.
Start small and Build to final Solution
•
•
•
3.
Use view data, copy, and peek.
Start from source and work out.
Develop with a 1 node configuration file.
Solve the business problem before the performance
problem.
•
4.
Jobs with many stages are hard to debug and maintain.
Don’t worry too much about partitioning until the
sequential flow works as expected.
If you have to write to Disk use a Persistent Data set.
Final Result
Good Things to Have in each Job

Use job parameters

Some helpful environmental variables to add to
job parameters
– $APT_DUMP_SCORE

Report OSH to message log
– $APT_CONFIG_FILE

Establishes runtime parameters to EE engine; I.e. Degree of
parallelization
Setting Job Parameters
Click to add
environment
variables
DUMP SCORE Output
Setting APT_DUMP_SCORE yields:
Double-click
Partitoner
And
Collector
Mapping
Node--> partition
Use Multiple Configuration Files

Make a set for 1X, 2X,….

Use different ones for test versus production

Include as a parameter in each job
Exercise

Complete exercise 3-1
Module 4
DBMS Access
Objectives

Understand how DSEE reads and writes records
to an RDBMS

Understand how to handle nulls on DBMS lookup

Utilize this knowledge to:
– Read and write database tables
– Use database tables to lookup data
– Use null handling options to clean data
Parallel Database Connectivity
Traditional
Client-Server
Client
Enterprise Edition
Client
Sort
Client
Client
Client
Load
Client
Parallel RDBMS
Parallel RDBMS

Only RDBMS is running in parallel

Parallel server runs APPLICATIONS

Each application has only one connection
Suitable only for small data volumes


Application has parallel connections to RDBMS
Suitable for large data volumes

Higher levels of integration possible

RDBMS Access
Supported Databases
Enterprise Edition provides high performance /
scalable interfaces for:

DB2

Informix

Oracle

Teradata
RDBMS Access

Automatically convert RDBMS table layouts to/from
Enterprise Edition Table Definitions

RDBMS nulls converted to/from nullable field values

Support for standard SQL syntax for specifying:
– field list for SELECT statement
– filter for WHERE clause


Can write an explicit SQL query to access RDBMS
EE supplies additional information in the SQL query
RDBMS Stages

DB2/UDB Enterprise

Informix Enterprise

Oracle Enterprise

Teradata Enterprise
RDBMS Usage

As a source
– Extract data from table (stream link)
– Extract as table, generated SQL, or user-defined SQL
– User-defined can perform joins, access views
– Lookup (reference link)
– Normal lookup is memory-based (all table data read into
memory)
– Can perform one lookup at a time in DBMS (sparse option)
– Continue/drop/fail options

As a target
– Inserts
– Upserts (Inserts and updates)
– Loader
RDBMS Source – Stream Link
Stream link
DBMS Source - User-defined SQL
Columns in SQL statement
must match the meta data
in columns tab
Exercise

User-defined SQL
– Exercise 4-1
DBMS Source – Reference Link
Reject link
Lookup Reject Link
“Output” option automatically
creates the reject link
Null Handling

Must handle null condition if lookup record is not
found and “continue” option is chosen

Can be done in a transformer stage
Lookup Stage Mapping
Link name
Lookup Stage Properties
Reference
link
Must have same column name
in input and reference links. You
will get the results of the lookup
in the output column.
DBMS as a Target
DBMS As Target

Write Methods
–
–
–
–

Delete
Load
Upsert
Write (DB2)
Write mode for load method
–
–
–
–
Truncate
Create
Replace
Append
Target Properties
Generated code
can be copied
Upsert mode
determines options
Checking for Nulls

Use Transformer stage to test for fields with null
values (Use IsNull functions)

In Transformer, can reject or load default value
Exercise

Complete exercise 4-2
Module 5
Platform Architecture
Objectives

Understand how Enterprise Edition Framework
processes data

You will be able to:
– Read and understand OSH
– Perform troubleshooting
Concepts

The Enterprise Edition Platform
– Script language - OSH (generated by DataStage
Parallel Canvas, and run by DataStage Director)
– Communication - conductor,section leaders,players.
– Configuration files (only one active at a time,
describes H/W)
– Meta data - schemas/tables
– Schema propagation - RCP
– EE extensibility - Buildop, Wrapper
– Datasets (data in Framework's internal
representation)
DS-EE Stage Elements
EE Stages Involve A Series Of Processing Steps
Output Data Set schema:
prov_num:int16;
member_num:int8;
custid:int32;
Input Data Set schema:
prov_num:int16;
member_num:int8;
custid:int32;
• Piece of Application
Logic Running Against
Individual Records
• Parallel or Sequential
Business
Logic
Partitioner
EE Stage
DSEE Stage Execution
Dual Parallelism Eliminates Bottlenecks!
• EE Delivers Parallelism in
Two Ways
– Pipeline
– Partition
• Block Buffering Between
Components
Producer
– Eliminates Need for Program
Load Balancing
– Maintains Orderly Data Flow
Pipeline
Consume
r
Partition
Stages Control Partition Parallelism

Execution Mode (sequential/parallel) is controlled by Stage
– default = parallel for most Ascential-supplied Stages
– Developer can override default mode
– Parallel Stage inserts the default partitioner (Auto) on its
input links
– Sequential Stage inserts the default collector (Auto) on
its input links
– Developer can override default
 execution mode (parallel/sequential) of Stage >
Advanced tab
 choice of partitioner/collector on Input > Partitioning
tab
How Parallel Is It?

Degree of parallelism is determined by the
configuration file
– Total number of logical nodes in default pool, or a
subset if using "constraints".

Constraints are assigned to specific pools as defined in
configuration file and can be referenced in the stage
OSH

DataStage EE GUI generates OSH scripts
– Ability to view OSH turned on in Administrator
– OSH can be viewed in Designer using job properties

The Framework executes OSH

What is OSH?
– Orchestrate shell
– Has a UNIX command-line interface
OSH Script

An osh script is a quoted string which
specifies:
– The operators and connections of a single
Orchestrate step
– In its simplest form, it is:
osh “op < in.ds > out.ds”

Where:
– op is an Orchestrate operator
– in.ds is the input data set
– out.ds is the output data set
OSH Operators

OSH Operator is an instance of a C++ class inheriting
from APT_Operator

Developers can create new operators

Examples of existing operators:
– Import
– Export
– RemoveDups
Enable Visible OSH in Administrator
Will be enabled for
all projects
View OSH in Designer
Operator
Schema
OSH Practice

Exercise 5-1 – Instructor demo (optional)
Elements of a Framework Program
• Operators
• Datasets: set of rows processed by Framework
– Orchestrate data sets:
– persistent (terminal) *.ds, and
– virtual (internal) *.v.
– Also: flat “file sets” *.fs
• Schema: data description (metadata) for datasets and links.
Datasets
• Consist of Partitioned Data and Schema
• Can be Persistent (*.ds)
or Virtual (*.v, Link)
• Overcome 2 GB File Limit
What you program:
GUI
=
What gets generated:
OSH
What gets processed:
data files
of x.ds
$ osh “operator_A > x.ds“
Node 1
Node 2
Node 3
Node 4
Operator
A
Operator
A
Operator
A
Operator
A
. . .
Multiple files per partition
Each file up to 2GBytes (or larger)
Computing Architectures: Definition
Dedicated Disk
Shared Disk
Disk
Disk
CPU
Memory
Uniprocessor
CPU CPU CPU CPU
Shared Memory
SMP System
Shared Nothing
Disk
Disk
Disk
Disk
CPU
CPU
CPU
CPU
Memory
Memory
Memory
Memory
Clusters and MPP Systems
(Symmetric Multiprocessor)
• PC
• Workstation
• Single processor server
• IBM, Sun, HP, Compaq
• 2 to 64 processors
• Majority of installations
• 2 to hundreds of processors
• MPP: IBM and NCR Teradata
• each node is a uniprocessor or SMP
Job Execution:
Orchestrate
Conductor Node
•
–
–
–
–
C
Processing Node
•
SL
P
P
•
Section Leader
Players
– The actual processes associated with Stages
– Combined players: one process only
– Send stderr to SL
SL
P
Step Composer
Creates Section Leader processes (one/node)
Consolidates massages, outputs them
Manages orderly shutdown.
– Forks Players processes (one/Stage)
– Manages up/down communication.
P
Processing Node
P
Conductor - initial DS/EE process
P
• Communication:
– Establish connections to other players for data
flow
– Clean up upon completion.
- SMP: Shared Memory
- MPP: TCP
Working with Configuration Files

You can easily switch between config files:

'1-node' file
- for sequential execution, lighter reports—handy for
testing

'MedN-nodes' file - aims at a mix of pipeline and data-partitioned
parallelism


'BigN-nodes' file
- aims at full data-partitioned parallelism
Only one file is active while a step is running

The Framework queries (first) the environment variable:
$APT_CONFIG_FILE

# nodes declared in the config file needs not match #
CPUs

Same configuration file can be used in development and
Scheduling
Nodes, Processes, and CPUs

DS/EE does not:
– know how many CPUs are available
– schedule

Who knows what?
Nodes
Ops
User
Y
N
Orchestrate
Y
Y
O/S

Nodes = # logical nodes declared in config. file
Ops = # ops. (approx. # blue boxes in V.O.)
Processes = # Unix processes
CPUs = # available CPUs
Processes
CPUs
Nodes * Ops
N
"
Y
Who does what?
– DS/EE creates (Nodes*Ops) Unix processes
– The O/S schedules these processes on the CPUs
Configuring DSEE – Node Pools
{
3
4
1
2
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Configuring DSEE – Disk Pools
{
3
4
1
2
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Re-Partitioning
Parallel to parallel flow may incur reshuffling:
Records may jump between nodes
node
1
node
2
partitioner
Partitioning Methods

Auto

Hash

Entire

Range

Range Map
Collectors
• Collectors combine partitions of a dataset into a
single input stream to a sequential Stage
...
data partitions
collector
–Collectors do NOT synchronize data
sequential Stage
Partitioning and Repartitioning Are
Visible On Job Design
Partitioning and Collecting Icons
Partitioner
Collector
Setting a Node Constraint in the GUI
Reading Messages in Director

Set APT_DUMP_SCORE to true

Can be specified as job parameter

Messages sent to Director log

If set, parallel job will produce a report showing
the operators, processes, and datasets in the
running job
Messages With APT_DUMP_SCORE
= True
Exercise

Complete exercise 5-2
Module 6
Transforming Data
Module Objectives

Understand ways DataStage allows you to
transform data

Use this understanding to:
– Create column derivations using user-defined code or
system functions
– Filter records based on business criteria
– Control data flow based on data conditions
Transformed Data

Transformed data is:
– Outgoing column is a derivation that may, or may not,
include incoming fields or parts of incoming fields
– May be comprised of system variables

Frequently uses functions performed on
something (ie. incoming columns)
– Divided into categories – I.e.





Date and time
Mathematical
Logical
Null handling
More
Stages Review

Stages that can transform data
– Transformer
Parallel
 Basic (from Parallel palette)

– Aggregator (discussed in later module)

Sample stages that do not transform data
–
–
–
–
Sequential
FileSet
DataSet
DBMS
Transformer Stage Functions

Control data flow

Create derivations
Flow Control

Separate records flow down links based on data
condition – specified in Transformer stage
constraints

Transformer stage can filter records

Other stages can filter records but do not exhibit
advanced flow control
– Sequential can send bad records down reject link
– Lookup can reject records based on lookup failure
– Filter can select records based on data value
Rejecting Data

Reject option on sequential stage
– Data does not agree with meta data
– Output consists of one column with binary data type

Reject links (from Lookup stage) result from the
drop option of the property “If Not Found”
– Lookup “failed”
– All columns on reject link (no column mapping option)

Reject constraints are controlled from the
constraint editor of the transformer
– Can control column mapping
– Use the “Other/Log” checkbox
Rejecting Data Example
Constraint –
Other/log option
Property Reject
Mode = Output
“If Not Found”
property
Transformer Stage Properties
Transformer Stage Variables

First of transformer stage entities to execute

Execute in order from top to bottom
– Can write a program by using one stage variable to
point to the results of a previous stage variable

Multi-purpose
–
–
–
–
Counters
Hold values for previous rows to make comparison
Hold derivations to be used in multiple field dervations
Can be used to control execution of constraints
Stage Variables
Show/Hide button
Transforming Data

Derivations
– Using expressions
– Using functions


Date/time
Transformer Stage Issues
– Sometimes require sorting before the transformer
stage – I.e. using stage variable as accumulator and
need to break on change of column value

Checking for nulls
Checking for Nulls

Nulls can get introduced into the dataflow
because of failed lookups and the way in which
you chose to handle this condition

Can be handled in constraints, derivations, stage
variables, or a combination of these
Transformer - Handling Rejects
Constraint Rejects
– All expressions are
false and reject row is
checked
Transformer: Execution Order
• Derivations in stage variables are executed first
• Constraints are executed before derivations
• Column derivations in earlier links are executed before later links
• Derivations in higher columns are executed before lower columns
Parallel Palette - Two Transformers

All > Processing >

Parallel > Processing

Transformer

Basic Transformer

Is the non-Universe
transformer


Has a specific set of
functions
Makes server style
transforms available on
the parallel palette

Can use DS routines

No DS routines available
• Program in Basic for both transformers
Transformer Functions From
Derivation Editor

Date & Time

Logical

Null Handling

Number

String

Type Conversion
Exercise

Complete exercises 6-1, 6-2, and 6-3
Module 7
Sorting Data
Objectives

Understand DataStage EE sorting options

Use this understanding to create sorted list of
data to enable functionality within a transformer
stage
Sorting Data

Important because
– Some stages require sorted input
– Some stages may run faster – I.e Aggregator

Can be performed
– Option within stages (use input > partitioning tab and
set partitioning to anything other than auto)
– As a separate stage (more complex sorts)
Sorting Alternatives
• Alternative representation of same flow:
Sort Option on Stage Link
Sort Stage
Sort Utility

DataStage – the default

UNIX
Sort Stage - Outputs

Specifies how the output is derived
Sort Specification Options

Input Link Property
– Limited functionality
– Max memory/partition is 20 MB, then spills to scratch

Sort Stage
– Tunable to use more memory before spilling to
scratch.

Note: Spread I/O by adding more scratch file
systems to each node of the APT_CONFIG_FILE
Removing Duplicates

Can be done by Sort stage
– Use unique option
OR

Remove Duplicates stage
– Has more sophisticated ways to remove duplicates
Exercise

Complete exercise 7-1
Module 8
Combining Data
Objectives

Understand how DataStage can combine data
using the Join, Lookup, Merge, and Aggregator
stages

Use this understanding to create jobs that will
– Combine data from separate input streams
– Aggregate data to form summary totals
Combining Data

There are two ways to combine data:
– Horizontally:
Several input links; one output link (+ optional rejects)
made of columns from different input links. E.g.,
Joins
 Lookup
 Merge

– Vertically:
One input link, one output link with column combining
values from all input rows. E.g.,

Aggregator
Join, Lookup & Merge Stages

These "three Stages" combine two or more input
links according to values of user-designated "key"
column(s).

They differ mainly in:
– Memory usage
– Treatment of rows with unmatched key values
– Input requirements (sorted, de-duplicated)
Not all Links are Created Equal
• Enterprise Edition distinguishes between:
- The Primary Input (Framework port 0)
- Secondary - in some cases "Reference" (other ports)
• Naming convention:
Primary Input: port 0
Secondary Input(s): ports 1,…
Joins
Lookup
Merge
Left
Right
Source
LU Table(s)
Master
Update(s)
Tip:
Check "Input Ordering" tab to make sure intended
Primary is listed first
Join Stage Editor
Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)
One of four variants:
– Inner
– Left Outer
– Right Outer
– Full Outer
Several key columns
allowed
1. The Join Stage
Four types:
• Inner
• Left Outer
• Right Outer
• Full Outer

2 sorted input links, 1 output link
– "left outer" on primary input, "right outer" on secondary input
– Pre-sort make joins "lightweight": few rows need to be in RAM
2. The Lookup Stage
Combines:
– one source link with
– one or more duplicate-free table links
Source
input
0
One or more
tables (LUTs)
1
2
0
1
Lookup
Output
Reject
no pre-sort necessary
allows multiple keys LUTs
flexible exception handling for
source input rows with no match
The Lookup Stage

Lookup Tables should be small enough to fit
into physical memory (otherwise,
performance hit due to paging)

On an MPP you should partition the lookup
tables using entire partitioning method, or
partition them the same way you partition the
source link

On an SMP, no physical duplication of a
Lookup Table occurs
The Lookup Stage

Lookup File Set
– Like a persistent data set only it
contains metadata about the key.
– Useful for staging lookup tables

RDBMS LOOKUP
– NORMAL

Loads to an in memory hash table first
– SPARSE
Select for each row.
 Might become a performance
bottleneck.

3. The Merge Stage

Combines
– one sorted, duplicate-free master (primary) link with
– one or more sorted update (secondary) links.
– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with
joins, but opposite to lookup).

Follows the Master-Update model:
– Master row and one or more updates row are merged if they have the same
value in user-specified key column(s).
– A non-key column occurs in several inputs? The lowest input port number
prevails (e.g., master over update; update values are ignored)
– Unmatched ("Bad") master rows can be either
kept
 dropped

– Unmatched ("Bad") update rows in input link can be captured in a "reject"
link
– Matched update rows are consumed.
The Merge Stage
Allows composite keys
Master
One or more
updates
Multiple update links
Matched update rows are consumed
Unmatched updates can be captured
1
0
2
Lightweight
0
1
2
Merge
Output
Space/time tradeoff: presorts vs. inRAM table
Rejects
Synopsis:
Joins, Lookup, & Merge
Joins
Lookup
Merge
Model
Memory usage
RDBMS-style relational
light
Source - in RAM LU Table
heavy
Master -Update(s)
light
# and names of Inputs
Mandatory Input Sort
Duplicates in primary input
Duplicates in secondary input(s)
Options on unmatched primary
Options on unmatched secondary
On match, secondary entries are
exactly 2: 1 left, 1 right
both inputs
OK (x-product)
OK (x-product)
NONE
NONE
reusable
1 Source, N LU Tables
1 Master, N Update(s)
no
OK
Warning!
[fail] | continue | drop | reject
NONE
reusable
all inputs
Warning!
OK only when N = 1
[keep] | drop
capture in reject set(s)
consumed
1
Nothing (N/A)
1 out, (1 reject)
unmatched primary entries
1 out, (N rejects)
unmatched secondary entries
# Outputs
Captured in reject set(s)
In this table:
• , <comma>
= separator between primary and secondary input links
(out and reject links)
The Aggregator Stage
Purpose: Perform data aggregations
Specify:

Zero or more key columns that define the
aggregation units (or groups)

Columns to be aggregated

Aggregation functions:
count (nulls/non-nulls) sum
max/min/range

The grouping method (hash table or pre-sort)
is a performance issue
Grouping Methods

Hash: results for each aggregation group are stored in a
hash table, and the table is written out after all input has
been processed
– doesn’t require sorted data
– good when number of unique groups is small. Running
tally for each group’s aggregate calculations need to fit
easily into memory. Require about 1KB/group of RAM.
– Example: average family income by state, requires .05MB
of RAM

Sort: results for only a single aggregation group are kept
in memory; when new group is seen (key value changes),
current group written out.
– requires input sorted by grouping keys
– can handle unlimited numbers of groups
– Example: average daily balance by credit card
Aggregator Functions

Sum

Min, max

Mean

Missing value count

Non-missing value count

Percent coefficient of variation
Aggregator Properties
Aggregation Types
Aggregation types
Containers

Two varieties
– Local
– Shared

Local
– Simplifies a large, complex diagram

Shared
– Creates reusable object that many jobs can include
Creating a Container

Create a job

Select (loop) portions to containerize

Edit > Construct container > local or shared
Using a Container

Select as though it were a stage
Exercise

Complete exercise 8-1
Module 9
Configuration Files
Objectives

Understand how DataStage EE uses
configuration files to determine parallel behavior

Use this understanding to
– Build a EE configuration file for a computer system
– Change node configurations to support adding
resources to processes that need them
– Create a job that will change resource allocations at
the stage level
Configuration File Concepts

Determine the processing nodes and disk space
connected to each node

When system changes, need only change the
configuration file – no need to recompile jobs

When DataStage job runs, platform reads
configuration file
– Platform automatically scales the application to fit the
system
Processing Nodes Are

Locations on which the framework runs
applications

Logical rather than physical construct

Do not necessarily correspond to the number of
CPUs in your system
– Typically one node for two CPUs

Can define one processing node for multiple
physical nodes or multiple processing nodes for
one physical node
Optimizing Parallelism

Degree of parallelism determined by number of
nodes defined

Parallelism should be optimized, not maximized
– Increasing parallelism distributes work load but also
increases Framework overhead

Hardware influences degree of parallelism
possible

System hardware partially determines
configuration
More Factors to Consider

Communication amongst operators
– Should be optimized by your configuration
– Operators exchanging large amounts of data should
be assigned to nodes communicating by shared
memory or high-speed link

SMP – leave some processors for operating
system

Desirable to equalize partitioning of data

Use an experimental approach
– Start with small data sets
– Try different parallelism while scaling up data set sizes
Factors Affecting Optimal Degree of
Parallelism

CPU intensive applications
– Benefit from the greatest possible parallelism

Applications that are disk intensive
– Number of logical nodes equals the number of disk
spindles being accessed
Configuration File

Text file containing string data that is passed to
the Framework
– Sits on server side
– Can be displayed and edited

Name and location found in environmental
variable APT_CONFIG_FILE

Components
–
–
–
–
Node
Fast name
Pools
Resource
Node Options

Node name – name of a processing node used by EE
– Typically the network name
– Use command uname –n to obtain network name

Fastname –
– Name of node as referred to by fastest network in the system
– Operators use physical node name to open connections
– NOTE: for SMP, all CPUs share single connection to network

Pools
– Names of pools to which this node is assigned
– Used to logically group nodes
– Can also be used to group resources

Resource
– Disk
– Scratchdisk
Sample Configuration File
{
node “Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk
"/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }
resource scratchdisk
"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}
Disk Pools

pool "bigdata"

Disk pools allocate storage
By default, EE uses the default
pool, specified by “”
Sorting Requirements
Resource pools can also be specified for sorting:

The Sort stage looks first for scratch disk resources
in a
“sort” pool, and then in the default disk pool
Another Configuration File Example
{
node "n1" {
fastname “s1"
pool "" "n1" "s1" "sort"
resource disk "/data/n1/d1" {}
resource disk "/data/n1/d2" {}
resource scratchdisk "/scratch"
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/data/n2/d1" {}
resource scratchdisk "/scratch"
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/data/n3/d1" {}
resource scratchdisk "/scratch"
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/data/n4/d1" {}
resource scratchdisk "/scratch"
}
...
6
4
5
2
3
1
}
{"sort"}
{}
{}
{}
Resource Types

Disk

Scratchdisk

DB2

Oracle

Saswork

Sortwork

Can exist in a pool
– Groups resources together
Using Different Configurations
Lookup stage where DBMS is using a sparse lookup type
Building a Configuration File

Scoping the hardware:
– Is the hardware configuration SMP, Cluster, or MPP?
– Define each node structure (an SMP would be single
node):





Number of CPUs
CPU speed
Available memory
Available page/swap space
Connectivity (network/back-panel speed)
– Is the machine dedicated to EE? If not, what other
applications are running on it?
– Get a breakdown of the resource usage (vmstat, mpstat,
iostat)
– Are there other configuration restrictions? E.g. DB only
runs on certain nodes and ETL cannot run on them?
Exercise

Complete exercise 9-1 and 9-2
Module 10
Extending DataStage EE
Objectives

Understand the methods by which you can add
functionality to EE

Use this understanding to:
– Build a DataStage EE stage that handles special
processing needs not supplied with the vanilla stages
– Build a DataStage EE job that uses the new stage
EE Extensibility Overview
Sometimes it will be to your advantage to
leverage EE’s extensibility. This extensibility
includes:

Wrappers

Buildops

Custom Stages
When To Leverage EE Extensibility
Types of situations:
Complex business logic, not easily accomplished using standard
EE stages
Reuse of existing C, C++, Java, COBOL, etc…
Wrappers vs. Buildop vs. Custom

Wrappers are good if you cannot or do not
want to modify the application and
performance is not critical.

Buildops: good if you need custom coding but
do not need dynamic (runtime-based) input
and output interfaces.

Custom (C++ coding using framework API): good
if you need custom coding and need dynamic
input and output interfaces.
Building “Wrapped” Stages
You can “wrapper” a legacy executable:
 Binary
 Unix command
 Shell script
… and turn it into a Enterprise Edition stage
capable, among other things, of parallel execution…
As long as the legacy executable is:
 amenable to data-partition parallelism


no dependencies between rows
pipe-safe
can read rows sequentially
 no random access to data

Wrappers (Cont’d)
Wrappers are treated as a black box

EE has no knowledge of contents

EE has no means of managing anything that occurs
inside the wrapper

EE only knows how to export data to and import data
from the wrapper

User must know at design time the intended behavior of
the wrapper and its schema interface

If the wrappered application needs to see all records prior
to processing, it cannot run in parallel.
LS Example

Can this command be wrappered?
Creating a Wrapper
To create the “ls” stage
Used in this job ---
Wrapper Starting Point
Creating Wrapped Stages
From Manager:
Right-Click on Stage Type
> New Parallel Stage > Wrapped
We will "Wrapper” an existing
Unix executables – the ls
command
Wrapper - General Page
Name of stage
Unix command to be wrapped
The "Creator" Page
Conscientiously maintaining the Creator page for all your wrapped stages
will eventually earn you the thanks of others.
Wrapper – Properties Page

If your stage will have properties appear, complete the
Properties page
This will be the name of
the property as it
appears in your stage
Wrapper - Wrapped Page
Interfaces – input and output columns these should first be entered into the table
definitions meta data (DS Manager); let’s
do that now.
Interface schemas
• Layout interfaces describe what columns the
stage:
– Needs for its inputs (if any)
– Creates for its outputs (if any)
– Should be created as tables with columns in
Manager
Column Definition for Wrapper
Interface
How Does the Wrapping Work?
– Define the schema for export
and import
 Schemas become interface
schemas of the operator and
allow for by-name column
access
input schema
export
stdin or
named pipe
UNIX executable
stdout or
named pipe
import
output schema
QUIZ: Why does export precede import?
Update the Wrapper Interfaces

This wrapper will have no input interface – i.e. no input
link. The location will come as a job parameter that will
be passed to the appropriate stage property. Therefore,
only the Output tab entry is needed.
Resulting Job
Wrapped stage
Job Run

Show file from Designer palette
Wrapper Story: Cobol Application

Hardware Environment:
– IBM SP2, 2 nodes with 4 CPU’s per node.

Software:
– DB2/EEE, COBOL, EE

Original COBOL Application:
– Extracted source table, performed lookup against table in DB2,
and Loaded results to target table.
– 4 hours 20 minutes sequential execution

Enterprise Edition Solution:
– Used EE to perform Parallel DB2 Extracts and Loads
– Used EE to execute COBOL application in Parallel
– EE Framework handled data transfer between
DB2/EEE and COBOL application
– 30 minutes 8-way parallel execution
Buildops
Buildop provides a simple means of extending beyond the
functionality provided by EE, but does not use an existing
executable (like the wrapper)
Reasons to use Buildop include:


Speed / Performance
Complex business logic that cannot be easily represented
using existing stages
– Lookups across a range of values
– Surrogate key generation
– Rolling aggregates


Build once and reusable everywhere within project, no
shared container necessary
Can combine functionality from different stages into one
BuildOps
– The DataStage programmer encapsulates the business
logic
– The Enterprise Edition interface called “buildop”
automatically performs the tedious, error-prone tasks:
invoke needed header files, build the necessary
“plumbing” for a correct and efficient parallel execution.
– Exploits extensibility of EE Framework
BuildOp Process Overview
From Manager (or Designer):
Repository pane:
Right-Click on Stage Type
> New Parallel Stage > {Custom | Build | Wrapped}
• "Build" stages
from within Enterprise Edition
• "Wrapping” existing “Unix”
executables
General Page
Identical
to Wrappers,
except:
Under the Build
Tab, your program!
Logic Tab for
Business Logic
Enter Business C/C++
logic and arithmetic in
four pages under the
Logic tab
Main code section goes
in Per-Record page- it
will be applied to all
rows
NOTE: Code will need
to be Ansi C/C++
compliant. If code does
not compile outside of
EE, it won’t compile
within EE either!
Code Sections under Logic Tab
Temporary
variables
declared [and
initialized] here
Logic here is
executed once
BEFORE
processing the
FIRST row
Logic here is
executed once
AFTER
processing the
LAST row
I/O and Transfer
Under Interface tab: Input, Output & Transfer pages
First line:
output 0
Optional
renaming of
output port
from default
"out0"
Write row
Input page: 'Auto Read'
Read next row
In-Repository
Table
Definition
'False' setting,
not to interfere
with Transfer
page
I/O and Transfer
First line:
Transfer of index 0
• Transfer all columns from input to output.
• If page left blank or Auto Transfer = "False" (and RCP = "False")
Only columns in output Table Definition are written
BuildOp Simple Example

Example - sumNoTransfer
– Add input columns "a" and "b"; ignore other columns
that might be present in input
– Produce a new "sum" column
– Do not transfer input columns
a:int32; b:int32
sumNoTransfer
sum:int32
No Transfer
From Peek:
NO TRANSFER
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"
• Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
Compare with transfer ON…
Transfer
TRANSFER
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"
• Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)
Columns vs.
Temporary C++ Variables
Temp C++ variables
Columns



DS-EE type
Defined in Table
Definitions
Value refreshed from row
to row

C/C++ type

Need declaration (in
Definitions or Pre-Loop
page)

Value persistent
throughout "loop" over
rows, unless modified in
code
Exercise

Complete exercise 10-1 and 10-2
Exercise

Complete exercises 10-3 and 10-4
Custom Stage

Reasons for a custom stage:
– Add EE operator not already in DataStage EE
– Build your own Operator and add to DataStage EE

Use EE API

Use Custom Stage to add new operator to EE
canvas
Custom Stage
DataStage Manager > select Stage Types branch
> right click
Custom Stage
Number of input and
output links allowed
Name of Orchestrate
operator to be used
Custom Stage – Properties Tab
The Result
Module 11
Meta Data in DataStage EE
Objectives

Understand how EE uses meta data, particularly
schemas and runtime column propagation

Use this understanding to:
– Build schema definition files to be invoked in
DataStage jobs
– Use RCP to manage meta data usage in EE jobs
Establishing Meta Data

Data definitions
– Recordization and columnization
– Fields have properties that can be set at individual
field level

Data types in GUI are translated to types used by EE
– Described as properties on the format/columns tab
(outputs or inputs pages) OR
– Using a schema file (can be full or partial)

Schemas
– Can be imported into Manager
– Can be pointed to by some job stages (i.e. Sequential)
Data Formatting – Record Level

Format tab

Meta data described on a record basis

Record level properties
Data Formatting – Column Level

Defaults for all columns
Column Overrides

Edit row from within the columns tab

Set individual column properties
Extended Column Properties
Field
and
string
settings
Extended Properties – String Type

Note the ability to convert ASCII to EBCDIC
Editing Columns
Properties depend
on the data type
Schema

Alternative way to specify column definitions for
data used in EE jobs

Written in a plain text file

Can be written as a partial record definition

Can be imported into the DataStage repository
Creating a Schema

Using a text editor
– Follow correct syntax for definitions
– OR

Import from an existing data set or file set
– On DataStage Manager import > Table Definitions >
Orchestrate Schema Definitions
– Select checkbox for a file with .fs or .ds
Importing a Schema
Schema location can be
on the server or local
work station
Data Types

Date

Vector

Decimal

Subrecord

Floating point

Raw

Integer

Tagged

String

Time

Timestamp
Runtime Column Propagation

DataStage EE is flexible about meta data. It can cope with the
situation where meta data isn’t fully defined. You can define
part of your schema and specify that, if your job encounters
extra columns that are not defined in the meta data when it
actually runs, it will adopt these extra columns and propagate
them through the rest of the job. This is known as runtime
column propagation (RCP).

RCP is always on at runtime.

Design and compile time column mapping enforcement.
– RCP is off by default.
– Enable first at project level. (Administrator project
properties)
– Enable at job level. (job properties General tab)
– Enable at Stage. (Link Output Column tab)
Enabling RCP at Project Level
Enabling RCP at Job Level
Enabling RCP at Stage Level


Go to output link’s columns tab
For transformer you can find the output links
columns tab by first going to stage properties
Using RCP with Sequential Stages


To utilize runtime column propagation in the
sequential stage you must use the “use schema”
option
Stages with this restriction:
–
–
–
–
Sequential
File Set
External Source
External Target
Runtime Column Propagation

When RCP is Disabled
– DataStage Designer will enforce Stage Input Column
to Output Column mappings.
– At job compile time modify operators are inserted on
output links in the generated osh.
Runtime Column Propagation

When RCP is Enabled
– DataStage Designer will not enforce mapping rules.
– No Modify operator inserted at compile time.
– Danger of runtime error if column names incoming do
not match column names outgoing link – case
sensitivity.
Exercise

Complete exercises 11-1 and 11-2
Module 12
Job Control Using the Job
Sequencer
Objectives

Understand how the DataStage job sequencer
works

Use this understanding to build a control job to
run a sequence of DataStage jobs
Job Control Options

Manually write job control
– Code generated in Basic
– Use the job control tab on the job properties page
– Generates basic code which you can modify

Job Sequencer
– Build a controlling job much the same way you build
other jobs
– Comprised of stages and links
– No basic coding
Job Sequencer

Build like a regular job

Type “Job Sequence”

Has stages and links

Job Activity stage
represents a DataStage
job

Links represent passing
control
Stages
Example
Job Activity
stage –
contains
conditional
triggers
Job Activity Properties
Job to be executed –
select from dropdown
Job parameters
to be passed
Job Activity Trigger

Trigger appears as a link in the diagram

Custom options let you define the code
Options

Use custom option for conditionals
– Execute if job run successful or warnings only


Can add “wait for file” to execute
Add “execute command” stage to drop real tables
and rename new tables to current tables
Job Activity With Multiple Links
Different links
having different
triggers
Sequencer Stage

Build job sequencer to control job for the
collections application
Can be set to all
or any
Notification Stage
Notification
Notification Activity
Sample DataStage log from Mail
Notification

Sample DataStage log from Mail Notification
Notification Activity Message

E-Mail Message
Exercise

Complete exercise 12-1
Module 13
Testing and Debugging
Objectives

Understand spectrum of tools to perform testing
and debugging

Use this understanding to troubleshoot a
DataStage job
Environment Variables
Parallel Environment Variables
Environment Variables
Stage Specific
Environment Variables
Environment Variables
Compiler
The Director
Typical Job Log Messages:

Environment variables

Configuration File information

Framework Info/Warning/Error messages

Output from the Peek Stage

Additional info with "Reporting" environments

Tracing/Debug output
– Must compile job in trace mode
– Adds overhead
Job Level Environmental Variables
• Job Properties, from Menu Bar of Designer
• Director will
prompt you
before each
run
Troubleshooting
If you get an error during compile, check the following:

Compilation problems
– If Transformer used, check C++ compiler, LD_LIRBARY_PATH
– If Buildop errors try buildop from command line
– Some stages may not support RCP – can cause column mismatch .
– Use the Show Error and More buttons
– Examine Generated OSH
– Check environment variables settings

Very little integrity checking during compile, should run validate from Director.
Highlights source of error
Generating Test Data

Row Generator stage can be used
– Column definitions
– Data type dependent

Row Generator plus lookup stages provides good
way to create robust test data from pattern files
Download