File - Christopher A. Jeffers

advertisement

Spring Batch

Christopher Jeffers

August 2012

Agenda

• Intro to Spring Batch and Use-Cases

• Spring Batch Technical Explanation

– Architecture

– The Batch Job

– Skipping and Retrying Steps

– Scaling Features

• Spring Batch Evaluation

– Solving Use-Cases

– Benefits

– Issues

– Integration Options

– Future Steps

2

Spring Batch Overview

• Lightweight framework designed to enable the development of robust batch applications used in enterprise systems

• As a part of Spring, it builds on the ease of use of the POJObased development approach, while making it easy for developers to use more advanced enterprise services when necessary

• Provides reusable functions that are essential in processing large volumes of data

• Provides scaling features, including multi-threading and massive parallelism for Spring Batch Jobs

3

Batch Use-Cases

• DataRoomBatch

– Physically delete all rows marked for deletion from a given bucket (DeepSix)

– Rerun user documents through publishing workflow

– Proactive auditing of the environment

• Public Records Batch Processing

– User inputs file with search criteria for many individuals and program searches database for changes in information, returning a report of hits to user

– Read, Process, and Write sequence

– Satisfies Government and Corporate requirements

4

Reason for Spring Batch POC

• Current batch system for public records is not powerful enough to handle very large requests

• Have had to turn away customers because of this

• A more powerful and flexible batch solution could solve this problem

5

Agenda

• Intro to Spring Batch and Use-Cases

• Spring Batch Technical Explanation

– Architecture

– The Batch Job

– Skipping and Retrying Steps

– Scaling Features

• Spring Batch Evaluation

– Solving Use-Cases

– Benefits

– Issues

– Integration Options

– Future Steps

6

Architecture

• Layered architecture

• The application layer contains all batch jobs and custom code

• Batch Core contains runtime classes necessary to launch and control a batch job

• Batch Infrastructure contains common readers and writers, and services used by both the application and the core framework http://static.springsource.org/spring-batch/reference/html/spring-batch-intro.html

7

The Batch Job

• A Job entity encapsulates an entire batch process

• A Job is comprised of Steps, which encapsulate a phase of a batch job

– Step can be as complex or simple as developer wants http://static.springsource.org/spring-batch/reference/html/domain.html

8

Chunk Processing

• Typical Spring Batch Step

– Read, Process, Write sequence

• Multiple items are read and processed before being written as a “chunk”

– Size of chunk declared in configuration (commit-interval) http://static.springsource.org/spring-batch/reference/html/configureStep.html

9

Step Flow

• Steps can be configured to flow sequentially or conditionally

– Allows for some complex jobs http://static.springsource.org/spring-batch/reference/html/configureStep.html

10

Job Repository

• The JobRepository is used to do CRUD operations with Meta-Data relating to Job and Step execution

– Example: Job Parameters, Job/Step status, etc.

http://static.springsource.org/spring-batch/reference/html/domain.html

11

Step Skipping

• Step is skipped if an exception listed in the configuration is thrown, rather than stopping the batch execution

• Used for exceptions that will be thrown on every attempt of the Step

– FileNotFoundException, Parse Exceptions, etc.

• SkipListener can be used to log skipped items

12

Retrying Steps

• If an exception listed in the configuration is thrown, the operation is attempted again

• Used for exceptions that may not be thrown on every attempt of the Step

– ConcurrencyFailureException,

DeadlockLoserDataAccessException, etc.

• Can set a limit on number of retries

• RetryListener can be used to log retried items

• RetryTemplate can be used to further customize retry logic

13

Scaling Features (Single Process)

• Multi-Threaded Jobs or Steps

– Using Spring’s TaskExecutor object

• Parallel Steps

– Using split flows and a TaskExecutor in Job configuration.

http://static.springsource.org/spring-batch/reference/html/scalability.html

14

Scaling Features (Multi-Process)

• Remote Chunking

– Splits Step processing across multiple processes, using some middleware to communicate http://static.springsource.org/spring-batch/reference/html/scalability.html

15

Scaling Features (Multi-Process)

• Step Partitioning

– Splits input and executes remote steps in parallel

– PartitionHandler sends StepExecution requests to remote steps

– Partitioner generates the input for new step executions http://static.springsource.org/spring-batch/reference/html/scalability.html

16

Job Flow with Client/Server and Partitioning

17

Agenda

• Intro to Spring Batch and Use-Cases

• Spring Batch Technical Explanation

– Architecture

– The Batch Job

– Skipping and Retrying Steps

– Scaling Features

• Spring Batch Evaluation

– Solving Use-Cases

– Benefits

– Issues

– Integration Options

– Future Steps

18

Solving the Use-Cases

• DataRoomBatch (DeepSix Example)

– Bucket is input to JdbcCursorItemReader

– Create an Item Processor to check if the row is marked for deletion and delete it if so

– Item Writer could be empty or used to output statistics

– Partitioning easily done by dividing up number of rows per partition

19

Solving the Use-Cases

• Public Records Batch Processing

– Input file is input to FlatFileItemReader

– Custom Item Processor to search the database for hits

– Custom Item Writer to compile report of search results

– Following step to send report to user

– Easy to implement a Partitioner for the input file

20

Benefits of Spring Batch

• Part of Spring Framework

– Allows easy integration with other Spring features

– General simplicity offered by Spring

• Step flow customizable

• Basic Item Readers and Writers already available

• Features available for monitoring Jobs and Steps

• Many scaling options available

21

Issues with Spring Batch

• No built-in scheduler

– Not a big issue, scheduler libraries easily integrated

• Potentially a lot of XML configuration

– Business logic across Java and XML files can complicate debugging and maintenance

– Annotations can help

• Anything but very basic components will need to be created as new classes

22

Helpful Integration Options

• Spring Batch Admin

– Web-Based administration console

– Contains Spring Batch Integration, allowing use of Spring

Integration messages to launch and monitor jobs

• Scheduler (cron, Spring Scheduling, Quartz)

• Clustering Framework (Hadoop, GridGain,

Terracotta)

– Ideal for improving horizontal scaling

– Spring Data Hadoop is a fairly new Spring feature that helps integrate Spring with Hadoop

23

Future Steps

• Get Spring Batch set up with a clustered environment

– Evaluate performance

– Figure out dynamic load balancing

• Play around with more features and integration options

– Spring Batch Admin, manual job restarting, etc.

• Implement Spring Batch Admin into Cobalt GUI?

• Look more into the information stored in Meta-data database and figure out how to use for monitoring/managing jobs

• Look into Partitioning and how much must be done to implement sending partitions off to remote machines

• Look into job/step timeout

24

Questions?

Download