Spring Batch and doing ETL between systems

SivaKumar
3 min readSep 1, 2020

--

Here i thought of writing about implementing data processing between databases/between database and kafka topics.

Usually to solve any data processing problems, we first search for any traditional ETL tools like Talend,Informatica,Qlick,IBM Data stage,etc . Apache camel is also widely used tool.These are the widely adopted tools for ETL tasks/problems.

What if we have to look for ETL alternatives. First we can check streaming platforms like kafka for real-time streaming since ETL tools are based on Batch processing. Second one is Hadoop batch processing.There are other tools like Apache Spark, Apache flink to consider.

So here this data transformation we can do using Spring Batch and using its scheduler features or use any other scheduler framework like Quartz.I want to tell clearly here is that this is one of the approaches to consider to solve any ETL problems.

Reasons to consider Spring Batch approach :
→ OpenSource framework
→Built on top of Spring framework
→Almost same as JSR-352 specification
→ Can run on JVM and can be scaled easily
→ Parallel processing also can be achieved
For more guidelines or principles you can check Spring batch framework reference documentation here.

Spring batch :

Spring batch is a lightweight batch framework designed to process tha data automatically,periodically between systems. It can work with other Schedulers like Quartz easily.
Batch framework consists 3 components as shown in below diagram.
→ Batch framework
→Batch core
→Batch Application

Here batch App is an application written in java using spring framework and Spring Batch.
Batch core consists core classes to launch Job and control them like Joblauncher,Job,Step .
Batch Infrastructure consists ItemReader,ItemWriter,ItemProcessor and RetryTemplate.

Main components of Spring batch :
Joblauncher,Job,Step,JobRepository,ItemReader,ItemWriter,
ItemProcessor.
Job :
It is an entity where we define the job configuration details such as Jobname,orders of steps and job restartable.
Step : In step we write the code to perform activities like load data or process business logic and then load to some other systems.
JobRepository : This concept is used to store all batch related information such as JobExecution,StepExecution,Batch executed timings,etc.
ItemReader : ItemReader is used to provide an input for Step.
ItemWriter: ItemWriter is used to write output of the Step.
ItemProcessor: ItemProcessor is used to transform the input data received from ItemReader and performs business logic processing also.
These terms are brief details about Job and its related words.
What all we can read/update ?
Using ItemReaders and ItemWriters we can read from different data store platforms as given below.
Flatfiles,XML files,JSON files, databases, kafka topics,Message brokers.
For other in-depth topics like parallel processing,Scaling,batch integration Spring Batch reference documentation is the correct place to explore more.

Reading data from database and sending it to the kafka topic:
In this section we will explore about reading data from Postgresql using Jdbcursor reader or Hibernatecursor reader,Hibernatepaging reader/Repository reader.

The code base can be found in the below github URL. To run this example you need Postgresql and kafka running in your system locally.
https://github.com/shivakumarksk/kafkapub-sub

This is same example given in Pub/Sub with kafka and microservices story. We added batch code for the same. KafkaItemwriter uses KafkaTemplate.

--

--

No responses yet