I’m part of a team that builds a data warehouse for our client, an international recognized car and motorcycle manufacturer. Right now, the data comes from various sources, most of it is provided in CSV files. As a first step, these files are loaded into a staging area. It has been decided to accomplish this task with the help of an ETL tool. The tool that has been selected is Talend Open Studio (TOS).
Although I wasn’t involved in the selection process, there’re good reasons for TOS, some were decisive:
- it’s an Open Source solution
- there’s a branch office in our country
- based on the Java and Eclipse platform
- active community
- training and support offerings
I’ve had really good experiences with our main Talend contact over here. Based on a couple of workshops and support cases I have to say that he was of great help for our project.
Depending on where you come from and what you need, some of the points speaking in favor of TOS for us might speak against it for you. The good thing is, there’re enough solutions/products out there, maybe even one for you. But it might not be TOS. If you’re not sure if an ETL solution is right for you, maybe this article by J. Mundy helps.
The staging area
The database system used for the staging area is IBM DB2 Express-C. The staging area is cleared before every import and has two stages itself:
- data is loaded into tables that reflect the schemas of the CSV files; all fields are VARCHAR fields
- then, the data is converted into the proper types, e.g. DECIMAL(15,2), DATE, INTEGER, …
The table schemas for the tables in stage one and stage two have a couple more fields than the CSV schemas:
- ID_ROW identifies the data row from the CSV file
- STATUS holds information regarding the correctness of the data (and thus if it has been successfully written to the stage-two table)
- ERRMSG contains a detailed description on the problem (if any)
There’re many CSV files to import into the staging area, but the basic process is the same for all. The TOS job that does the work looks like this:
I’ll give you a short description of the process:
- first line imports the CSV data into the stage-one table
- the CSV file is read in
- because the CSV schema and the stage-one table schema differ (more fields, see above) we map the CSV schema to the table schema
- each row gets the additional ID_ROW field filled
- the data is written to the stage-one table
- the data now residing in the stage-one table is converted and written to the stage-two table
- stage-one table data is read in
- the conversion to the proper types takes place; the status and the error messages are generated here
- all problem rows are sorted out, based on the content in STATUS
- the rows without problems are written to the stage-two table
- error logging and reporting
- all rows that didn’t make it to the stage-two table are updated in the stage-one table to have the correct STATUS and ERRMSG set
- problematic rows are written to the console for this example
TOS basically lets you work with components. You place them on the screen and connect them, enter some information into the properties of these components, and then TOS generates Java source code ready to compile and run. Once generated, you cannot alter the Java source code directly. You have to do it by altering the components properties. Some informations like schemas for CSV files and database tables, database connection info, etc. can be stored in a repository and you can connect them with components. That’s nice if you need the same information in more than one place.
For the type conversions, we use a component called tJavaRow. It has two main properties (called settings in TOS):
- the input and output schemas
- a text area that contains the Java source code executed by this component
I’ll explain later, why we use this really basic component instead of a more fancy one like tMap that seems to do the job more nicely. It doesn’t. Anyhow, the Java source code looks like this:
C_OUT.STATUS = "OK"; C_OUT.ERRMSG = null; convert(C_IN, C_OUT, "ID_ROW", "toField", "ID_ROW"); convert(C_IN, C_OUT, "FIELD_ONE", "toNumber", "SignedNumber", ","); convert(C_IN, C_OUT, "FIELD_TWO", "toNormString"); convert(...);
A central conversion routine written in Java is used for all conversions. It does the exception handling in case of a problem (e.g. sets the fields STATUS and ERRMSG) and controls the conversion. The input and output Java data objects are passed in as well as the field names, the name of the conversion routine, and additional parameters. The convert method does the work by using the Java Reflection API to look up fields and methods. The advantage is that it simplifies the exception handling, but it comes with a price: there’s no type safety anymore, i.e. problems are reported at run time, not compile time. We automatically generate the calls from a repository that contains all necessary information. The Java source code is simply copied, not written by hand.
The tMap component on the other hand is nice for ad-hoc mapping and conversion of a few fields, but isn’t really fit for the task otherwise. It might look good during a presentation/sale pitch, but it’s just too painful to work with. That’s because you have to select every field where you want to enter some functionality. It’s just too many clicks. And guess what, if you have to refactor you code, you’ll start all over again, i.e. go to every output field and do the refactoring manually. Imagine a schema with couple of hundred fields and you go nuts.
What I think so far
I’ve only written about a small number of things regarding TOS, but they were important for this part of the project. The import of data from CSV files into the DB2 database is possible within the requirements given above. I didn’t have performance issues related to TOS. The visual components are nice, but sometimes get in the way of working efficiently. What I really want to see is the possibility to edit the components settings with a text editor as a whole in addition to the form based entry.
This information is based on TOS 2.3.2.r12707. Things that were broken in this release were
- HTML documentation was not generated at all
- given exception locations weren’t in sync with generated source code
- export project didn’t work without refreshing project in navigator view (F5)
These things will be fixed in the next releases I’m sure.
Two last things I want to say:
- if you don’t have someone on the team who is sufficiently fluent in Java, you’ll have a hard time to implement non-trivial solutions; depends on your or my definition of ‘non-trivial’ of course 😉
- after working with TOS for some time now, I have the feeling that it can solve nearly any problem that might occur in such a project