Showing posts with label data migration. Show all posts
Showing posts with label data migration. Show all posts

InfoSphere DataStage

While researching information about ETL tools, I've come across with this IBM RedBook about its InfoSphere DataStage product:

You can get IBM InfoSphere DataStage Data Flow and Job Design here.
I also recommend InfoSphere DataStage Parallel Framework Standard Practices.

./M6

Pentaho first impressions

I've just went to a Pentaho presentation performed by Xpand IT.
It was the first opportunity I had to get some answers from someone close to the project and I must say, in general, it is what I was expecting.

The presentation was a bit technical, but I think it could not be otherwise since Pentaho is technical. What I mean is that it does not address one of the most important issues on an ETL project: funcional mapping.
Technically, it's all there, but I do feel the transformations could, and should, be easier to define and implement. Point and click is cool when one wishes to sell the product to management, but to do real work, it slows down the development process.
There are just too many clicks involved even for the simplest task, some could be avoided with a GUI revision focused on the user productivity. For instance, recording previous user tree filters, instead of writing the word every time one wishes to filter something, would save a lot of time to the end user.

Another weak point is the lack of rule mapping management and documentation. It does not have a clean nor fast way to see field mappings. If one wishes to see which fields are mapped or what transformation rules are implemented, one has to manually search all the transformations and click on its graphical representation in order to find them.
It's a lot of click when one has hundreds of tables, and thousands of transformation rules defined.
And, obviously, this lack of management reflects on the project documentation because one cannot generate access nor generate rule mapping reports.
This is a serious issue Pentaho should solve. Rule definition is the core of an ETL process and it is totally unacceptable that one cannot access it is a simple, fast and clear way.

On the strong side, Pentaho has a lot of connectors and operators already available. It's entirely written in Java, supports scripting languages and it's open source, through the Kettle project. All this means that it can be easily expanded, either through customization of its core or through the development of plugins.
It can work from the command line, which is excellent because it can be included a shell script, and it has it's own J2EE server, also excellent because it provides out of the box integration solutions. For instance, one can write a transformation that starts when a web service is called or a JMS receives a message.
It comes with some simple, fuzzy, functions that helps to clean data, but don't expect too much out of it.
It seems to scale well, mainly through parallelization, but orchestration can only be achieved manually.

In short, Pentaho can evolve a lot. Specially when it comes to the funcional part of the ETL. But even technically, it misses an orchestrator to help jobs orchestration.
Currently, as a data migration expert, I'm still not convinced that Pentaho can be used on "hard core ETL" projects where the functional mapping management, the development time and the data migration time window are critical points.

./M6

Talend Open Studio

Since one of my professional interests is ETL/Data Migration, I'm evaluating Talend Open Studio, version 3.1 RC1, since it's an open source solution.

I've downloaded the product, installed it and when I opened it, I had to read and accept the license, and then I got a dialog box that was asking for a connection and a project. Obviously I had none of those so I tried to create one... That proved to be a not so easy task! I was not understanding what should I do, so I pressed F1 for help and... No luck... I had to figure out what the hell I was supposed to do to be able to create a project. It as not that hard to find it out, but still, the first impression was not a very positive one.
I had to register, or at least so it seemed since I had to insert my email address, and the I was able to import a Java demo project, which I did.
Then, I opened the project and, finally, I've arrived to what I was expecting to be the real first Open Studio window, the Welcome page! Talend is an RCP application, and in RCP applications, the welcome page is the first thing that the user sees, after the traditional splash screen.

Finally, on the welcome page, I got a register pop-up, where I should write my email address and state my location... I really don't get it! If registration is optional, why the hell did I had to write my email address to create a new repository and then a project on that repository?
All Open Studio does with this awkward interface is confusing its users, since it is using hiding, on a very confusing way, the Eclipse workspace and projects.
From the starting page I went to the, previously desired but inaccessible, help page from where I could watch a, also desired, kick start tutorial where the workspace and project creations were visible. Unfortunately, it was totally time dislocated, since it was now totally irrelevant.
I know I'm using a RC, but this kind of issues are not RC bugs, they are design faults!

Since I'm a technical guy, unfortunately I'm used to bad user interfaces, so I focused on the juicy stuff, its features, performance and transformations.

I started to explore the application and I got into one ugly dialog box! I haven't seen a dialog box so ugly for a long time. And it is so big that I almost felt that if I was not using an wide screen (1280x800) I would be unable to see the dialog box. The dialog box rules are also a bit confusing, for instance, I was forced to choose a week day, Monday was my choice, even after I had chosen an month day, day 1 was my choice. I wonder what will happen if the first day of the next month is not a Monday...

Talend Open Studio ugly "Add a task" dialog box.

Definitely, Open Studio interface has a long way to go before becoming really user friendly.

After that shocking moment, I continue to explore the product.

There's a business model area, where it is possible to specify very simple business diagrams. My first impression about this is that I have doubts about the real value and usefulness of this feature. I'll have to explore it more to know if it is really useful or not.
Open Studio has some simple data quality components, including a fuzzy one. Talend already has a data cleaning tool, Talend Data Quality.
It supports a variety of file formats, including Excel, XML and EBCDIC. EBCDIC in particular is extremely useful when it involves files from IBM mainframes.
There's a nice set of connections, including a connection for AS/400 and SAP.
It supports orchestration through a set of iterative and job execution components.
There's a set of SQL templates, some of them are not really that useful. There are templates what just have COMMIT; or DROP TABLE <%= __DATABASE_NAME__ %>.<%=__TABLE_NAME_TARGET__%>;.

Almost all components and processes have history, which is a very nice feature. It looks like that there's no version control implemented, just history, but that is a good first step into a control versioning.
The same applies to documentation, almost all components and processes seem to have documentation associated, this is not just an interesting feature, it's a must have on such a tool.

Since it is possible to document the components, the processes and the business rules through diagrams, I look around for a way to export the project documentation, but I was unable to find such feature.
There's an Documentation area, but it's not what I was expected. It seems to be just a file link interface, where documentation files, like spreadsheets, can be accessed from.
And there's a javadoc export functionality, which also does not do what I expected, apparently it exports Talend components documentation.
There's no really usefulness for documentation when it is not easily accessible. It's like having a jar library all documented but no javadoc to build its documentation, forcing anyone who needs to read the documentation to open the source code and read it from there. It does not make much sense.

Finally, one of the most interesting features is the real time debug. I still haven't got the opportunity to try it out, but for what I could see, that is the ETL developer best friend Open Studio feature.

I've already watched some videos of how easy ETL is with Open Studio, dragging and dropping and graphically connecting the components and all that. In the next days, I'll try it for myself.

./M6