Pentaho Kettle Solutions

Name: Pentaho Kettle Solutions | Building Open Source ETL Solutions with Pentaho Data Integration
Brand: Wiley
Price: 32.99 EUR
Availability: OnlineOnly

Building Open Source ETL Solutions with Pentaho Data Integration

Matt Casters Roland Bouman Jos van Dongen(Autor*in)

Wiley (Verlag)

Erschienen am 24. September 2010

720 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-0-470-94752-4 (ISBN)

32,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Personen

Inhalt

Introduction.
Part I Getting Started.
Chapter 1 ETL Primer.
Chapter 2 Kettle Concepts.
Chapter 3 Installation and Configuration.
Chapter 4 An Example ETL Solution--Sakila.
Part II ETL.
Chapter 5 ETL Subsystems.
Chapter 6 Data Extraction.
Chapter 7 Cleansing and Conforming.
Chapter 8 Handling Dimension Tables.
Chapter 9 Loading Fact Tables.
Chapter 10 Working with OLAP Data.
Part III Management and Deployment.
Chapter 11 ETL Development Lifecycle.
Chapter 12 Scheduling and Monitoring.
Chapter 13 Versioning and Migration.
Chapter 14 Lineage and Auditing.
Part IV Performance and Scalability.
Chapter 15 Performance Tuning.
Chapter 16 Parallelization, Clustering, and Partitioning.
Chapter 17 Dynamic Clustering in the Cloud.
Chapter 18 Real-Time Data Integration.
Part V Advanced Topics.
Chapter 19 Data Vault Management.
Chapter 20 Handling Complex Data Formats.
Chapter 21 Web Services.
Chapter 22 Kettle Integration.
Chapter 23 Extending Kettle.
Appendix A The Kettle Ecosystem.
Appendix B Kettle Enterprise Edition Features.
Appendix C Built-in Variables and Properties Reference.
Index.

Introduction

More than 50 years ago the first computers for general use emerged, and we saw a gradually increasing adoption of their use by the scientific and business world. In those early days, most organizations had just one computer with a single display and printer attached to it, so the need for integrating data stored in different systems simply didn't exist. This changed when in the late 1970s the relational database made inroads into the corporate world. The 1980s saw a further proliferation of both computers and databases, all holding different bits and pieces of an organization's total collection of information. Ultimately, this led to the start of a whole new industry, which was sparked by IBM researchers Dr. Barry Devlin and Paul Murphy in their seminal paper "An architecture for a business and information system" (first published in 1988 in IBM Systems Journal, Volume 27, Number 1). The concept of a business data warehouse was introduced for the first time as being "the single logical storehouse of all the information used to report on the business." Less than five years later, Bill Inmon published his landmark book, Building the Data Warehouse, which further popularized the concepts and technologies needed to build this "logical storehouse."

One of the core themes in all data warehouse-related literature is the concept of integrating data. The term data integration refers to the process of combining data from different sources to provide a single comprehensible view on all of the combined data. A typical example of data integration would be combining the data from a warehouse inventory system with that of the order entry system to allow order fulfillment to be directly related to changes in the inventory. Another example of data integration is merging customer and contact data from separate departmental customer relationship management (CRM) systems into a corporate customer relationship management system.

NOTE Throughout this book, you'll find the terms "data integration" and "ETL" (short for extract, transform, and load) used interchangeably. Although technically not entirely correct (ETL is only one of the possible data integration scenarios, as you'll see in Chapter 1), most developers treat these terms as synonyms, a sin that we've adopted over the years as well.

In an ideal world, there wouldn't be a need for data integration. All the data needed for running a business and reporting on its past and future performance would be stored and managed in a single system, all master data would be 100 percent correct, and every piece of external data needed for analysis and decision making would be automatically linked to our own data. This system wouldn't have any problems with storing all available historical data, nor with offering split-second response times when querying and analyzing this data.

Unfortunately, we don't live in an ideal world. In the real world, most organizations use different systems for different purposes. They have systems for CRM (Customer Relationship Management), for accounting, for sales and sales support, for supporting a help desk, for managing inventory, for supporting a logistics process, and the list goes on and on. To make things worse, the same data is often stored and maintained independently, and probably inconsistently, in different systems. Customer and product data might be available in all the aforementioned systems, and when a customer calls to pass on a new telephone number or a change of address, chances are that this information is only updated in the CRM system, causing inconsistency of the customer information within the organization.

To cope with all these challenges and create a single, integrated, conformed, and trustworthy data store for reporting and analysis, data integration tools are needed. One of the more popular and powerful solutions available is Kettle, also known as Pentaho Data Integration, which is the topic of this book.

The Origins of Kettle

Kettle originated ten years ago, at the turn of the century. Back then, ETL tools could be found in all sorts of shapes and forms. At least 50 known tools competed in this software segment. Beneath that collection of software, there was an even larger set of ETL frameworks. In general, you could split up the tools into different types based on their respective origin and level of sophistication, as shown in Figure 1.

Figure 1: ETL tool generations

Quick hacks: These tools typically were responsible for extraction of data or the load of text files. A lot of these solutions existed out there and still do. Words such as "hacker" and "hacking" have an undeservedly negative connotation. Business intelligence can get really complex and in most cases, the quick hacks make the difference between project disaster and success. As such, they pop up quite easily because someone simply has a job to do with limited time and money. Typically, these ETL quick hack solutions are created by consultancy firms and are meant to be one-time solutions.
Frameworks: Usually when a business intelligence consultant does a few similar projects, the idea begins to emerge that code needs to be written in such a way that it can be re-used on other projects with a few minor adjustments. At one point in time it seemed like every self-respecting consultancy company had an ETL framework out there. The reason for this is that these frameworks offer a great way to build up knowledge regarding the ETL processes. Typically, it is easy to change parameters for extraction, loading, logging, change data capture, database connections, and such.
Code generators: When a development interface is added as an extra level of abstraction for the frameworks, it is possible to generate code for a certain platform (C, Java, SQL, and so on) based on sets of metadata. These code generators come in different types, varying from one-shot generators that require you to maintain the code afterward to full-fledge ETL tools that can generate everything you need. These kinds of ETL tools were also written by consultancy companies left and right, but mostly by known, established vendors.
Engines: In the continuing quest by ETL vendors to make life easier for their users, ETL engines were created so that no code had to be generated. With these engines, the entire ETL process can be executed based on parameterization and configuration, i.e. the description of the ETL process itself as described throughout this book. This by itself does away with any code generation, compilation, and deployment difficulties.

Based on totally non-scientific samples of projects that were executed back then, it's safe to say that over half of the projects used quick hacks or frameworks. Code generators in all sorts of shapes and forms accounted for most of the rest, with ETL engines only being used in exceptional cases, usually for very large projects.

NOTE Very few tools were available under an open source license ten years ago. The only known available tool was Enhydra Octopus, a Java-based code generator type of ETL tool (available at www.enhydra.org/tech/octopus/). To its credit and benefit to its users, it's still available as a shining example of the persistence of open source.

It's in this software landscape that Matt Casters, the author of Kettle and one of the authors of this book, was busy with consultancy, writing quick hacks and frameworks, and deploying all sorts of code generators.

Back in the early 2000s he was working as a business intelligence consultant, usually in the position of a data warehouse architect or administrator. In such a position, you have to take care of bridging the well-known gap between information and communication technology and business needs. Usually this sort of work was done without a big-vendor ETL tool because those things were prohibitively costly back then. As such, these tools were too expensive for most, if not all, small-to-medium-sized projects. In that situation, you don't have much of a choice: You face the problem time after time and you do the best you can with all sorts of frameworks and code generation. Poor examples of that sort of work include a program, written in C and embedded SQL (ESQL/C) to extract data from Informix; an extraction tool written in Microsoft Visual Basic to get data from an IBM AS/400 Mainframe system; and a complete data warehouse consisting of 35 fact tables and 90 slowly changing dimensions for a large bank, written manually in Oracle PL/SQL and shell scripts.

Thus, it would be fair to say that Matt knew what he was up to when he started thinking about writing his own ETL tool. Nevertheless, the idea to write it goes back as far as 2001:

Matt: "I'm going to write a new piece of software to do ETL. It's going to take up some time left and right in the evenings and weekends."

Kathleen (Matt's wife): "Oh, that's great! How long is this going to take?"

Matt: "If all goes well, I should have a first somewhat working version in three years and a complete version in five years."

The Design of Kettle

After more than ten years of wrestling with ETL tools of dubious quality, one of the main design goals of Kettle was to be as open as possible. Back then that specifically meant:

Open, readable metadata (XML) format
Open, readable relational repository format
Open API
Easy to set up (less than 2 minutes)
Open to all kinds of databases
Easy-to-use graphical user...

Systemvoraussetzungen

Als PDF speichern Als Link merken

Pentaho Kettle Solutions

Beschreibung

Weitere Details

Weitere Ausgaben

Andere Ausgaben

Personen

Inhalt

Introduction

The Origins of Kettle

The Design of Kettle

Systemvoraussetzungen