
Pentaho Kettle Solutions
Beschreibung
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
This practical book is a complete guide to installing,configuring, and managing Pentaho Kettle. If you're adatabase administrator or developer, you'll first get up tospeed on Kettle basics and how to apply Kettle to create ETLsolutions--before progressing to specialized concepts such asclustering, extensibility, and data vault models. Learn how todesign and build every phase of an ETL solution.
* Shows developers and database administrators how to use theopen-source Pentaho Kettle for enterprise-level ETL processes(Extracting, Transforming, and Loading data)
* Assumes no prior knowledge of Kettle or ETL, and bringsbeginners thoroughly up to speed at their own pace
* Explains how to get Kettle solutions up and running, thenfollows the 34 ETL subsystems model, as created by the KimballGroup, to explore the entire ETL lifecycle, including all aspectsof data warehousing with Kettle
* Goes beyond routine tasks to explore how to extend Kettle andscale Kettle solutions using a distributed "cloud"
Get the most out of Pentaho Kettle and your data warehousingwith this detailed guide--from simple single table datamigration to complex multisystem clustered data integrationtasks.
Weitere Details
Weitere Ausgaben
Andere Ausgaben

Personen
Inhalt
Part I Getting Started.
Chapter 1 ETL Primer.
Chapter 2 Kettle Concepts.
Chapter 3 Installation and Configuration.
Chapter 4 An Example ETL Solution--Sakila.
Part II ETL.
Chapter 5 ETL Subsystems.
Chapter 6 Data Extraction.
Chapter 7 Cleansing and Conforming.
Chapter 8 Handling Dimension Tables.
Chapter 9 Loading Fact Tables.
Chapter 10 Working with OLAP Data.
Part III Management and Deployment.
Chapter 11 ETL Development Lifecycle.
Chapter 12 Scheduling and Monitoring.
Chapter 13 Versioning and Migration.
Chapter 14 Lineage and Auditing.
Part IV Performance and Scalability.
Chapter 15 Performance Tuning.
Chapter 16 Parallelization, Clustering, and Partitioning.
Chapter 17 Dynamic Clustering in the Cloud.
Chapter 18 Real-Time Data Integration.
Part V Advanced Topics.
Chapter 19 Data Vault Management.
Chapter 20 Handling Complex Data Formats.
Chapter 21 Web Services.
Chapter 22 Kettle Integration.
Chapter 23 Extending Kettle.
Appendix A The Kettle Ecosystem.
Appendix B Kettle Enterprise Edition Features.
Appendix C Built-in Variables and Properties Reference.
Index.
Introduction
More than 50 years ago the first computers for general use emerged, and we saw a gradually increasing adoption of their use by the scientific and business world. In those early days, most organizations had just one computer with a single display and printer attached to it, so the need for integrating data stored in different systems simply didn't exist. This changed when in the late 1970s the relational database made inroads into the corporate world. The 1980s saw a further proliferation of both computers and databases, all holding different bits and pieces of an organization's total collection of information. Ultimately, this led to the start of a whole new industry, which was sparked by IBM researchers Dr. Barry Devlin and Paul Murphy in their seminal paper "An architecture for a business and information system" (first published in 1988 in IBM Systems Journal, Volume 27, Number 1). The concept of a business data warehouse was introduced for the first time as being "the single logical storehouse of all the information used to report on the business." Less than five years later, Bill Inmon published his landmark book, Building the Data Warehouse, which further popularized the concepts and technologies needed to build this "logical storehouse."
One of the core themes in all data warehouse-related literature is the concept of integrating data. The term data integration refers to the process of combining data from different sources to provide a single comprehensible view on all of the combined data. A typical example of data integration would be combining the data from a warehouse inventory system with that of the order entry system to allow order fulfillment to be directly related to changes in the inventory. Another example of data integration is merging customer and contact data from separate departmental customer relationship management (CRM) systems into a corporate customer relationship management system.
NOTE Throughout this book, you'll find the terms "data integration" and "ETL" (short for extract, transform, and load) used interchangeably. Although technically not entirely correct (ETL is only one of the possible data integration scenarios, as you'll see in Chapter 1), most developers treat these terms as synonyms, a sin that we've adopted over the years as well.
In an ideal world, there wouldn't be a need for data integration. All the data needed for running a business and reporting on its past and future performance would be stored and managed in a single system, all master data would be 100 percent correct, and every piece of external data needed for analysis and decision making would be automatically linked to our own data. This system wouldn't have any problems with storing all available historical data, nor with offering split-second response times when querying and analyzing this data.
Unfortunately, we don't live in an ideal world. In the real world, most organizations use different systems for different purposes. They have systems for CRM (Customer Relationship Management), for accounting, for sales and sales support, for supporting a help desk, for managing inventory, for supporting a logistics process, and the list goes on and on. To make things worse, the same data is often stored and maintained independently, and probably inconsistently, in different systems. Customer and product data might be available in all the aforementioned systems, and when a customer calls to pass on a new telephone number or a change of address, chances are that this information is only updated in the CRM system, causing inconsistency of the customer information within the organization.
To cope with all these challenges and create a single, integrated, conformed, and trustworthy data store for reporting and analysis, data integration tools are needed. One of the more popular and powerful solutions available is Kettle, also known as Pentaho Data Integration, which is the topic of this book.
The Origins of Kettle
Kettle originated ten years ago, at the turn of the century. Back then, ETL tools could be found in all sorts of shapes and forms. At least 50 known tools competed in this software segment. Beneath that collection of software, there was an even larger set of ETL frameworks. In general, you could split up the tools into different types based on their respective origin and level of sophistication, as shown in Figure 1.
Figure 1: ETL tool generations
- Quick hacks: These tools typically were responsible for extraction of data or the load of text files. A lot of these solutions existed out there and still do. Words such as "hacker" and "hacking" have an undeservedly negative connotation. Business intelligence can get really complex and in most cases, the quick hacks make the difference between project disaster and success. As such, they pop up quite easily because someone simply has a job to do with limited time and money. Typically, these ETL quick hack solutions are created by consultancy firms and are meant to be one-time solutions.
- Frameworks: Usually when a business intelligence consultant does a few similar projects, the idea begins to emerge that code needs to be written in such a way that it can be re-used on other projects with a few minor adjustments. At one point in time it seemed like every self-respecting consultancy company had an ETL framework out there. The reason for this is that these frameworks offer a great way to build up knowledge regarding the ETL processes. Typically, it is easy to change parameters for extraction, loading, logging, change data capture, database connections, and such.
- Code generators: When a development interface is added as an extra level of abstraction for the frameworks, it is possible to generate code for a certain platform (C, Java, SQL, and so on) based on sets of metadata. These code generators come in different types, varying from one-shot generators that require you to maintain the code afterward to full-fledge ETL tools that can generate everything you need. These kinds of ETL tools were also written by consultancy companies left and right, but mostly by known, established vendors.
- Engines: In the continuing quest by ETL vendors to make life easier for their users, ETL engines were created so that no code had to be generated. With these engines, the entire ETL process can be executed based on parameterization and configuration, i.e. the description of the ETL process itself as described throughout this book. This by itself does away with any code generation, compilation, and deployment difficulties.
Based on totally non-scientific samples of projects that were executed back then, it's safe to say that over half of the projects used quick hacks or frameworks. Code generators in all sorts of shapes and forms accounted for most of the rest, with ETL engines only being used in exceptional cases, usually for very large projects.
NOTE Very few tools were available under an open source license ten years ago. The only known available tool was Enhydra Octopus, a Java-based code generator type of ETL tool (available at www.enhydra.org/tech/octopus/). To its credit and benefit to its users, it's still available as a shining example of the persistence of open source.
It's in this software landscape that Matt Casters, the author of Kettle and one of the authors of this book, was busy with consultancy, writing quick hacks and frameworks, and deploying all sorts of code generators.
Back in the early 2000s he was working as a business intelligence consultant, usually in the position of a data warehouse architect or administrator. In such a position, you have to take care of bridging the well-known gap between information and communication technology and business needs. Usually this sort of work was done without a big-vendor ETL tool because those things were prohibitively costly back then. As such, these tools were too expensive for most, if not all, small-to-medium-sized projects. In that situation, you don't have much of a choice: You face the problem time after time and you do the best you can with all sorts of frameworks and code generation. Poor examples of that sort of work include a program, written in C and embedded SQL (ESQL/C) to extract data from Informix; an extraction tool written in Microsoft Visual Basic to get data from an IBM AS/400 Mainframe system; and a complete data warehouse consisting of 35 fact tables and 90 slowly changing dimensions for a large bank, written manually in Oracle PL/SQL and shell scripts.
Thus, it would be fair to say that Matt knew what he was up to when he started thinking about writing his own ETL tool. Nevertheless, the idea to write it goes back as far as 2001:
Matt: "I'm going to write a new piece of software to do ETL. It's going to take up some time left and right in the evenings and weekends."
Kathleen (Matt's wife): "Oh, that's great! How long is this going to take?"
Matt: "If all goes well, I should have a first somewhat working version in three years and a complete version in five years."
The Design of Kettle
After more than ten years of wrestling with ETL tools of dubious quality, one of the main design goals of Kettle was to be as open as possible. Back then that specifically meant:
- Open, readable metadata (XML) format
- Open, readable relational repository format
- Open API
- Easy to set up (less than 2 minutes)
- Open to all kinds of databases
- Easy-to-use graphical user...
Systemvoraussetzungen
Dateiformat: ePUB
Kopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
- Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
- Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
- E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.