Pentaho Data Integration | Pentaho DI

Pentaho Data Integration

Pentaho Data Integration (PDI) is an extract, transform, and load (ETL) solution that uses an innovative metadata-driven approach.

  • It includes the DI Server, a design tool, three utilities, and several plugins.
  • Pentaho tool is most frequently used in data warehouses environments.
  • PDI can also be used for other purposes
    • Migrating data between applications or databases
    • Exporting data from databases to flat files
    • Loading data massively into databases
    • Data cleansing
    • Integrating applications



Common uses of Pentaho Data Integration

  • It is easy to use as every process is created with a graphical tool where you specify what to do without writing code to indicate how to do it
  • It supports a vast array of input and output formats, including text files, data sheets, and commercial and free database engines.
  • It is used for Data migration between different databases and applications
  • Loading huge data sets into databases taking full advantage of cloud, clustered, and massively parallel processing environments
  • Data Cleansing with steps ranging from very simple to very complex transformations






Key Benefits of Pentaho Data Integration

  • It installs in minutes.
  • 100% Java with cross platform support for Windows, Linux, and Macintosh.
  • Easy to use graphical designer including inputs, transforms, and outputs.
  • Simple plug-in architecture for adding your own custom extensions.
  • Enterprise Data Integration server providing security integration, scheduling, and robust content management including full revision history for jobs and transformations.
  • Pentaho combined with metadata modeling and data visualization, providing the perfect environment for rapidly developing new Business Intelligence solutions.
  • Streaming engine architecture provides the ability to work with extremely large data volumes.
  • Live support provided by a knowledgeable team of product experts that consistently rates higher in customer satisfaction.
  • Dedicated customer case triage providing faster response times and increased priority for customer reported defects and enhancements..
  • Centralized content management facilitating team collaboration including secured sharing of content, content versioning (revision history), and transformation and job locking.



Pentaho Data Integration Components

Pentaho Data Integration is composed of the following primary components

  • Spoon : Introduced earlier, Spoon is a desktop application that uses a graphical interface and editor for transformations and jobs. Spoon provides a way for you to create complex ETL jobs without having to read or write code. When you think of Pentaho Data Integration as a product, Spoon is what comes to mind because, as a database developer, this is the application on which you will spend most of your time. Any time you author, edit, run or debug a transformation or job, you will be using Spoon.
  • Pan : A standalone command line process that can be used to execute transformations and jobs you created in Spoon. The data transformation engine Pan reads data from and writes data to various data sources. Pan also allows you to manipulate data.
  • Kitchen : A standalone command line process that can be used to execute jobs. The program that executes the jobs designed in the Spoon graphical interface, either in XML or in a database repository. Jobs are usually scheduled to run in batch mode at regular intervals.
  • Carte : Carte is a lightweight Web container that allows you to set up a dedicated, remote ETL server. This provides similar remote execution capabilities as the Data Integration Server, but does not provide scheduling, security integration, and a content management system.