Defination Pentaho and its usage. Revered as one of the most efficient and resourceful data integration tools (DI), Pentaho virtually supports all available data sources and allows scalable data clustering and data mining. It is a light-weight Business Intelligence suite executing Online Analytical Processing (OLAP) services, ETL functions, reports and dashboards creation and other data-analysis and visualization operations. Important features of Pentaho. Pentaho is capable of creating Advanced Reporting Algorithms regardless of their input and output data format. It supports various report formats, whether Excel spreadsheets, XMLs, PDF docs, CSV files. It is a Professionally Certified DI Software rendered by the renowned Pentaho Company headquartered in Florida, United States. Offers enhanced functionality and in-Hadoop functionality. Allows dynamic drill down into larger and greater information. Rapid Interactive response optimization. Explore and view multidimensional data. Major applications comprising Pentaho BI Project. Business Intelligence Platform. Dashboards and Visualizations. Reporting. Data Mining. Data Analysis. Data Integration and ETL (also called Kettle). Data Discovery and Analysis (OLAP). Architecture of Pentaho Data Integration Spoon is the design interface for building ETL jobs and transformations. Spoon provides a drag-and-drop interface that allows you to graphically describe what you want to take place in your transformations. Transformations can then be executed locally within Spoon, on a dedicated Data Integration Server, or a cluster of servers. The Data Integration Server is a dedicated ETL server whose primary functions are: Execution Executes ETL jobs and transformations using the Pentaho Data Integration engine Security Allows you to manage users and roles (default security) or integrate security to your existing security provider such as LDAP or Active Directory Content Provides a centralized repository that allows you to manage your ETL jobs and Management transformations. This includes full revision history on content and features such as sharing and locking for collaborative development environments. Scheduling Provides the services allowing you to schedule and monitor activities on the Data Integration Server from within the Spoon design environment. Pentaho Data Integration is composed of the following primary components: Spoon. Introduced earlier, Spoon is a desktop application that uses a graphical interface and editor for transformations and jobs. Spoon provides a way for you to create complex ETL jobs without having to read or write code. When you think of Pentaho Data Integration as a product, Spoon is what comes to mind because, as a database developer, this is the application on which you will spend most of your time. Any time you author, edit, run or debug a transformation or job, you will be using Spoon. Pan. A standalone command line process that can be used to execute transformations and jobs you created in Spoon. The data transformation engine Pan reads data from and writes data to various data sources. Pan also allows you to manipulate data. Kitchen. A standalone command line process that can be used to execute jobs. The program that executes the jobs designed in the Spoon graphical interface, either in XML or in a database repository. Jobs are usually scheduled to run in batch mode at regular intervals. Carte. Carte is a lightweight Web container that allows you to set up a dedicated, remote ETL server. This provides similar remote execution capabilities as the Data Integration Server but does not provide scheduling, security integration, and a content management system Benefits of Data Integration. The biggest benefit is that integrating data improves consistency and reduces conflicting and erratic data from the DB. Integration of data allows users to fetch exactly what they look for, enabling them utilize and work with what they collected. Accurate data extraction, which further facilitates flexible reporting and monitoring of the available volumes of data. Helps meet deadlines for effective business management. Track customer’s information and buying behavior to improve traffic and conversions in the future, thus advancing your business performance. Major types of Data Integration Jobs. Transformation Jobs : Used for preparing data and used only when the there is no change in data until transforming of data job is finished. Provisioning Jobs : Used for transmission/transfer of large volumes of data. Used only when no change is data is allowed unless job transformation and on large provisioning requirement. Hybrid Jobs : Execute both transformation and provisioning jobs. No limitations for data changes; it can be updates regardless of success/failure. The transforming and provisioning requirements are not large in this case. Pentaho Metadata Pentaho Metadata is a piece of the Pentaho BI Platform designed to make it easier for users to access information in business terms. With the help of Pentaho’s open source metadata capabilities, administrators can outline a layer of abstraction that presents database information to business users in familiar business terms.