Writing Binary Files to HDFS © 2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners. Abstract You can run an Informatica mapping to write a binary file to HDFS. In the mapping, create a Data Processor transformation to generate the binary file. Define a complex file data object to read the binary file and write it to HDFS. Supported Versions • Informatica Big Data Edition 9.6.1 Table of Contents Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Complex File Writer Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Mapping Run-Time Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Customer Source Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Sales Source Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Joiner Promotions Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Joiner Products Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Customer Sales Data Processor Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Creating the Data Processor Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Write Binary Single File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Creating a Complex File Data Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Overview You can run an Informatica mapping to write a binary file to HDFS. In the mapping, create a Data Processor transformation to generate the binary file. Define a complex file data object to read the binary file and write it to HDFS. Example The following example describes a sample use case. An organization receives sales data in large flat files. They need to organize the sales data by customer and store the information on HDFS. The organization has multiple flat files that contain sales information. They need to consolidate the information by customer. The organization has a customer file, sales file, promotion file, products file, and a countries file. The sales, the promotions, and the products files contain a Sale_ID field that identifies a specific sale transaction. The countries file identifies the country for each customer. A developer creates a mapping to join the customer ID to the product information, and to join the customer ID to the promotions data. A Data Processor transformation consolidates the information from each file and writes an XML file to HDFS. To write the binary file to HDFS, the developer completes the following steps: 1. Configure an Informatica mapping to process the sales data. 2. Configure the physical data objects that contain the source data. 2 3. Configure a Joiner transformation to join the customer ID to the products record. Configure another Joiner transformation to join the customer ID to the promotions record. 4. Configure a Data Processor transformation to read the customer data, products, promotions, sales, and countries records. Configure the Data Processor transformation to return an XML file that contains sales data by customer. 5. Configure a complex data object to read the XML and write the binary stream to HDFS. The example is part of the Informatica Big Data Trial Sandbox for Hortonworks VM. You can download the VM from Informatica Marketplace in the following location: https://community.informatica.com/solutions/bde_trial_sandbox_for_hortonworks Complex File Writer Mapping The Complex File Writer mapping combines multiple flat files into an XML file and writes the XML file to HDFS. The following image shows the objects in the mapping: The mapping contains the following objects: Customer flat file source files The customer source files include information about customers. The files are Customers and Countries flat files. Sales flat file source files The sales source files include information about sales transactions. The files are Sales, Products, and Promotions flat files. Joiner_Promotions Joiner transformation that joins promotions with sales. 3 Joiner_Products Joiner transformation that joins sales with products. Customer_Sales_XML_Generator Data Processor transformation that generates an XML file of the sales by customer. Write_Binary_Single_File Complex file data object that writes one binary file to HDFS. Mapping Run-Time Properties Select the run-time environment to use when the mapping runs. When you run a mapping in the native environment, the Data Integration Service processes the transformation logic. When you run a mapping in the Hive environment, the Data Integration Service pushes the transformation logic to the Hadoop cluster through the Hive connection. The Hadoop cluster processes the data. For this example, the run-time environment is Hive. You can select either environment. Customer Source Files The Customers and Countries flat files contain information about the customer. The Customers source file contains the customer information, such as the customer ID, the name, and address. The following table includes example Customer records: Cust_Id 3 4 Cust_Fir Buick Frank Cust_Last Emmerson Hardy Cust_Gender M M Cust_Yr 1939 1934 Cust_Street 7 South Boyd Circle 17 Boyd Court The Countries source file contains the country and region information for each customer. The following table includes example Countries records: Customer 3 4 CountryID 52770 52770 CountryCD IT IT CountryName Italy Italy SubRegion Western Europe Western Europe Region Europe Europe Sales Source Files The Sales, the Products, and the Promotions files contain information about the sales. Each file contains the sales ID that identifies a sales transaction. The Sales file contains the customer ID, the sales ID, the product ID, and the promotion number. The following table includes examples of Sales records: Customer 3 4 Sale_ID 1 169 Prod 18 24 Date/Time 1999-06-03 00:00:00 2000-07-30 00:00:00 Promo 1 999 Qty 1.00 1.00 Cost 1758.11 47.86 The Products file contains product detail information for the products in each sale. The Products file contains the sales ID, the product ID, and the fields that describe the product. 4 The following table includes examples of Products records: Sale_ID 1 169 Prod_ID 18 24 Prod_Name Envoy Amb PCMCIA modem/fax Prod_Desc Envoy Amb PCMCIA modem/fax Prod_Subcat PCs Modems/Fax Prod_Cat Hardware Accessories The Promotions file contains promotion information for each sale. A Promotions record contains the sales ID, the promotion ID, and the fields that describe the promotion. The following table includes examples of Promotion records: Sale_ID 1 169 Promo_ID 1 999 Promo_Name Promotion 1 No Promotion Promo_Subcategory PCs No Promotion Promo_Category Hardware No Promotion Joiner Promotions Transformation The Joiner_Promotions transformation joins the customer ID from the sales record with the fields from the promotions records. The transformation returns a record that contains the promotion information for each sale for a customer. The following image show the ports in the Joiner transformation: The master source is the sales record. The detail source is the promotions record. The join condition is SALES_ID = SALE_ID. The transformation passes the customer ID and the promotion fields to the Data Processor transformation. Joiner Products Transformation The Joiner_Products transformation joins the customer ID from the sales records and the product information from the products record. The transformation returns a record that contains the product information in each sale for a customer. The following image show the ports in the Joiner transformation: 5 The master source is the sales record. The detail source is the products record. The join condition is SALES_ID = SALE_ID. The transformation passes the customer ID and the product fields to the Data Processor transformation. Customer Sales Data Processor Transformation The Customer_Sales_XML_Generator transformation is a Data Processor transformation that processes the customers, the sales, the promotions, the products, and the countries files. The transformation runs returns an XML file that contains sales data by customer. To process the source rows, enable the Data Processor transformation for relational input. Then, map the input ports to nodes in the XML output hierarchy. The following image shows where to map nodes from the input ports to the output XML hierarchy to the output ports in the Data Processor transformation: The XML output hierarchy contains the customers node. Within each customer node is the customer's country and multiple-occurring sales nodes. Within each sale are product and promotion nodes. The following text shows the structure of the output XML: customers customer (multiple-occurring) customerid firstname lastname gender ... 6 country countryID countrycode countryname sales sale (multiple-occurring) saleid timesold quantitysold ... product productid productname description ... promotion promotionid promotionname promotionsubcategory Note: The sample XML hierarchy does not include all the elements for each complex data type. Creating the Data Processor Transformation To create the Data Processor transformation, choose an XML schema that defines the XML output and then define the parser that processes the data. 1. In the Developer tool, click File > New > Transformation. 2. Select the Data Processor transformation and click Next. 3. Choose to create a Data Processor transformation with a wizard. 4. Select Relational Data as the input format and then click Next. 5. Select XML as the output format and click Next. 6. Browse for the XML schema that represents the structure of the target file. The schema for this example is customer_sales_schema.xsd. When you select this schema from the file system, the Developer tool imports the schema to the repository. The Developer tool shows that the customers node is the input hierarchy root. 7. Click Finish. The Developer tool creates the transformation in the repository. The Overview view appears in the Developer tool. You can view the input and the output ports in the Overview view. 8. Change the Output port precision to 65536, which is the maximum port size. The transformation must return an XML file in the port. 9. Click Input Mapping to view how the Developer tool maps the nodes from the input ports to the output hierarchy. The transformation writes the XML data to the binary Output port. 10. Save the transformation. Write Binary Single File Write_Binary_Single_File is a complex file data object that reads the XML file from the Data Processor transformation and writes the file to HDFS. When you create a complex file data object, the Developer tool creates a read and write operation. Use the complex file data object write operation as the target in the mapping. Use an HDFS connection to access the Hadoop distributed file system. For this example, the complex file data object writes a single file to HDFS. 7 When you open the complex file data object in a mapping, you can view a source transformation and an output transformation in the editor. Click the Source transformation to edit the General, Port, and Sources properties. Select the Output transformation and edit the Advanced properties. The Advanced properties include the output file path, the file format, and the file compression format. The following image shows the Advanced properties: Creating a Complex File Data Object The complex file data object receives the XML file from the Data Processor transformation and it writes a binary stream to HDFS. 1. In the Developer tool, select a project or folder in the Object Explorer view. 2. Click File > New > Data Object. 3. Select Complex File Data Object and click Next. The New Complex File Data Object dialog box appears. 4. Click Browse next to the Location option and select a repository location. 5. In the Resource Location list, select Remote. 6. Click Browse next to the Connection option and select an HDFS connection. 7. To add a resource to the data object, click Add next to the Selected Resource option. The resource is the complex file to write to HDFS. You can override the resource in the mapping. 8. Browse for the resource to add to the data object and click OK. 9. Optionally, enter a name for the complex file data object. The default name is Complex_File_Data_Object. 10. Click Finish. A read data object and two write data objects appear under the Physical Data Objects category in the Object Explorer view. 11. Select the write_binary_single_file operation as a target in the mapping. 12. Click the Advanced Properties in the complex file write data object. Configure the target file path, the target file type, and the type of compression. 13. Save the complex file data object. Author Ellen Chandler Principal Technical Writer 8