HP TRIM Bulk Data Importing Programming Guide Introduction Concepts How It Works The BulkDataLoader class in detail Constructor Methods Miscellaneous Properties Locations and the BulkDataLoader Location Matching Functions Location Creation Functions The Origin object Appendix A Bulk Loader Sample Code 1 1 2 2 2 2 7 8 8 9 10 11 Introduction TRIM SDK.NET has introduced a new class to manage loading high volumes of records into the TRIM database. The class is called BulkDataLoader. This mechanism utilizes the inherent bulk loading capabilities of the underlying database engine (BULK INSERT for SQL Server, SQLLDR.exe for Oracle). In addition, some of the intrinsic TRIM behaviours have also been modified to further increase throughput. This document explains the underlying mechanics of the approach and also provides guidelines for developers on using the various BulkDataLoader methods. Concepts Whenever you wish to use BulkDataLoader to import records, you will need to use a TRIM SDK object called an Origin. This object encapsulates some defaults which can be useful in determining how new records are initialized and it also provides a way of identifying records in TRIM after an import run has completed. An Origin has an associated OriginHistory object which records details of each invocation of the Origin. In the TRIM user 1 HP TRIM Bulk Data Importing Programming Guide interface, you can navigate to these runs by selecting the Origin (see Tools Record Origins) and from there, navigate to all OriginHistory objects representing all the runs done for this Origin. This OriginHistory object is also used by the Event Processor to efficiently schedule content indexing, word indexing and retention trigger calculations. How It Works The BulkDataLoader provides a function called SubmitRecord(), which replaces the Record.Save() call you would normally use when creating new records. This function places the underlying TRIM database management layer into a “Bulk Loading” state, telling TRIM to write details of SQL INSERT transactions to temporary disk files, rather than having INSERT’s executed individually by the Database Engine. A set, or batch, of records would be processed in this way, continually adding to the temporary files until all the records in the batch have been added. At this stage, the temporary disk files need to be passed up to the SQL engine for bulk loading. This is done by calling the ProcessAccumulatedData() function. The batch size can be any value (greater than zero), but the larger the batch size, the better the performance gain. For large import tasks in the millions of records, a batch size of 10,000 is reasonable. Any errors that occur during the loading of a single batch result in the whole batch being discarded, so this needs to be kept in mind if very large batch sizes are used. The BulkDataLoader class in detail Constructor BulkDataLoader(Origin usingOrigin) Constructs an instance of the BulkDataLoader class that can be used by your program for all data importing. You need to provide a valid Origin object which will be used by all records created by the BulkDataLoader class. See below for more details on the Origin object. Methods bool Initialise() bool Initialise(bool useBulkLoadMode) The Initialise function must be called prior to any of the other functions of BulkDataLoader. It allows you to manage any error situations in the current operating environment that may prevent the BulkDataLoader from operating in the correct manner. The Initialise function will check that it can access the OLE DB connection string and that SQLLDR.exe is available on the path for the host machine if Oracle is the database being used. . The bulk load data files created for each batch must be accessible to the database server, as either a local path or a shared network path. To obtain the OLE DB connection string the program doing the bulk loading must be running on the 2 HP TRIM Bulk Data Importing Programming Guide same machine as a TRIM Workgroup Server. The second form of the function allows you to specify not to use the BulkLoading style of update, in which case functions like SubmitRecord will do a normal Save(). This may be useful for testing or development scenarios. If the Initialise function fails for any reason, the BulkDataLoader will be set to do normal Save() operations. The reason for the failure can be accessed via the ErrorMessage property. Int64 StartRun(String nameOfActualSourceData, String workingFolder) Having created your BulkDataLoader object and successfully called Initialise, you are ready to start using it to create some records. When you process your record source, particularly if you have a large quantity of records to import, you need to break the import up into one or more “runs”. A run consists of one or more import “batches”. TRIM creates an OriginHistory object for each run executed – this can be useful if you wish to later inspect the results of the import. These OriginHistory items can be accessed in TRIM via the Origin object. In addition, the OriginHistory is recorded against each record so that you can later inspect a TRIM record and see where it originated. An OriginHistory is used to also pass work across the TRIM Event processor. A single event can be created for all records in a run to be processed providing further efficiencies in data processing. The nameOfActualSourceData parameter is documentary and allows you to provide some extra information for the OriginHistory item that is created in TRIM. Typically it may contain the name of the source XML or ASCII file you are processing to create records from. It is up to your program to provide whatever may be useful. The workingFolder parameter is the name of a folder where all the temporary bulk loader files will be created during the SubmitRecord processing. It needs to be large enough to store all the intermediary SQL bulk loading files. There may be 10 or 12 separate tables involved in the bulk update process. The BulkDataLoader will create a file corresponding to each table that needs data loaded. In order to allow simultaneous BulkDataLoader processes, a uniquely named subfolder of the workingFolder will be automatically created for this Run. For SQL Server, this folder also needs to be accessible by the SQL Server instance, which usually entails using a UNC path. Return Code The function returns the Unique Identifier value for the OriginHistory. This is preallocated and is stored against each record created during a run. You can use this value to select processed records if your program wishes to subsequently access these records after bulk data loading has been completed (refer to TRIM record search methods – originRun). void EndRun() Call this method when you have loaded all the records for a run. You can then either start another run, or simply finish with bulk loading. The function will automatically submit any remaining work since the last call to ProcessAccumulatedData, update the 3 HP TRIM Bulk Data Importing Programming Guide statistical counters on the OriginHistory object, and also create events for bulk content indexing (if any electronic documents were created in the Run), bulk word indexing (record titles and record notes) and for calculating disposition schedules. Record NewRecord() Record NewRecord(RecordType typeOfRecord) Called within a BulkDataLoader run, this function commences the process of creating a record. This is very similar to calling a standard Record constructor, however the Record is initialized using defaults contained within the Origin object associated with the BulkDataLoader instance. Note that the first form of this function uses the record type of the Origin object, the second form allows the default record type of the Origin to be bypassed. The TRIM database is placed in Bulk Loading mode whilst the new record is initialized. Return Code The function returns a Record object which you will eventually submit for Bulk Loading. You should take care in calling any functions and setting any properties of the Record using normal SDK.NET commands. These will not execute in Bulk Loader mode and so may have performance side effects. In particular definitely do not call the Save() or Delete() methods. To set property values, custom user field values and to attach an electronic document, use the special BulkDataLoader methods below. bool SetProperties( TrimObject ofNewObject, array<PropertyOrFieldValue> objectMetadata) This function allows you to set up the properties of the record that you want to create. Presumably your input source will contain values that you wish to transfer across to your TRIM record. The BulkDataLoader provides this single method for setting all property and custom user-field values in one method call. (You could call the method many times for one record, using different property value arrays if you like, but Bulk Loading is about minimizing calls). So, to setup your record you would first create an array of all the property and custom field values you want to set, using the SDK.NET PropertyOrFieldValue helper class. Set each PropertyOrFieldValue to the appropriate value from your input data source, then call SetProperties, passing the record you created with the NewRecord call as the ofNewObject parameter and the array as the objectMetadata parameter. Return Code Some of the TRIM business rules may be “relaxed” during bulk loader mode, however it is possible that some of the property values will be deemed unsuitable by TRIM. In this case, the function will return false, the reason for the failure will be described in the BulkDataLoader ErrorMessage property. 4 HP TRIM Bulk Data Importing Programming Guide bool SetDocument( Record ofNewRecord, InputDocument electronicRecord, BulkCopyStyle copyStyle) bool SetDocument( Record ofNewRecord, InputDocument electronicRecord) If you wish to create an electronic record in TRIM, you need to pass through details of the electronic document that will be attached to the record. The SetDocument method allows you to do this is a highly optimized way. The second form of the function uses the BulkCopyStyle of WindowsFileCopy. So, to attach a document, you firstly construct an instance of the InputDocument helper class, set it up appropriately to represent your source document (it may be a Windows file, an email message, etc). You then call the BulkDataLoader SetDocument function passing in the Record object created in the NewRecord function, the InputDocument and a BulkCopyStyle value. The BulkCopyStyle parameter allows you to increase the performance of document transfers to the TRIM document store. If you are using a Windows File System store type, the BulkDataLoader will see if the Path of this Document Store is accessible to the BulkDataLoader process. If this is the case, it will create a special part of this store for storing bulk loaded documents. This store location is fully supported by the TRIM workgroup and documents stored in this special store subfolder will behave like any other normal TRIM electronic document. To further improve performance, you can specify to use a Windows FileMove operation rather than a FileCopy operation. This means that the source document will be removed as part of the operation. A Windows FileMove will be much faster if the Document Store is located on the same hard drive as the source document. If your normal storage arrangements make it difficult to set up this high-performance storage option, it may well be worth considering creating a temporary store for transfers, then manually migrating the documents to a permanent store at a later stage, using the TRIM Store Transfer option. Should none of the storage optimizations be possible, the BulkDataLoader will drop back to doing a normal streaming operation to the TRIM Workgroup. Int64 SubmitRecord(Record newRecordToAdd) You call SubmitRecord to add the record to the batch of pending records awaiting the SQL bulk loading process. This function is designed to replace the normal Record.Save() function and appends details of the Record to one or more files in a format suitable for bulk loading to your TRIM database. You call SubmitRecord after you have set any properties using SetProperties and after you have possibly set up an electronic document using SetDocument. The newRecordToAdd is the Record object you used for these setup calls. This Record is basically now “done”, should not be used any further and can be disposed of following the call. SubmitRecord will also do a check of all properties, ensuring values are consistent with each other (for example that Date Closed is after Date Created, etc). The BulkDataLoader has its own way of allocating unique record numbers if one is not set by the SetProperties call. It provides this sequencing in order that records can be loaded without the need to make a synchronized numbering call to the TRIM 5 HP TRIM Bulk Data Importing Programming Guide Workgroup. The value will have a prefix of “BL” followed by an 8-digit number which is equivalent to the record’s unique identifier (e.g. BL00000077). Return Code The function returns the Unique Identifier that will be used for this record when it is eventually created. You cannot use this Uri to instantiate a Record object until the record has been processed by calling the ProcessAccumulatedData call. It is worth keeping this in mind if you wish to create Record relationships during an import. For example if you wished to set the Container property of a Record, the Container must have already been created in a previous batch (although it can be within the same run). For this reason, sometimes it is necessary to order your input data to work within this limitation. void ProcessAccumulatedData() This function will issue SQL commands to actually create TRIM records (and possibly any bulk loaded locations) that have been queued using the SubmitRecord method since the most recent call to ProcessAccumulatedData. When you choose to call this function effectively determines how many records there will be in a particular “batch”. Larger batch sizes will have better performance, however there can be a drawback in that calls to the SQL bulk loader may sometimes fail. In the case of failure, the BulkDataLoader attempts to restore the database to its state prior to the function call, although this too can fail. Basically, the processing involves issuing a BULK INSERT SQL command for each table involved in storing record metadata – this can be between 1 and 10 individual SQL tables. If, for example, the 5th of these 10 BULK INSERT statements fails, it is necessary to revert the effects of the preceding 4 BULK INSERT statements. To do this, the BulkDataLoader automatically issues DELETE statements for all the rows that have been inserted in a “no longer wanted” BULK INSERT. Further work will be required to diagnose the reason for failure. This can be difficult as often the BULK INSERT error will not indicate a particular row that is invalid. Diagnosis may involve resubmitting the records using a batch size of one to try to identify the failing record. Failure would generally indicate some unlikely fault in the format of the input data. After successfully processing the data, the Run data is updated with the latest running totals. You can then decide that the run is complete and call EndRun(), or continue creating records with subsequent calls of NewRecord(). bool UpdateDocument( Int64 recordUri, InputDocument electronicRecord) bool UpdateDocument( Int64 recordUri, InputDocument electronicRecord, BulkCopyStyle copyStyle) These two functions allow the performance improvements provided for SetDocument to also be applied to checking in new revisions. This is useful when developing applications 6 HP TRIM Bulk Data Importing Programming Guide to synchronise documents stored locally on a hard drive with TRIM. The first format is identical to calling the second form with a copyStyle of WindowsFileCopy. Refer to the SetDocument function for details on how the BulkCopyStyle works. The UpdateDocument function will either attach the document to an existing record if it has no document, or will create a new revision of the document associated with TRIM. Note that BulkDataLoading only supports INSERT, so the record is updated immediately after an UpdateDocument, rather than relying on ProcessAccumulatedData. Miscellaneous Properties bool CreateContentIndexEvents Default is true. Instructs the BulkDataLoader to create events to content index all records created within a Run. Content Indexing involves calls to the TRIM Content Indexing engine to analyse the contents of electronic documents and build an indexing so that records can be retrieved based on words within the document contents. bool CreateWordIndexEvents Default is true. Instructs the BulkDataLoader to create events to word index all records created within a Run. Word indexing involves creating title word and notes word indexes for fast searching of records by title and notes. String LogFileName The BulkDataLoader class allows you to specify a log file which is where all SQL calls will be written, together with some timings and tables affected. The log file will be invaluable for diagnosing any SQL errors that may occur during import runs, particularly if rollbacks have failed. 7 HP TRIM Bulk Data Importing Programming Guide bool CreateRetentionTriggerEvents Default is true. Instructs the BulkDataLoader to create events to calculate disposition schedules for all records created within a Run. A disposition schedule is created based on applying rules within a TRIM retention schedule that has been attached to a record, determining optimal dates for archival or destruction. Int64 RunHistoryUri Allows you to construct the OriginHistory object that has been created for the current run that is in progress. Locations and the BulkDataLoader Because TRIM has a heavy reliance on relationships between records and locations, some extra capabilities have been provided for creating and finding locations. When you wish to set a property on a record that is a location (for example, the Author property), you will need to convert your input data in such a way that either a new location is created or an existing location is used. Typically you would first search to see if a location exists – this can be tricky because people’s names are not necessarily unique. To improve performance of location searches, the BulkDataLoader has some search methods that act on a set of prefetched arrays in memory, using these searching indexes can speed up performance. Location Matching Functions Int64 FindLocation(PropertyValue locationMatchingProperty) This function will search the in-memory locations array for a location that has a property value that matches the value you wish to search by. The properties available for this search are: PropertyIds.LocationSortName PropertyIds.LocationIdNumber PropertyIds.LocationEAddressName Return Code If a single match is found, the Unique Identifier of that location is returned. If no matches were found, or if multiple matches were found, the return value is 0. array<Int64> FindAllLocations(PropertyValue locationMatchingProperty) This function works similarly to the FindLocation() function above except for the return code. The properties available for this search are: PropertyIds.LocationSortName PropertyIds.LocationIdNumber PropertyIds.LocationEAddressName 8 HP TRIM Bulk Data Importing Programming Guide Return Code If no match is found, returns a null value. If one or matches occur, the array of matching Unique Identifiers is returned . Int64 FindPerson( String surname, String givenNames, String initials, TrimDateTime dateOfBirth) This function will search the in-memory locations arrays for a person that matches the parameter values. You should always use a non-blank surname. The other parameters can be left blank if you do not want to use them in the match. Return Code If a single match is found, the Unique Identifier of that person location is returned. If no matches were found, or if multiple matches were found, the return value is 0. array<Int64> FindAllPeople( String surname, String givenNames, String initials, TrimDateTime dateOfBirth) This function works similarly to the FindPerson () function above except for the return code. Return Code If no match is found, returns a null value. If one or matches occur, the array of matching Unique Identifiers is returned. Location Creation Functions If you do not find an existing location to use for a record property, you may choose to create one. BulkDataLoader supports creating locations in a similar way to the way records are created. Note that the SetProperties call will work for Location and Record objects. Location NewLocation() Initialises a new Location object that is to be created using the BulkDataLoader. This is similar to the NewRecord function. Similarly you should not call Save() or Delete() on this location. 9 HP TRIM Bulk Data Importing Programming Guide Int64 SubmitLocation(Location newLocationToAdd) Submits the location to the bulk loader process. Data relating to the location will be accumulated in the temporary SQL loader table. The ProcessAccumulatedData function will process any accumulated locations together with any records accumulated since the preceding call. The Origin object The TRIM Origin object is provided primarily to support managing the bulk loading process. It has a number of properties which provides default values for records created with the BulkDataLoader. This means that these defaults can be changed using the TRIM User interface, without having to either write an alternative interface or change code. In addition, the Origin has links to OriginHistory objects which record details of every BulkDataLoader run that has been done for that origin. This allows records imported in this way to be easily identified for possible subsequent quality control measures / user review, etc. The Origin object also has the capability of storing additional configuration parameters by way of supporting a custom XML file. Origins of type “Tab Delimited” also support a mapping capability which maps columns in the input source to record properties. This feature may be useful when developing your own importing utility. 10 HP TRIM Bulk Data Importing Programming Guide Appendix A Bulk Loader Sample Code The following example code is part of the SDK samples for SDK.NET. It demonstrates a typical sequence used to create records via the Bulk Loader. class bulkLoaderSample : IDisposable { public bulkLoaderSample() { m_origin = null; m_loader = null; m_recordCount = 0; } public void Dispose() { if (m_loader != null) { m_loader.Dispose(); m_loader = null; } if (m_origin != null) { m_origin.Dispose(); m_origin = null; } } bool getNextRecord(PropertyOrFieldValue[] fields) { // this function would normally be processing some input data source, // reading the next item, setting yp the fields array from this input. // For the example, we "hard code" some simple property values and // provide 5 records. m_recordCount++; if (m_recordCount > 5) { return false; } // a simple title String title = "Test Bulk Loaded Record Import #" + System.Convert.ToString(m_recordCount); title += ", imported from run number " + System.Convert.ToString(m_loader.RunHistoryUri); // some notes String notes = "Some notes for the bulk imported record #" + System.Convert.ToString(m_recordCount) + " data imported at: " + TrimDateTime.Now.ToLongDateTimeString(); // a random date TrimDateTime dateCreated = new TrimDateTime(2007, 12, 14, 12, 0, 0); // create a location "on the fly" for the Author property PropertyOrFieldValue[] authorFields = new PropertyOrFieldValue[2]; authorFields[0] = new PropertyOrFieldValue(PropertyIds.LocationTypeOfLocation); authorFields[1] = new PropertyOrFieldValue(PropertyIds.LocationSortName); authorFields[0].SetValue(LocationType.Position); authorFields[1].SetValue("Position #" + System.Convert.ToString(m_recordCount)); 11 HP TRIM Bulk Data Importing Programming Guide // if we are rerunning, check to see if it already exists PropertyValue positionName = new PropertyValue(PropertyIds.LocationSortName); positionName.SetValue("Position #" + System.Convert.ToString(m_recordCount)); Int64 authorUri = m_loader.FindLocation(positionName); if (authorUri == 0) { // use the loader to setup the new location's properties using (Location authorLoc = m_loader.NewLocation()) { m_loader.SetProperties(authorLoc, authorFields); // now submit the location to the bulk loader queue authorUri = m_loader.SubmitLocation(authorLoc); } } // now set up the fields array based on these values fields[0].SetValue(title); fields[1].SetValue(dateCreated); fields[2].SetValue(notes); fields[3].SetValue(authorUri); return true; } public bool run(Database db) { Console.WriteLine("Initialise BulkDataLoader sample ..."); // create an origin to use for this sample. Look it up first just so you can rerun the code. m_origin = db.FindTrimObjectByName(BaseObjectTypes.Origin, "Bulk Loader Sample") as Origin; if (m_origin == null) { m_origin = new Origin(db, OriginType.Custom1); m_origin.Name = "Bulk Loader Sample"; m_origin.OriginLocation = "n.a"; // sample code assumes you have a record type defined called "Document" m_origin.DefaultRecordType = db.FindTrimObjectByName( BaseObjectTypes.RecordType, "Document") as RecordType; // don't bother with other origin defaults for the sample, just save it so we can use it m_origin.Save(); } // construct a BulkDataLoader for this origin m_loader = new BulkDataLoader(m_origin); // initialise it as per instructions if (!m_loader.Initialise()) { // this sample has no way of dealing with the error. Console.WriteLine(m_loader.ErrorMessage); return false; } Console.WriteLine("Starting up an import run ..."); // the sample is going to do just one run, let's get started... // you will need to specify a working folder that works for you (see programming guide) m_loader.StartRun("Simulated Input Data", "C:\\junk"); 12 HP TRIM Bulk Data Importing Programming Guide // setup the property array that will be used to transfer record metadata PropertyOrFieldValue[] recordFields = new PropertyOrFieldValue[4]; recordFields[0] = new PropertyOrFieldValue(PropertyIds.RecordTitle); recordFields[1] = new PropertyOrFieldValue(PropertyIds.RecordDateCreated); recordFields[2] = new PropertyOrFieldValue(PropertyIds.RecordNotes); recordFields[3] = new PropertyOrFieldValue(PropertyIds.RecordAuthor); // now lets add some records while (getNextRecord(recordFields)) { Console.WriteLine("Importing record #" + System.Convert.ToString(m_recordCount) + " ..."); using (Record importRec = m_loader.NewRecord()) { // set the record properties m_loader.SetProperties(importRec, recordFields); // attach an electronic. The sample just does this for the first record, and uses // a mail message to avoid having to find a document on the hard drive somewhere. using (InputDocument doc = new InputDocument()) { doc.SetAsMailMessage(); doc.Subject = "Imported mail message #" + System.Convert.ToString(m_recordCount); doc.Content = "Some mail messages have very little content"; doc.SentDate = new TrimDateTime(2007, 12, 14, 10, 30, 15); doc.ReceivedDate = new TrimDateTime(2007, 12, 14, 14, 30, 30); using (EmailParticipant mailContactFrom = new EmailParticipant("random.kindness@oneworld.com", "Random Kindness", "SMTP")) { doc.SetAuthor(mailContactFrom); } using (EmailParticipant mailContactTo = new EmailParticipant("little.animals@oneworld.com", "Little Animals", "SMTP")) { doc.AddRecipient(MailRecipientType.To, mailContactTo); } m_loader.SetDocument(importRec, doc, BulkCopyStyle.WindowsFileCopy); } // submit it to the bulk loader m_loader.SubmitRecord(importRec); } } // by now the loader has accumulated 5 record inserts and 5 location inserts // if you set a breakpoint here and look into the right subfolder of your working folder // you will see a bunch of temporary files ready to be loaded into the SQL engine // process this batch Console.WriteLine("Processing import batch ..."); m_loader.ProcessAccumulatedData(); // grab a copy of the history object (it is not available in bulk loader after we end the run Int64 runHistoryUri = m_loader.RunHistoryUri; // we're done, lets end the run now Console.WriteLine("Processing complete ..."); m_loader.EndRun(); 13 HP TRIM Bulk Data Importing Programming Guide // just for interest, lets look at the origin history object and output what it did using (OriginHistory hist = new OriginHistory(db, runHistoryUri)) { Console.WriteLine("Number of records created ..... " + System.Convert.ToString(hist.RecordsCreated)); Console.WriteLine("Number of locations created ... " + System.Convert.ToString(hist.LocationsCreated)); } return true; } 14 HP TRIM Bulk Data Importing Programming Guide