HP TRIM Bulk Data Importing Programming Guide

advertisement
HP TRIM Bulk Data
Importing Programming
Guide
Introduction
Concepts
How It Works
The BulkDataLoader class in detail
Constructor
Methods
Miscellaneous Properties
Locations and the BulkDataLoader
Location Matching Functions
Location Creation Functions
The Origin object
Appendix A
Bulk Loader Sample Code
1
1
2
2
2
2
7
8
8
9
10
11
Introduction
TRIM SDK.NET has introduced a new class to manage loading high volumes of records into
the TRIM database. The class is called BulkDataLoader. This mechanism utilizes the inherent
bulk loading capabilities of the underlying database engine (BULK INSERT for SQL Server,
SQLLDR.exe for Oracle). In addition, some of the intrinsic TRIM behaviours have also been
modified to further increase throughput. This document explains the underlying mechanics
of the approach and also provides guidelines for developers on using the various
BulkDataLoader methods.
Concepts
Whenever you wish to use BulkDataLoader to import records, you will need to use a TRIM
SDK object called an Origin. This object encapsulates some defaults which can be useful in
determining how new records are initialized and it also provides a way of identifying
records in TRIM after an import run has completed. An Origin has an associated
OriginHistory object which records details of each invocation of the Origin. In the TRIM user
1
HP TRIM Bulk Data Importing Programming Guide
interface, you can navigate to these runs by selecting the Origin (see Tools  Record 
Origins) and from there, navigate to all OriginHistory objects representing all the runs done
for this Origin. This OriginHistory object is also used by the Event Processor to efficiently
schedule content indexing, word indexing and retention trigger calculations.
How It Works
The BulkDataLoader provides a function called SubmitRecord(), which replaces the
Record.Save() call you would normally use when creating new records. This function places
the underlying TRIM database management layer into a “Bulk Loading” state, telling TRIM
to write details of SQL INSERT transactions to temporary disk files, rather than having
INSERT’s executed individually by the Database Engine.
A set, or batch, of records would be processed in this way, continually adding to the
temporary files until all the records in the batch have been added. At this stage, the
temporary disk files need to be passed up to the SQL engine for bulk loading. This is done
by calling the ProcessAccumulatedData() function. The batch size can be any value (greater
than zero), but the larger the batch size, the better the performance gain. For large import
tasks in the millions of records, a batch size of 10,000 is reasonable. Any errors that occur
during the loading of a single batch result in the whole batch being discarded, so this needs
to be kept in mind if very large batch sizes are used.
The BulkDataLoader class in detail
Constructor
BulkDataLoader(Origin usingOrigin)
Constructs an instance of the BulkDataLoader class that can be used by your program
for all data importing. You need to provide a valid Origin object which will be used by all
records created by the BulkDataLoader class. See below for more details on the Origin
object.
Methods
bool
Initialise()
bool
Initialise(bool useBulkLoadMode)
The Initialise function must be called prior to any of the other functions of
BulkDataLoader. It allows you to manage any error situations in the current operating
environment that may prevent the BulkDataLoader from operating in the correct
manner. The Initialise function will check that it can access the OLE DB connection string
and that SQLLDR.exe is available on the path for the host machine if Oracle is the
database being used. . The bulk load data files created for each batch must be accessible
to the database server, as either a local path or a shared network path. To obtain the
OLE DB connection string the program doing the bulk loading must be running on the
2
HP TRIM Bulk Data Importing Programming Guide
same machine as a TRIM Workgroup Server. The second form of the function allows you
to specify not to use the BulkLoading style of update, in which case functions like
SubmitRecord will do a normal Save(). This may be useful for testing or development
scenarios. If the Initialise function fails for any reason, the BulkDataLoader will be set to
do normal Save() operations. The reason for the failure can be accessed via the
ErrorMessage property.
Int64
StartRun(String nameOfActualSourceData, String workingFolder)
Having created your BulkDataLoader object and successfully called Initialise, you are
ready to start using it to create some records. When you process your record source,
particularly if you have a large quantity of records to import, you need to break the
import up into one or more “runs”. A run consists of one or more import “batches”. TRIM
creates an OriginHistory object for each run executed – this can be useful if you wish to
later inspect the results of the import. These OriginHistory items can be accessed in
TRIM via the Origin object. In addition, the OriginHistory is recorded against each record
so that you can later inspect a TRIM record and see where it originated. An OriginHistory
is used to also pass work across the TRIM Event processor. A single event can be created
for all records in a run to be processed providing further efficiencies in data processing.
The nameOfActualSourceData parameter is documentary and allows you to provide
some extra information for the OriginHistory item that is created in TRIM. Typically it
may contain the name of the source XML or ASCII file you are processing to create
records from. It is up to your program to provide whatever may be useful.
The workingFolder parameter is the name of a folder where all the temporary bulk
loader files will be created during the SubmitRecord processing. It needs to be large
enough to store all the intermediary SQL bulk loading files. There may be 10 or 12
separate tables involved in the bulk update process. The BulkDataLoader will create a file
corresponding to each table that needs data loaded. In order to allow simultaneous
BulkDataLoader processes, a uniquely named subfolder of the workingFolder will be
automatically created for this Run. For SQL Server, this folder also needs to be accessible
by the SQL Server instance, which usually entails using a UNC path.
Return Code
The function returns the Unique Identifier value for the OriginHistory. This is preallocated and is stored against each record created during a run. You can use this
value to select processed records if your program wishes to subsequently access
these records after bulk data loading has been completed (refer to TRIM record
search methods – originRun).
void
EndRun()
Call this method when you have loaded all the records for a run. You can then either
start another run, or simply finish with bulk loading. The function will automatically
submit any remaining work since the last call to ProcessAccumulatedData, update the
3
HP TRIM Bulk Data Importing Programming Guide
statistical counters on the OriginHistory object, and also create events for bulk content
indexing (if any electronic documents were created in the Run), bulk word indexing
(record titles and record notes) and for calculating disposition schedules.
Record
NewRecord()
Record
NewRecord(RecordType typeOfRecord)
Called within a BulkDataLoader run, this function commences the process of creating a
record. This is very similar to calling a standard Record constructor, however the Record
is initialized using defaults contained within the Origin object associated with the
BulkDataLoader instance. Note that the first form of this function uses the record type of
the Origin object, the second form allows the default record type of the Origin to be
bypassed. The TRIM database is placed in Bulk Loading mode whilst the new record is
initialized.
Return Code
The function returns a Record object which you will eventually submit for Bulk
Loading. You should take care in calling any functions and setting any properties of
the Record using normal SDK.NET commands. These will not execute in Bulk Loader
mode and so may have performance side effects. In particular definitely do not call
the Save() or Delete() methods. To set property values, custom user field values and
to attach an electronic document, use the special BulkDataLoader methods below.
bool
SetProperties( TrimObject ofNewObject,
array<PropertyOrFieldValue> objectMetadata)
This function allows you to set up the properties of the record that you want to create.
Presumably your input source will contain values that you wish to transfer across to your
TRIM record. The BulkDataLoader provides this single method for setting all property
and custom user-field values in one method call. (You could call the method many times
for one record, using different property value arrays if you like, but Bulk Loading is about
minimizing calls). So, to setup your record you would first create an array of all the
property and custom field values you want to set, using the SDK.NET
PropertyOrFieldValue helper class. Set each PropertyOrFieldValue to the appropriate
value from your input data source, then call SetProperties, passing the record you
created with the NewRecord call as the ofNewObject parameter and the array as the
objectMetadata parameter.
Return Code
Some of the TRIM business rules may be “relaxed” during bulk loader mode, however
it is possible that some of the property values will be deemed unsuitable by TRIM. In
this case, the function will return false, the reason for the failure will be described in
the BulkDataLoader ErrorMessage property.
4
HP TRIM Bulk Data Importing Programming Guide
bool
SetDocument( Record ofNewRecord,
InputDocument electronicRecord,
BulkCopyStyle copyStyle)
bool
SetDocument( Record ofNewRecord,
InputDocument electronicRecord)
If you wish to create an electronic record in TRIM, you need to pass through details of
the electronic document that will be attached to the record. The SetDocument method
allows you to do this is a highly optimized way. The second form of the function uses the
BulkCopyStyle of WindowsFileCopy. So, to attach a document, you firstly construct an
instance of the InputDocument helper class, set it up appropriately to represent your
source document (it may be a Windows file, an email message, etc). You then call the
BulkDataLoader SetDocument function passing in the Record object created in the
NewRecord function, the InputDocument and a BulkCopyStyle value.
The BulkCopyStyle parameter allows you to increase the performance of document
transfers to the TRIM document store. If you are using a Windows File System store type,
the BulkDataLoader will see if the Path of this Document Store is accessible to the
BulkDataLoader process. If this is the case, it will create a special part of this store for
storing bulk loaded documents. This store location is fully supported by the TRIM
workgroup and documents stored in this special store subfolder will behave like any
other normal TRIM electronic document. To further improve performance, you can
specify to use a Windows FileMove operation rather than a FileCopy operation. This
means that the source document will be removed as part of the operation. A Windows
FileMove will be much faster if the Document Store is located on the same hard drive as
the source document.
If your normal storage arrangements make it difficult to set up this high-performance
storage option, it may well be worth considering creating a temporary store for transfers,
then manually migrating the documents to a permanent store at a later stage, using the
TRIM Store Transfer option.
Should none of the storage optimizations be possible, the BulkDataLoader will drop
back to doing a normal streaming operation to the TRIM Workgroup.
Int64
SubmitRecord(Record newRecordToAdd)
You call SubmitRecord to add the record to the batch of pending records awaiting the SQL
bulk loading process. This function is designed to replace the normal Record.Save() function
and appends details of the Record to one or more files in a format suitable for bulk loading to
your TRIM database. You call SubmitRecord after you have set any properties using
SetProperties and after you have possibly set up an electronic document using SetDocument.
The newRecordToAdd is the Record object you used for these setup calls. This Record is
basically now “done”, should not be used any further and can be disposed of following the call.
SubmitRecord will also do a check of all properties, ensuring values are consistent with
each other (for example that Date Closed is after Date Created, etc).
The BulkDataLoader has its own way of allocating unique record numbers if one is not
set by the SetProperties call. It provides this sequencing in order that records can be
loaded without the need to make a synchronized numbering call to the TRIM
5
HP TRIM Bulk Data Importing Programming Guide
Workgroup. The value will have a prefix of “BL” followed by an 8-digit number which is
equivalent to the record’s unique identifier (e.g. BL00000077).
Return Code
The function returns the Unique Identifier that will be used for this record when it is
eventually created. You cannot use this Uri to instantiate a Record object until the
record has been processed by calling the ProcessAccumulatedData call. It is worth
keeping this in mind if you wish to create Record relationships during an import. For
example if you wished to set the Container property of a Record, the Container must
have already been created in a previous batch (although it can be within the same
run). For this reason, sometimes it is necessary to order your input data to work
within this limitation.
void
ProcessAccumulatedData()
This function will issue SQL commands to actually create TRIM records (and possibly any
bulk loaded locations) that have been queued using the SubmitRecord method since the
most recent call to ProcessAccumulatedData. When you choose to call this function
effectively determines how many records there will be in a particular “batch”. Larger
batch sizes will have better performance, however there can be a drawback in that calls
to the SQL bulk loader may sometimes fail.
In the case of failure, the BulkDataLoader attempts to restore the database to its state
prior to the function call, although this too can fail. Basically, the processing involves
issuing a BULK INSERT SQL command for each table involved in storing record metadata
– this can be between 1 and 10 individual SQL tables. If, for example, the 5th of these 10
BULK INSERT statements fails, it is necessary to revert the effects of the preceding 4
BULK INSERT statements. To do this, the BulkDataLoader automatically issues DELETE
statements for all the rows that have been inserted in a “no longer wanted” BULK
INSERT.
Further work will be required to diagnose the reason for failure. This can be difficult as
often the BULK INSERT error will not indicate a particular row that is invalid. Diagnosis
may involve resubmitting the records using a batch size of one to try to identify the
failing record. Failure would generally indicate some unlikely fault in the format of the
input data.
After successfully processing the data, the Run data is updated with the latest running
totals. You can then decide that the run is complete and call EndRun(), or continue
creating records with subsequent calls of NewRecord().
bool
UpdateDocument( Int64 recordUri,
InputDocument electronicRecord)
bool
UpdateDocument( Int64 recordUri,
InputDocument electronicRecord,
BulkCopyStyle copyStyle)
These two functions allow the performance improvements provided for SetDocument to
also be applied to checking in new revisions. This is useful when developing applications
6
HP TRIM Bulk Data Importing Programming Guide
to synchronise documents stored locally on a hard drive with TRIM. The first format is
identical to calling the second form with a copyStyle of WindowsFileCopy. Refer to the
SetDocument function for details on how the BulkCopyStyle works. The
UpdateDocument function will either attach the document to an existing record if it has
no document, or will create a new revision of the document associated with TRIM. Note
that BulkDataLoading only supports INSERT, so the record is updated immediately after
an UpdateDocument, rather than relying on ProcessAccumulatedData.
Miscellaneous Properties
bool
CreateContentIndexEvents
Default is true. Instructs the BulkDataLoader to create events to content index all records
created within a Run. Content Indexing involves calls to the TRIM Content Indexing
engine to analyse the contents of electronic documents and build an indexing so that
records can be retrieved based on words within the document contents.
bool
CreateWordIndexEvents
Default is true. Instructs the BulkDataLoader to create events to word index all records
created within a Run. Word indexing involves creating title word and notes word indexes
for fast searching of records by title and notes.
String
LogFileName
The BulkDataLoader class allows you to specify a log file which is where all SQL calls will be
written, together with some timings and tables affected. The log file will be invaluable for
diagnosing any SQL errors that may occur during import runs, particularly if rollbacks have
failed.
7
HP TRIM Bulk Data Importing Programming Guide
bool
CreateRetentionTriggerEvents
Default is true. Instructs the BulkDataLoader to create events to calculate disposition
schedules for all records created within a Run. A disposition schedule is created based
on applying rules within a TRIM retention schedule that has been attached to a record,
determining optimal dates for archival or destruction.
Int64
RunHistoryUri
Allows you to construct the OriginHistory object that has been created for the current
run that is in progress.
Locations and the BulkDataLoader
Because TRIM has a heavy reliance on relationships between records and locations, some
extra capabilities have been provided for creating and finding locations. When you wish to
set a property on a record that is a location (for example, the Author property), you will
need to convert your input data in such a way that either a new location is created or an
existing location is used. Typically you would first search to see if a location exists – this can
be tricky because people’s names are not necessarily unique. To improve performance of
location searches, the BulkDataLoader has some search methods that act on a set of prefetched arrays in memory, using these searching indexes can speed up performance.
Location Matching Functions
Int64
FindLocation(PropertyValue locationMatchingProperty)
This function will search the in-memory locations array for a location that has a property
value that matches the value you wish to search by. The properties available for this
search are:
PropertyIds.LocationSortName
PropertyIds.LocationIdNumber
PropertyIds.LocationEAddressName
Return Code
If a single match is found, the Unique Identifier of that location is returned. If no
matches were found, or if multiple matches were found, the return value is 0.
array<Int64>
FindAllLocations(PropertyValue locationMatchingProperty)
This function works similarly to the FindLocation() function above except for the return
code. The properties available for this search are:
PropertyIds.LocationSortName
PropertyIds.LocationIdNumber
PropertyIds.LocationEAddressName
8
HP TRIM Bulk Data Importing Programming Guide
Return Code
If no match is found, returns a null value. If one or matches occur, the array of
matching Unique Identifiers is returned .
Int64
FindPerson( String surname,
String givenNames,
String initials,
TrimDateTime dateOfBirth)
This function will search the in-memory locations arrays for a person that matches the
parameter values. You should always use a non-blank surname. The other parameters
can be left blank if you do not want to use them in the match.
Return Code
If a single match is found, the Unique Identifier of that person location is returned. If
no matches were found, or if multiple matches were found, the return value is 0.
array<Int64>
FindAllPeople( String surname,
String givenNames,
String initials,
TrimDateTime dateOfBirth)
This function works similarly to the FindPerson () function above except for the return
code.
Return Code
If no match is found, returns a null value. If one or matches occur, the array of
matching Unique Identifiers is returned.
Location Creation Functions
If you do not find an existing location to use for a record property, you may choose to
create one. BulkDataLoader supports creating locations in a similar way to the way records
are created. Note that the SetProperties call will work for Location and Record objects.
Location
NewLocation()
Initialises a new Location object that is to be created using the BulkDataLoader. This is
similar to the NewRecord function. Similarly you should not call Save() or Delete() on this
location.
9
HP TRIM Bulk Data Importing Programming Guide
Int64
SubmitLocation(Location newLocationToAdd)
Submits the location to the bulk loader process. Data relating to the location will be
accumulated in the temporary SQL loader table. The ProcessAccumulatedData function
will process any accumulated locations together with any records accumulated since the
preceding call.
The Origin object
The TRIM Origin object is provided primarily to support managing the bulk loading process.
It has a number of properties which provides default values for records created with the
BulkDataLoader. This means that these defaults can be changed using the TRIM User
interface, without having to either write an alternative interface or change code. In addition,
the Origin has links to OriginHistory objects which record details of every BulkDataLoader
run that has been done for that origin. This allows records imported in this way to be easily
identified for possible subsequent quality control measures / user review, etc. The Origin
object also has the capability of storing additional configuration parameters by way of
supporting a custom XML file. Origins of type “Tab Delimited” also support a mapping
capability which maps columns in the input source to record properties. This feature may be
useful when developing your own importing utility.
10
HP TRIM Bulk Data Importing Programming Guide
Appendix A
Bulk Loader Sample Code
The following example code is part of the SDK samples for SDK.NET. It demonstrates a typical
sequence used to create records via the Bulk Loader.
class bulkLoaderSample : IDisposable
{
public bulkLoaderSample()
{
m_origin = null;
m_loader = null;
m_recordCount = 0;
}
public void Dispose()
{
if (m_loader != null)
{
m_loader.Dispose();
m_loader = null;
}
if (m_origin != null)
{
m_origin.Dispose();
m_origin = null;
}
}
bool getNextRecord(PropertyOrFieldValue[] fields)
{
// this function would normally be processing some input data source,
// reading the next item, setting yp the fields array from this input.
// For the example, we "hard code" some simple property values and
// provide 5 records.
m_recordCount++;
if (m_recordCount > 5)
{
return false;
}
// a simple title
String title = "Test Bulk Loaded Record Import #" + System.Convert.ToString(m_recordCount);
title += ", imported from run number " + System.Convert.ToString(m_loader.RunHistoryUri);
// some notes
String notes = "Some notes for the bulk imported record #"
+ System.Convert.ToString(m_recordCount)
+ " data imported at: "
+ TrimDateTime.Now.ToLongDateTimeString();
// a random date
TrimDateTime dateCreated = new TrimDateTime(2007, 12, 14, 12, 0, 0);
// create a location "on the fly" for the Author property
PropertyOrFieldValue[] authorFields = new PropertyOrFieldValue[2];
authorFields[0] = new PropertyOrFieldValue(PropertyIds.LocationTypeOfLocation);
authorFields[1] = new PropertyOrFieldValue(PropertyIds.LocationSortName);
authorFields[0].SetValue(LocationType.Position);
authorFields[1].SetValue("Position #" + System.Convert.ToString(m_recordCount));
11
HP TRIM Bulk Data Importing Programming Guide
// if we are rerunning, check to see if it already exists
PropertyValue positionName = new PropertyValue(PropertyIds.LocationSortName);
positionName.SetValue("Position #" + System.Convert.ToString(m_recordCount));
Int64 authorUri = m_loader.FindLocation(positionName);
if (authorUri == 0)
{
// use the loader to setup the new location's properties
using (Location authorLoc = m_loader.NewLocation())
{
m_loader.SetProperties(authorLoc, authorFields);
// now submit the location to the bulk loader queue
authorUri = m_loader.SubmitLocation(authorLoc);
}
}
// now set up the fields array based on these values
fields[0].SetValue(title);
fields[1].SetValue(dateCreated);
fields[2].SetValue(notes);
fields[3].SetValue(authorUri);
return true;
}
public bool run(Database db)
{
Console.WriteLine("Initialise BulkDataLoader sample ...");
// create an origin to use for this sample. Look it up first just so you can rerun the code.
m_origin = db.FindTrimObjectByName(BaseObjectTypes.Origin, "Bulk Loader Sample") as Origin;
if (m_origin == null)
{
m_origin = new Origin(db, OriginType.Custom1);
m_origin.Name = "Bulk Loader Sample";
m_origin.OriginLocation = "n.a";
// sample code assumes you have a record type defined called "Document"
m_origin.DefaultRecordType = db.FindTrimObjectByName(
BaseObjectTypes.RecordType,
"Document") as RecordType;
// don't bother with other origin defaults for the sample, just save it so we can use it
m_origin.Save();
}
// construct a BulkDataLoader for this origin
m_loader = new BulkDataLoader(m_origin);
// initialise it as per instructions
if (!m_loader.Initialise())
{
// this sample has no way of dealing with the error.
Console.WriteLine(m_loader.ErrorMessage);
return false;
}
Console.WriteLine("Starting up an import run ...");
// the sample is going to do just one run, let's get started...
// you will need to specify a working folder that works for you (see programming guide)
m_loader.StartRun("Simulated Input Data", "C:\\junk");
12
HP TRIM Bulk Data Importing Programming Guide
// setup the property array that will be used to transfer record metadata
PropertyOrFieldValue[] recordFields = new PropertyOrFieldValue[4];
recordFields[0] = new PropertyOrFieldValue(PropertyIds.RecordTitle);
recordFields[1] = new PropertyOrFieldValue(PropertyIds.RecordDateCreated);
recordFields[2] = new PropertyOrFieldValue(PropertyIds.RecordNotes);
recordFields[3] = new PropertyOrFieldValue(PropertyIds.RecordAuthor);
// now lets add some records
while (getNextRecord(recordFields))
{
Console.WriteLine("Importing record #" + System.Convert.ToString(m_recordCount) + " ...");
using (Record importRec = m_loader.NewRecord())
{
// set the record properties
m_loader.SetProperties(importRec, recordFields);
// attach an electronic. The sample just does this for the first record, and uses
// a mail message to avoid having to find a document on the hard drive somewhere.
using (InputDocument doc = new InputDocument())
{
doc.SetAsMailMessage();
doc.Subject = "Imported mail message #" + System.Convert.ToString(m_recordCount);
doc.Content = "Some mail messages have very little content";
doc.SentDate = new TrimDateTime(2007, 12, 14, 10, 30, 15);
doc.ReceivedDate = new TrimDateTime(2007, 12, 14, 14, 30, 30);
using (EmailParticipant mailContactFrom = new
EmailParticipant("random.kindness@oneworld.com",
"Random Kindness", "SMTP"))
{
doc.SetAuthor(mailContactFrom);
}
using (EmailParticipant mailContactTo = new
EmailParticipant("little.animals@oneworld.com", "Little Animals", "SMTP"))
{
doc.AddRecipient(MailRecipientType.To, mailContactTo);
}
m_loader.SetDocument(importRec, doc, BulkCopyStyle.WindowsFileCopy);
}
// submit it to the bulk loader
m_loader.SubmitRecord(importRec);
}
}
// by now the loader has accumulated 5 record inserts and 5 location inserts
// if you set a breakpoint here and look into the right subfolder of your working folder
// you will see a bunch of temporary files ready to be loaded into the SQL engine
// process this batch
Console.WriteLine("Processing import batch ...");
m_loader.ProcessAccumulatedData();
// grab a copy of the history object (it is not available in bulk loader after we end the run
Int64 runHistoryUri = m_loader.RunHistoryUri;
// we're done, lets end the run now
Console.WriteLine("Processing complete ...");
m_loader.EndRun();
13
HP TRIM Bulk Data Importing Programming Guide
// just for interest, lets look at the origin history object and output what it did
using (OriginHistory hist = new OriginHistory(db, runHistoryUri))
{
Console.WriteLine("Number of records created ..... "
+ System.Convert.ToString(hist.RecordsCreated));
Console.WriteLine("Number of locations created ... "
+ System.Convert.ToString(hist.LocationsCreated));
}
return true;
}
14
HP TRIM Bulk Data Importing Programming Guide
Download