ProjectWise 101 – Chapter 9 Document Indexing Gary Cochrane – Technical Director Geospatial Sales – North America Introduction • ProjectWise Document Indexing – Really means three things • Full Text Indexing, in support of full text searching • Thumbnail Extraction • Document Property Extraction – We won’t cover this one in PW101 – See Bentley Institute PW Admin course guide for this Full Text Indexing • We did not write the engine for this – But elected to use the one Microsoft provides • Included with every copy of Windows – That engine is called the MS Indexing Service • And it was installed in the VM as an optional Windows component – Microsoft indexes the following file formats • MSWord, Excel, PPT, HTML, XML, TXT Pre-installed in VM ProjectWise Integration Server ProjectWise Orchestration Framework MicroStation V8i-SS1 Supported Database Engine Microsoft Message Queuing Service Microsoft Indexing Service Microsoft .NET Framework 2.0 Windows Server 2003 with SP2 Extending the MS Index Service • Microsoft provides an SDK for third parties to extend the Indexing service – So the Indexing service will know how to “filter” files from that vendor • For instance, Adobe provides an “iFilter” that teaches the MS Index Service how to extract text from a PDF file • The Adobe PDF iFilter is installed with Acrobat Reader V9x Indexing Overview • Within PW, Indexing consists of: – Scheduling • A process that wakes up, checks for new, (or modified files), adds them to the Copy-out queue, and goes back to sleep – Copy-out • Copy the file from the Storage Area, to the machine running the Indexing Service. Then add file to the extraction queue. • Remember, files may be stored on multiple servers • Also, in large installations, a machine may be dedicated to indexing Indexing Overview – Part II • Overview – continued – Extraction • This process gets the text from the file and adds it to the MS Index catalog. Then adds the file to the Update queue – Update • This process sets the flag on the file (in the PW database) that says it is “done” • New files are added with the flag set to “undone” • Check-out/in causes the flag to be set to “undone” A note on “done” • Done does not necessarily mean it was successful – It means the file has been processed • In other words, what happens if an unknown file (Ex: an Autocad file) is sent to the Indexing Service? – The file is attempted… • And the indexing service says, “I don’t know how to extract text from this file” – There would be no point in trying the file again • So it is marked as “done”, even when unsuccessful MicroStation and AutoCAD • ProjectWise provides a mechanism to index the text from these file types – Instead of writing an iFilter, Bentley elected to: • Copy-out the file • Run MicroStation in the background, extract all the text, and write it to an XML file • Send the XML file to the Indexing Engine – Since MicroStation can parse DWG as well… • Then this method saved us from having to write two iFilters Summary • So within ProjectWise, we index: – Word, PPT, Excel, XML, HTML, TXT – Adobe PDF – DGN, & DWG • More good news – iFilters can be found for many file formats • Some free, and some for purchase PW Orchestration Framework • Remember when we installed this? – PWOF is responsible for managing batch processes for ProjectWise • This includes all those processes discussed on the previous slides – For Full Text Indexing, that means • Scheduler process, Copy-out process, Extraction process, Updater process, and the MicroStation instance running in the background Lab 1a • PW Orchestration Framework – Start the Windows Task Manager • Hint: Right-click on empty part of Taskbar – Examine memory usage • On the Performance tab – Switch to Processes tab • Sort by Mem Usage column (descending) • Look for ustation.exe • Look for DmsAfpEngine(s) – Lots of memory consumed here… Lab 1b • Now open Services dialog – Remember “gears” icon on Quick-Launch • Locate PW Orchestration Framework service – Select the PW OF service, and choose> Stop • Watch memory usage in Task Manager – For remainder of exercise, we need PWOF running • So start it back up now • Note PWOF is configured for automatic startup – It will run each time machine is booted – Close Services and Task Manager Lab 2a • Open PW Administrator – Log in as> adminpw – Drill down to: • Document Processors> Full Text Indexing – Right-click, choose> Properties Lab 2b - Full Text Indexing Accept defaut, unless Indexing is to be run on another machine Turn on adminpw adminpw Set to 60 Lab 2c - Full Text Indexing Enable all times in the schedule Set to 2 Lab 2d • Switch to File Type Associations tab – Press> Add • In the Extension field, enter> DWG • In the bottom field, enter> DGN – So that DWG files are processed as if they were DGN – Press> OK Lab 2e Lab 2f • Still on the File Type Associations tab – Again, press> Add • In the Extension field, enter> itiff • In the bottom, enable> Do not process these documents – You can’t extract text from a raster so this prevents wasted file transfers – Press> OK • Press OK again – To close the Full Text Indexing Properties Lab 2g • Open Task Manager again – Switch to Performance tab • Within 2 minutes, you should see heavy CPU usage • Memory usage will also go up – Up to 60 documents will be indexed in the first pass • If there are more than 60 documents to be done, then they will be queued in the next pass – 2 minutes from now Analysis • All documents will eventually be processed – When done, the index will be ready for fast full text searches • Once the indexer has caught up, future load will be lighter due to only processing incremental documents Lab 3a • When done, close Task Manager, open PW Explorer – Log in as user1 • From the main tool box, select> Find Documents – Binocular icon • Change to Full Text tab – Enter Look For> detail • Press OK to start search – Then Close the Search dialog • Your results should include: DGN’s, DWG’s, and PDF’s Lab 3b • Browse to: – User1/Document Indexing/MS-SHT • These files were not successful because they have an unknown extension • But they were attempted, and flagged as done • Return to PW Administrator – Select datasource name (pwdemo) • • • • Right-click, choose> Properties Change to Statistics tab Choose Refresh Review Full Text Statistics – Close dialog Lab 3c • While still in PW Administrator – Open Full Text Indexing Properties again • Switch to the File Type Associations tab – Press Add • In the Extension field, enter> SHT • In the bottom Extension field, enter> DGN – So that SHT files will be processed as if they were DGN files • Press OK to complete the Extension mapping – Press OK again to close the Properties dialog Lab 3d • Once new file type has been added… – Now a small problem • These files were flagged as done, and the Indexer won’t try them again unless they are checked out/in • And even that won’t work unless you actually makes changes… • PW compares files to version on server, and doesn’t transfer back if there are no changes Lab 3e • Rather than check them all out, and back in – From PW Administrator • Right-click Full Text Indexing – Choose> • Mark folder Documents for Reprocessing – Browse “…” to • USer1/Document Indexing/MS-SHT – Press OK • Press OK again Analysis • Within 2 minutes, these documents will be reprocessed – If you run the search again (in a few minutes), you should also get SHT files in your results – Re-visit Datasource statistics to see if it Full Text categories have changed Summary • Once the index is created, – You can stop the PW Orchestration Framework service • It is used to create the index, but not to search the index – This will save memory, and CPU cycles • So in a demo, your machine will run faster • BUT, new, (or modified) files will not be re-indexed – Up until now, the PWOF was not being used at all • Full Text Indexing is the first time we’ve needed PWOF, even though it has been running since installation PW Thumbnails • PW Thumbnails is not “indexing” in the proper sense, but it is similar in nature to Full Text – PW Thumbnails extracts a thumbnail from the document, and stores a copy in the PW database • This allows one to browse PW Explorer, and see thumbnails in the Preview Pane – Not all file types support thumbnails • Among those that do, some don’t do it per the industry standard Thumbnails – Part II • Important to remember – ProjectWise does not create thumbnails • It only extracts what might be in the file – A good test is to check to see if Windows Explorer displays a thumbnail for the file • If it does, then PW should as well Lab 4a • Open Windows Explorer – Browse to: • C:\PW-101 Class Files\Document Indexing\MS-V8 – Change to Thumbnail display • MicroStation V8 files have thumbnails Lab 4b • Browse through remaining Document Indexing folders – Note which include thumbnails – Additional notes • PDF files take a long time because you are really looking at a small view of the whole file, not a thumbnail • AutoCAD doesn’t adhere to the Industry standard – These files only display correctly because MicroStation is installed, and is responsible for displaying a thumbnail – Autodesk may have fixed this in later versions? Lab 5a • Open PW Administrator – Log in as> adminpw – Drill down to: • Document Processors> Thumbnail Extraction – Right-click, choose> Properties • Similar to Full Text Indexing – But actually less involved Lab 5b Turn on adminpw adminpw Set to 60 Lab 5c Enable all times in the schedule Set to 2 Lab 5d • No changed required on the File Type Associations tab – Press OK to complete the configuration and close the dialog • Within a few minutes, thumbnails should show up in the preview pane Analysis • Thumbnails are extracted and stored in the PW database – Because document storage may not be local • Thus “touching” the document to see thumbnail in real-time is not practical – Thumbnail notes • Requires less processing than full text – MicroStation not running in this process – Requires PWOF to extract, but not to display Review • Topics covered in this Chapter – – – – – Full text Indexing – Configuration Full Text Searches ProjectWise Orchestration Framework Thumbnail Extraction Microsoft Indexing Service • And iFilters to extend default supported file types • (I have a free Visio, and MSG iFilter from Microsoft)