Another Way to Attack the BLOB: Server-side Access via PL/SQL and Perl VUGM 2002 Saturday, April 27 Session 61 Presented by: Roy Zimmer Programmer/Analyst Office of Information Technology Western Michigan University -- PL/SQL example of accessing the BLOB -- need following line to enable output to screen set serveroutput on format wrapped size 1000000 -- also bear in mind that with dbms_output.put_line, -we're limited to a maximum string length of 255 -- this example is easily adaptable to bib or mfhd. -merely change "auth" to "bib" or "mfhd" -in the case of mfhd, the recseg length needs to be 300 -- also remember to replace "wmichdb" in the query below -with the database name at your installation declare cursor marcrec is select auth_id, record_segment, seqnum from wmichdb.auth_data where auth_id >= 650161 and auth_id <= 650163 order by auth_id asc, seqnum desc; -- why order query like this? -see assemble marc record section below for explanation m_authid number(22); m_recseg char(990); m_seqnum number(22); marc marclen baseaddr strptr tagid taglen offset tagaddr tagdata varchar2(17000); integer(5); integer(5); integer(5); varchar2(3); integer(4); integer(5); integer(5); varchar2(9999); idx strlen pos mchar subfldchar marceof integer(5); integer(5); integer(5); char(1); char(1):= '|'; boolean:= false; begin open marcrec; -- this loop goes through all the rows returned by the cursor above loop -- begin assemble marc record section marc:= ''; fetch marcrec into m_authid, m_recseg, m_seqnum; exit when marcrec%notfound; -- previous line is our exit for when we've gotten all the cursor's records. -- exit mechanism below in case we hit the end of the cursor 2 while m_seqnum <> 1 loop marc:= m_recseg || marc; fetch marcrec into m_authid, m_recseg, m_seqnum; if marcrec%notfound then marceof:= true; exit; end if; end loop; exit when marceof; -- get segment #1 marc:= m_recseg || marc; ---------------------- EXPLANATION: start off by getting a marc record segment. if it's a typical record, all the data is in this one segment and only the last line of code gets executed. if this is a longer record spread across multiple rows, we have the last segment of this record (remember the query order above?). the while loop is now employed to march through successive rows from our cursor, going backwards by seqnum and prepending successive segments to what we already have. so why go "backwards"? if we predicate the segment-to-marc-record assembly on when the auth_id changes, once it changes we've gone too far and can't go back to get the last segment to completely assemble the now previous record. thus the looping is predicated on seqnum in reverse order because there will *always* be a seqnum of -1-. if there are multiple segments, we'll always end with a seqnum of -1- *and* still be on the same auth_id and can go on processing the record. -- end assemble marc record section -- get the record length and the base address where tag data begins marclen:= substr(marc, 1, 5); baseaddr:= substr(marc, 13, 5) + 1; dbms_output.put_line('authid= ' || m_authid || ' marclen= ' || marclen || ' baseaddr); baseaddr= ' || -- begin get tag section -- this loop goes through the record's directory, reading a tag's -information and its data, and doing some formatting -- position ourselves at the start of the directory strptr:= 25; -- start going through directory and stopping at the end, -before the base address where the data begins while strptr < baseaddr-1 loop -- getting the tag's parameters from the directory "triplet" tagid:= substr(marc, strptr, 3); 3 || -- ---- taglen:= substr(marc, strptr+3, 4); offset:= substr(marc, strptr+7, 5); dbms_output.put_line('strptr= ' || strptr || ' tagid= ' || tagid || ' ' offset= ' || offset); compute where the tag's data starts tagaddr:= baseaddr + offset; dbms_output.put_line('tagaddr= ' || tagaddr); read the tag's data tagdata:= substr(marc, baseaddr+offset, taglen-1); I like to convert the subfield delimiter to a vertical bar (|) for better readability tagdata:= translate(tagdata, chr(31), subfldchar); taglen= ' || taglen -- do some additional readability formatting by adding spaces -around a subfield indicator idx:= 2; strlen:= length(tagdata); while idx < strlen loop mchar:= substr(tagdata, idx, 1); if mchar = subfldchar then tagdata:= substr(tagdata, 1, idx-1) || ' ' || substr(tagdata, idx, 2 ) || ' ' || substr(tagdata, idx+2 ); idx:= idx + 2; strlen:= length(tagdata); end if; idx:= idx + 1; end loop; -- output the tag's data -- breaking it into 80 character sections looks better and -gets us around the 255 character limit pos:= 1; while pos < strlen loop dbms_output.put_line(substr(tagdata, pos, 80)); pos:= pos + 80; end loop; -- move to the next tag in the directory strptr:= strptr + 12; end loop; -- end get tag section end loop; close marcrec; end; / 4 #!/usr/local/bin/perl ### extract marc data to standard output from Voyager database as ### raw Marc records or human-readable data # this is required for database access use DBI; # get input arguments if ($#ARGV < 0) {usage();} $searchtype = $ARGV[0]; $idstart = $ARGV[1]; $idend = $ARGV[2]; if ($#ARGV = 4) {$raw = $ARGV[3];} if ($raw) {$raw = 1;} else {$raw = 0;} # show program usage if (($searchtype ne "auth") and ($searchtype ne "bib") and ($searchtype ne "mfhd")) {usage();} ### connect to database # specifying which type of database, the host name, the SID # and the database username and password # receive a database handle for this database connection $dbh = DBI->connect('DBI:Oracle:host=voyager.library.wmich.edu;sid=LIBR', 'dbread', 'dbread') or die "connecting: $DBI::errstr"; # formulate the query statement to be used # parameters taken from program arguments above $sqlquery = sprintf("select %s_id, record_segment, seqnum from wmichdb.%s_data where %s_id >= %s and %s_id <= %s order by %s_id asc, seqnum desc", $searchtype, $searchtype, $searchtype, $idstart, $searchtype, $idend, $searchtype); # have DBI prepare the query, identified by statement handle "sth" $sth = $dbh->prepare($sqlquery) or die "preparing query statement"; # execute the query, getting a return code $rc = $sth->execute; ### usual assembly of marc data in reverse order (per sort in query) 5 ### by auth/bib/mfhd id # shunt complete records to stdout (screen) for raw output, or # write to array for processing to get human-readable output $marcstuff = ""; $marc = ""; $oldrec_id = 0; # following statement gets one row at a time from the query result set while (($rec_id, $recseg, $seqnum) = $sth->fetchrow_array) { # when transitioning from one marc record to another, # print or store previous marc record, and # start storing this marc record if ($rec_id != $oldrec_id) { if (!$raw) {$marcstuff = $marcstuff . $marc;} else {print $marc;} $oldrec_id = $rec_id; $marc = $recseg; } # else just prepend the record segment to the current marc record being built else {$marc = $recseg . $marc;} } # handle the last record at the end if (!$raw) {$marcstuff = $marcstuff . $marc;} else {print $marc;} # release resources associated with this statement handle $sth->finish; # release the database connection associated with this database handle $dbh->disconnect; # if want human-readable output if (!$raw) { # marc records are delimited by this character # this creates the array of marc records from the # previously built string of marc data @marcrec = split /\x1d/, $marcstuff; # loop through array of marc records $idx = 0; while ($idx < @marcrec) { # output the leader $leader = substr($marcrec[$idx], 0, 24); if ($idx != 0) {printf("\n");} printf("LDR:%s\n", $leader); # grab the record length and the data base-address, # "move" to the start of the directory $reclen = substr($marcrec[$idx], 1, 5); $baseaddr = substr($marcrec[$idx], 12, 5) - 1; $strptr = 24; # loop through the directory while ($strptr < $baseaddr-1) 6 { # get the tag id, the tag's length, and the tag's offset $tagid = substr($marcrec[$idx], $strptr, 3); $taglen = substr($marcrec[$idx], $strptr+3, 4); $offset = substr($marcrec[$idx], $strptr+7, 5); # read the tag's data from the computed start of the tag's data, # for tag length characters $tagdata = substr($marcrec[$idx], $baseaddr+$offset, $taglen); # do the pretty printing formatting for human readability $tagdata =~ s/\x1f[a-z]/ \|$& /g; # use " |x " for subfield ind, $tagdata =~ s/\x1f//g; # remove original subfield ind, $tagdata =~ s/\x1e//g; # remove field ind, if (substr($tagdata, 2, 2) eq " |") # & remove the "1st" space in the line {$tagdata = substr($tagdata, 0, 2) . substr($tagdata, 3);} # output the tag parameters and its data printf("%3s:%4s:%5s:%s\n", $tagid, $taglen, $offset, $tagdata); # move to the next tag in the directory $strptr+= 12; } # move to the next record in the array of marc records $idx++; } # provide count of marc records handled if ($idx > 1) {$plural = "s read";} else {$plural = " read";} printf ("\n<<%d Marc record%s>>\n\n", $idx, $plural); } # show this to illustrate program usage sub usage() { printf ("\nUsage: perl example1.pl [auth | bib | mfhd] startID endID [raw]\n"); printf (" Pick one of the 3 data types.\n"); printf (" Specify record ID numbers; specified range is inclusive.\n"); printf (" Parameters must be in the above order.\n"); printf (" All parameters are required except for the last one.\n"); printf (" Program extracts marc data from blobs in Oracle.\n"); printf (" Output is human-formatted unless *raw* is specified\n"); printf (" and it goes to STDOUT.\n"); exit(0); } 7 8 USMARC Concise Holdings: Leader and Directory LEADER A fixed field that comprises the first 24 character positions (00-23) of each record and provides information for the processing of the record. 00-04 - Logical record length The computer-generated, five-character numeric string that specifies the length of the entire record. The number is right justified and each unused position contains a zero. 05 - Record status A one-character code that indicates the relation of the record to a file. c - Corrected or revised d - Deleted n - New 06 - Type of record A one-character code that indicates the characteristics of and defines the components of the record. When holdings information is embedded in a USMARC bibliographic record, this information may be contained in field 841 $a (Holdings Coded Data Values, Type of record). v - Multipart item holdings x - Single-part item holdings y - Serial item holdings 07-08 - Undefined character positions Each contains a blank (#) 09 – Character coding scheme # - Marc-8 a - UCS/Unicode 10 - Indicator count The computer-generated number 2 that indicates the number of character positions used for indicators in a variable data field. 11 - Subfield code count The computer-generated number 2 that indicates the number of character positions used for each subfield code in a variable data field. 12-16 - Base address of data The computer-generated, five-character numeric string that specifies the first character position of the first variable control field in a record. The number is right justified and each unused position contains a zero. 17 - Encoding level A one-character code that indicates the ANSI Z39.44 or ANSI/NISO Z39.57 level-of-specificity requirements met by the holdings statement. When holdings information is embedded in a USMARC bibliographic record, this information may be contained in field 841 $e (Holdings Coded Data Values, Encoding level). 1 - Holdings level 1 2 - Holdings level 2 3 - Holdings level 3 4 - Holdings level 4 5 - Holdings level 4 with piece designation Physical piece designation is contained in subfield p (Piece designation) of field 852 (Location) or one of the 863-865 Enumeration and Chronology fields, or in subfield $a (Textual holdings) in one of the 866-868 Textual Holdings fields. m - Mixed levels Holdings are recorded at more than one level. The value in the first indicator position (Field encoding level) of the applicable 863-868 field indicates the level for each holdings data field. u - unknown z - Other level 18 - Item information in record One character code that indicates whether item information is in the record, contained in one or more occurrences of fields 876-878 (Item information fields). i - Item information n - No item information 19 - Undefined character position Contains a blank (#) 20-23 - Entry map Four computer-generated, single-digit numeric characters that indicate the structure of each entry in the Directory. 20 - Length of the length-of-field portion Contains a 4 21 - Length of the starting-character-position portion Contains a 5 22 - Length of the implementation-defined portion Contains a 0 23 - Undefined Entry map character position Contains a 0 DIRECTORY A computer-generated index to the location of the variable control and data fields within a record. The Directory immediately follows the Leader at character position 24 and consists of a series of fixed-length (12 character positions) entries that give the tag, length, and starting character position of each variable field. 00-02 - Tag Three numeric or alphabetic characters (uppercase or lowercase, but not both) that identify an associated field. 03-06 - Field length Four numeric characters that indicate the length of the field, including indicators, subfield codes, data, and the field terminator. The number is right justified and each unused position contains a zero. 07-11 - Starting character position Five numeric characters that indicate the starting character position of the field relative to the Base address of data (Leader/12-16) of the record. The number is right justified and each unused position contains a zero. from: http://www.loc.gov/marc/holdings/echdldrd.html 9 Resources Books Oracle PL/SQL Programming; Steven Feuerstein with Bill Pribyl; O’Reilly Oracle SQL*PLUS The Definitive Guide; Jonathan Gennick; O’Reilly Programming Perl; Larry Wall, Tom Christiansen, Randal L. Schwartz; O’Reilly Perl Cookbook; Tom Christiansen & Nathan Torkington; O’Reilly Perl in a Nutshell; Ellen Siever, Stephen Spainhour, Nathan Patwardhan; O’Reilly Oracle Documentation SQL*PLUS User’s Guide and Reference Oracle SQL Reference PL/SQL User’s Guide and Reference Oracle Supplied PL/SQL Packages Reference Web www.loc.gov/marc/holdings/echdldrd.html MARC leader details lcweb.loc.gov/marc/umb/um01to06.html lcweb.loc.gov/marc/umb/um07to10.html lcweb.loc.gov/marc/umb/um11to12.html MARC record format details MARC record format details MARC record format details http://www.orafaq.org/faq2.htm Answers to all sorts of questions www.revealnet.com Lots of Oracle stuff including PL/SQL www.cpan.org Perl modules such as DBD/DBI, and module documentation www.cpan.org/authors/id/TIMB/ DBD-Oracle-1.12.tar.gz www.cpan.org/authors/id/TIMB/ DBI-1.20.tar.gz www.perl.org Perl related links www.perl.com O’Reilly Perl site: books, documentation, etc. www.gnu.org/manual/manual/html GNU Organization documentation for Unix, other links 10