Another Way to Attack the BLOB: VUGM 2002

advertisement
Another Way to Attack the BLOB:
Server-side Access via PL/SQL and Perl
VUGM 2002
Saturday, April 27
Session 61
Presented by:
Roy Zimmer
Programmer/Analyst
Office of Information Technology
Western Michigan University
-- PL/SQL example of accessing the BLOB
-- need following line to enable output to screen
set serveroutput on format wrapped size 1000000
-- also bear in mind that with dbms_output.put_line,
-we're limited to a maximum string length of 255
-- this example is easily adaptable to bib or mfhd.
-merely change "auth" to "bib" or "mfhd"
-in the case of mfhd, the recseg length needs to be 300
-- also remember to replace "wmichdb" in the query below
-with the database name at your installation
declare
cursor marcrec is
select auth_id,
record_segment,
seqnum
from wmichdb.auth_data
where auth_id >= 650161 and auth_id <= 650163
order by auth_id asc, seqnum desc;
-- why order query like this?
-see assemble marc record section below for explanation
m_authid number(22);
m_recseg char(990);
m_seqnum number(22);
marc
marclen
baseaddr
strptr
tagid
taglen
offset
tagaddr
tagdata
varchar2(17000);
integer(5);
integer(5);
integer(5);
varchar2(3);
integer(4);
integer(5);
integer(5);
varchar2(9999);
idx
strlen
pos
mchar
subfldchar
marceof
integer(5);
integer(5);
integer(5);
char(1);
char(1):= '|';
boolean:= false;
begin
open marcrec;
-- this loop goes through all the rows returned by the cursor above loop
-- begin assemble marc record section
marc:= '';
fetch marcrec into m_authid, m_recseg, m_seqnum;
exit when marcrec%notfound;
-- previous line is our exit for when we've gotten all the cursor's records.
-- exit mechanism below in case we hit the end of the cursor
2
while m_seqnum <> 1
loop
marc:= m_recseg || marc;
fetch marcrec into m_authid, m_recseg, m_seqnum;
if marcrec%notfound then
marceof:= true;
exit;
end if;
end loop;
exit when marceof;
-- get segment #1
marc:= m_recseg || marc;
----------------------
EXPLANATION:
start off by getting a marc record segment.
if it's a typical record, all the data is in this one
segment and only the last line of code gets executed.
if this is a longer record spread across multiple rows,
we have the last segment of this record (remember the
query order above?). the while loop is now employed to
march through successive rows from our cursor, going
backwards by seqnum and prepending successive segments
to what we already have.
so why go "backwards"?
if we predicate the segment-to-marc-record assembly on
when the auth_id changes, once it changes we've gone
too far and can't go back to get the last segment to
completely assemble the now previous record.
thus the looping is predicated on seqnum in reverse order
because there will *always* be a seqnum of -1-.
if there are multiple segments, we'll always end with a
seqnum of -1- *and* still be on the same auth_id and can
go on processing the record.
-- end assemble marc record section
-- get the record length and the base address where tag data begins
marclen:= substr(marc, 1, 5);
baseaddr:= substr(marc, 13, 5) + 1;
dbms_output.put_line('authid= ' || m_authid || ' marclen= ' || marclen || '
baseaddr);
baseaddr= ' ||
-- begin get tag section
-- this loop goes through the record's directory, reading a tag's
-information and its data, and doing some formatting
-- position ourselves at the start of the directory
strptr:= 25;
-- start going through directory and stopping at the end,
-before the base address where the data begins
while strptr < baseaddr-1
loop
-- getting the tag's parameters from the directory "triplet"
tagid:= substr(marc, strptr,
3);
3
||
--
----
taglen:= substr(marc, strptr+3, 4);
offset:= substr(marc, strptr+7, 5);
dbms_output.put_line('strptr= ' || strptr || ' tagid= ' || tagid || '
' offset= ' || offset);
compute where the tag's data starts
tagaddr:= baseaddr + offset;
dbms_output.put_line('tagaddr= ' || tagaddr);
read the tag's data
tagdata:= substr(marc, baseaddr+offset, taglen-1);
I like to convert the subfield delimiter to a vertical bar (|)
for better readability
tagdata:= translate(tagdata, chr(31), subfldchar);
taglen= ' || taglen
-- do some additional readability formatting by adding spaces
-around a subfield indicator
idx:= 2;
strlen:= length(tagdata);
while idx < strlen
loop
mchar:= substr(tagdata, idx, 1);
if mchar = subfldchar then
tagdata:= substr(tagdata, 1,
idx-1) || ' ' ||
substr(tagdata, idx, 2
) || ' ' ||
substr(tagdata, idx+2
);
idx:= idx + 2;
strlen:= length(tagdata);
end if;
idx:= idx + 1;
end loop;
-- output the tag's data
-- breaking it into 80 character sections looks better and
-gets us around the 255 character limit
pos:= 1;
while pos < strlen
loop
dbms_output.put_line(substr(tagdata, pos, 80));
pos:= pos + 80;
end loop;
-- move to the next tag in the directory
strptr:= strptr + 12;
end loop;
-- end get tag section
end loop;
close marcrec;
end;
/
4
#!/usr/local/bin/perl
### extract marc data to standard output from Voyager database as
###
raw Marc records or human-readable data
# this is required for database access
use DBI;
# get input arguments
if ($#ARGV < 0) {usage();}
$searchtype = $ARGV[0];
$idstart = $ARGV[1];
$idend = $ARGV[2];
if ($#ARGV = 4) {$raw = $ARGV[3];}
if ($raw) {$raw = 1;}
else {$raw = 0;}
# show program usage
if (($searchtype ne "auth") and ($searchtype ne "bib") and ($searchtype ne "mfhd"))
{usage();}
### connect to database
# specifying which type of database, the host name, the SID
#
and the database username and password
#
receive a database handle for this database connection
$dbh = DBI->connect('DBI:Oracle:host=voyager.library.wmich.edu;sid=LIBR',
'dbread', 'dbread')
or die "connecting: $DBI::errstr";
# formulate the query statement to be used
#
parameters taken from program arguments above
$sqlquery = sprintf("select %s_id,
record_segment,
seqnum
from wmichdb.%s_data
where %s_id >= %s and %s_id <= %s
order by %s_id asc, seqnum desc",
$searchtype, $searchtype,
$searchtype, $idstart,
$searchtype, $idend,
$searchtype);
# have DBI prepare the query, identified by statement handle "sth"
$sth = $dbh->prepare($sqlquery) or die "preparing query statement";
# execute the query, getting a return code
$rc = $sth->execute;
### usual assembly of marc data in reverse order (per sort in query)
5
###
by auth/bib/mfhd id
# shunt complete records to stdout (screen) for raw output, or
#
write to array for processing to get human-readable output
$marcstuff = "";
$marc = "";
$oldrec_id = 0;
# following statement gets one row at a time from the query result set
while (($rec_id, $recseg, $seqnum) = $sth->fetchrow_array)
{
# when transitioning from one marc record to another,
#
print or store previous marc record, and
#
start storing this marc record
if ($rec_id != $oldrec_id)
{
if (!$raw) {$marcstuff = $marcstuff . $marc;}
else {print $marc;}
$oldrec_id = $rec_id;
$marc = $recseg;
}
# else just prepend the record segment to the current marc record being built
else {$marc = $recseg . $marc;}
}
# handle the last record at the end
if (!$raw) {$marcstuff = $marcstuff . $marc;}
else {print $marc;}
# release resources associated with this statement handle
$sth->finish;
# release the database connection associated with this database handle
$dbh->disconnect;
# if want human-readable output
if (!$raw)
{
# marc records are delimited by this character
# this creates the array of marc records from the
#
previously built string of marc data
@marcrec = split /\x1d/, $marcstuff;
# loop through array of marc records
$idx = 0;
while ($idx < @marcrec)
{
# output the leader
$leader = substr($marcrec[$idx], 0, 24);
if ($idx != 0) {printf("\n");}
printf("LDR:%s\n", $leader);
# grab the record length and the data base-address,
#
"move" to the start of the directory
$reclen = substr($marcrec[$idx], 1, 5);
$baseaddr = substr($marcrec[$idx], 12, 5) - 1;
$strptr = 24;
# loop through the directory
while ($strptr < $baseaddr-1)
6
{
# get the tag id, the tag's length, and the tag's offset
$tagid = substr($marcrec[$idx], $strptr, 3);
$taglen = substr($marcrec[$idx], $strptr+3, 4);
$offset = substr($marcrec[$idx], $strptr+7, 5);
# read the tag's data from the computed start of the tag's data,
#
for tag length characters
$tagdata = substr($marcrec[$idx], $baseaddr+$offset, $taglen);
# do the pretty printing formatting for human readability
$tagdata =~ s/\x1f[a-z]/ \|$& /g;
# use " |x " for subfield ind,
$tagdata =~ s/\x1f//g;
# remove original subfield ind,
$tagdata =~ s/\x1e//g;
# remove field ind,
if (substr($tagdata, 2, 2) eq " |")
# & remove the "1st" space in the line
{$tagdata = substr($tagdata, 0, 2) . substr($tagdata, 3);}
# output the tag parameters and its data
printf("%3s:%4s:%5s:%s\n", $tagid, $taglen, $offset, $tagdata);
# move to the next tag in the directory
$strptr+= 12;
}
# move to the next record in the array of marc records
$idx++;
}
# provide count of marc records handled
if ($idx > 1) {$plural = "s read";}
else {$plural = " read";}
printf ("\n<<%d Marc record%s>>\n\n", $idx, $plural);
}
# show this to illustrate program usage
sub usage()
{
printf ("\nUsage: perl example1.pl [auth | bib | mfhd] startID endID [raw]\n");
printf ("
Pick one of the 3 data types.\n");
printf ("
Specify record ID numbers; specified range is inclusive.\n");
printf ("
Parameters must be in the above order.\n");
printf ("
All parameters are required except for the last one.\n");
printf ("
Program extracts marc data from blobs in Oracle.\n");
printf ("
Output is human-formatted unless *raw* is specified\n");
printf ("
and it goes to STDOUT.\n");
exit(0);
}
7
8
USMARC Concise Holdings: Leader and Directory
LEADER
A fixed field that comprises the first 24 character
positions (00-23) of each record and provides information
for the processing of the record.
00-04 - Logical record length
The computer-generated, five-character numeric string
that specifies the length of the entire record. The
number is right justified and each unused position
contains a zero.
05 - Record status
A one-character code that indicates the relation of the
record to a file.
c - Corrected or revised
d - Deleted
n - New
06 - Type of record
A one-character code that indicates the characteristics
of and defines the components of the record. When
holdings information is embedded in a USMARC
bibliographic record, this information may be contained
in field 841 $a (Holdings Coded Data Values, Type of
record).
v - Multipart item holdings
x - Single-part item holdings
y - Serial item holdings
07-08 - Undefined character positions
Each contains a blank (#)
09 – Character coding scheme
# - Marc-8
a - UCS/Unicode
10 - Indicator count
The computer-generated number 2 that indicates the number
of character positions used for indicators in a variable
data field.
11 - Subfield code count
The computer-generated number 2 that indicates the number
of character positions used for each subfield code in a
variable data field.
12-16 - Base address of data
The computer-generated, five-character numeric string
that specifies the first character position of the first
variable control field in a record. The number is right
justified and each unused position contains a zero.
17 - Encoding level
A one-character code that indicates the ANSI Z39.44 or
ANSI/NISO Z39.57 level-of-specificity requirements met by
the holdings statement. When holdings information is
embedded in a USMARC bibliographic record, this
information may be contained in field 841 $e (Holdings
Coded Data Values, Encoding level).
1 - Holdings level 1
2 - Holdings level 2
3 - Holdings level 3
4 - Holdings level 4
5 - Holdings level 4 with piece designation
Physical piece designation is contained in
subfield p (Piece designation) of field 852
(Location) or one of the 863-865 Enumeration and
Chronology fields, or in subfield $a (Textual
holdings) in one of the 866-868 Textual Holdings
fields.
m - Mixed levels
Holdings are recorded at more than one level. The
value in the first indicator position (Field
encoding level) of the applicable 863-868 field
indicates the level for each holdings data field.
u - unknown
z - Other level
18 - Item information in record
One character code that indicates whether item
information is in the
record, contained in one or more occurrences of fields
876-878 (Item
information fields).
i - Item information
n - No item information
19 - Undefined character position
Contains a blank (#)
20-23 - Entry map
Four computer-generated, single-digit numeric characters
that indicate the structure of each entry in the
Directory.
20 - Length of the length-of-field portion
Contains a 4
21 - Length of the starting-character-position portion
Contains a 5
22 - Length of the implementation-defined portion
Contains a 0
23 - Undefined Entry map character position
Contains a 0
DIRECTORY
A computer-generated index to the location of the
variable control and data fields within a record. The
Directory immediately follows the Leader at character
position 24 and consists of a series of fixed-length (12
character positions) entries that give the tag, length,
and starting character position of each variable field.
00-02 - Tag
Three numeric or alphabetic characters (uppercase or
lowercase, but not both) that identify an associated
field.
03-06 - Field length
Four numeric characters that indicate the length of the
field, including indicators, subfield codes, data, and
the field terminator. The number is right justified and
each unused position contains a zero.
07-11 - Starting character position
Five numeric characters that indicate the starting
character position of the field relative to the Base
address of data (Leader/12-16) of the record. The number
is right justified and each unused position contains a
zero.
from: http://www.loc.gov/marc/holdings/echdldrd.html
9
Resources
Books
Oracle PL/SQL Programming; Steven Feuerstein with Bill Pribyl; O’Reilly
Oracle SQL*PLUS The Definitive Guide; Jonathan Gennick; O’Reilly
Programming Perl; Larry Wall, Tom Christiansen, Randal L. Schwartz; O’Reilly
Perl Cookbook; Tom Christiansen & Nathan Torkington; O’Reilly
Perl in a Nutshell; Ellen Siever, Stephen Spainhour, Nathan Patwardhan; O’Reilly
Oracle Documentation
SQL*PLUS User’s Guide and Reference
Oracle SQL Reference
PL/SQL User’s Guide and Reference
Oracle Supplied PL/SQL Packages Reference
Web
www.loc.gov/marc/holdings/echdldrd.html MARC leader details
lcweb.loc.gov/marc/umb/um01to06.html
lcweb.loc.gov/marc/umb/um07to10.html
lcweb.loc.gov/marc/umb/um11to12.html
MARC record format details
MARC record format details
MARC record format details
http://www.orafaq.org/faq2.htm
Answers to all sorts of questions
www.revealnet.com
Lots of Oracle stuff including PL/SQL
www.cpan.org
Perl modules such as DBD/DBI, and module documentation
www.cpan.org/authors/id/TIMB/ DBD-Oracle-1.12.tar.gz
www.cpan.org/authors/id/TIMB/ DBI-1.20.tar.gz
www.perl.org
Perl related links
www.perl.com
O’Reilly Perl site: books, documentation, etc.
www.gnu.org/manual/manual/html
GNU Organization documentation for Unix, other links
10
Download