Another Way to Attack the BLOB: Server-side Access via

advertisement
Another Way to
Attack the
BLOB:
Server-side Access via
PL/SQL and Perl
Why Server-side?
• Your choice of tools to handle queries and
generate reports
• Complete programmatic control
• Easier to write complex reports
• No (well, fewer) limitations
• Easier to restrict database access to the
masses
Syllabus
•
•
•
•
•
Brief MARC record review
The BLOB Plan of Attack
Data Retrieval via PL/SQL
Required tools for Perl: getting DBD & DBI
Data Retrieval via Perl
•
•
•
•
•
Brief MARC record review
The BLOB Plan of Attack
Data Retrieval via PL/SQL
Required tools for Perl: getting DBD & DBI
Data Retrieval via Perl
MARC?
• MARC is an acronym for
MAchine Readable Cataloging
MARC
• MARC is an acronym for
MAchine Readable Cataloging.
• It’s a standard format for storing an item’s
data.
MARC
• MARC is an acronym for
MAchine Readable Cataloging.
• It’s a standard format for storing an item’s
data.
• It’s machine readable, but not so easy for
us humans to read.
MARC
• MARC is an acronym for
MAchine Readable Cataloging.
• It’s a standard format for storing an item’s
data.
• It’s machine readable, but not so easy for
us humans to read.
• With a bit of practice, a raw MARC record
can be parsed by hand.
MARC
• MARC is an acronym for
MAchine Readable Cataloging.
• It’s a standard format for storing an item’s
data.
• It’s machine readable, but not so easy for
us humans to read.
• With a bit of practice, a raw MARC record
can be parsed by hand.
• However, doing so is about as exciting and
satisfying as trying to thread a needle onehanded.
A MARC record’s three pieces:
• Leader
• Directory
• Data
Partial view of a MARC record
01551nam
22003738a 4500001001300000003000600013005001700019008004100036
010001700077035001800094040001800112043001200130049003000142050002500172
074000900197082001600206086001700222099001700239100001800256245011000274
this is the leader
260011200384300003800496490005400534500016500588500007500753500003400828
500003900862504005200901650004600953650005000999650004901049710002901098
830005001127 ocm10726696
b
f000 0 eng
an-us-az
a
OCoLC 19961223115432.0 840406s1996
84600065
a(GPO)97054409
dcuab
dGPO dDLC dMvI
awdoc,sudc i31141009995734 00 aQE611.5.U6 bF84 1996
a06
/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\
tural zArizona zMohave County. 2
aGeological Survey (U.S.)
al Survey professional paper ; v1266.
0 aGeologic
Partial view of a MARC record
01551nam
22003738a 4500001001300000003000600013005001700019008004100036
010001700077035001800094040001800112043001200130049003000142050002500172
074000900197082001600206086001700222099001700239100001800256245011000274
260011200384300003800496490005400534500016500588500007500753500003400828
500003900862504005200901650004600953650005000999650004901049710002901098
830005001127 ocm10726696
b
f000 0 eng
an-us-az
a
OCoLC 19961223115432.0 840406s1996
84600065
a(GPO)97054409
dcuab
dGPO dDLC dMvI
awdoc,sudc i31141009995734 00 aQE611.5.U6 bF84 1996
this is the directory
a06
/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\
tural zArizona zMohave County. 2
aGeological Survey (U.S.)
al Survey professional paper ; v1266.
0 aGeologic
Partial view of a MARC record
01551nam
22003738a 4500001001300000003000600013005001700019008004100036
010001700077035001800094040001800112043001200130049003000142050002500172
this is the data
074000900197082001600206086001700222099001700239100001800256245011000274
260011200384300003800496490005400534500016500588500007500753500003400828
500003900862504005200901650004600953650005000999650004901049710002901098
830005001127 ocm10726696
b
f000 0 eng
an-us-az
a
OCoLC 19961223115432.0 840406s1996
84600065
a(GPO)97054409
dcuab
dGPO dDLC dMvI
awdoc,sudc i31141009995734 00 aQE611.5.U6 bF84 1996
a06
/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\
tural zArizona zMohave County. 2
aGeological Survey (U.S.)
al Survey professional paper ; v1266.
0 aGeologic
Dissection of MARC record leader
(pertinent details)
01551nam
22003738a 4500001001300000003000600013005001700019008004100036
010001700077035001800094040001800112043001200130049003000142050002500172
074000900197082001600206086001700222099001700239100001800256245011000274
260011200384300003800496490005400534500016500588500007500753500003400828
data starts at this offset,
the base address
500003900862504005200901650004600953650005000999650004901049710002901098
830005001127 ocm10726696
b
OCoLC 19961223115432.0 840406s1996
f000 0 eng
a
84600065
record length
an-us-az
a(GPO)97054409
dcuab
dGPO dDLC dMvI
awdoc,sudc i31141009995734 00 aQE611.5.U6 bF84 1996
a06
/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\
tural zArizona zMohave County. 2
aGeological Survey (U.S.)
al Survey professional paper ; v1266.
0 aGeologic
Dissection of MARC record directory
how to parse it
01551nam
22003738a 4500001001300000003000600013005001700019008004100036
010001700077035001800094040001800112043001200130049003000142050002500172
01551nam
tag
len
22003738a 4500
header
offset
001 0013 00000
003 0006 00013
005 0017 00019
008 0041 00036
010 0017 00077
035 0018 00094
040 0018 00112
etc.
Each 12-character “triplet” is
associated with one field.
Where in the record does a field’s
data start?
01551nam
22003738a 4500001001300000003000600013005001700019008004100036
010001700077035001800094040001800112043001200130049003000142050002500172
01551nam
tag
len
22003738a 4500
offset
001 0013 00000
003 0006 00013
005 0017 00019
008 0041 00036
010 0017 00077
035 0018 00094
040 0018 00112
etc.
header
Where a field’s data starts is
determined by adding its
offset to the base address.
Data for the first field, tag
001, begins at position 373,
tag 003 begins at 386, tag 005
begins at 392, etc.
Partial view of a raw MARC record,
data section
01551nam
22003738a 4500001001300000003000600013005001700019008004100036
010001700077035001800094040001800112043001200130049003000142050002500172
The “box characters” below are the MARC
074000900197082001600206086001700222099001700239100001800256245011000274
format binary separation characters.
260011200384300003800496490005400534500016500588500007500753500003400828
500003900862504005200901650004600953650005000999650004901049710002901098
830005001127 ocm10726696
b
f000 0 eng
an-us-az
a
OCoLC 19961223115432.0 840406s1996
84600065
a(GPO)97054409
dcuab
dGPO dDLC dMvI
awdoc,sudc i31141009995734 00 aQE611.5.U6 bF84 1996
a06
/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\
tural zArizona zMohave County. 2
aGeological Survey (U.S.)
al Survey professional paper ; v1266.
0 aGeologic
Partial view of a raw MARC record,
data section
01551nam 22003738a 4500001001300000003000600013005001700019008004100036
010001700077035001800094040001800112043001200130049003000142050002500172
074000900197082001600206086001700222099001700239100001800256245011000274
260011200384300003800496490005400534500016500588500007500753500003400828
500003900862504005200901650004600953650005000999650004901049710002901098
830005001127<TAG>ocm10726696 <TAG>OCoLC<TAG>19961223115432.0<TAG>840406s
1996
dcuab
b
f000 0 eng <TAG> <SUB>a
84600065 <TAG> <SUB>a(
GPO)97054409<TAG> <SUB>dGPO<SUB>dDLC<SUB>dMvI<TAG> <SUB>an-us-az<TAG>
<SUB>awdoc,sudc<SUB>i31141009995734<TAG>00<SUB>aQE611.5.U6<SUB>bF84 199
6<TAG> <SUB>a06
/\/\/\/\/\/\/\/\/\/\/\/\/\/\ skipping part of record here /\/\/\/\/\/\/\/\/\/\/\/\
tural<SUB>zArizona<SUB>zYavapai County.<TAG>0<SUB>aGeology, Structural<S
UB>zArizona<SUB>zMohave County.<TAG>2 <SUB>aGeological Survey (U.S.)<TAG
> 0<SUB>aGeological Survey professional paper ;<SUB>v1266.<TAG><EOR>
The MARC format uses the
following characters:
<TAG> hex 1e tag delimiter
<SUB> hex 1f subfield delimiter
<EOR> hex 1d end of record indicator
Programmer’s MARC format review
• Get the record length from the 1st 5 columns.
Programmer’s MARC format review
• Get the record length from the 1st 5 columns.
• Get the data base-address from columns 13-17.
Programmer’s MARC format review
• Get the record length from the 1st 5 columns.
• Get the data base-address from columns 13-17.
• Parse through the directory for the desired
field by looking at the 1st 3 columns of each
tag’s 12-character “triplet”. Get the tag’s
length (next 4 columns) and offset (last 5
columns of the “triplet”).
Programmer’s MARC format review
• Get the record length from the 1st 5 columns.
• Get the data base-address from columns 13-17.
• Parse through the directory for the desired
field by looking at the 1st 3 columns of each
tag’s 12-character “triplet”. Get the tag’s
length (next 4 columns) and offset (last 5
columns of the “triplet”).
• Read the tag’s data by:
Adding the tag’s offset to the record’s
base address.
Starting at that position, read the tag’s
data for tag length columns.
Programmer’s MARC format review
• Get the record length from the 1st 5 columns.
• Get the data base-address from columns 13-17.
• Parse through the directory for the desired
field by looking at the 1st 3 columns of each
tag’s 12-character “triplet”. Get the tag’s
length (next 4 columns) and offset (last 5
columns of the “triplet”).
• Read the tag’s data by:
Adding the tag’s offset to the record’s
base address.
Starting at that position, read the tag’s
data for tag length columns.
• Make sure the position you’re reading from is
not beyond the end of the record.
Programmer’s MARC format review
• Get the record length from the 1st 5 columns.
• Get the data base-address from columns 13-17.
• Parse through the directory for the desired
Beware
the common
bycolumns
1” error.
field
by of
looking
at the“off
1st 3
of each
Depending
on the language
you’re
using,
tag’s
12-character
“triplet”.
Get tag’s
you could
by 1 in
length
(nextbe4 off
columns)
andeither
offsetdirection
(last 5
regarding
your“triplet”).
position within the record.
columns
of the
• Read the tag’s data by:
Adding the tag’s offset to the record’s
base address.
Starting at that position, read the tag’s
data for tag length columns.
• Make sure the position you’re reading from is
not beyond the end of the record.
•
•
•
•
•
Brief MARC record review
The BLOB Plan of Attack
Data Retrieval via PL/SQL
Required tools for Perl: getting DBD & DBI
Data Retrieval via Perl
The BLOB Plan of Attack
• Voyager’s BLOB data is stored the same way
for the Auth, Bib, and Mfhd data tables.
table_data (where “table”
table_id
record_segment
seqnum
is auth, bib, or mfhd)
The BLOB Plan of Attack
table_data (where “table”
table_id
record_segment
seqnum
is auth, bib, or mfhd)
A MARC record is typically stored entirely in
one row in the table. Longer records which are
longer than the record_segment size have
to be stored in more than one row.
The BLOB Plan of Attack
table data (where “table”
table_id
record_segment
seqnum
is auth, bib, or mfhd)
Each table_id is unique to an item’s record.
However, if more than one row makes up a record,
we will have duplicate table_ids. In that case,
we’ll have seqnum = 1, 2, 3, etc., for that
record.
The BLOB Plan of Attack
auth_id
record_segment
seqnum
635406
MARC data
1
An example of a record contained
completely in one row.
This record is ready to be processed
after extraction from the
record_segment.
The BLOB Plan of Attack
auth_id
record_segment
seqnum
635406
635406
635406
MARC data
MARC data
MARC data
1
2
3
This longer record is spread across 3 rows.
Assemble the MARC record by concatenating MARC
data in seqnum order:
MARC-record =
record_segment<-seqnum1 +
record_segment<-seqnum2 +
record_segment<-seqnum3
This record is then ready to be processed.
•
•
•
•
•
Brief MARC record review
The BLOB Plan of Attack
Data Retrieval via PL/SQL
Required tools for Perl: getting DBD & DBI
Data Retrieval via Perl
PL/SQL Example
The example code retrieves a few MARC
records, and displays them on the
screen in human-readable format, along
with some diagnostics.
(The code examined in the following
slides starts on Page 2 of the handout.)
PL/SQL Example
Use a cursor to retrieve data
Also declare necessary variables in this section
PL/SQL Example
Open the cursor and start
looping through the rows
PL/SQL Example
Get a row from the cursor into the program
variables
PL/SQL Example
Assemble the marc record. The typical record
fits into one row, thus seqnum = 1 and we
skip the loop.
PL/SQL Example
For a longer, multi-segment record (from an earlier
example), we 1st have seqnum=3 & put it into marc. Then
we have seqnum=2 and PREPEND that to marc. Last we
exit the loop since now seqnum=1 and the last
statement here takes care of that.
PL/SQL Example
Why go “backwards” in assembling a MARC record?
If we predicate the segment-to-marc-record assembly on
when the auth_id changes in our loop structure, once
it changes we've gone too far and can't go back to get
the last segment to completely assemble the now
previous record.
It’s simpler to predicate looping on seqnum in reverse
order because there will always be a seqnum of 1.
If there are multiple segments, we'll always end with
a seqnum of 1 and still be on the same auth_id and can
go on processing the record.
This reasoning is not for PL/SQL only, although that
is “where” the idea came from.
PL/SQL Example
Now that we have a MARC record, let’s get the
record length and data base-address. We set our
pointer to the start of the directory and start
looping through the directory.
PL/SQL Example
As we loop through the directory, we read the tag
id, its length, and its offset in the data part. The
actual tag address where we get the data is the data
base-address plus the offset.
PL/SQL Example
In the last line here, the subfield
indicators (hex 1f = dec 31) are
replaced by the vertical bar
character “|” for better
readability.
PL/SQL Example
Along with the subfield indicator character
substitution, we add some space formatting
to further increase readability.
Thus, instead of
0aPetroleumxDrilling fluids
we get
0|a Petroleum |x Drilling fluids
for tag data.
PL/SQL Example
PL/SQL Example
Now we can output the tag’s data. Output is broken
into 80 character chunks to get around the 255
character limit of dbms_output and for better
readability.
PL/SQL Example
We’re done with this tag, so we move on to the next
tag in the directory. At the end, close loops and
clean up.
End looping for directory traversal
End looping for cursor
Don’t forget that this ending character is
required for your PL/SQL code to run!
PL/SQL Example
Demo…
example.pls
•
•
•
•
•
Brief MARC record review
The BLOB Plan of Attack
Data Retrieval via PL/SQL
Required tools for Perl: getting DBD & DBI
Data Retrieval via Perl
Additional tools required for Perl
to talk to Oracle:
• DBI, the generic DataBase Interface
software.
• DBD, the specific DataBase Driver, for
Oracle in our case.
Getting and installing DBI and DBD
Point your browser to:
http://www.cpan.org/authors/id/TIMB/
Complete the above URL with
“DBD-Oracle-1.12.tar.gz” to get DBD software
“DBI-1.20.tar.gz” to get DBI software
Getting and installing DBI and DBD
•gunzip each file.
•un-tar each file.
•READ the instructions!
•Installation takes 4 or 5 steps and
requires you to be root.
•If you don’t have root access, or if you’re
uncomfortable doing any of this, seek out
your SysAdmin for assistance.
•
•
•
•
•
Brief MARC record review
The BLOB Plan of Attack
Data Retrieval via PL/SQL
Required tools for Perl: getting DBD & DBI
Data Retrieval via Perl
Perl Example
The following real-world example lets you
retrieve an arbitrary range of MARC records
from your choice of Auth, Bib, or Mfhd.
Output goes to <stdout>, and can be raw MARC
data, or formatted for human readability.
(The code examined in the following
slides starts on Page 5 of the handout.)
Perl Example
Must pull in DBI stuff
Handle program
arguments and
show how to
use it if
necessary
Perl Example
Here we create the database connection and assign
its context to a database handle. We need to
specify what type of data (Oracle), the name of
the machine to which we’re connecting, the SID,
and the username and password.
Perl Example
We saw this query in the PL/SQL example. Here
we build the query statement, inserting the
program arguments where needed. This allows
this query to work with any MARC table type and
an arbitrary table_id range.
Perl Example
Create the query context and assign it to a
statement handle.
Execute the statement and receive a return
code.
Perl Example
This is how we get rows from the result set of the
query, via the statement handle. The three columns in
the row fall into the list of three variables.
Perl Example
Raw output:
On record transition, output
the MARC record we just
built, reset the ID
variable, and store the MARC
data for the record we just
started reading.
If on the same record, keep
on storing MARC data.
Output last record here
Perl Example
Formatted (not raw) output:
On record transition, store the accumulated MARC
record and start building a new one, else just prepend
to the present marc record.
(We’re effectively building
a MARC file in memory, a
virtual file, in the
$marcstuff variable.)
Store last record here
Perl Example
Release the resources associated with the
statement handle and the database handle.
Perl Example
Executing this part for formatted, readable output
MARC data contains no CR-LFs; instead
it uses the hex 1d character to
delimit the end of a MARC record.
Create the array of MARC records here.
Perl Example
Executing this part for formatted, readable output
Start looping through the
array of MARC records.
Perl Example
Executing this part for formatted, readable output
We get and output the leader, and then get the
record length and the data base-address. Then we
position ourselves at the start of the directory.
Perl Example
Executing this part for formatted, readable output
Loop through the directory
Perl Example
Executing this part for formatted, readable output
Get the tag id, its length, and its offset. Then
read the tag’s data. The actual tag address where
we get the data is the data base-address plus the
offset.
Perl Example
Executing this part for formatted, readable output
Now do some formatting for readability. We substitute
the vertical bar character “|” for the subfield
delimiter, and remove the other delimiters.
Perl Example
Executing this part for formatted, readable output
Output the tag’s parameters, and the data. Then go
to the next tag in the directory.
Perl Example
Executing this part for formatted, readable output
End of program stuff. Close loops and show
count of records output.
Perl Example
Demo…
example.pl
Perl
•PROBLEM: if you’re reading the entire table, you
can still run into problems with too much data at
one time.
•SOLUTION: process your data in small chunks.
•Dividing the table into chunks of about 50,000
rows has worked very well for us.
•The following method has proven useful:
Perl
Large Table Solution in a Nutshell
•This example uses the BIB_DATA table
in your setup section,
set a db_increment variable to 50,000
set max_bib_id to highest bib_id from table
set
beginning_bib_id to 0,
ending_bib_id to db_increment
Perl
Large Table Solution in a Nutshell
This outer loop goes through the entire table:
while beginning_bib_id < max_bib_id
call chunkthrudb
set
beginning_bib_id to (ending_bib_id + 1)
increment ending_bib_id by db_increment
end while
Perl
Large Table Solution in a Nutshell
This inner loop goes through db_increment-sized
chunks:
sub chunkthrudb
select bib_id,
record_segment,
seqnum
from bib_data
where bib_id >= beginning_bib_id and
bib_id < ending_bib_id
order by bib_id asc, seqnum desc
build the MARC record and call processrec
end sub
Perl
Large Table Solution in a Nutshell
sub processrec
process the MARC record as needed
end sub
Perl
Large Table Solution in a Nutshell
Page 8 of the handout has a diagram
illustrating this process.
Thanks for listening.
Questions?
Email: zimmer@wmich.edu
Phone: 616.387.3885
Download