Repository_services_..

advertisement
1. Developing a broad digital storage architecture that supports the long-term storage and
availability of digital masters, complete restoration of repository objects within one week of a
disaster and online access to presentation files within 24 hours of a disaster. The architecture
should encompass Systems and SCC facilities and should indicate hardware and software needs,
for both replacement and to supplement existing infrastructure. The broad architecture must
be presented by Working GroupChair, Dave Hoover, at the January 6 CISC meeting.
See Diagram #1 for existing Repository Infrastructure servers and storage.
24 hour recovery of presentation files will be handled by having live copies made of all
components that support online delivery of objects from the repository. These copies will be
made to alternate servers initially housed in Systems and in the SCC (although they could be
housed anywhere). These will be read only versions of the software with no provision being
made for the addition of any new files to the systems. These systems will have an installed copy
of all the applications (fedora, mysql, apache, etc ..) as well as the most recent Repository code
base.
Based on the current Repository disk storage, it is seen that approximately 10% of all
datastreams are presentation datastreams.
If there is a need for ongoing WMS work and/or ingestion into fedora, adequate disk space for
the workarea filesystem as well as fedora datastream space will need to be taken into account.
The WMS software will need to be fully installed and configured. Indexing routines and other
cron scripts will need to be put into place.
Restoration of the existing archival masters is not seen as an immediate need as they are not
delivered to end users once ingested. The restoration will take place from either the near line
and/or offline tape backup copies that were made.
2.
Identify digital storage needs, in Terabytes, for the next 3-5 years, in a report to CISC at the
January 6 meeting. This storage should encompass--ongoing collection building, including ETDs,
faculty deposits, SC/UA collection building and NJDH. In addition, this storage should
encompass the 400 hour Video Mosaic collection and should support a forthcoming data sets
collection, perhaps as part of an NSF DataNets grant. Calculations should include large data sets,
such as the Coastal and Marine Sciences ORBIS data, that the library might assume responsibility
for, as well as future grants by science and social science faculty that the libraries might support.
Approximately 85.12 TB is needed for existing projects in the pipeline that shoul;d be able to be
completed in the next 2 years. The majority of this space (80 TB) is for the Video Mosaic project.
Over the next 3-5 years it is anticipated that we will need an additional 247.57 TB of storage to
support Data sets as well as other video and audio projects.
See attached document for specific details.
3. Determine backup and disaster recovery strategies, including tape backup, removable disk
backup, etc. to ensure complete recovery in the event of both hardware failure and complete
physical disaster at the Main Distribution Facility (Ssytems) and Intermediate Distribution
Facility (SCC).
Backups of the repository filesystem (contains ingested datastreams and objects) as well as the
workarea filesystem (contains in-process files) will continue to be to tape using SAM-FS
software. The current backup schedule has all files in the repository filesystem that are 20
minutes old being written to one set of tapes in the library and then when they reach 24 hours
old they are written to another set of tapes. Files in the workare filesystem are backed up when
they are 24 hours old. The 2nd copy repository filesystem tapes are removed when they are
filled and sent offsite for permanent storage. During the week the current unfilled tape for the
repository and workarea filesystems are copied to another disk partition that gets backed up
and sent offsite weekly.
The current tape library uses LTO2 tape drives which can fit a maximm of 200GB per tape for
uncompressed data. As we move forward with larger amounts of data to be backed up, it will be
necessary to move to a more modern tape library that has newer LTO drives as well as sufficient
slot capacity for near line tapes.
4. Accommodate potential data sharing and failover collaborations with external partners,
particularly OIT and NJEdge.
As another safeguard for protecting our data, we could enter into partnership with OIT or
NJEDGE or cloud storage vendors to store our data, but we do not feel that it should replace a
robust tape solution. We also look at any of these places just providing a dark archive of our
data and not versions that would be delivered through the public interfaces to end users.
5. Determine energy efficient strategies to accommodate an aging air conditioning system that will
probably be replaced after the installation of a new digital storage system.
Make sure that energy efficient servers are purchased as part of the plan. Will consolidate
servers as much as possible by looking at running multiple services from the least amount of
machines either on the same OS or through the use of Solaris Zones and/or VMWare.
After verifying storage will only keep enough disks running to support the current work.
2GB Fiber switch
Tape Library
6320 storage
7TB total
2 LTO tape drives
N 200GB tapes
5TB repository
1.2 TB workarea
2 copies repository
1 copy workarea
Storage
samfs
Staging server
Production server
mss3 (V440 solaris) – Apache,
fedora, mysql, Repository code
and support software
Production
pdfserver
mss2 (V440 solaris) – Apache
fedora, mysql, Repository code
and support software
Storage
Development server
lefty64 (Suse) – Apache, fedora,
mysql, Repository code and
support software
Storage
Development
pdfserver
Production &
Development
OCR server
Production &
Development
Handle server
Production &
Development
Streaming server
Readonly system
mss3 (V440 solaris) – Apache,
fedora, mysql, Repository code
and support software
Production &
Development
Handle server
Production &
Development
Streaming server
Storage
Will need to house
/repository presentation datastreams 10-15% of total space
Mysql databases mysql usernames/passwords; databases dlrcollections, fedora,
portal, fedsearch
Indexes 4.1 GB
Support software 4.0 GB
WMS ,ingest and edit system
mss3 (V440 solaris) – Apache,
fedora, mysql, Repository code
and support software
Production
pdfserver
Storage
Will need to house
/workarea files all
Mysql databases WMS, dlrcollections
Support Software
WMS Repository code
dlr/EDIT
Restore existing Archival files and resume checksum checking
Requires full /repository filesystem space
Production &
Development
OCR server
samfs
Existing Tape Library
samfs
New Tape Library
2 LTO2 tape drives
2 LTO4 tape drives
200GB tapes
800GB tapes
2 copies repository
2 copies repository
1 copy workarea
1 copy workarea
OIT
Dark
Archive
NJEDGE
Dark
Archive
CLOUD
Dark
Archive
6320 storage
New storage
7TB total
Initial 90TB total up to 320TB
5TB repository
82TB repository
1.2 TB workarea
20TB managed 62TB external
200 GB drtape
8 TB workarea
1 TB drtape
Notes on new storage
RAID 5, 6 or 10 for storage setup
Archival files can be on larger slower disks (500 GB 10K)
Presentation files, software and Indexes on faster drives (15K)
The larger the disk the longer the reconstruction time when a disk fails. If
rebuild time is too long should look at a mirroring solution.
RECOVERY OF PRESENTATION FILES WITHIN 24 HOURS OF A DISASTER
For redundancy and restoration of services we need to look at an active read-only system and a fully
functioning read/write system.
For a read-only system it is assumed that the following is needed:
a) Fedora - server software, objects and presentation datastreams
b) Mysql
- mysql usernames/passwords; databases dlrcollections, fedora, portal, fedsearch
c) Apache - configuration to run as real servername
d) Support software to index, search and retrieve records (amberfish, php, xsltproc, dlr/,
disseminators/, partnerportal/, api /, search/, /mellon/includes/, etc …) Note that it is
probably best to have a full contingent of repository support software installed
e) Quicktime streaming files and server
f)
Websites if they live on the Repository servers. Currently includes NJDH and RUcore.
g) Have a second Ethernet interface (or be able to reconfigure the existing one) to answer on the
production IP address. If subnets are different, DNS changes will need to be made.
Ideally there would be an alternate server in the systems office as well as an alternate server in the SCC
where we could copy the necessary files and databases. So nightly we would
1) Run a scp script of all new objects and presentation datastreams to the alternate servers
2) Take the nightly mysql dumps of the above tables and push then to the alternate servers and
actively load them into the mysql server of those machines
3) Copy the nightly built indexes to the alternate server (if of same OS type). If OS is different, then
rebuilding the indexes would be required.
4) Make sure that support software is up to date (should only need refreshing after a new release).
5) Keep websites up to date using tar or rsync
6) Configure the target machine software (ie apache, fedora) to answer as the production server
Breakdown of external archive space usage (used for video masters)
Space used for existing /repository/rarch/temp_upload directory (28 videos)
Total space used 491376544
> find /repository/rarch/ -type f -name "*tar" -exec du -ks {} \; | grep temp_upload | awk '{sum
= sum +$1; print $1" "sum}' | tail -1
467039440 tar files 467 GB
95 %
> find /repository/rarch/ -type f -name "*mov" -exec du -ks {} \; | grep temp_upload | awk '{sum
= sum +$1; print $1" "sum}' | tail -1
14025856 Quicktime mov files 14 GB 2.85 %
> find /repository/rarch/ -type f -name "*flv" -exec du -ks {} \; | grep temp_upload | awk '{sum =
sum +$1; print $1" "sum}' | tail -1
10311248 Flash files 10GB
2.15 %
Space used for /repository/rarch directory (does not include temp_upload)
Total space used 285048032
> find /repository/rarch/ -type f -name "*tar" -exec du -ks {} \; | grep -v temp_upload | awk
'{sum = sum +$1; print $1" "sum}' | tail -1
277905872 tar file 277 GB (83 files) 97.5 %
> find /repository/rarch/ -type f -name "*mov" -exec du -ks {} \; | grep -v temp_upload | awk
'{sum = sum +$1; print $1" "sum}' | tail -1
3992720 Quicktime mov files 3.99 GB (31 files)
1.4 %
> find /repository/rarch/ -type f -name "*flv" -exec du -ks {} \; | grep -v temp_upload | awk
'{sum = sum +$1; print $1" "sum}' | tail -1
3149440 Flash files 3.14 GB) (31 files)
1.1 %
> du -hs /local /local/src /u3/INDEX
8.1G /local
application software
4.0G /local/src application software source directory
4.1G /local
installed application software
4.3G /u3/INDEX amberfish Indexes (16119 objects; 146 collections)
Ingested managed datastreams in /repository filesystem
# du -hs objects
483M objects
# du -ks datastreams 1092984188 datastreams 1.0 TB
#
Datastream type counts
# find datastreams -type f | awk -F"/" '{print $6}' | awk -F"+" '{print $2}' | awk -F"-" '{print $1}' |
nawk -f /home/dhoover/cntaip | sort | more
ARCH1 15864 ARCH2 22 ARCH3 8 ARCH4 5 ARCH5 3 ARCH6 2 ARCH7 1 ARCH8 1 ARCH9 1
DJVU 13072 JPEG 19053 PDF 15725 SMAP1 15343 XML 3948
THUMB 23 THUMBJPEG 8482
FLV 78 MOV 79 SMOV 47
MP3 5
EMBARGOPDF 1 PLAIN 1 POLICY 2 TECHNICAL1 1
Breakdown of major type by size
# find datastreams -type f -ls | awk '{sum=sum+$7;print $7" "sum}' | tail -1
1118598075423 100 %
# find datastreams -type f -ls | grep ARCH | awk '{sum=sum+$7;print $7" "sum}'|tail -1
1019268590231
91.1 %
# find datastreams -type f -ls | grep -v ARCH | awk '{sum=sum+$7;print $7" "sum}' | tail -1
99329485192
8.9 %
# find datastreams -type f -ls | grep DJVU |awk '{sum=sum+$7;print $7" "sum}' | tail -1
12994992446
1.16 %
# find datastreams -type f -ls | grep JPEG | awk '{sum=sum+$7;print $7" "sum}' | tail -1
13871755474
1.23 %
# find datastreams -type f -ls | grep PDF | awk '{sum=sum+$7;print $7" "sum}' | tail -1
54711738229 4.89 %
# find datastreams -type f -ls | grep SMAP1 | awk '{sum=sum+$7;print $7" "sum}' | tail -1
6107104
less than .01 %
# find datastreams -type f -ls | grep XML | awk '{sum=sum+$7;print $7" "sum}' | tail -1
536796300 less than .01 %
# find datastreams -type f -ls | grep THUMB | awk '{sum=sum+$7;print $7" "sum}' | tail -1
88954339
less than .01 %
# find datastreams -type f -ls | grep FLV | awk '{sum=sum+$7;print $7" "sum}' | tail -1
6755777719 .60 %
# find datastreams -type f -ls | grep MOV | awk '{sum=sum+$7;print $7" "sum}' | tail -1
9788254297 .88 %
# find datastreams -type f -ls | grep MP3 | awk '{sum=sum+$7;print $7" "sum}' | tail -1
663139005
/workarea file system
# find /workarea -type f -ls | grep -v temp_upload | awk '{count=count+1;sum=sum+$7; print
count" " sum}' | tail -1
23948 313538177123 100 %
# find /workarea -type f -ls | grep -v temp_upload | grep tar | awk
'{count=count+1;sum=sum+$7; print count" " sum}' | tail -1
2018 149770308976
48.0 %
# find /workarea -type f -ls | grep -v temp_upload | grep -v tar | awk
'{count=count+1;sum=sum+$7; print count" " sum}' | tail -1
21930 163767868147 52.0 %
/workarea/temp_upload
# find /workarea -type f -ls | grep temp_upload | awk '{count=count+1;sum=sum+$7
; print count" " sum}' | tail -1
6512 128921741086
100 %
# find /workarea -type f -ls | grep temp_upload | grep tar|awk '{count=count+1;sum=sum+$7;
print count" " sum}' | tail -1
973 38291583507 29.7 %
# find /workarea -type f -ls | grep temp_upload | grep -v tar | awk
'{count=count+1;sum=sum+$7; print count" " sum}' | tail -1
5539 90630157579 70.3 %
Download