Repository_services_..

1. Developing a broad digital storage architecture that supports the long-term storage and availability of digital masters, complete restoration of repository objects within one week of a disaster and online access to presentation files within 24 hours of a disaster. The architecture should encompass Systems and SCC facilities and should indicate hardware and software needs, for both replacement and to supplement existing infrastructure. The broad architecture must be presented by Working GroupChair, Dave Hoover, at the January 6 CISC meeting. See Diagram #1 for existing Repository Infrastructure servers and storage. 24 hour recovery of presentation files will be handled by having live copies made of all components that support online delivery of objects from the repository. These copies will be made to alternate servers initially housed in Systems and in the SCC (although they could be housed anywhere). These will be read only versions of the software with no provision being made for the addition of any new files to the systems. These systems will have an installed copy of all the applications (fedora, mysql, apache, etc ..) as well as the most recent Repository code base. Based on the current Repository disk storage, it is seen that approximately 10% of all datastreams are presentation datastreams. If there is a need for ongoing WMS work and/or ingestion into fedora, adequate disk space for the workarea filesystem as well as fedora datastream space will need to be taken into account. The WMS software will need to be fully installed and configured. Indexing routines and other cron scripts will need to be put into place. Restoration of the existing archival masters is not seen as an immediate need as they are not delivered to end users once ingested. The restoration will take place from either the near line and/or offline tape backup copies that were made. 2. Identify digital storage needs, in Terabytes, for the next 3-5 years, in a report to CISC at the January 6 meeting. This storage should encompass--ongoing collection building, including ETDs, faculty deposits, SC/UA collection building and NJDH. In addition, this storage should encompass the 400 hour Video Mosaic collection and should support a forthcoming data sets collection, perhaps as part of an NSF DataNets grant. Calculations should include large data sets, such as the Coastal and Marine Sciences ORBIS data, that the library might assume responsibility for, as well as future grants by science and social science faculty that the libraries might support. Approximately 85.12 TB is needed for existing projects in the pipeline that shoul;d be able to be completed in the next 2 years. The majority of this space (80 TB) is for the Video Mosaic project. Over the next 3-5 years it is anticipated that we will need an additional 247.57 TB of storage to support Data sets as well as other video and audio projects. See attached document for specific details. 3. Determine backup and disaster recovery strategies, including tape backup, removable disk backup, etc. to ensure complete recovery in the event of both hardware failure and complete physical disaster at the Main Distribution Facility (Ssytems) and Intermediate Distribution Facility (SCC). Backups of the repository filesystem (contains ingested datastreams and objects) as well as the workarea filesystem (contains in-process files) will continue to be to tape using SAM-FS software. The current backup schedule has all files in the repository filesystem that are 20 minutes old being written to one set of tapes in the library and then when they reach 24 hours old they are written to another set of tapes. Files in the workare filesystem are backed up when they are 24 hours old. The 2nd copy repository filesystem tapes are removed when they are filled and sent offsite for permanent storage. During the week the current unfilled tape for the repository and workarea filesystems are copied to another disk partition that gets backed up and sent offsite weekly. The current tape library uses LTO2 tape drives which can fit a maximm of 200GB per tape for uncompressed data. As we move forward with larger amounts of data to be backed up, it will be necessary to move to a more modern tape library that has newer LTO drives as well as sufficient slot capacity for near line tapes. 4. Accommodate potential data sharing and failover collaborations with external partners, particularly OIT and NJEdge. As another safeguard for protecting our data, we could enter into partnership with OIT or NJEDGE or cloud storage vendors to store our data, but we do not feel that it should replace a robust tape solution. We also look at any of these places just providing a dark archive of our data and not versions that would be delivered through the public interfaces to end users. 5. Determine energy efficient strategies to accommodate an aging air conditioning system that will probably be replaced after the installation of a new digital storage system. Make sure that energy efficient servers are purchased as part of the plan. Will consolidate servers as much as possible by looking at running multiple services from the least amount of machines either on the same OS or through the use of Solaris Zones and/or VMWare. After verifying storage will only keep enough disks running to support the current work. 2GB Fiber switch Tape Library 6320 storage 7TB total 2 LTO tape drives N 200GB tapes 5TB repository 1.2 TB workarea 2 copies repository 1 copy workarea Storage samfs Staging server Production server mss3 (V440 solaris) – Apache, fedora, mysql, Repository code and support software Production pdfserver mss2 (V440 solaris) – Apache fedora, mysql, Repository code and support software Storage Development server lefty64 (Suse) – Apache, fedora, mysql, Repository code and support software Storage Development pdfserver Production & Development OCR server Production & Development Handle server Production & Development Streaming server Readonly system mss3 (V440 solaris) – Apache, fedora, mysql, Repository code and support software Production & Development Handle server Production & Development Streaming server Storage Will need to house /repository presentation datastreams 10-15% of total space Mysql databases mysql usernames/passwords; databases dlrcollections, fedora, portal, fedsearch Indexes 4.1 GB Support software 4.0 GB WMS ,ingest and edit system mss3 (V440 solaris) – Apache, fedora, mysql, Repository code and support software Production pdfserver Storage Will need to house /workarea files all Mysql databases WMS, dlrcollections Support Software WMS Repository code dlr/EDIT Restore existing Archival files and resume checksum checking Requires full /repository filesystem space Production & Development OCR server samfs Existing Tape Library samfs New Tape Library 2 LTO2 tape drives 2 LTO4 tape drives 200GB tapes 800GB tapes 2 copies repository 2 copies repository 1 copy workarea 1 copy workarea OIT Dark Archive NJEDGE Dark Archive CLOUD Dark Archive 6320 storage New storage 7TB total Initial 90TB total up to 320TB 5TB repository 82TB repository 1.2 TB workarea 20TB managed 62TB external 200 GB drtape 8 TB workarea 1 TB drtape Notes on new storage RAID 5, 6 or 10 for storage setup Archival files can be on larger slower disks (500 GB 10K) Presentation files, software and Indexes on faster drives (15K) The larger the disk the longer the reconstruction time when a disk fails. If rebuild time is too long should look at a mirroring solution. RECOVERY OF PRESENTATION FILES WITHIN 24 HOURS OF A DISASTER For redundancy and restoration of services we need to look at an active read-only system and a fully functioning read/write system. For a read-only system it is assumed that the following is needed: a) Fedora - server software, objects and presentation datastreams b) Mysql - mysql usernames/passwords; databases dlrcollections, fedora, portal, fedsearch c) Apache - configuration to run as real servername d) Support software to index, search and retrieve records (amberfish, php, xsltproc, dlr/, disseminators/, partnerportal/, api /, search/, /mellon/includes/, etc …) Note that it is probably best to have a full contingent of repository support software installed e) Quicktime streaming files and server f) Websites if they live on the Repository servers. Currently includes NJDH and RUcore. g) Have a second Ethernet interface (or be able to reconfigure the existing one) to answer on the production IP address. If subnets are different, DNS changes will need to be made. Ideally there would be an alternate server in the systems office as well as an alternate server in the SCC where we could copy the necessary files and databases. So nightly we would 1) Run a scp script of all new objects and presentation datastreams to the alternate servers 2) Take the nightly mysql dumps of the above tables and push then to the alternate servers and actively load them into the mysql server of those machines 3) Copy the nightly built indexes to the alternate server (if of same OS type). If OS is different, then rebuilding the indexes would be required. 4) Make sure that support software is up to date (should only need refreshing after a new release). 5) Keep websites up to date using tar or rsync 6) Configure the target machine software (ie apache, fedora) to answer as the production server Breakdown of external archive space usage (used for video masters) Space used for existing /repository/rarch/temp_upload directory (28 videos) Total space used 491376544 > find /repository/rarch/ -type f -name "*tar" -exec du -ks {} \; | grep temp_upload | awk '{sum = sum +$1; print $1" "sum}' | tail -1 467039440 tar files 467 GB 95 % > find /repository/rarch/ -type f -name "*mov" -exec du -ks {} \; | grep temp_upload | awk '{sum = sum +$1; print $1" "sum}' | tail -1 14025856 Quicktime mov files 14 GB 2.85 % > find /repository/rarch/ -type f -name "*flv" -exec du -ks {} \; | grep temp_upload | awk '{sum = sum +$1; print $1" "sum}' | tail -1 10311248 Flash files 10GB 2.15 % Space used for /repository/rarch directory (does not include temp_upload) Total space used 285048032 > find /repository/rarch/ -type f -name "*tar" -exec du -ks {} \; | grep -v temp_upload | awk '{sum = sum +$1; print $1" "sum}' | tail -1 277905872 tar file 277 GB (83 files) 97.5 % > find /repository/rarch/ -type f -name "*mov" -exec du -ks {} \; | grep -v temp_upload | awk '{sum = sum +$1; print $1" "sum}' | tail -1 3992720 Quicktime mov files 3.99 GB (31 files) 1.4 % > find /repository/rarch/ -type f -name "*flv" -exec du -ks {} \; | grep -v temp_upload | awk '{sum = sum +$1; print $1" "sum}' | tail -1 3149440 Flash files 3.14 GB) (31 files) 1.1 % > du -hs /local /local/src /u3/INDEX 8.1G /local application software 4.0G /local/src application software source directory 4.1G /local installed application software 4.3G /u3/INDEX amberfish Indexes (16119 objects; 146 collections) Ingested managed datastreams in /repository filesystem # du -hs objects 483M objects # du -ks datastreams 1092984188 datastreams 1.0 TB # Datastream type counts # find datastreams -type f | awk -F"/" '{print $6}' | awk -F"+" '{print $2}' | awk -F"-" '{print $1}' | nawk -f /home/dhoover/cntaip | sort | more ARCH1 15864 ARCH2 22 ARCH3 8 ARCH4 5 ARCH5 3 ARCH6 2 ARCH7 1 ARCH8 1 ARCH9 1 DJVU 13072 JPEG 19053 PDF 15725 SMAP1 15343 XML 3948 THUMB 23 THUMBJPEG 8482 FLV 78 MOV 79 SMOV 47 MP3 5 EMBARGOPDF 1 PLAIN 1 POLICY 2 TECHNICAL1 1 Breakdown of major type by size # find datastreams -type f -ls | awk '{sum=sum+$7;print $7" "sum}' | tail -1 1118598075423 100 % # find datastreams -type f -ls | grep ARCH | awk '{sum=sum+$7;print $7" "sum}'|tail -1 1019268590231 91.1 % # find datastreams -type f -ls | grep -v ARCH | awk '{sum=sum+$7;print $7" "sum}' | tail -1 99329485192 8.9 % # find datastreams -type f -ls | grep DJVU |awk '{sum=sum+$7;print $7" "sum}' | tail -1 12994992446 1.16 % # find datastreams -type f -ls | grep JPEG | awk '{sum=sum+$7;print $7" "sum}' | tail -1 13871755474 1.23 % # find datastreams -type f -ls | grep PDF | awk '{sum=sum+$7;print $7" "sum}' | tail -1 54711738229 4.89 % # find datastreams -type f -ls | grep SMAP1 | awk '{sum=sum+$7;print $7" "sum}' | tail -1 6107104 less than .01 % # find datastreams -type f -ls | grep XML | awk '{sum=sum+$7;print $7" "sum}' | tail -1 536796300 less than .01 % # find datastreams -type f -ls | grep THUMB | awk '{sum=sum+$7;print $7" "sum}' | tail -1 88954339 less than .01 % # find datastreams -type f -ls | grep FLV | awk '{sum=sum+$7;print $7" "sum}' | tail -1 6755777719 .60 % # find datastreams -type f -ls | grep MOV | awk '{sum=sum+$7;print $7" "sum}' | tail -1 9788254297 .88 % # find datastreams -type f -ls | grep MP3 | awk '{sum=sum+$7;print $7" "sum}' | tail -1 663139005 /workarea file system # find /workarea -type f -ls | grep -v temp_upload | awk '{count=count+1;sum=sum+$7; print count" " sum}' | tail -1 23948 313538177123 100 % # find /workarea -type f -ls | grep -v temp_upload | grep tar | awk '{count=count+1;sum=sum+$7; print count" " sum}' | tail -1 2018 149770308976 48.0 % # find /workarea -type f -ls | grep -v temp_upload | grep -v tar | awk '{count=count+1;sum=sum+$7; print count" " sum}' | tail -1 21930 163767868147 52.0 % /workarea/temp_upload # find /workarea -type f -ls | grep temp_upload | awk '{count=count+1;sum=sum+$7 ; print count" " sum}' | tail -1 6512 128921741086 100 % # find /workarea -type f -ls | grep temp_upload | grep tar|awk '{count=count+1;sum=sum+$7; print count" " sum}' | tail -1 973 38291583507 29.7 % # find /workarea -type f -ls | grep temp_upload | grep -v tar | awk '{count=count+1;sum=sum+$7; print count" " sum}' | tail -1 5539 90630157579 70.3 %

Repository_services_..

Related documents

Products

Support

Repository_services_..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib