Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance Computing Laboratory April 2007 Open Access What is Open Access? free online access to electronic resources: research papers, courseware, ETDs, etc. Why? Lower costs, Empower producers Empower consumers, Improve visibility … How? Institutional Repository: online Open Access Journal: online system to manage documents, system to manage publication typically at one institution. and dissemination. Advocacy, Policy, Procedures, Management, … Software Tools: DSpace, EPrints, OJS, … Very Large Data Collections Source: Sizing Open Access UCT-CS Publication Archive: UCT 2005 Research Report: Average size of PDF document: 848K 4150 Research document artefacts Estimate output for one year: 4150 documents, 3.15 GB And this is only published peer-reviewed research as we know it! What about theses/dissertations? What about technical and project reports? 1000 documents, 10MB each, totalling 10GB Courseware? 1000 documents, 5MB each, totalling 5GB 50MB per course? Datasets? Almost infinite! Repository Software UnScalability Most systems do not scale well beyond small collections. EPrints DSpace source: Technical Evaluation of selected Open Source Repository Systems, Catalyst IT Service Provision UnScalability UCT-CS Archive: Average of 6.24 user accesses per document per month Average of 18 accesses per document per month For 83000 documents: 1.494 million accesses per month 34.58 accesses per minute What about search/browse and other services? And this is only published peer-reviewed research as we know it! Some Solutions Devise completely new algorithms and systems to deal with massive quantities of information. Use computing resources more efficiently, to maximise benefits with minimum cost. Efficient Cluster and Grid computing Make the users’ computer do more work. Fedora/SRB/etc., Parallel OAI-PMH, Terascale IR systems… Client-side computation: AJAX Make the users do all or most of the work. Web 2.0 Scalable Repositories Fedora Storage Resource Broker Storage abstraction for large-scale stores developed at San Diego Supercomputing Centre. Grid-based Storage Systems Digital Repository system developed at Cornell with API for higher level services. Systems to utilise Grid computing for storage of data in distributed fashion. Amazon and Google Third party providers of storage at a premium. Parallel Harvesting Multiple harvesters cycle through harvest and process operations in parallel. Significant benefit when workload is high. OAI Data Provider drone drone … Parallelism helps even on one machine! What about parallel data provision? Primary Harvester Beowulf cluster or multiprocessor High(er) Performance IR for the Rest of Us Efficient search engine on a small cluster, more likely in developing countries. Job dispatcher Nodes either do querying or indexing and can swap if needed. Reasonably good performance. Work is being extended for larger collections and grids. Terascale IR? index node index node query node query node … Beowulf cluster High-Level Component-Based DL Scalability Split DL into components and spread across cluster. Services are Web-distributed. Make services mobile and create replicates. Performance improvement and better use of multiple cheap computers. Moving digital archives to grids can deal with service provision scalability. Registry See DILIGENT… node1 node2 Resolver instances Resolver Web 2.0 – User Contributions Make users provide as much information as possible. Users managing content means less central management. Greater scalability! Ajax for DL Services AJAX supports applications or services within the browser. Move computation from server to client Greater scalability! Scalable Preservation XML is often used for Preservation while Databases are used for Access. How do we make XML tools scalable? Can we? Concluding Thoughts Open Access and Digital Repositories must consider scalability as fundamental as preservation and access. Realistically, we have not had major scalability problems yet, but we don’t have many Open Access systems either. Google works because scalability is a primary concern. They intend to index the world’s information. If we believe in similar ideals (like producing or curating the world’s information), we too must plan for scalability! That’s all Folks! direct all comments to: