scalability - digital libraries laboratory @ uct . cs

advertisement
Hussein Suleman
hussein@cs.uct.ac.za
University of Cape Town
Department of Computer Science
Advanced Information Management Laboratory
High Performance Computing Laboratory
April 2007
Open Access
What is Open Access?
free online access to
electronic resources:
research papers,
courseware, ETDs, etc.
Why?
Lower costs, Empower producers
Empower consumers, Improve
visibility
…
How?
Institutional Repository: online
Open Access Journal: online
system to manage documents,
system to manage publication
typically at one institution.
and dissemination.
Advocacy, Policy, Procedures,
Management, …
Software Tools: DSpace,
EPrints, OJS, …
Very Large Data Collections
Source: http://www2.warwick.ac.uk/fac/sci/physics/research/astro/postgraduate/galplane/
Sizing Open Access

UCT-CS Publication Archive:


UCT 2005 Research Report:


Average size of PDF document: 848K
4150 Research document artefacts
Estimate output for one year:

4150 documents, 3.15 GB

And this is only published peer-reviewed research as we know it!

What about theses/dissertations?


What about technical and project reports?


1000 documents, 10MB each, totalling 10GB
Courseware?


1000 documents, 5MB each, totalling 5GB
50MB per course?
Datasets?

Almost infinite!
Repository Software UnScalability

Most systems do not scale well beyond small
collections.
EPrints
DSpace
source: Technical Evaluation of selected Open Source Repository Systems, Catalyst IT
Service Provision UnScalability

UCT-CS Archive:



Average of 6.24 user accesses per document per
month
Average of 18 accesses per document per month
For 83000 documents:


1.494 million accesses per month
34.58 accesses per minute


What about search/browse and other services?
And this is only published peer-reviewed research as
we know it!
Some Solutions

Devise completely new algorithms and systems to
deal with massive quantities of information.


Use computing resources more efficiently, to
maximise benefits with minimum cost.


Efficient Cluster and Grid computing
Make the users’ computer do more work.


Fedora/SRB/etc., Parallel OAI-PMH, Terascale IR systems…
Client-side computation: AJAX
Make the users do all or most of the work.

Web 2.0
Scalable Repositories

Fedora


Storage Resource Broker


Storage abstraction for large-scale stores developed at San Diego
Supercomputing Centre.
Grid-based Storage Systems


Digital Repository system developed at Cornell with API for higher
level services.
Systems to utilise Grid computing for storage of data in distributed
fashion.
Amazon and Google

Third party providers of storage at a premium.
Parallel Harvesting




Multiple harvesters cycle
through harvest and process
operations in parallel.
Significant benefit when
workload is high.
OAI Data Provider
drone
drone
…
Parallelism helps even on one
machine!
What about parallel data
provision?
Primary
Harvester
Beowulf cluster
or
multiprocessor
High(er) Performance IR for the Rest of Us


Efficient search engine on a
small cluster, more likely in
developing countries.
Job dispatcher
Nodes either do querying or
indexing and can swap if
needed.

Reasonably good performance.

Work is being extended for
larger collections and grids.

Terascale IR?
index
node
index
node
query
node
query
node
…
Beowulf cluster
High-Level Component-Based DL Scalability

Split DL into components and
spread across cluster.


Services are Web-distributed.
Make services mobile and create
replicates.

Performance improvement and
better use of multiple cheap
computers.

Moving digital archives to grids
can deal with service provision
scalability.

Registry
See DILIGENT…
node1
node2
Resolver
instances
Resolver
Web 2.0 – User Contributions

Make users provide as
much information as
possible.

Users managing
content means less
central management.

Greater scalability!
Ajax for DL Services

AJAX supports
applications or
services within
the browser.

Move
computation
from server to
client

Greater
scalability!
Scalable Preservation

XML is often used
for Preservation
while Databases
are used for
Access.

How do we make
XML tools
scalable?
Can we?

Concluding Thoughts

Open Access and Digital Repositories must consider scalability
as fundamental as preservation and access.

Realistically, we have not had major scalability problems yet,
but we don’t have many Open Access systems either.

Google works because scalability is a primary concern.


They intend to index the world’s information.
If we believe in similar ideals (like producing or curating the
world’s information), we too must plan for scalability!
That’s all Folks!
direct all comments to:
hussein@cs.uct.ac.za
Download