Regional Mega Scanning Centre IIIT Hyderabad

advertisement
Digital Library Mega Scanning Centre
IIIT Hyderabad
Vamshi Ambati
Major Objectives of our Centre
Digitizing to produce books of quality in
quantity
 Development of core technologies needed
for Digital Libraries
 Knowledge and Experience Dissemination

 Training
 Sharing
resources
Progress
Established centers at Osmania University,
Telugu University, Salarjung Museum
 Content generation at SVDL, CCL, SCL
 Conducted a workshop for sharing
resources and establishing common
standards
 Generated content of about 32 Million
Pages
 Host content at (http://dli.iiit.ac.in)

Effort Distribution of Digitization
Web
Quality Enablement,
Assurance,
5%
15%
Identification,
15%
Metadata, 5%
OCR, 10%
Image
Processing,
20%
Scanning,
30%
Current Status
170,000 books
 72,000 English books
 18 other languages
 http://dli.iiit.ac.in


Operations 


50 scanners
15 centers
300 people in all
Language Report
RMSC BOOKS WISE REPORT
80000
70000
60000
50000
40000
30000
20000
10000
0
En
h
is
gl
an
Eu
pi
ro
n
La
e
ag
u
g
s
Sa
it
kr
s
n
at
ar
M
hi
T
u
el
gu
i
nd
i
h
Pe
ia
rs
n
U
u
rd
il
m
a
T
Ka
ad
nn
a
a
Ar
c
bi
O
e
th
rs
OTHERS
TTD
WASHINGTON
SCL-HYD
SALARJUNG
MEZUM
STATE
ARCHIVE
PSTU
OUL
KANSAS
FAO
EPW
CCL-HYD
AP TEXT
BOOKS
AOU
Source Library Report
RMSC SOURCE LIBRARY BOOKS REPORT
30000
25000
20000
15000
10000
5000
0
Scanning Centre Report
RMSC SCANNING LOCATION BOOKS WISE REPORT
ER
S
G
TH
O
SC
LB
N
TU
PS
SJ
M
LH
TT
D
YD
L
D
SC
C
C
SV
YD
LH
SU
O
III
TH
35000
30000
25000
20000
15000
10000
5000
0
Technologies: Research
Content Search in Images
 Text Mining
 Cross Lingual Retrieval
 Summarization tools
 UniTrans: Universal Transliteration tool

 Languages:
Arabic, Persian,Urdu,Assamese,
Bengali, Tamil, Telugu, Kannada, Malayalam,
Sanskrit, Hindi, Marathi
Technologies: Workflow

Workflow Tools
 Metadata
creation, Structural metadata etc
 Server management

Image Processing
 Image
Processing tools
 Plug-in
Server Management Tools
 Digital Library of India Portal

Rare Collections
50 years of Andhra Pradesh State
Legislature Proceedings (Multilingual data)
 Rare Telugu classics (like Kalidasa’s work)
 Andhra Pradesh State Archive Books (rare
collection as old as 1835)
 Text Books State Board of Education (1st
to 10th grade)

Acknowledgements
Ajay Pannala, CEO Par Informatics
 C S N Mohan, CEO Thrinaina Ltd
 T N Sreenivas, CEO SV Infosys
 Bhuman Reddy, Director SVDL
 Kiran V K, Planning Director DLI
 Nadendla Manohar, MLA Tenali
 Rajeev Sangal, Director IIIT
 C.V Jawahar, Professor IIIT

Thank you
Workshops held

Tools and Resources for DLI



Research Challenges in DLI



(5th May to 7th May 2005)
36 participants
30th December 2006
100 participants
Speakers and Dignitaries

Raj Reddy, Pradeep Chopra, Sunil Alag, Yagna
Narayana among others
Center Specific Technology
Search Similar Images based on
Image Patterns
Problem
Huge amount of content generated by DLI
 Search the DLI
 Query is generally in form of text word
 Currently cannot convert all document
images into text
 Can we match words in the image space
by converting the query into image?

Challenges
Match two word images in the presence of
 Degradations





Salt and Pepper noise
Cuts and Breaks
Blobs
Erosion of Boundary pixels
Print Variations
Font Type
 Font Size


Variability due to Language Cases
Proposed Solution
Results and Discussion

Partial Matching
Demo
Core Technologies for Digital
Library
Workflow and Tools

Workflow Management



Vendor Progress Tracking
Report Generation tool
Server Management Tools


Metadata Management tools


Server uptime monitoring, Server cluster solution
Regular metadata, Structural metadata
Quality Assurance



Online metadata verification and correction interface
Centralized Duplicate Detection tool
Image quality assurance tool (QualCheck)
Multilingual Information Retrieval

Cross Lingual Information Retrieval
 Universal
Dictionary based
 Query expansion
 Explicit
(user feedback) and Implicit (word frq)
Automatic Text Summarization

Summarization system for Telugu
 Frequency
based
 Position based
 Most informative sentence identification
 Dictionary lookup
 Approximate String Matching to compensate
for lack of Morph Analyzers
 Stop Word vs. Content Word identification
Search and Indexing

Web Crawler




Focused Crawling
Incremental Crawling
Crawls Telugu, Malayalam and Tamil web pages
Content Based Image Retrieval
Addresses queries in multiple formats (sample image
or text)
 Uses features such as color, texture to match images.
 Learns from user feedback.

Search and Indexing

ITRANS based search for DLI servers
 Search
on actual content as opposed to
search on metadata
 Ability to extend for multiple languages
 Allows users to query in their native
languages and converts the documents
actually stored in ITRANS to native language
on the fly
Multimodal Multimedia Tools

Book Reading Interface


Developed TIFF Plugin (released open source)
Image Server for ‘on the fly’




Format conversions
Resolution conversion
Thumbnail generation etc
Speech Interface
Plugin for IE and Firefox for Reading a Book
 Text To Speech System (developed in IIIT using
Festivox CMU toolkit)

Tools for Download

Tools available for download at
http://dli.iiit.ac.in/download.html
Download