Digital Library Mega Scanning Centre IIIT Hyderabad Vamshi Ambati Major Objectives of our Centre Digitizing to produce books of quality in quantity Development of core technologies needed for Digital Libraries Knowledge and Experience Dissemination Training Sharing resources Progress Established centers at Osmania University, Telugu University, Salarjung Museum Content generation at SVDL, CCL, SCL Conducted a workshop for sharing resources and establishing common standards Generated content of about 32 Million Pages Host content at (http://dli.iiit.ac.in) Effort Distribution of Digitization Web Quality Enablement, Assurance, 5% 15% Identification, 15% Metadata, 5% OCR, 10% Image Processing, 20% Scanning, 30% Current Status 170,000 books 72,000 English books 18 other languages http://dli.iiit.ac.in Operations 50 scanners 15 centers 300 people in all Language Report RMSC BOOKS WISE REPORT 80000 70000 60000 50000 40000 30000 20000 10000 0 En h is gl an Eu pi ro n La e ag u g s Sa it kr s n at ar M hi T u el gu i nd i h Pe ia rs n U u rd il m a T Ka ad nn a a Ar c bi O e th rs OTHERS TTD WASHINGTON SCL-HYD SALARJUNG MEZUM STATE ARCHIVE PSTU OUL KANSAS FAO EPW CCL-HYD AP TEXT BOOKS AOU Source Library Report RMSC SOURCE LIBRARY BOOKS REPORT 30000 25000 20000 15000 10000 5000 0 Scanning Centre Report RMSC SCANNING LOCATION BOOKS WISE REPORT ER S G TH O SC LB N TU PS SJ M LH TT D YD L D SC C C SV YD LH SU O III TH 35000 30000 25000 20000 15000 10000 5000 0 Technologies: Research Content Search in Images Text Mining Cross Lingual Retrieval Summarization tools UniTrans: Universal Transliteration tool Languages: Arabic, Persian,Urdu,Assamese, Bengali, Tamil, Telugu, Kannada, Malayalam, Sanskrit, Hindi, Marathi Technologies: Workflow Workflow Tools Metadata creation, Structural metadata etc Server management Image Processing Image Processing tools Plug-in Server Management Tools Digital Library of India Portal Rare Collections 50 years of Andhra Pradesh State Legislature Proceedings (Multilingual data) Rare Telugu classics (like Kalidasa’s work) Andhra Pradesh State Archive Books (rare collection as old as 1835) Text Books State Board of Education (1st to 10th grade) Acknowledgements Ajay Pannala, CEO Par Informatics C S N Mohan, CEO Thrinaina Ltd T N Sreenivas, CEO SV Infosys Bhuman Reddy, Director SVDL Kiran V K, Planning Director DLI Nadendla Manohar, MLA Tenali Rajeev Sangal, Director IIIT C.V Jawahar, Professor IIIT Thank you Workshops held Tools and Resources for DLI Research Challenges in DLI (5th May to 7th May 2005) 36 participants 30th December 2006 100 participants Speakers and Dignitaries Raj Reddy, Pradeep Chopra, Sunil Alag, Yagna Narayana among others Center Specific Technology Search Similar Images based on Image Patterns Problem Huge amount of content generated by DLI Search the DLI Query is generally in form of text word Currently cannot convert all document images into text Can we match words in the image space by converting the query into image? Challenges Match two word images in the presence of Degradations Salt and Pepper noise Cuts and Breaks Blobs Erosion of Boundary pixels Print Variations Font Type Font Size Variability due to Language Cases Proposed Solution Results and Discussion Partial Matching Demo Core Technologies for Digital Library Workflow and Tools Workflow Management Vendor Progress Tracking Report Generation tool Server Management Tools Metadata Management tools Server uptime monitoring, Server cluster solution Regular metadata, Structural metadata Quality Assurance Online metadata verification and correction interface Centralized Duplicate Detection tool Image quality assurance tool (QualCheck) Multilingual Information Retrieval Cross Lingual Information Retrieval Universal Dictionary based Query expansion Explicit (user feedback) and Implicit (word frq) Automatic Text Summarization Summarization system for Telugu Frequency based Position based Most informative sentence identification Dictionary lookup Approximate String Matching to compensate for lack of Morph Analyzers Stop Word vs. Content Word identification Search and Indexing Web Crawler Focused Crawling Incremental Crawling Crawls Telugu, Malayalam and Tamil web pages Content Based Image Retrieval Addresses queries in multiple formats (sample image or text) Uses features such as color, texture to match images. Learns from user feedback. Search and Indexing ITRANS based search for DLI servers Search on actual content as opposed to search on metadata Ability to extend for multiple languages Allows users to query in their native languages and converts the documents actually stored in ITRANS to native language on the fly Multimodal Multimedia Tools Book Reading Interface Developed TIFF Plugin (released open source) Image Server for ‘on the fly’ Format conversions Resolution conversion Thumbnail generation etc Speech Interface Plugin for IE and Firefox for Reading a Book Text To Speech System (developed in IIIT using Festivox CMU toolkit) Tools for Download Tools available for download at http://dli.iiit.ac.in/download.html