Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad IIIT Hyderabad Digital Library of India (DLI) http://www.dli.iiit.ac.in/ Vision : To enhance access to information and knowledge to masses. • Partner to Million Book Universal Digital Library Programme. Information for people Dataset for researchers IIIT Hyderabad Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007. Digital Library of India (DLI) Vision : To enhance access to information and knowledge to masses. Content Languages Statistics IIIT Hyderabad • #Books 4 Lakhs • 41 different languages • #Pages 134 Million • Includes - Hindi, Telugu, Marathi.. • #Words 26 Billion - English, French, Greek.. Source: http://www.new1.dli.ernet.in/ Digital Library of India (DLI) Meta data search • Supports Meta data based search. • No Content Level Access Indian freedom struggle and independence Search IIIT Hyderabad Digital Library of India (DLI) • Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search IIIT Hyderabad Digital Library of India (DLI) • Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search Reliable Text Representation ? IIIT Hyderabad Goal Digital Library of India Search • Build a search engine with support for Indian languages. • Word Spotting IIIT Hyderabad Goal Indian Language Document Search Engine Text Query Support खोज Page 1 IIIT Hyderabad Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Multi Keyword Support Page 1 IIIT Hyderabad Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Ranks based on # Occurrences Page 1 IIIT Hyderabad Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Semantically Related Words Page 1 IIIT Hyderabad Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Seamless scaling to billions of word images. Sub second retrieval Page 1 IIIT Hyderabad Text from OCR Hindi Page Telugu Page IIIT Hyderabad - Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960 Text from OCR Hindi Page IIIT Hyderabad Cuts Telugu Page Text from OCR Hindi Page IIIT Hyderabad Merges Cuts Telugu Page Text from OCR Hindi Page Telugu Page IIIT Hyderabad Variations in Script,Cuts Font and Typesetting. Text from OCR Char % Hindi Telugu IIIT Hyderabad [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011. Text from OCR Word % Hindi Telugu IIIT Hyderabad [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011. Text from OCR Search % Hindi Telugu IIIT Hyderabad BoVW for Image Retrieval Text Retrieval Image Recognition Query Image Ranked Retrieved Results IIIT Hyderabad Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003 BoVW for Image Retrieval • Fixed Length Representation • Invariant to popular deformation Query Image Ranked Retrieved Results IIIT Hyderabad Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003 BoVW for Document Image Retrieval IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. BoVW for Document Image Retrieval Histogram of Visual Words IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. BoVW for Document Image Retrieval Cuts IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. BoVW for Document Image Retrieval Cuts Histogram of Visual Words IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. BoVW for Document Image Retrieval Merges IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. BoVW for Document Image Retrieval Merges Histogram of Visual Words IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. BoVW for Document Image Retrieval • Robust against degradation • Lost Geometry • Use Spatial Verification – SIFT based. – Longest Subsequence alignment. y 1 0.5 Clean 0 0.5 IIIT Hyderabad V 1 1 V 2 V 6 1.5 Cuts 2 V 4 V 4 2.5 V 8 3 V 9 x Merge R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012 Query Expansion Querying Database Query Image Rank 1 Rank 2 Histogram Rank 3 Rank 4 Rank 5 Rank 6 IIIT Hyderabad Refined Histogram Query Expansion Querying Database Query Image Rank 1 Rank 2 Query Histogram Rank 3 Rank 4 Better Results Rank 5 Rank 6 IIIT Hyderabad Text Query Support • Originally formulated in a “query by example” setting. Input Query Image Histogram IIIT Hyderabad Text Query Support • Originally formulated in a “query by example” setting. • Need Text Queries Input Text Query IIIT Hyderabad Text Query Histogram Observations • Are the results of OCR and BoVW complementary? IIIT Hyderabad BoVW OCR OCR BoVW Observations mAP • mAP v/s Word Length IIIT Hyderabad No. of Characters Observations • “OCR system has a high precision while BoVW approach has a high recall.” • Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4 BoVW Out List; Precision = 0.8 ; Recall = 1 IIIT Hyderabad Fusion • Fusion Techniques:• Naïve Fusion mAP Chart OCR IIIT Hyderabad Fusion • Fusion Techniques:• Naïve Fusion mAP Chart BoVW IIIT Hyderabad Fusion • Fusion Techniques:• Naïve Fusion Concatenating OCR Results with BoVW OCR BoVW mAP Chart IIIT Hyderabad Fusion • Fusion Techniques:• Edit Distance Based Fusion OCR BoVW mAP Chart IIIT Hyderabad Fusion • Fusion Techniques:• Edit Distance Based Fusion • Reordering BoVW • BoVW score • Modified Edit distance cost BoVW mAP Chart IIIT Hyderabad Fusion • Fusion Techniques:• Edit Distance Based Fusion • Reordering BoVW • BoVW score • Modified Edit distance cost BoVW mAP Chart IIIT Hyderabad Fusion • Fusion Techniques:• Edit Distance Based Fusion OCR BoVW mAP Chart IIIT Hyderabad Fusion • Fusion Techniques:• Hybrid Fusion OCR BoVW mAP Chart IIIT Hyderabad Fusion • Fusion Techniques:• Hybrid Fusion mAP Chart • Re-querying BoVW using • OCR retrieved results. • Using rank aggregation techniques BoVW IIIT Hyderabad Fusion • Fusion Techniques:• Hybrid Fusion mAP Chart • Re-querying BoVW using • OCR retrieved results. • Using rank aggregation techniques BoVW IIIT Hyderabad Fusion • Fusion Techniques:• Hybrid Fusion OCR BoVW mAP Chart IIIT Hyderabad Experimental Results IIIT Hyderabad Experimental Details • OCR [1] • Feature Detector – Harris Interest point detection. [2] • Feature Descriptor – SIFT [2] • Indexing – Lucene [3] IIIT Hyderabad [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] http://www.vlfeat.org [3] http://lucene.apache.org/ Test Bed Sample Word Images Language #Books #Pages #Words #Annotation Hindi (HS1) 11 1000 362,593 Yes Hindi (HS2) 52 10,196 4,290,864 No Telugu (TS1) 11 1000 161,276 Yes Telugu (TS2) 69 13,871 2,531,069 No DLI Corpus IIIT Hyderabad • In addition, we used HP1 & TP1 fully annotated dataset Evaluation Measures • Precision • Recall • TP = True Positive FP = False Positive FN = False Negative mAP (Mean Average Precision) Mean of the area under the precision recall curve for all the queries. • Precision @ 10 Shows how accurate top 10 retrieved IIIT Hyderabad results are. Precision-Recall Curve BoVW Search Language #Query BoVW + Query Expansion mAP Prec@10 mAP Prec@10 Hindi (HP1) 100 62.54 81.30 66.09 83.86 Telugu (TP1) 100 71.13 78 73.08 79.89 Comparison of naïve BoVW with BoVW + Query Expansion IIIT Hyderabad BoVW Search Language #Query BoVW using Text Queries mAP Prec@10 mAP Prec@10 Hindi (HP1) 100 62.54 81.30 56.32 73.89 Telugu (TP1) 100 71.13 78 69.06 78.83 Comparison of naïve BoVW with BoVW + Text Query Support IIIT Hyderabad Naïve Language #Query Edit Distance Hybrid mAP Prec@10 mAP Prec@10 mAP Prec@10 Hindi (HP1) 100 75.66 90.7 79.58 90.8 80.37 91.4 Telugu (TP1) 100 76.02 81.2 78.01 81.4 80.23 83.7 Comparative performance of different fusion techniques on HP1 & TP1 IIIT Hyderabad OCR Language #Query BoVW Fusion mAP Prec@10 mAP Prec@10 mAP Prec@10 Hindi (HS1) 100 14.95 62.60 60.55 95.5 68.81 95.6 Telugu (TS1) 100 27.03 62.10 74.38 90.6 78.41 91.9 Performance statistics on DLI Annotated Corpus IIIT Hyderabad Language Hindi (HS2) Telugu (TS2) #Query 50 50 Precision @ N OCR BoVW Fusion Prec@10 82.03 96.94 97.11 Prec@20 75.16 94.83 95.42 Prec@30 71.12 92.82 93.16 Prec@10 90.85 99.14 99.14 Prec@20 85.42 98.00 98.85 Prec@30 80.76 96.38 96.57 Performance statistics on DLI Un-Annotated Corpus IIIT Hyderabad Retrieved Results IIIT Hyderabad Retrieved Results IIIT Hyderabad Failure Cases • The word images shown in the figure fails in both OCR and BoVW. • Reason: – (a) Word Image smaller in length and containing a character not used these days. IIIT Hyderabad – (b) A highly degraded word image. Implementation Details • Search Engine Development – An elegant web based search and retrieval interface. No of Images Time in milliseconds Lucene Scalability IIIT Hyderabad Sample Retrieved Page No of Visual Words Search Architecture (Ongoing) Search Query Ranked Results Delegator Partial Scores FUSION Query Expansion Ranking OCR BoVW IIIT Hyderabad OCR Index Web Service BoVW Index Web Service Web Service Ongoing Work • Learn to improve from annotated dataset – Use of visual confusion matrix to improve BoVW results from annotated datasets. • Necessity of Costly Features for Re-ranking – The images shows in failure cases would require costly features to show up. – Use of machine learning algorithms. IIIT Hyderabad • Exploration of features better than SIFT. Thank You IIIT Hyderabad