MOVIE SUBTITLES EXTRACTION Final Project Presentation Part 2 Student: Elad Hoffer Advisor: Eli Appleboim INTRODUCTION TO SUBTITLE RECOGNITION Project object is to extract and recognize subtitle embedded within video. Whilst first part demonstrated the algorithm using Matlab implementation, current part focused on delivering a working system using OpenCV. Main obstacles: Distinguish text from background. Recognize text reliably. Allow real-time implementation. ALGORITHM IMPLEMENTED Repeat for same subtitle – search for better match Dictionary and language data Acquire Frame From Video Identify text, filter unwanted noise and image artifacts Detect text using Tesseract O.C.R Compare text descriptions for best match FALSE Same subtitle New subtitle? Repeat for new subtitle TRUE New subtitle Output best identified subtitle text to file STEP 1 – DISTINGUISH TEXT FROM IMAGE B&W threshold is a crucial step for the algorithm as recognition heavily depends on clean text Different approaches were examined: Global and local thresholding using otsu’s method Text specific thresholding – Sauvola, Niblack Adaptive Local Gaussian Thresholding Stroke width transform Not a clear winner – very content dependent It seems Adaptive Gaussian thresholding gives best all-around results SWT is interesting, but way too slow for our purpose STEP 1 – DISTINGUISH TEXT FROM IMAGE STEP 1 – DISTINGUISH TEXT FROM IMAGE (CONT.) After binarization, Image is segmented into objects Noted Obstacle: OpenCV does not contain connected-component analysis implementation cvBlob library was used as a replacement, and other necessary functions were added Noise and non-text objects are then filtered out Text is identified by: Size – height, width and area parameters Color - subtitles are white and bright Location and alignment – text is displayed over one/two horizontal line Near other text objects – remove remote objects Not contained within any other text object Special consideration – punctuation marks and other language-dependent markings STEP 2 – DETECT TEXT FROM SUBTITLE IMAGE Used O.C.R – Tesseract by Google http://code.google.com/p/tesseract-ocr/ High success rate Multiple languages Open source Tesseract expects “Clean” text image – no overlay, and preferably little noise Image must be thoroughly cleaned from any objects that are not part of the subtitle Non English language is even more sensitive Big improvement over Matlab implementation – Tessercat’s full API can now be used Using user designed dictionaries – DAWG are created from words-list Using all of Tesseracts features and “switches” Getting per-word confidence STEP 3 – DETECT CHANGE OF SUBTITLES The average subtitle appears for about 3-4 seconds Output must not contain multiple, same content detections Large number of frames contains no subtitle Multiple frames of the same subtitle content are informative Some frames are noisier than others Main image content can change to better suit our algorithm Detection is made using our output, cleaned image (same input fed into the O.C.R) Decide according to correlation Fast and reliable – our input images are binary matrices Threshold heuristically decided at ~0.5 STEP 4 – IMPROVE SUCCESS RATE USING MULTIPLE FRAMES For each subtitle, we can have a large number of possible detections (4 seconds of subtitles, 30fps). Average frames containing the same subtitle to reduce noise How can we compare different detections and choose best one? Number of valid words – (1st part used this metric) Confidence level of our O.C.R engine Grammatically correct sentences Other text analysis – a try was made to “autocorrect” words (not reliable, slow) Modified metric over 1st part – check the O.C.R confidence level for each subtitle Can be tweaked using “non-dictionary” penalty Created a modified dictionary for use with Tesseract STEP 4 – IMPROVE SUCCESS RATE USING MULTIPLE FRAMES - AVERAGE Averaging Frames STEP 4 – IMPROVE SUCCESS RATE USING MULTIPLE FRAMES – RANK & CHOOSE comme d ia colère et e laïfureur. 83 comme de la colère et de la fureur. 87 STATISTICS AND PERFORMANCE Success Rate: 80-85% per word Processing time: Real-Time ability – x1 video length at most Can be a lot shorter depending on content (non text frames are omitted) Can be greatly improved – very little parallelization