未來之星-智慧科技營構想書

作品名稱:Improving Automated Character Recognitions using Bidirectional Attention Models
關鍵詞:Neural Network、Attention Mechanism、OCR

一、 摘要: 近年來,OCR(光學字元辨識)在市場上越來越普遍,但很多 OCR 軟 體的精準度卻常被清晰度的高低大幅度地影響。所以,我希望能夠研發 出一款能夠識別清晰及不清晰的字元圖像的系統。這個系統的基本法則 即為「在與真實序列最小誤差下優化不清晰的字詞辨識並維持清晰字詞 的精準度」 。其中,我使用了以遞迴神經網路(recurrent neural network, or RNN)結構為基礎的注意力機制(attention mechanism)模型來進一 步最佳化各個自元空間之機率序列的合理性與準確性。簡而言之,在計 算每個字元空間的最終機率序列時,系統會先將每種可能出現的字詞序 列之機率算出(即字元辨識的機率序列之乘積) ,使各自乘上系統透過注 意力機制得出的,在此種字詞組合下的合理性機率序列,最後將每個乘 積加總。最終,輸出的結果也將和最初預期的相同,即在高相似的字元 圖像中,最終的優化序列和字元辨識完的機率序列將極其相似,而較不 清楚之字元圖像的最終序列則會和字元辨識完的機率序列產生適當程度 上的不同,而這些相異僅只是修正不清楚字元辨識的結果,和真實序列 依舊保持最小差異。除了辨識純文字圖像,此系統的概念亦可被應用在 不同種類的圖像辨識,車牌辨識,進階資料擷取與處理,和智慧監控系 統。

二、Content:
未來之星-智慧科技營構想書
(一) 、Introduction
Nowadays, people are used to using OCR to convert images with characters to different types of documents and get access to the texts within. However, in many cases, the characters in images may have low resolutions due to hardware or accessibility problems. As a consequence, the conversion process to be hard and ambiguous. I have also encountered these problems among many OCR software; thus, I decide to use an attention model to help the system distinguish both clear and unclear characters or words by choosing the best-fit output. (二) 、Methods and Process Since my goal is to improve the accuracy of unclear character recognition and maintain that of the clear ones under minimum difference from the real sequence, I use an attention model to translate the probability sequences to the final predicted sequence in the transcription layer. As the common character recognition systems, the image will undergo CNN (convolutional neural network) and RNN (recurrent neural network) with Attention modeling layer, later becoming a series of probability sequences. (Sang, Dinh Viet, and Le Tran Cuong. 2) More specifically, all the probability sequences have gone through the softmax function, each representing the probability set of the possible alphabet or numeral being the true character in the particular character space. (Shi, et al. 2) Originally, the system will end here by selecting the possibilities of the highest probabilities from each sequence and output them into the final predicted sequence. Yet, the process later turns out different from the conventional approach. The probability sequences turn into the inputs of the attention model. As the input sequences become encoders, they are modified by a function that amplifies the variation within the sequences. Besides the amplification, the probability values that are extremely small (unneeded) will be eliminated from the 未來之星-智慧科技營構想書 sequences. Afterward, every timestep will different ways to output their final predicted character. (Kang) These outputs are probability sequences similar to the inputs, but they blend the original probability sequence with deep-learning predictions based on the surrounding context. In mathematical words, the sequences are the summation of multiplications of the probability of the chosen characters in the context vector (pattern recognition probability) and the probability sequence of predicted characters based on the chosen characters, as shown in the following figure. (三) 、Results and Discussions 未來之星-智慧科技營構想書 As a result, this improved system gives users a better prediction of words whether the text is clear. Overall, its unique feature is that it integrates text rationality checks perfectly with the original recognition outcomes, making sure that the final predictions are most reasonable under minimum differences from the original images. In text images of high resolution, the outcomes will be close, and in many cases even the same, to the original predictions based on the traditional CRNN models. (This is because the probability sequences will each have the only possibility after the modification from the amplification and elimination functions); on the contrary, in cases of unclear characters, the system will utilize the attention mechanism of text predicting to make the recognition more reasonable and accurate. The maximum time complexity of the integration of the attention vectors is O (n * a^n) (when most values of the probability sequence are eliminated with an average length of a), which seems large. Yet, in real-life cases, not all features in every character space will be so ambiguous, so there might be many combinations that is almost impossible to appear. That suggests it is feasible to be optimized down to a minimum bound of O(n) since many low probability products are unnecessary for the final integration. (四) 、Conclusions and Applications To conclude, the system has a great potential to transform images of any resolution to text more accurately than the conventional ones. Besides pure text images, it can also be applied to various kinds of pictures, such as license plate recognition. Furthermore, it can be used in advanced data acquisition (RPA for OCR) and smart monitoring. (OCR3 大應用) Especially, smart monitoring will highly rely on the recognitions of unclear characters because most images in the smart monitoring database have lower resolution. 