Uploaded by 郝胤翔Sam 9De_30_Hao

Improving Automated Character Recognitions using Bidirectional Attention Models

advertisement
未來之星-智慧科技營構想書
110 年臺灣國際科學展覽會
研究報告
區別:北區
科別:電腦科學與資訊工程
作品名稱:Improving Automated Character Recognitions
using Bidirectional Attention Models
關鍵詞:Neural Network、Attention Mechanism、OCR
編號:
(編號由國立臺灣科學教育館統一填列,請保留空白)
研究報告
未來之星-智慧科技營構想書
作品名稱
一、 摘要:
近年來,OCR(光學字元辨識)在市場上越來越普遍,但很多 OCR 軟
體的精準度卻常被清晰度的高低大幅度地影響。所以,我希望能夠研發
出一款能夠識別清晰及不清晰的字元圖像的系統。這個系統的基本法則
即為「在與真實序列最小誤差下優化不清晰的字詞辨識並維持清晰字詞
的精準度」
。其中,我使用了以遞迴神經網路(recurrent neural network,
or RNN)結構為基礎的注意力機制(attention mechanism)模型來進一
步最佳化各個自元空間之機率序列的合理性與準確性。簡而言之,在計
算每個字元空間的最終機率序列時,系統會先將每種可能出現的字詞序
列之機率算出(即字元辨識的機率序列之乘積)
,使各自乘上系統透過注
意力機制得出的,在此種字詞組合下的合理性機率序列,最後將每個乘
積加總。最終,輸出的結果也將和最初預期的相同,即在高相似的字元
圖像中,最終的優化序列和字元辨識完的機率序列將極其相似,而較不
清楚之字元圖像的最終序列則會和字元辨識完的機率序列產生適當程度
上的不同,而這些相異僅只是修正不清楚字元辨識的結果,和真實序列
依舊保持最小差異。除了辨識純文字圖像,此系統的概念亦可被應用在
不同種類的圖像辨識,車牌辨識,進階資料擷取與處理,和智慧監控系
統。
二、Content:
未來之星-智慧科技營構想書
(一)
、Introduction
Nowadays, people are used to using OCR to convert images with
characters to different types of documents and get access to the texts
within. However, in many cases, the characters in images may have low
resolutions due to hardware or accessibility problems. As a consequence,
the conversion process to be hard and ambiguous. I have also
encountered these problems among many OCR software; thus, I decide
to use an attention model to help the system distinguish both clear and
unclear characters or words by choosing the best-fit output.
(二)
、Methods and Process
Since my goal is to improve the accuracy of unclear character
recognition and maintain that of the clear ones under minimum
difference from the real sequence, I use an attention model to translate
the probability sequences to the final predicted sequence in the
transcription layer. As the common character recognition systems, the
image will undergo CNN (convolutional neural network) and RNN
(recurrent neural network) with Attention modeling layer, later becoming
a series of probability sequences. (Sang, Dinh Viet, and Le Tran Cuong.
2) More specifically, all the probability sequences have gone through the
softmax function, each representing the probability set of the possible
alphabet or numeral being the true character in the particular character
space. (Shi, et al. 2) Originally, the system will end here by selecting the
possibilities of the highest probabilities from each sequence and output
them into the final predicted sequence. Yet, the process later turns out
different from the conventional approach. The probability sequences turn
into the inputs of the attention model. As the input sequences become
encoders, they are modified by a function that amplifies the variation
within the sequences. Besides the amplification, the probability values
that are extremely small (unneeded) will be eliminated from the
未來之星-智慧科技營構想書
sequences. Afterward, every timestep will different ways to output their
final predicted character. (Kang) These outputs are probability sequences
similar to the inputs, but they blend the original probability sequence
with deep-learning predictions based on the surrounding context. In
mathematical words, the sequences are the summation of multiplications
of the probability of the chosen characters in the context vector (pattern
recognition probability) and the probability sequence of predicted
characters based on the chosen characters, as shown in the following
figure.
(三)
、Results and Discussions
未來之星-智慧科技營構想書
As a result, this improved system gives users a better prediction of
words whether the text is clear. Overall, its unique feature is that it
integrates text rationality checks perfectly with the original recognition
outcomes, making sure that the final predictions are most reasonable
under minimum differences from the original images. In text images of
high resolution, the outcomes will be close, and in many cases even the
same, to the original predictions based on the traditional CRNN models.
(This is because the probability sequences will each have the only
possibility after the modification from the amplification and elimination
functions); on the contrary, in cases of unclear characters, the system
will utilize the attention mechanism of text predicting to make the
recognition more reasonable and accurate. The maximum time
complexity of the integration of the attention vectors is O (n * a^n)
(when most values of the probability sequence are eliminated with an
average length of a), which seems large. Yet, in real-life cases, not all
features in every character space will be so ambiguous, so there might be
many combinations that is almost impossible to appear. That suggests it
is feasible to be optimized down to a minimum bound of O(n) since
many low probability products are unnecessary for the final integration.
(四)
、Conclusions and Applications
To conclude, the system has a great potential to transform images of any
resolution to text more accurately than the conventional ones. Besides
pure text images, it can also be applied to various kinds of pictures, such
as license plate recognition. Furthermore, it can be used in advanced
data acquisition (RPA for OCR) and smart monitoring. (OCR3 大應用)
Especially, smart monitoring will highly rely on the recognitions of
unclear characters because most images in the smart monitoring
database have lower resolution.
未來之星-智慧科技營構想書
(五)
、Reference
“原來 OCR 不只能辨識平面文字?完整介紹帶你認識 OCR 3 大應
用.” 大數看時事, 13 Nov. 2020,
www.largitdata.com/blog_detail/20111113.
Kang, WenWei. “Attention Mechanism.” Medium, Taiwan AI Academy,
10 Apr. 2019, medium.com/ai-academy-taiwan/attention-mechanismfad735db3c2c.
“RPA and Intelligent Automation for Optical Character Recognition
Based Business Processes.” Accelirate, 7 Jan. 2019,
www.accelirate.com/rpa-intelligent-automation-optical-characterrecognition-based-business-processes/.
Sang, Dinh Viet, and Le Tran Cuong. “Improving CRNN with
EfficientNet-like Feature Extractor and Multi-Head Attention for Text
Recognition.” Proceedings of the Tenth International Symposium on
Information and Communication Technology - SoICT 2019, 2019,
doi:10.1145/3368926.3369689.
Shi, Baoguang, et al. “An End-to-End Trainable Neural Network for
Image-Based Sequence Recognition and Its Application to Scene Text
Recognition.” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 39, no. 11, 2017, pp. 2298–2304.,
doi:10.1109/tpami.2016.2646371.
Download