Harbin Institute of Technology Opening Recognition Corpus

advertisement
DATASET: Harbin Institute of Technology Opening
Recognition Corpus for Chinese Characters (HITOR3C)
Contact Author
Dr Qingcai Chen
Shenzhen Graduate School,
Harbin Institute of Technology.
Shenzhen, P.R. China 518055
e. qingcai.chen@hitsz.edu.cn
t. +86 755 26033475
Keywords
HIT-OR3C, Online handwriting Chinese characters, Online handwriting
Chinese document, Character recognition corpus
Description
The characters of HIT-OR3C are collected using handwriting pad and are
recorded and labelled automatically via the proposed handwriting document
collection software OR3C Toolkit. HIT-OR3C consists of 5 subsets, namely
GB1, GB2, Letter, Digit and Document. The first 4 corpora contain 6,825
categories produced by 122 persons and 832,650 samples in total. The
document corpus corresponds to 10 news articles that contain 2,442
categories produced by 20 persons and 77,168 samples in total.
The dataset contains 909,818 images. The total size of the dataset is 15.5 GB
(1125 Mb compressed).
There are three file formats, defined by ourselves and introduced in the
related documents. The image sizes is 128 x 128, and colour depths from
1bpp.
Metadata
For each image, the label is provided. The labels of digits and letters are
encoded in ASCII; the labels of Chinese characters are encoded in
GB2312-80. The label file is in every folder and named “labels.txt”.
Related Ground Truth Data
N/A
Related Tasks
[link to task page]
References
1. S. Zhou, Q. Chen, X. Wang, “HIT-OR3C: An Opening Recognition Corpus
for Chinese Characters”, DAS 2010, to appear
Submitted Files
File Formats
TASK: Handwriting recognition for Chinese characters
Description
The aim of this task is to automatically recognize a series of characters that
written on paper or handwriting input device.
This is a topic that has received a lot of attention lately, including the shape
normalization methods for handwritten Chinese character recognition [1], and
various methods for online recognition of Chinese characters (see [2] for a
review).
Evaluation Protocol
This task consists of a total number of 909,818 handwriting characters. We
provide one training set (character dataset, 832,650 samples) and one testing
set (document dataset, 77,168 samples). For evaluation, we compare the
recognition rate of varies methods.
References
1. C. L. Liu, and K. Marukawa, 2005, “Pseudo two-dimensional shape
normalization methods for handwritten Chinese character recognition,”
Pattern Recognition, vol. 38, no. 12, pp. 2242-2255, Dec.
2. C. L. Liu, S. Jaeger, and M. Nakagawa, 2004, “Online recognition of
Chinese characters: The state-of-the-art,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 26, no. 2, pp. 198-213.
Download