DATASET: Harbin Institute of Technology Opening Recognition Corpus for Chinese Characters (HITOR3C) Contact Author Dr Qingcai Chen Shenzhen Graduate School, Harbin Institute of Technology. Shenzhen, P.R. China 518055 e. qingcai.chen@hitsz.edu.cn t. +86 755 26033475 Keywords HIT-OR3C, Online handwriting Chinese characters, Online handwriting Chinese document, Character recognition corpus Description The characters of HIT-OR3C are collected using handwriting pad and are recorded and labelled automatically via the proposed handwriting document collection software OR3C Toolkit. HIT-OR3C consists of 5 subsets, namely GB1, GB2, Letter, Digit and Document. The first 4 corpora contain 6,825 categories produced by 122 persons and 832,650 samples in total. The document corpus corresponds to 10 news articles that contain 2,442 categories produced by 20 persons and 77,168 samples in total. The dataset contains 909,818 images. The total size of the dataset is 15.5 GB (1125 Mb compressed). There are three file formats, defined by ourselves and introduced in the related documents. The image sizes is 128 x 128, and colour depths from 1bpp. Metadata For each image, the label is provided. The labels of digits and letters are encoded in ASCII; the labels of Chinese characters are encoded in GB2312-80. The label file is in every folder and named “labels.txt”. Related Ground Truth Data N/A Related Tasks [link to task page] References 1. S. Zhou, Q. Chen, X. Wang, “HIT-OR3C: An Opening Recognition Corpus for Chinese Characters”, DAS 2010, to appear Submitted Files File Formats TASK: Handwriting recognition for Chinese characters Description The aim of this task is to automatically recognize a series of characters that written on paper or handwriting input device. This is a topic that has received a lot of attention lately, including the shape normalization methods for handwritten Chinese character recognition [1], and various methods for online recognition of Chinese characters (see [2] for a review). Evaluation Protocol This task consists of a total number of 909,818 handwriting characters. We provide one training set (character dataset, 832,650 samples) and one testing set (document dataset, 77,168 samples). For evaluation, we compare the recognition rate of varies methods. References 1. C. L. Liu, and K. Marukawa, 2005, “Pseudo two-dimensional shape normalization methods for handwritten Chinese character recognition,” Pattern Recognition, vol. 38, no. 12, pp. 2242-2255, Dec. 2. C. L. Liu, S. Jaeger, and M. Nakagawa, 2004, “Online recognition of Chinese characters: The state-of-the-art,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 198-213.