Database Specification Data formats

advertisement
IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com/
Database Specification
Overview
•
•
•
•
•
•
•
•
Over 2200 form-pages of (937) Tunisian town/village names
All data are digitized using a resolution of 300 dpi (b/w)
Ground truth information available
o e.g. sequence of Arabic character shapes
o baseline/reference line position
o topline position for set_a
A wide variety of writing styles; 411 different writers
About 26000 Arabic handwritten Tunisian town/village names
Approximately 212 000 Arabic characters and ligatures
Divided into 4 sets (a-d)
Images and ground truth documentation included
Set specification
The whole database is divided into 4 disjoint sets for training and testing Arabic OCR systems. We
recommend using the specified sets for comparability with results of other groups.
SET
a
b
c
d
SUM
Number of words
6537
6710
6477
6735
26459
Number of writer (OK+bad)
88+14=102
89+13=102
88+15=103
90+14=104
355+56=411
Data formats
Image format
The form-pages are stored in uncompressed TIFF-file format (file-extension .tif).
All cropped Tunisian town/village names coming as uncompressed TIFF-images and as BMP-images
(file-extension .bmp). In the TIFF header section “Image description” label information is stored.
Ground Truth format
For each cropped Tunisian town/village name you find a truth file (file-extension .tru). The truth-file is an
ASCII .txt file including all available ground truth information. One example is shown below.
01:
02:
03:
04:
05:
06:
07:
08:
09:
10:
11:
COM:
COM:
COM:
COM:
X_Y:
BDR:
LBL:
CHA:
BLN:
TLN:
EDR:
IFN/ENIT-database truth (label) file
http://www.ifnenit.com
IfN, TU-BS
di45_019.tif coming from pb377_6.tif
498 87
begin data record
ZIP:3032;AW1:‫;ﻣﺮآﺰدروﻳﺶ‬AW2:maB|raE|keB|zaE|daA|raA|waA|yaB|shE|;QUA:YB1;ADD:P6
9
56,42
23,19
end of data record
Arabic database -- IFN/ENIT-database for developing and testing recognition systems for handwritten Arabic words (Arabic OCR) (version 1.0p2)
1-5
IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com/
lines 01-04:
line 05:
line 06:
line 07:
ZIP:
AW1:
AW2:
comments
image size in pixel (x,y)
begin data record
label
Tunisian post code / ZIP code
Tunisian town/village name in Arabic windows encoding
Arabic character shape sequence of Tunisian town/village name in Latin code (refer
Appendix for the lookup table).
QUA: baseline quality tag (B1=:OK;B2=:bad)
ADD: number of pieces of Arabic words (PAW’s)
line 08:
number of characters
line 09:
baseline/reference line information Y1,Y2
line 10:
topline information Y1,Y2 (in data set_a only!!!)
line 11:
end of data record
Database Organisation
The IFN/ENIT-database has the following directory structure.
IFN/ENIT - database
doc
data
forms
doc
set_a
tru
tif
...
bmp
set_d
...
doc
In this directory documentation in pdf and/or txt format is available.
data
In this directory all available data are stored. Each of the four sets has it’s own directory (set_?).
set_?
In these directories all data are available in tif and bmp file format. The ground truth information you will
find in the directory tru. The subdirectory doc under data/set_? includes a pdf-file for each writer with all
words and baselines.
File name convention
The following file name convention is used.
SWww_NNN.EXT
• S:=set (a,b,c,d)
• Www:=writerID; W=(e,f,i,j,m,q) ,w=(0..9)
• NNN:=word_number; N=(0..9)
Ordering Information
IFN/ENIT-database is made available for non-commercial use. The data is supplied with no guarantee of
accuracy or usability. We can’t guarantee to maintain the IFN/ENIT-database, but would be interested in
hearing of any comments or results that you have.
Arabic database -- IFN/ENIT-database for developing and testing recognition systems for handwritten Arabic words (Arabic OCR) (version 1.0p2)
2-5
IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com/
Upon request we make the data available on the Internet for free download. If the database has to be
shipped as a CD-Rom production and shipping costs will be charged. In both cases please contact us.
BUGS
When you are working with the IFN/ENIT-database perhaps you will discover bugs concerning the label
or the extraction of the data. Please report the bugs you find back to us. So we can improve the quality of
the data over the time.
Reporting results
We kindly invite you to publish reached results with the IFN/ENIT-database. To keep recognition results
comparable, we suggest reporting results as shown in the following example:
Database version: IFN/ENIT-database v1.0p2
Test
Training set(s) Test set
Recognition result(*)
1
a,b,c
D
23.7%
2
c,b,d
A
25,4%
…
(*) Percentage of correctly recognised words in the specified test set. We recommend to use the ground
truth / label category “ZIP:” as reference for the recognition result. Please use for each test the whole
number of 937 different Tunisian town/village names as lexicon, in the case a lexicon is needed.
Contact
Please feel free to contact contact@ifnenit.com.
References
Mario Pechwitz, Samia Snoussi Maddouri, Volker Märgner, Noureddine Ellouze, Hamid Amiri; IFN/ENITdatabase of handwritten Arabic words, In Proceedings of CIFED’02, Hammamet, Tunisia, 21.23.10.2002, p. 129-136
Appendix
Label legend / lookup table – Arabic label to Latin label & statistic of occurrence in
the database
Arabic label
0_A
1_A
2_A
6_A
7_A
8_A
9_A
‫_ء‬A
‫_ﺁ‬A
‫_أ‬A
‫_إ‬A
‫_أ‬E‫_ل‬B
‫_إ‬E‫_ل‬B
‫_ئ‬M
Latin label
0A
1A
2A
6A
7A
8A
9A
hhA
amA
aeA
ahA
aeElaB
ahElaB
alM
Quantity
342
279
384
311
354
284
341
520
544
1660
631
360
122
355
‫_ا‬A
‫_ا‬E
‫_ا‬E‫_ل‬B
‫_ا‬E‫_ل‬M
‫_ب‬A
‫_ب‬B
‫_ب‬E
‫_ب‬M
‫_ة‬A
‫_ت‬A
‫_ت‬B
‫_ة‬E
‫_ت‬E
‫_ت‬M
aaA
aaE
aaElaB
aaElaM
baA
baB
baE
baM
teA
taA
taB
teE
taE
taM
20251
13308
799
1076
331
5636
344
3407
2182
356
2324
7259
357
1045
Arabic database -- IFN/ENIT-database for developing and testing recognition systems for handwritten Arabic words (Arabic OCR) (version 1.0p2)
3-5
IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com
‫_ث‬A
‫_ث‬B
‫_ث‬M
‫_ج‬A
‫_ج‬B
‫_ج‬E
‫_ج‬M
‫_ج‬M‫_ل‬B
‫_ح‬A
‫_ح‬B
‫_ح‬E
‫_ح‬M
‫_ح‬M‫_ل‬B
‫_ح‬M‫_م‬M‫_ل‬B
‫_ح‬M‫_ن‬B
‫_خ‬A
‫_خ‬B
‫_خ‬E
‫_خ‬M
‫_خ‬M‫_ل‬B
‫_د‬A
‫_د‬E
‫_ذ‬A
‫_ذ‬E
‫_ر‬A
‫_ر‬E
‫_ز‬A
‫_ز‬E
‫_س‬A
‫_س‬B
‫_س‬E
‫_س‬M
‫_ش‬A
‫_ش‬B
‫_ش‬E
‫_ش‬M
‫_ص‬A
‫_ص‬B
‫_ص‬E
‫_ص‬M
‫_ض‬A
‫_ض‬B
‫_ض‬E
‫_ض‬M
thA
thB
thM
jaA
jaB
jaE
jaM
jaMlaB
haA
haB
haE
haM
haMlaB
haMmaMlaB
haMnaB
khA
khB
khE
khM
khMlaB
daA
daE
dhA
dhE
raA
raE
zaA
zaE
seA
seB
seE
seM
shA
shB
shE
shM
saA
saB
saE
saM
deA
deB
deE
deM
338
‫_ط‬A
‫_ط‬B
‫_ط‬E
‫_ط‬M
‫_ظ‬B
‫_ظ‬M
‫_ع‬A
‫_ع‬B
‫_ع‬E
‫_ع‬M
‫_غ‬B
‫_غ‬M
‫_ف‬A
‫_ف‬B
‫_ف‬E
‫_ف‬M
‫_ق‬A
‫_ق‬B
‫_ق‬E
‫_ق‬M
‫_ك‬B
‫_ك‬E
‫_ك‬M
‫_ل‬A
‫_ل‬B
‫_ل‬E
‫_ل‬M
‫_م‬A
‫_م‬B
‫_م‬E
‫_م‬M
‫_م‬M‫_ل‬B
‫_ن‬A
‫_ن‬B
‫_ن‬E
‫_ن‬M
‫_ﻩ‬A
‫_ﻩ‬B
‫_ﻩ‬E
‫_ﻩ‬M
‫_و‬A
‫_و‬E
‫_ى‬A
‫_ي‬A
353
327
504
981
346
1218
539
314
2483
295
1804
365
64
100
341
863
321
425
310
2718
4883
353
703
6252
9369
1764
2718
873
4218
811
1110
475
1617
351
1277
355
956
357
1096
351
730
343
328
4-5
toA
toB
toE
toM
zaB
zaM
ayA
ayB
ayE
ayM
ghB
ghM
faA
faB
faE
faM
kaA
kaB
kaE
kaM
keB
keE
keM
laA
laB
laE
laM
maA
maB
maE
maM
maMlaB
naA
naB
naE
naM
heA
heB
heE
heM
waA
waE
eeA
yaA
343
359
350
1258
339
690
915
1650
347
1990
326
600
319
898
316
1647
397
2608
348
1307
1221
335
980
1485
14340
1056
2594
890
3886
536
4626
458
1267
3723
2119
2912
696
1924
351
347
3511
6529
322
2932
IFN/ENIT-database – DATABASE OF HANDWRITTEN ARABIC WORDS – http://www.ifnenit.com
‫_ي‬B
‫_ى‬E
‫_ي‬E
‫_ي‬M
yaB
eeE
yaE
yaM
4383
350
2167
7759
Note:
“llL” is often added and means
ligature “chadda”
• The character shape indicators
(A,B,M,E) are sometimes
supplemented with a “1” or a “2”,
like ”baA1”. In this case there is a
point error detected. Take care: this
feature is not consistent over the
whole database /. We recommend
ignoring these kinds of label
supplements.
•
5-5
Download