MS-Word

advertisement
Historical Document Analysis
Mini-Project
Processing Historical Document image follow the following procedures
1.
2.
3.
4.
5.
Image De-noising and Binarization
Page Layout Analysis
Feature Extraction
Classification
Applications
The input dataset is a set of images in one folder. The system should record for each
manuscript its main properties in XML files that include title, author, writer, period, region,
etc. Applying a procedure on a manuscript of set of images should also be recorded in an
XML file or log file. This is important to determine the applicability of a procedure as some
procedure is pre-requisite for others; e.g., it is mandatory to extract/compute feature before
applying classification. The control of the flow, while respecting the pre-requisite is
maintained by the GUI.
For each one of the above mentioned procedures there are several algorithms that vary in
accuracy and speed. Thus, we plane to implement multiple algorithms for each procedure
and let the user, via the GUI, select one of these algorithms.
Here is the projects list, which is derived from the procedure list
1.
2.
3.
4.
5.
6.
Graphics User Interface GUI
Image denoising and binarization will be done by the GUI group, as it is just a call for
an OpenCV function.
Page Layout Analysis
a. Manual Page Layout Analysis- This should be done in close collaboration with the
GUI group
b. Automatic Page Layout Analysis- Require some research and it is good for those
who would like to play with algorithms
Feature Extraction
a. Define the set and parameters for various features, in this we will handle pixel
based and contour-based features, using XML file and determine the applicable
classifiers.
b. Compute features and set the appropriate fields in the XML file in (a)
Classification
a. Integrate classification libraries into the systems
b. Pass the features to the appropriate classifier
Application
a. Search for a work in document
b. Scrip recognition
c. Compare two scripts
Download