Objectives of layout analysis and segmentation

advertisement
Document Analysis:
Segmentation & Layout Analysis
Prof. Rolf Ingold, University of Fribourg
Master course, spring semester 2008
Prénom Nom
Outline








Objectives of layout analysis
Classification of layout analysis methods
Splitting methods
Grouping methods
Text-Graphics-Image Separation
Text line segmentation
Word and character segmentation
Field extraction from forms
2
© Prof. Rolf Ingold
Objectives of layout analysis and segmentation



The role of segmentation is to split a document image into regions
of interest
Regions of interest may be of different granularity levels: graphics or
text blocs, text lines, words, characters
The goal of layout analysis is to get a hierarchical description of
segmented objects
3
© Prof. Rolf Ingold
Segmentation strategies

Segmentation produces a hierarchy of physical objects

Two strategies can be used
 top-down segmentation: starting with the entire image, split it
recursively down to elementary shapes
 bottom-up segmentation: starting at pixel level, detect
connected components and group them hierarchically

Hybrid methods combine both strategies

Segmentation methods can be
 data-driven using only data properties (without contextual
knowledge)
 model-driven, i.e., using contextual knowledge
4
© Prof. Rolf Ingold
Top-down methods

Top-down methods decompose the entire page into a hierarchy of
 rectangular regions

Top-down approaches perform recursive XY-cuts
 horizontal and vertical projection profile analysis
 white streams (spaces) analysis
 run length smoothing algorithm (RLSA)
5
© Prof. Rolf Ingold
Recursive XY-Cut

The page is cut alternatively horizontally and vertically according to
white spaces
 Robust for most printed modern documents
 Supposes page images to be unskewed
 Does not work for all kind of layouts
 Non rectangular formatting
 Complex mosaics (illustration next)
 Resulting hierarchy may not reflect
the natural structure (illustration below)
6
© Prof. Rolf Ingold
Top-Down Segmentation

Recursive splitting can be performed by horizontal and vertical
profile analysis
 images need to be "unskewed" !
7
© Prof. Rolf Ingold
Top-Down Segmentation (2)

Order in which X-Y cuts are performed is critical
8
© Prof. Rolf Ingold
White streams analysis

Principle: detect maximal rectangular white blocs
 split regions recursively according to thresholds
9
© Prof. Rolf Ingold
Run Length Smearing Algorithm (RLSA)

The Run Length Smearing Algorithm (RLSA) is a morphological
operator
 it replaces white runs that are smaller or equal to a given
threshold by black runs
 it can be applied horizontally as well as vertically
10
© Prof. Rolf Ingold
RLSA based segmentation

RLSA can be used to segment a
page into blocs using three steps
 applied horizontally
 applied vertically
 combined by logical and
operator

Threshold values are critical and
have to be chosen
 according to document class
 using statistical white space
analysis
11
© Prof. Rolf Ingold
Bottom-up methods

Bottom-up methods start at pixel levels and groups them together in
a hierarchy of
 multi-rectangular regions (shapes delimited by horizontal and
vertical segments)
 arbitrary shapes

Bottom up methods use
 connected component extraction
 region grouping
12
© Prof. Rolf Ingold
Connected components

In a binary image, a connected component is a set of black pixels
connected by 4- or 8-adjacency
five 4-connected components
two 8-connected components
13
© Prof. Rolf Ingold
Extraction of connected components

Connected components can be extracted by different algorithms
 By a one pass full image scanning process, from top to bottom
and from left to right
 By a border following algorithm, using as first pixel a border
pixel supposed to be known
14
© Prof. Rolf Ingold
Scanning based CC Extraction
merge
for each scan line ly
for each black run r
if on line ly-1 there is no run k-adjacent to r
create a new component containing r
else if on line ly-1 there exist one run r’ k-adjacent to r
add r to the component containing r’
else if on line ly-1 there exist several runs ri k-adjacent to r
merge all components containing such a ri
add r to that component
15
© Prof. Rolf Ingold
Border following algorithm
consider P0 S having a 4-neighbor Q0  S
P ← P0 ; Q ← Q0 ; d ← direction of Q according to P ;
repeat
let Ri be the neighbor of P in direction (d+i) mod 8
if R2  S then Q ← R2 ; d ← (d+2) mod 8;
else
if R1  S then P ← R2; Q ← R1;
d
else P ← R1; d ← (d2) mod 8;
add P to the contour
until P = P0 and Q = Q0
R1
R2
Q
R2
P
d
Q
16
© Prof. Rolf Ingold
Illustration of connected components
17
© Prof. Rolf Ingold
Connected components from RLSA

Connected components can be used to detect characters

Word can be located using RLSA
18
© Prof. Rolf Ingold
Grouping components

Grouping connected components is non trivial

Grouping rules are based on
 relative positioning
 distances and thresholds
 component classification

Parameters can be estimated statistically
19
© Prof. Rolf Ingold
Allen's relations in 2D space

Relative positioning of two rectangles generate 169 configurations !
20
© Prof. Rolf Ingold
Threshold estimation

Thresholds can be estimated on statistical distributions of
 horizontal spaces for character grouping into words and word
grouping into text lines
 vertical spacing for grouping text lines into text blocs
21
© Prof. Rolf Ingold
Distributions of component sizes

Components can be
classified into
 symbols
 letters
 hairlines
 punctuation
according to their size
22
© Prof. Rolf Ingold
Region grouping
23
© Prof. Rolf Ingold
Docstrum

The docstrum method [O'Gorman] is using a graph that connects
each connected component to its k closest neighbors
24
© Prof. Rolf Ingold
Model driven layout analysis [Azokly95]
© Prof. Rolf Ingold
Generic macrostructures

In a model-driven approach, generic macrostructures are used
 a formal language describes margins and separators
© Prof. Rolf Ingold
Formal description of macrostructures
VOLUME Article IS
WIDTH = 160; HEIGHT = 240;
PAGE Garde IS ... END;
PAGE Paire IS
HSEP hs1 = (4, 3, LEFT, RIGHT, BLANK);
LAYER Principal IS
VSEP vs1 = (40, 65, TOP, hs1, BLANK);
VSEP vs2 = ([50,60], 4, hs1, BOTTOM, BLANK);
REGION Centre = (vs2, RIGHT, hs1, BOTTOM, ANY, NORMAL);
REGION Marge = (LEFT, vs2, hs1, BOTTOM, TEXT, SMALL);
...
END;
LAYER Secondaire IS
HSEP hs2 = ([10,220], 2, LEFT, RIGHT, BLANK) SUBST hs1;
HSEP hs3 = ([20,240], 2, LEFT, RIGHT, BLANK) SUBST BOTTOM;
REGION Figure = (LEFT, RIGHT, hs2, hs3, {TABLE, GRAPHICS});
END;
END;
PAGE Impaire IS ... END;
END;
© Prof. Rolf Ingold
Evaluation of segmentation results

Segmentation is rarely perfect; it generates
 undersegmentation : real components are merged
 oversegmentation : a single component is split

Special metrics have been developed to evaluate a segmentation
result

In ICDAR'03 and ICDAR'05 scientific contests were organized
© Prof. Rolf Ingold
Conclusion

Segmentation is a crucial step in document analysis

Segmentation is almost solved for
 printed documents with regular layout
 form analysis
Results are rarely perfect
 Contextual knowledge may improve the results
 Advanced pattern recognition method are required


Segmentation remains an open problem for uncontrolled
handwriting and graphical documents
© Prof. Rolf Ingold
Component hierarchy
© Prof. Rolf Ingold
Download