Uploaded by qmnfynxawqloqwhjje

An experimental evaluation of cluster k estimation methods on deep learned vector embeddings for page stream segmentation

advertisement
An experimental evaluation of cluster k estimation methods on
deep learned vector embeddings for page stream segmentation
submitted in partial fulfillment for the degree of master of science
Erick Alfaro
11814918
Master Information Studies
data science
Faculty of Science
University of Amsterdam
Submitted on 23.12.2022
Title, Name
Affiliation
Email
UvA Supervisor
Dr. Maarten Marx
UvA
maartenmarx@uva.nl
External Supervisor
ABSTRACT
In the task of Page Stream Segmentation, a Stream refers to a consecutive set of pages that may split into π‘˜ collection of documents,
where π‘˜ may be unknown. We approach the task of Page Stream
Segmentation as a clustering problem where we attempt to cluster
similar pages into a set of π‘˜ documents that reflect the true number
of clusters/segments of a stream of pages. Similar to clustering tasks
a common challenge for the practitioner is identifying the exact
number of π‘˜ present in a given data set. Thus we build on prior
work in this clustering approach and evaluate the effectiveness
of using estimation methods to approximate the true number of
π‘˜ documents in our PSS problem, specifically on the WOO data
set. In our approach we use deep learning based representations
of each page in a stream, specifically testing both pretrained and
finetuned BERT sentence embeddings as well as VGG16 pretrained
on ImageNet. We attempt to approximate the π‘˜ number of documents in a given stream using various knee based methods as
well as using the Pruned Exact Linear Time (PELT) change point
detection method. We evaluate our results on both synthetically
generated data as well as our domain specific WOO corpus. We
find that the BERT embeddings in combination with PELT on the
synthetically generated data provides an RMSE within 20% of the
actual π‘˜. However our results for the vector representations of the
WOO corpus show that the RMSE error of 179 shows a significantly
larger error magnitude.
KEYWORDS
Page Stream Segmentation, Agglomerative Clustering, PELT, LMethod
1
GITHUB REPOSITORY
https://github.com/erickalfaro/ealfaro_thesis
2
INTRODUCTION
Page Stream Segmentation (PSS) is the task of splitting a stream
of pages into sets of consecutive pages that belong to a single
document. PSS can play an important role in digitizing physical
documents via a batch scanning process. Batch scanning documents
with disregard for exact cutoffs leads to the generation of a stream
of pages which may represent multiple sets of documents with
no clear separation. In this evaluation we will ultimately assess
the performance of cluster k estimation methods on the recently
released corpus of documents provided by the Dutch government
under the Open Government Act, or Wet Open Overheid (WOO)
[43]. Wet Open Overheid gives citizens/journalists the ability to
request documents related to governmental decision making. For
such requests the information is received in the form of batchscanned PDF’s which may contain 1000+ pages. The documents
can represent a wide range of concatenated document types such as
multi-page governmental reports, electronic communications, and
partially redacted meeting minutes. Thus the task of segmenting the
pages in a stream of pages provided by the Dutch government serves
the purpose of enabling citizens and journalists to effectively audit
the Dutch government as a safeguard for a functioning democracy.
The problem of Page Stream Segmentation can be approached
both as a binary classification problem and/or as a clustering problem. In the classification approach a binary classifier attempts to
predict whether each page is the first page of a document [1,0].
In this evaluation we instead use the clustering approach, specifically using constrained agglomerative clustering, wherein cluster
distances are computed strictly for each neighboring page. This clustering approach builds on prior work by Thompson and Nikolov [9]
wherein document features such as word position, character width,
spacing, as well as text similarity and meta data are clustered using
hierarchical clustering. In our approach we instead use pre-trained
and fine-tuned Image CNN and BERT transfer learning based image
and text vector representations. The CNN based approach builds
on work introduced by Gallo et al. [17] and further builds on the
multi-modal approach by Wiedemann et al. [44] wherein image
based vectors as well as OCR’ed text based features are used in a
binary classification setting.
Our work extends the work by Busch [8] Groenen [19], and
Heusden [43] wherein we attempt to improve PSS clustering performance by approximating the correct cluster π‘˜ using π‘˜ estimation
methods.
In our evaluation we test using four Knee approximation methods and a change point detection method. First we implement and
test the L-Method [37] and Refined L-Method which is a computationally efficient "knee" finding algorithm which can be defined as
the point of maximum curvature on a graph. The L-Method fits two
lines on a set of cluster distances/similarities and attempts minimize
the weighted RMSE of the two lines. We additionally test the open
source Kneedle algorithm as well as a more recent knee approximation method, dynamic first derivative thresholding (DFDT). Our
second approach we test the Penalized Exact Linear Time (PELT)
[28] algorithm which is a commonly used change point detection
algorithm with an objective function that minimizes a penalized
sum of costs. PELT is an exact approach whose computational cost
can be considered 𝑂 (𝑛).
Research Aim The aim of this research is evaluation of cluster
π‘˜ estimation methods on linearly constrained data using agglomerative clustering. We will use vector embedding representations of
the WOO data obtained by finetuned and pre-trained deep learning
models. We aim to provide answers to the following questions:
• RQ 1 Using RMSE to evaluate the π‘˜ estimation methods, how
effective are our cluster π‘˜ estimation methods on synthetic
data sets?
• RQ 2 Using RMSE to evaluate the π‘˜ estimation methods, how
effective are our cluster π‘˜ estimation methods on the Wet
Open Overheid (WOO) data set? Evaluate for transformer
based pretrained & fine tuned embeddings and CNN model
embeddings.
• RQ 3 Which π‘˜ estimation method shows the greatest improvement in PSS clustering performance?
3
RELATED WORK
We first discuss prior work related strictly to Page Stream Segmentation. Then we discuss work related to Knee/Elbow estimation for
the purpose of cluster π‘˜ approximation and Change-point Detection
as related to our problem of Page Stream Segmentation.
3.1
and as such may produce poor accuracy when estimating knee/elbow points. Satopaa et al. [38] propose the Kneedle algorithm to
find the point of maximum curvate of a line by re-scaling a set of
discrete points using the first (head) and last (tail) points of a curve.
Salvador et al. [37] propose the L-method which identifies a point
on a error curve that minimizes the weighted RMSE of two lines
fit on the error curve. Antunes et al. [4] introduce dynamic first
derivative thresholding (DFDT) which uses a threshold algorithm
and the first derivative of a curve to find the sharpest angle between
high and low values. Antunes et al. [3] later propose the AL-method
and S-method. The AL-method, similar to the L-method, fits two
lines on an error curve and calculates a scaled score between the
range of [0, 1] combining the RMSE and an angle score based on
the sharpness of the angle identified. The S-method, similar to the
L-method, fits three lines on an error curve.
Prior work on Knee/Elbow estimation methods can be found in a
wide variety of domains. In Franke et al. [16] they estimate fatigue
damage of materials using a knee based approach. Satopaa et al. [38]
test the effectiveness of Kneedle on network latency and congestion
control. Khan et al. [27] use Kneedle to identify the K in the task
of clustering sentence topics in the task of Text Summarization
using an Extractive based K-Means and TFIDF method. Cuevas
[10] evaluates the effectiveness of COVID outbreak detection in
facilities using Kneedle. Salvador et al. [37] evaluate the L-method
specifically for estimating the cluster K for hierarchical clustering
using similarity/distance metrics where K is unknown.
Page Stream Segmentation
Prior work on Page Stream Segmentation broadly falls into two categories, rule based or machine learning based. Rule based systems
generally rely on manually engineered features from the layout and
text descriptors of a given page to segment pages into sequential
documents. Some examples of Rule based features include recurring
phrases[9] and named entities, page number sequences located on
specific parts of a page [9] [31], text box positions on a page[26],
and formatting information[26] [9]. Machine learning based approaches tend to not rely on manually engineered features and
instead create word or image based or embeddings which are later
segmented/clustered/classified into documents.
To our knowledge Thompson and Nikolov [9] first used a combination of hierarchical clustering and classification using handcrafted structure/layout features as well as text similarity features
for rule based representations of pages to separate pages into documents. Later in 2009 Meilender and Belaïd [31] present a multi-gram
based Bakis Model. Gordo et al. [18] propose a multipage supervised approach using textual descriptors for classifying document
pages in the task of document separation. [11] [12] propose a two
model approach to classify each page to a document and then validating the classifications by predicting confidence scores. Rusiñol
et al. [36] propose a multi-modal approach to the Document Image
Classification (DIC) problem using TF-IDF and LSA textual features
and pixel density descriptors from image data. Harley et al. [22]
at the time introduced a state-of-the-art DIC approach using CNN
image vector features. Agin et al. [2] proposed an ML only approach
wherein they test the performance of three binary classifiers (SVM,
Random Forest, MLP) on a combination of Bag of Visual Words
page representations and font information. Noce et al. [33] also
leverage a multi-modal approach for the DIC problem using classspecific key-terms along with an image classifier. Gallo et al. [17]
leverage deep neural networks tackling the PSS problem as a result
of a DIC process detecting a change of the document class between
consecutive pages; though this process only works if we assume
documents alternate in the sequential streams. Hamdi et al. [21]
create embedding features using doc2vec and compare these embeddings against page neighbors with cosine similarity to classify
the start of a new page which showed difficulty in identifying single
page documents with their approach.
The current state-of-the-art approaches [7] [44] [20] [14] for
Page Stream Segmentation use multi-modal deep learning based
embeddings and treat PSS as a binary classification problem. Wiedemann et al. [44] used the VGG16 architecture for image embeddings
and pre-trained FastText for word embeddings. Later, Demirtas et
al. [14] improved performance by adding interpage context for 33
semantic classes.
3.2
3.3
Change Point Detection
We refer to Charles et al. [42] for a comprehensive survey of algorithms for the task of offline change point detection. The earliest
works in change point detection can be found in the domain of
industrial quality control in locating a shift in the mean of independent and identically distributed (iid) Gaussian variables. This type of
task has since been actively researched in a variety of other domains
such as speech processing, financial analysis, bio-informatics, climatology and other areas of physical phenomena or human activity
that are monitored with sensors.
Change point detection can be divided into two categories, online and offline methods. Online methods often referred to as event/anomaly detection usually applied in real-time setting whereas
offline methods are also known as signal segmentation methods
and focus on signal segmentation after the signal has been collected.
Furthermore offline change point detection methods fall into two
categories, (1) where the number of change points K is known and
(2) where the number of change points K is unknown.
For our PSS task, we seek an offline change point detection algorithm which must be applicable where the number of change points
K is unknown and is computationally efficient. The earliest and
most established search method within the change point literature
is Binary Segmentation (BS) [40] [39]. The BS method iteratively
segments subsets of a sequence and can extend a single change
point to multiple change points. The BS method is computationally efficient 𝑂 (𝑛 log 𝑛) however since change points are wholly
dependent on the first change point the BS method and as such
is not considered an exact method. Auger et al. [6] propose the
Segment Neighbourhood (SN) exact search method which searches
Knee/Elbow Estimation
Several approaches exist in the task of detecting knee/elbow points
in discrete data. Léger [29] defined curvature as the curvature circumscribed by a set of three discrete points of a function. This
method relies on only three points to estimate knee/elbow points
2
the entire segmentation space using dynamic programming. SN
computes a cost for all possible segmentation’s between 0 and Q,
the max number of change points specified. Due to the exhaustive
search of SN the computation cost is significant 𝑂 (𝑄𝑛 2 ) and as
observed data increases linearly the SN method computation cost
approaches 𝑂 (𝑛 3 ). Yao [46] and Jackson et al. [25] propose the
Optimal Partitioning method (OP) exact method, which iteratively
identifies a change point and computes a cost related to data prior
to the last change point plus the cost for the segment from the last
changepoint to the end of the data. Compared to the SN method, the
OP method provides an improvement to computational efficiency
with computation time being 𝑂 (𝑛 2 ). Killick et al. [28] introduce
a modification to the OP method denoted as PELT which, under
certain conditions, result in linear computational cost 𝑂 (𝑛) whilst
retaining an exact minimisation of the OP search method. PELT
achieves this via a combination of optimal partitioning and pruning
data points which can never be minima from the minimisation
performed at each search iteration.
PELT has been applied on a wide range of data sets such as DNA
sequences [24] [30], physiological signals [23], oceanographic data
[28], and textual data based on term frequencies [32] [47] and
word2vec embeddings [34].
4
Figure 1: Page and Document distributions of WOO data
METHODOLOGY
The methodology is organized as follows. We first introduce our
WOO data set as well as our baseline synthetic data set. We then
review our Knee/Elbow π‘˜ estimation methods followed by a review
of our change point detection approach PELT. We then close this
section with a brief overview of the metrics used to evaluate our
results.
4.1
Table 1: Brown Corpus Sentence Examples
Line number
Data
4.1.1 Wet Open Overheid. In this evaluation we use a recently
released stream of data provided by the Dutch government under
the Open Government Act, or Wet Open Overheid(WOO). This
data originates from sets of pdf documents which contained concatenated emails, reports, whatsapp conversations, and meeting
minutes which have been annotated by members from Follow
the Money [1] and students of the University of Amsterdam. We
evaulate a total of 108 streams having at least two documents, excluding 2 from our total of 110 streams.
The average page count in each stream is 827 pages (Fig. 1 top
left) and the average document count is 224 documents per stream
(Fig. 1 top right). The total number of pages in our data set is 89491
and the total document count is 24181. We observe that 102 streams
have a ratio of less than 10 pages per document (Fig. 1 bottom left )
and in aggregate we observe an average of 3 pages per document
after having removed two outliers from our data set (Fig. 1 bottom
right).
Sentence
10
That could be easily done , but there is little
reason in it .
11
It would come down to saying that Fromm
paints with a broad brush , and that , after all ,
is not a conclusion one must work toward but
an impression he has from the outset .
12
the effect of the digitalis glycosides is inhibited
by a high concentration of potassium in the
incubation medium and is enhanced by the absence of potassium ( Wolff , 1960 ) .
B. Organification of iodine The precise mechanism for organification of iodine in the thyroid
is not as yet completely understood .
However , the formation of organically bound
iodine , mainly mono-iodotyrosine , can be accomplished in cell-free systems .
13
14
12, 13, and 14 belong to the next group of sentences. The semantic
difference is clear and our segmentation approach should be able
to detect the semantic separation and differentiate between the
sentence groups.
To generate baseline results for our π‘˜ estimation methods, we
take inspiration from Charles et al. [42] and refer to the Brown
corpus [15] to generate more synthetic samples. We generate synthetic data from the Brown corpus using the π‘π‘Žπ‘Ÿπ‘Žπ‘  function that is
available in the NLTK package which extracts paragraphs from the
corpus. We then filter all paragraphs to only paragraphs containing
4.1.2 Synthetic Text Data. The authors of the Ruptures [42] change
point detection Python library introduce a synthetically generated
data set of 99 sentences which split into 10 distinct groups of nine
to eleven sentences. The sentences are linearly ordered with the
group they belong and show clear semantic variation from group to
group. Table 1 shows five sentences in the order they are found in
the data set. Sentences 10 & 11 belong to one group and sentences
3
at a minimum 9-11 sentences with each sentence being at least
6 words long, yielding 252 unique paragraphs across 15 varying
categories of topics such as ’adventure’, ’government’, ’humor’, etc.
We use these 252 unique paragraphs to iteratively construct 1000
unique combinations of paragraphs such that each paragraph is
followed by a paragraph from a different topic category.
The above synthetic data set can be considered a proxy to our
WOO data set if we consider each sentence as a proxy for a page
and each paragraph as a proxy for a document in the context of
page stream segmentation.
Figure 2: Distance metrics relative to a dendogram
4.1.3 Embeddings. We evaluate a set of pre-trained and fine tuned
embeddings for both image and text modalities on the WOO data.
The image vector representations are generated from the last layer
of a pre-trained and fine tuned VGG16 CNN model based on the
prior works of Noce et al. [33], Gallo et al. [17], and Weidemann
et al. [44]. The pre-trained text vectors are generated using the
Dutch Roberta model [13], using SentenceTransformers to generate
vectors for a page. The fine tuned text embeddings are generated
from a BERT model trained for sequence classification using huggingface [45]. Lastly, we use the pre-trained π‘Žπ‘™π‘™ − 𝑀𝑖𝑛𝑖𝐿𝑀 − 𝐿6 − 𝑣2
BERT model from hugginface to generate text embeddings for our
synthetic data.
4.2
to all points on the evaluation graph with the goal of minimizing
the the weighted RMSE of the fit between the two lines. The two
lines 𝐿𝑐 and 𝑅𝑐 are fit at each 𝑐 where 𝑐 is the iterative cluster cutoff
value and 𝑏 being total count of clusters.
𝑐 −1
𝑏 +𝑐
× π‘…π‘€π‘†πΈ (𝐿𝑐 ) +
× π‘…π‘€π‘†πΈ (𝑅𝑐 ) (2)
𝑏−1
𝑏−1
Equation 2 defines the weighted RMSE, where the partition of
𝐿𝑐 and 𝑅𝑐 is at π‘₯ = 𝑐. The L-method seeks the value of 𝑐 , such that
𝑅𝑀𝑆𝐸𝑐 is minimized.
π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›π‘…π‘€π‘†πΈπ‘ =
4.2.3 Refined L Method. As already discussed in [37], the L-method
evaluation graph plots each distance data point in an effort to
then find the knee of the series of distances. In cases where the
practitioner is fitting the L-method to a long array of distances, the
L-method evaluation graph can lead to a long tail of irrelevant data
points to be plotted which skew the actual knee finding process of
the L-method. Salvador et al. [37] additionally proposes the Refined
L Method which is an iterative pruning process in which a knee is
identified using the classic L method and each subsequent iteration
reduces the total count of data points 𝑏 and recomputes the L
method knee in an attempt to find a more accurate knee cutoff. The
iterative process stops after the first instance in which a reduction
in 𝑏 does not lead to a change in 𝑐.
K Segment Estimation Methods
4.2.1 L Method. Though knee/elbow estimation of a curve is considered a heuristic process, a definition for the curvature for continuous functions has been previously [38] [37] defined as equation
1. For continuous functions the knee/elbow point is the point of
maximum curvature. The maximum curvature can be found by
measuring how much a function differs from a straight line. The
closed form 𝐾 𝑓 (π‘₯) in equation 1 defines this curvature 𝑓 (π‘₯) at any
given point as a function of its first and second derivative. The maximum of the first derivative is also known as the inflection point of
the curve. This is not representative of the knee and instead only
captures where the rate of increase reaches a maximum. In contrast
the maximum curvature definition precisely matches the concept
of a knee.
𝐾 𝑓 (π‘₯) =
𝑓 ′′ (π‘₯)
3
(1 + 𝑓 ′ (π‘₯) 2 ) 2
4.2.4 Kneedle. Kneedle [38] approximates maximum curvature
(knee) as the set of points in a curve that are local maxima if the
curve is rotated (πœƒ ) degrees clockwise about (π‘₯π‘šπ‘–π‘› , π‘¦π‘šπ‘–π‘› ) through
the line formed by the points (π‘₯π‘šπ‘–π‘› , π‘¦π‘šπ‘–π‘› ) and (π‘₯π‘šπ‘Žπ‘₯ , π‘¦π‘šπ‘Žπ‘₯ ). The
Kneedle algorithm takes into account the head and tail points of
an error curve and approximates the knee of a curve by identifying
points on a curve where the curve becomes inherently more "flat".
The first step of Kneedle is to use a smoothing spline to smooth
the shape of the input data. Then the range of x and y values
are normalized to the unit square. Next, perpendicular distances
are calculated between each discrete data point and the diagonal
line between the first and last data point. The distances from the
prior step are then used to find a local maxima of the new set of
normalized points which is the knee point derived by Kneedle. It
is additionally possible to adjust the 𝑆 sensitivity of the Kneedle
algorithm which defines how aggressive/conservative the Knee
detection should be. Lower values of 𝑆 lead to more accurate Knee
points; in online scenarios a higher 𝑆 may be preferred.
(1)
However as shown in [5] fitting a continuous function to discrete
curvatures produce poor results. Thus for our PSS task we instead
seek to detect the knee/elbow points using a method that is more
appropriate for discrete data.
4.2.2 L-method. introduced by Stan Salvador et al. [37] proposes
an efficient knee finding algorithm to automatically determine the
number of clusters using the discrete distances as returned by the
hierarchical clustering (these are the same distances that are used to
create a dendogram, Fig. 2). In Hierarchical clustering each cluster is
associated to a distance metric computed greedily that specifies the
distance between itself and the nearest cluster after just one single
iteration. The L-method uses a two dimensional evaluation graph
(Fig. 3) where the x-axis plots the count of clusters and the y-axis
plots an evaluation metric such as, cluster distance. The method
iteratively fits two lines starting at each end of the evaluation graph
4.2.5 DFDT. Antunes et al. [4] propose DFDT for the task of cluster
π‘˜ estimation. DFDT is a computationally efficient knee estimation
4
4.3
Metrics
For the evaluation of our cluster π‘˜ approximation methods we refer
to metrics that allow us to measure our ability to label a page as the
start of a new document (we interchange "sentence" for "pages" and
"paragraphs" for "documents" as it relates to our synthetic data).
For this purpose we measure the Macro F1 score, Weighted Block
F1 score , and RMSE to measure RMSE of the estimated π‘˜ respective
to true π‘˜.
4.3.1 The Macro F1 score. , also referred to as Page F1 by Lukas
et al. [8], can be defined by the formulas below. True and False
positives and negatives are defined by the standard 𝑇 𝑃, 𝐹 𝑃, 𝑇 𝑁 ,
and 𝐹 𝑁 respectively; each explicitly referring to whether the given
page is labeled as "start page".
π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› =
π‘…π‘’π‘π‘Žπ‘™π‘™ =
Figure 3: All possible best-fits with respective RMSE for synthetic example data
π‘ƒπ‘Žπ‘”π‘’πΉ 1 =
π‘ƒπ‘Žπ‘”π‘’πΉ 1 =
method with the aim to improve the Knee estimation of discrete
curve’s with long tails. DFDT is a hybrid between the Menger curvature [41] and the L-method. DFDT estimates the first derivative of
a given curve which represents the slope of the tangent line. Using
this tangent line, DFDT identifies the point where the function has
a sharp angle. Instead of fitting two straight lines like the L-method,
DFDT uses the IsoData [35] threshold algorithm to find the ideal
split between high and low values of the first derivative. The advantage of this approach is that the threshold algorithm splits data
based on actual values of the discrete curve and not the quantity of
values, mitigating the effect of long tails.
(6)
2 ∗ π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› ∗ π‘…π‘’π‘π‘Žπ‘™π‘™
π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› + π‘…π‘’π‘π‘Žπ‘™π‘™
(7)
2 ∗𝑇𝑃
(2 ∗ 𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 )
(8)
π·π‘œπ‘π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› =
π·π‘œπ‘π‘…π‘’π‘π‘Žπ‘™π‘™ =
π·π‘œπ‘πΉ 1 =
π·π‘œπ‘πΉ 1 =
(3)
π‘Šπ‘‡ 𝑃
|𝑇 𝑃 | + |𝐹 𝑃 |
π‘Šπ‘‡ 𝑃
|𝑇 𝑃 | + |𝐹 𝑁 |
(10)
(11)
2 ∗ π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› ∗ π‘…π‘’π‘π‘Žπ‘™π‘™
π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› + π‘…π‘’π‘π‘Žπ‘™π‘™
(12)
π‘Šπ‘‡ 𝑃
|𝑇 𝑃 | + 0.5(|𝐹 𝑃 | + |𝐹 𝑁 |)
(13)
4.3.3 RMSE. is also computed in order to evaluate the performance
of our cluster π‘˜ approximation methods. We use RMSE to compare
the estimated number of clusters to the ground truth number of
clusters in our dataset. To calculate the RMSE, we first compute
the difference between the ground truth and estimated number
of clusters for each sample in our dataset. The lower the RMSE,
the closer the estimated number of clusters is to the ground truth
While ensuring optimality of OP and improving computational
efficiency to 𝑂 (𝑛), Killick et al [28] propose a pruning step as demonstrated in equation 4. If Equation 4 is true at 𝜏 ′ ≤ 𝑑 − 1, then 𝜏 ′ can
be pruned for all future time steps.
𝐹 (𝜏 ′ ) + 𝐢 (𝑦 (𝜏 ′ +1):𝑑 ) < 𝐹 (𝑑 )
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
(5)
4.3.2 The Weighted Block F1 score. [43], also referred to as Doc
F1 by Lukas et al. [8], can be defined by the formulas below. The
ground truth block is denoted as 𝐷𝑑 and a predicted block defined
as 𝐷𝑝 . A block is a set of connected pages. The intersection of 𝐷𝑑
and 𝐷𝑝 is called πΌπ‘›π‘‘π‘’π‘Ÿπ‘ π‘’π‘π‘‘π‘–π‘œπ‘›π‘œπ‘£π‘’π‘Ÿπ‘ˆ π‘›π‘–π‘œπ‘›; πΌπ‘œπ‘ˆ (𝐷𝑑 , 𝐷𝑝 ), calculated
using Jaccard similarity. A pair of blocks (𝐷𝑑 , 𝐷𝑝 ) is considered
True Positive if more than half of the pages overlap, πΌπ‘œπ‘ˆ (𝐷𝑑 , 𝐷𝑝 ) >
0.5. 𝐹 𝑃 is defined as the set that contains all predicted blocks 𝐷𝑝
that are not in a TP pair and 𝐹 𝑁 as the set of all ground truth blocks
𝐷𝑑 that are not in a TP pair. This error metric approach essentially
allows for Weighted True Positives, denoted as WTP.
∑︁
π‘Šπ‘‡ 𝑃 =
{πΌπ‘œπ‘ˆ (𝐷𝑑 , 𝐷𝑝 )|(𝐷𝑑 , 𝐷𝑝 ) ∈ 𝑇 𝑃 }
(9)
4.2.6 PELT. Change point detection (CPD) detects shifts in time
series trends. CPD can be used to detect anomalous sequences/states
both in real-time (online) and retroactively (offline). Pruned Exact
Linear Time (PELT) builds on Optimal Partitioning (OP) with an
added step to prune candidate change points. OP aims to minimize
the cost function shown in equation 3 using dynamic programming.
The algorithm works by optimising at each time step based on
the optimal solution at all previous time steps. At each step 𝑑, the
algorithm considers time steps after the last change point to the
current time step, as candidates for a new change point. Starting
with F(0) = 0 and given a value of 𝛽, F(𝑑) for all values of 𝑑 can be
calculated, with a resulting optimal cost. The above equation is
iteratated for each step 𝑑=1 : 𝑛 which results in 𝑂 (𝑛 2 ) time.
𝐹 (𝑑) = π‘šπ‘–π‘› 0≤𝑇 <𝑑 [𝐹 (𝜏) + 𝐢 (π‘¦πœ+1,𝑑 ) + 𝛽]
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
(4)
5
number of clusters, indicating a better performance of the clustering
algorithm.
5
then provide an exact π‘˜ as well as the specific change points in the
given data.
On Table 3 we show the performance of our π‘˜ approximation
methods on synthetic data using both simple count vector embeddings as well as BERT based embeddings. For the baseline results
we used four Knee based π‘˜ approximation methods as well as PELT.
Additionally we focus on the BERT embeddings as they are the
topic of the WOO evaluation as well. These baselines show the
relevance of our π‘˜ approximation methods for the task of PSS and
set the scene for our research question related to PSS evaluation of
the CNN and BERT embeddings on the WOO data. Referring to our
research questions. We find that the PELT change point segmentation approach on our synthetic data using BERT embeddings is by
far the best performing π‘˜ approximation method with an RMSE of
2.23. The remaining π‘˜ approximation methods yield RMSE that is
significantly worse with the next best approximation method being
PELT using CountVector embeddings.
RESULTS
We first review our π‘˜ approximation baseline results on our synthetically generated data. Then we review our results on the WOO
dataset.
5.1
Baseline Results
To set a baseline for our evaluation of the WOO data we use the
average document length of our streams as shown on Figure 1. Our
approach is simply to take the average document length of π‘˜ = 3
and evaluate our metrics on the assumption that every 𝑛 = 3 page is
the start of a new document. With this approach we see the results
shown on table 2.
Table 2: Baseline results for the Naive mean on all streams.
Page F1
0.04
Doc F1
0.22
5.3
RMSE
174
On table 4 we show the results of our π‘˜ approximation methods.
We abbreviate fine-tuned and pre-trained embeddings to FT and PT
respectively. The results are representative of all 110 streams in the
WOO dataset for PELT, Kneedle, and DFDT. For the L-Method and
Refined L-Method there is a 20 sample minimum as suggested by
the original author [37]. Our results on our WOO data show clear
deterioration with Page F1, Doc F1, and RMSE being far worse than
the experiments on the synthetic data.
As evident by the Naive mean approach our Page F1 is very
poor which suggests that this approach is very poor at capturing
a "document start page". The Doc F1 is quite poor but is slightly
higher than Page F1 and suggests that the Naive mean approach is
able to overlap with the ground truth blocks, in a small number of
cases. Lastly, the high RMSE for the Naive mean suggests that on
average the count of documents π‘˜ in a stream is off by 174 using
this approach.
5.2
Table 4: Performance of π‘˜ approximation methods on WOO
data 𝑛 = 100
Synthetic Data Results
Embedding
Image FT
Image FT
Image FT
Image FT
Image FT
Image PT
Image PT
Image PT
Image PT
Image PT
Text FT
Text FT
Text FT
Text FT
Text FT
Text PT
Text PT
Text PT
Text PT
Text PT
Table 3: Performance of π‘˜ approximation methods on synthetic data 𝑛 = 1000
Embedding
BERT
CountVector
BERT
BERT
BERT
BERT
Method
PELT
PELT
Refined L-Method
L-Method
DFDT
Kneedle
Page F1
0.608
0.117
0.0633
0.056
0.055
0.054
Doc F1
0.743
0.392
0.123
0.116
0.107
0.091
WOO Results
RMSE
2.23
8.19
54.1
54.2
49.9
8.36
Next we evaluate our π‘˜ approximation methods on the synthetic
data. We perform this evaluation in the same manner for both
synthetically generated data and for the WOO data. We create
vector representations of each segment within our corpus and look
to approximate an π‘˜ which matches as closely to the ground truth
π‘˜ as possible. In the case of the Knee based approximation methods
we first calculate cosine distances calculated by clustering the vector
embeddings. Next we attempt to find a Knee point on the series of
distances with the hypothesis that the Knee point will most closely
match the ground truth π‘˜. In the case of the PELT change point
detection approximation approach, we are able to pass the multivariate embeddings directly into the PELT algorithm. PELT will
6
Method
PELT
Kneedle
Refined L-Method
L-Method
DFDT
PELT
Kneedle
Refined L-Method
L-Method
DFDT
PELT
Kneedle
Refined L-Method
L-Method
DFDT
PELT
Kneedle
Refined L-Method
L-Method
DFDT
Page F1
0.1829
0.0295
0.0443
0.0547
0.0350
0.1806
0.0205
0.0349
0.0538
0.0679
0.0423
0.0107
0.0200
0.0203
0.0353
0.0187
0.0251
0.0238
0.0615
0.0458
Doc F1
0.1860
0.0555
0.1511
0.1947
0.0828
0.1842
0.0666
0.0788
0.1317
0.1075
0.2582
0.0463
0.2364
0.2363
0.2505
0.0527
0.0523
0.0518
0.1318
0.0811
RMSE
1037
343
165
155
374
1037
340
500
448
428
148
350
165
166
157
331
341
353
136
342
6 DISCUSSION
6.1 RQ 1: How effective are our cluster
estimation methods on synthetic data sets?
use for baseline results. We use this corpus to generate 𝑛 = 1000
samples and evaluate our approximation techniques.
In our evaluation we find that our cluster π‘˜ approximation techniques perform well on the BERT vector representations of the
synthetically generated data with a promising RMSE score of 2.23.
Additionally we observe a Page F1 of 0.608 and Doc F1 of 0.743.
Out of the five approximation techniques the PELT change point
detection approach showed the best performance by a large margin
with the next closest approach yielding an RMSE of 8.19.
Next we apply the same approximation techniques to the WOO
data set introduced by van Heusen et al. [43]. We find that all five
approaches degrade in performance significantly as compared to
the synthetic data set. PELT on both pretrained and fine tuned image embeddings shows large RMSE errors. These lage RMSE errors
for the image embeddings are due to PELT identifying each page as
a change point for 79 streams which feeds into a very large RMSE
error. It is worth noting that the synthetically generated data is
by design semantically different at each synthetically generated
segment. In other words each paragraph generated originates from
a different category of topics which allows for the cosine distances
to be large enough paragraph to paragraph that the PELT change
point detection method picks up on these semantic change points.
In contrast we observe that the WOO corpus contains cases where
consecutive pages contain the same text on both pages. In some specific case this may occur where for example an email conversation
on one page is consequently referred to verbatim in two separate
documents that occur consecutively.
Lastly for future work we suggest penalty optimization or penalty
learning for PELT. As PELT requires a penalty parameter an area of
improvement would be to identify a penalty parameter that works
on the corpus within the training set and apply this penalty on the
testing set.
We tested all of our cluster π‘˜ estimation methods on synthetically
generated data from the Brown corpus [15]. Our evaluation on
1000 synthetically generated samples showed very clearly the most
viable solution which we found to be the PELT changepoint detection approach. Using BERT embeddings in combination with PELT
yielded a Page F1 at 0.608 which was almost 6x the same approach
with CountVector embeddings and Doc F1 at 0.743 which is double
that of the CountVector embeddings. The BERT+PELT embedding
approach also yielded an average RMSE at 2.23 which shows that
the change points by PELT have an error rate on average of around
±2 π‘˜ for our 𝑛 = 1000.
The results on the synthetic data using the BERT vector representation of the text are well adapted to the synthetic data. This
proves the viability of using a change point detection approach
within the task of text segmentation.
6.2
RQ 2: How effective are our cluster
estimation methods on the Wet Open
Overheid (WOO) data?
We tested all of our cluster π‘˜ estimation methods on the WOO data
and observe significantly different results from the baseline results
on the synthetic data. The lowest RMSE we observe is 136 using the
L-method on the pretrained BERT embeddings closely followed by
PELT on the fine-tuned BERT embeddings. The PELT approach has
highest Page F1 and Doc F1 on the pretrained image embeddings
however has a RMSE over 1000. This is due to PELT identifying
almost every page as a new page for the image embeddings for both
the pretrained and fine tuned models.
6.3
REFERENCES
RQ 3: Which estimation method shows the
greatest improvement in PSS clustering
performance?
[1] [n. d.]. Follow the money - platform voor onderzoeksjournalistiek. https:
//www.ftm.nl/
[2] Onur Agin, Cagdas Ulas, Mehmet Ahat, and Can Bekar. 2015. An approach to
the segmentation of multi-page document flow using binary classification. In
International Conference on Graphic and Image Processing.
[3] Mário Antunes, Henrique Aguiar, and Diogo Nuno Pereira Gomes. 2019. AL and
S Methods: Two Extensions for L-Method. 2019 7th International Conference on
Future Internet of Things and Cloud (FiCloud) (2019), 371–376.
[4] Mário Antunes, Diogo Gomes, and Rui L Aguiar. 2018. Knee/elbow estimation
based on first derivative threshold. In 2018 IEEE Fourth International Conference
on Big Data Computing Service and Applications (BigDataService). IEEE, 237–240.
[5] Mário Antunes, Joana Ribeiro, Diogo Nuno Pereira Gomes, and Rui. L. Aguiar.
2018. Knee/Elbow Point Estimation through Thresholding. 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud) (2018),
413–419.
[6] I. E. Auger and Charles E. Lawrence. 1989. Algorithms for the optimal identification of segment neighborhoods. Bulletin of mathematical biology 51 1 (1989),
39–54.
[7] Fabricio Ataides Braz, Nilton Correia da Silva, and Jonathan Alis Salgado Lima.
2021. Leveraging effectiveness and efficiency in Page Stream Deep Segmentation.
Eng. Appl. Artif. Intell. 105 (2021), 104394.
[8] Lukas Busch and Maarten Marx. 2022. Using deep learned vector representations
for page stream segmentation by agglomerative clustering. (2022).
[9] Kevyn Collins-Thompson. 2002. A Clustering-Based Algorithm for Automatic
Document Separation.
[10] Erik Cuevas. 2020. An agent-based model to evaluate the COVID-19 transmission
risks in facilities. Computers in Biology and Medicine 121 (2020), 103827 – 103827.
[11] Hani Daher and Abdel Belaïd. 2013. Document flow segmentation for business
applications. In Electronic Imaging.
[12] Hani Daher, Mohamed-Rafik Bouguelia, Abdel Belaïd, and Vincent Poulain
d’Andecy. 2014. Multipage Administrative Document Stream Segmentation.
After all our evaluation we found the BERT vector embeddings of
the synthetic data in combination with PELT change point detection
to be a very compelling approach in identifying text based segments.
The application of this approach on the synthetic data and the
positive results suggest that it is relevant to our task of PSS and,
under certain circumstances, may provide a viable approach.
However upon testing the approach on our research specific
WOO corpus, we find that the vector embeddings we used in combination with PELT and other Knee based π‘˜ approximation methods
performed significantly worse than the performance observed on
the synthetic data. Improvements against the Naive mean baseline
are also observed and most notable the RMSE is lower in 7 different
π‘˜ approximation approaches.
7
CONCLUSION
In this evaluation we attempt to improve the clustering performance for the task of PSS on the WOO data by using five cluster π‘˜
approximation techniques. We additionally identified the Brown
[15] corpus as a good data source to generate synthetic data and
7
2014 22nd International Conference on Pattern Recognition (2014), 966–971.
[13] Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2020. RobBERT: a Dutch
RoBERTa-based Language Model. In Findings of the Association for Computational
Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3255–
3265. https://doi.org/10.18653/v1/2020.findings-emnlp.292
[14] Mehmet Arif Demirtacs, Berke Oral, Mehmet Yasin Akpinar, and Onur Deniz.
2022. Semantic Parsing of Interpage Relations. ArXiv abs/2205.13530 (2022).
[15] W. N. Francis and H. Kucera. 1979. Brown Corpus Manual. Technical Report.
Department of Linguistics, Brown University, Providence, Rhode Island, US.
http://icame.uib.no/brown/bcm.html
[16] Lutz Franke and G. Dierkes. 1999. A non-linear fatigue damage rule with an
exponent based on a crack growth boundary condition. International Journal of
Fatigue 21 (1999), 761–767.
[17] Ignazio Gallo, Lucia Noce, Alessandro Zamberletti, and Alessandro Calefati. 2016.
Deep Neural Networks for Page Stream Segmentation and Classification. 2016
International Conference on Digital Image Computing: Techniques and Applications
(DICTA) (2016), 1–7.
[18] Albert Gordo, Marçal Rusiñol, Dimosthenis Karatzas, and Andrew D. Bagdanov.
2013. Document Classification and Page Stream Segmentation for Digital Mailroom Applications. 2013 12th International Conference on Document Analysis and
Recognition (2013), 621–625.
[19] Pepijn Groenen and Maarten Marx. 2022. Multi-source clustering evaluation
of deep learning page-stream segmentation methods on 2 governmental data.
(2022).
[20] Abhijit Guha, Abdulrahman Alahmadi, D. Samanta, Mohammad Zubair Khan,
and Ahmed H. Alahmadi. 2022. A Multi-Modal Approach to Digital Document
Stream Segmentation for Title Insurance Domain. IEEE Access PP (2022), 1–1.
[21] Ahmed Hamdi, Joris Voerman, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain
d’Andecy, and Jean-Marc Ogier. 2017. Machine Learning vs Deterministic RuleBased System for Document Stream Segmentation. 2017 14th IAPR International
Conference on Document Analysis and Recognition (ICDAR) 05 (2017), 77–82.
[22] Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis. 2015. Evaluation of
deep convolutional nets for document image classification and retrieval. 2015
13th International Conference on Document Analysis and Recognition (ICDAR)
(2015), 991–995.
[23] Kaylea Haynes, Paul Fearnhead, and Idris Arthur Eckley. 2016. A computationally efficient nonparametric approach for changepoint detection. Statistics and
Computing 27 (2016), 1293 – 1305.
[24] Toby Hocking, Gudrun Schleiermacher, Isabelle Janoueix-Lerosey, Valentina
Boeva, Julie Cappo, Olivier Delattre, Francis R. Bach, and Jean-Philippe Vert.
2013. Learning smoothing models of copy number profiles using breakpoint
annotations. BMC Bioinformatics 14 (2013), 164 – 164.
[25] Brad Jackson, Jeffrey D. Scargle, David Barnes, Sundararajan Arabhi, Alina Alt,
Peter Gioumousis, Elyus Gwin, Paungkaew Sangtrakulcharoen, Linda Tan, and
Tun Tao Tsai. 2005. An algorithm for optimal partitioning of data on an interval.
IEEE Signal Processing Letters 12 (2005), 105–108.
[26] Romain Karpinski and Abdel Belaïd. 2016. Combination of Structural and Factual
Descriptors for Document Stream Segmentation. 2016 12th IAPR Workshop on
Document Analysis Systems (DAS) (2016), 221–226.
[27] Rahim Khan, Yurong Qian, and Sajid Naeem. 2019. Extractive based Text Summarization Using KMeans and TF-IDF. International Journal of Information
Engineering and Electronic Business (2019).
[28] Rebecca Killick, Paul Fearnhead, and Idris Arthur Eckley. 2012. Optimal detection
of changepoints with a linear computational cost. J. Amer. Statist. Assoc. 107
(2012), 1590 – 1598.
[29] J. C. Leger. 1999. Menger curvature and rectifiability. Annals of Mathematics 149
(1999), 831–869.
[30] Robert Maidstone, Toby Hocking, Guillem Rigaill, and Paul Fearnhead. 2014. On
optimal multiple changepoint algorithms for large data. Statistics and Computing
27 (2014), 519 – 533.
[31] Thomas Meilender and Abdel Belaïd. 2009. Segmentation of continuous document
flow by a modified backward-forward algorithm. In Electronic Imaging.
[32] Q. Niu, J. Liu, Masashi Kato, T. Aoyama, and Momoko Nagai-Tanima. 2022. Fear
of Infection and Sufficient Vaccine Reservation Information Might Drive Rapid
Coronavirus Disease 2019 Vaccination in Japan: Evidence from Twitter Analysis.
In medRxiv.
[33] Lucia Noce, Ignazio Gallo, Alessandro Zamberletti, and Alessandro Calefati. 2016.
Embedded Textual Content for Document Image Classification with Convolutional Neural Networks. Proceedings of the 2016 ACM Symposium on Document
Engineering (2016).
[34] N. Pedrazzini and Barbara McGillivray. 2022. Machines in the media: semantic
change in the lexicon of mechanization in 19th-century British newspapers. In
NLP4DH.
[35] Tony Ridler. 1978. Picture thresholding using an iterative selection method. IEEE
Transactions on Systems, Man, and Cybernetics 8 (1978), 630–632.
[36] Marçal Rusiñol, Volkmar Frinken, Dimosthenis Karatzas, Andrew D. Bagdanov,
and Josep Lladós. 2014. Multimodal page classification in administrative document image streams. International Journal on Document Analysis and Recognition
(IJDAR) 17 (2014), 331–341.
[37] Stan Salvador and Philip K. Chan. 2004. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. 16th IEEE International
Conference on Tools with Artificial Intelligence (2004), 576–584.
[38] Ville A. Satopaa, Jeannie R. Albrecht, David E. Irwin, and Barath Raghavan. 2011.
Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior.
2011 31st International Conference on Distributed Computing Systems Workshops
(2011), 166–171.
[39] Alastair Scott and Martin Knott. 1974. A Cluster Analysis Method for Grouping
Means in the Analysis of Variance. Biometrics 30 (1974), 507.
[40] Ashish K. Sen and Muni S. Srivastava. 1975. On Tests for Detecting Change in
Mean. Annals of Statistics 3 (1975), 98–108.
[41] Xavier Tolsa. 2000. Principal values for the Cauchy integral and rectifiability.
[42] Charles Truong, Laurent Oudre, and Nicolas Vayatis. 2019. Selective review of
offline change point detection methods.
[43] Ruben van Heusden, J. Kamps, and maarten marx. 2022. WooIR: A New Open Page
Stream Segmentation Dataset. Proceedings of the 2022 ACM SIGIR International
Conference on Theory of Information Retrieval (2022).
[44] Gregor Wiedemann and Gerhard Heyer. 2018. Page Stream Segmentation with
Convolutional Neural Nets Combining Textual and Visual Features. ArXiv
abs/1710.03006 (2018).
[45] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
and Alexander M. Rush. 2019. HuggingFace’s Transformers: State-of-the-art
Natural Language Processing. https://doi.org/10.48550/ARXIV.1910.03771
[46] Yi-Ching Yao. 1984. Estimation of a Noisy Discrete-Time Step Function: Bayes
and Empirical Bayes Approaches. Annals of Statistics 12 (1984), 1434–1447.
[47] Qian Ye, Kaan Ozbay, Fan Zuo, and Xiaohong Chen. 2021. Impact of Social Media
on Travel Behaviors during the COVID-19 Pandemic: Evidence from New York
City. Transportation Research Record: Journal of the Transportation Research Board
(2021).
8
Download