Cluster K Estimation for Page Stream Segmentation

An experimental evaluation of cluster k estimation methods on deep learned vector embeddings for page stream segmentation submitted in partial fulfillment for the degree of master of science Erick Alfaro 11814918 Master Information Studies data science Faculty of Science University of Amsterdam Submitted on 23.12.2022 Title, Name Affiliation Email UvA Supervisor Dr. Maarten Marx UvA maartenmarx@uva.nl External Supervisor ABSTRACT In the task of Page Stream Segmentation, a Stream refers to a consecutive set of pages that may split into 𝑘 collection of documents, where 𝑘 may be unknown. We approach the task of Page Stream Segmentation as a clustering problem where we attempt to cluster similar pages into a set of 𝑘 documents that reflect the true number of clusters/segments of a stream of pages. Similar to clustering tasks a common challenge for the practitioner is identifying the exact number of 𝑘 present in a given data set. Thus we build on prior work in this clustering approach and evaluate the effectiveness of using estimation methods to approximate the true number of 𝑘 documents in our PSS problem, specifically on the WOO data set. In our approach we use deep learning based representations of each page in a stream, specifically testing both pretrained and finetuned BERT sentence embeddings as well as VGG16 pretrained on ImageNet. We attempt to approximate the 𝑘 number of documents in a given stream using various knee based methods as well as using the Pruned Exact Linear Time (PELT) change point detection method. We evaluate our results on both synthetically generated data as well as our domain specific WOO corpus. We find that the BERT embeddings in combination with PELT on the synthetically generated data provides an RMSE within 20% of the actual 𝑘. However our results for the vector representations of the WOO corpus show that the RMSE error of 179 shows a significantly larger error magnitude. KEYWORDS Page Stream Segmentation, Agglomerative Clustering, PELT, LMethod 1 GITHUB REPOSITORY https://github.com/erickalfaro/ealfaro_thesis 2 INTRODUCTION Page Stream Segmentation (PSS) is the task of splitting a stream of pages into sets of consecutive pages that belong to a single document. PSS can play an important role in digitizing physical documents via a batch scanning process. Batch scanning documents with disregard for exact cutoffs leads to the generation of a stream of pages which may represent multiple sets of documents with no clear separation. In this evaluation we will ultimately assess the performance of cluster k estimation methods on the recently released corpus of documents provided by the Dutch government under the Open Government Act, or Wet Open Overheid (WOO) [43]. Wet Open Overheid gives citizens/journalists the ability to request documents related to governmental decision making. For such requests the information is received in the form of batchscanned PDF’s which may contain 1000+ pages. The documents can represent a wide range of concatenated document types such as multi-page governmental reports, electronic communications, and partially redacted meeting minutes. Thus the task of segmenting the pages in a stream of pages provided by the Dutch government serves the purpose of enabling citizens and journalists to effectively audit the Dutch government as a safeguard for a functioning democracy. The problem of Page Stream Segmentation can be approached both as a binary classification problem and/or as a clustering problem. In the classification approach a binary classifier attempts to predict whether each page is the first page of a document [1,0]. In this evaluation we instead use the clustering approach, specifically using constrained agglomerative clustering, wherein cluster distances are computed strictly for each neighboring page. This clustering approach builds on prior work by Thompson and Nikolov [9] wherein document features such as word position, character width, spacing, as well as text similarity and meta data are clustered using hierarchical clustering. In our approach we instead use pre-trained and fine-tuned Image CNN and BERT transfer learning based image and text vector representations. The CNN based approach builds on work introduced by Gallo et al. [17] and further builds on the multi-modal approach by Wiedemann et al. [44] wherein image based vectors as well as OCR’ed text based features are used in a binary classification setting. Our work extends the work by Busch [8] Groenen [19], and Heusden [43] wherein we attempt to improve PSS clustering performance by approximating the correct cluster 𝑘 using 𝑘 estimation methods. In our evaluation we test using four Knee approximation methods and a change point detection method. First we implement and test the L-Method [37] and Refined L-Method which is a computationally efficient "knee" finding algorithm which can be defined as the point of maximum curvature on a graph. The L-Method fits two lines on a set of cluster distances/similarities and attempts minimize the weighted RMSE of the two lines. We additionally test the open source Kneedle algorithm as well as a more recent knee approximation method, dynamic first derivative thresholding (DFDT). Our second approach we test the Penalized Exact Linear Time (PELT) [28] algorithm which is a commonly used change point detection algorithm with an objective function that minimizes a penalized sum of costs. PELT is an exact approach whose computational cost can be considered 𝑂 (𝑛). Research Aim The aim of this research is evaluation of cluster 𝑘 estimation methods on linearly constrained data using agglomerative clustering. We will use vector embedding representations of the WOO data obtained by finetuned and pre-trained deep learning models. We aim to provide answers to the following questions: • RQ 1 Using RMSE to evaluate the 𝑘 estimation methods, how effective are our cluster 𝑘 estimation methods on synthetic data sets? • RQ 2 Using RMSE to evaluate the 𝑘 estimation methods, how effective are our cluster 𝑘 estimation methods on the Wet Open Overheid (WOO) data set? Evaluate for transformer based pretrained & fine tuned embeddings and CNN model embeddings. • RQ 3 Which 𝑘 estimation method shows the greatest improvement in PSS clustering performance? 3 RELATED WORK We first discuss prior work related strictly to Page Stream Segmentation. Then we discuss work related to Knee/Elbow estimation for the purpose of cluster 𝑘 approximation and Change-point Detection as related to our problem of Page Stream Segmentation. 3.1 and as such may produce poor accuracy when estimating knee/elbow points. Satopaa et al. [38] propose the Kneedle algorithm to find the point of maximum curvate of a line by re-scaling a set of discrete points using the first (head) and last (tail) points of a curve. Salvador et al. [37] propose the L-method which identifies a point on a error curve that minimizes the weighted RMSE of two lines fit on the error curve. Antunes et al. [4] introduce dynamic first derivative thresholding (DFDT) which uses a threshold algorithm and the first derivative of a curve to find the sharpest angle between high and low values. Antunes et al. [3] later propose the AL-method and S-method. The AL-method, similar to the L-method, fits two lines on an error curve and calculates a scaled score between the range of [0, 1] combining the RMSE and an angle score based on the sharpness of the angle identified. The S-method, similar to the L-method, fits three lines on an error curve. Prior work on Knee/Elbow estimation methods can be found in a wide variety of domains. In Franke et al. [16] they estimate fatigue damage of materials using a knee based approach. Satopaa et al. [38] test the effectiveness of Kneedle on network latency and congestion control. Khan et al. [27] use Kneedle to identify the K in the task of clustering sentence topics in the task of Text Summarization using an Extractive based K-Means and TFIDF method. Cuevas [10] evaluates the effectiveness of COVID outbreak detection in facilities using Kneedle. Salvador et al. [37] evaluate the L-method specifically for estimating the cluster K for hierarchical clustering using similarity/distance metrics where K is unknown. Page Stream Segmentation Prior work on Page Stream Segmentation broadly falls into two categories, rule based or machine learning based. Rule based systems generally rely on manually engineered features from the layout and text descriptors of a given page to segment pages into sequential documents. Some examples of Rule based features include recurring phrases[9] and named entities, page number sequences located on specific parts of a page [9] [31], text box positions on a page[26], and formatting information[26] [9]. Machine learning based approaches tend to not rely on manually engineered features and instead create word or image based or embeddings which are later segmented/clustered/classified into documents. To our knowledge Thompson and Nikolov [9] first used a combination of hierarchical clustering and classification using handcrafted structure/layout features as well as text similarity features for rule based representations of pages to separate pages into documents. Later in 2009 Meilender and Belaïd [31] present a multi-gram based Bakis Model. Gordo et al. [18] propose a multipage supervised approach using textual descriptors for classifying document pages in the task of document separation. [11] [12] propose a two model approach to classify each page to a document and then validating the classifications by predicting confidence scores. Rusiñol et al. [36] propose a multi-modal approach to the Document Image Classification (DIC) problem using TF-IDF and LSA textual features and pixel density descriptors from image data. Harley et al. [22] at the time introduced a state-of-the-art DIC approach using CNN image vector features. Agin et al. [2] proposed an ML only approach wherein they test the performance of three binary classifiers (SVM, Random Forest, MLP) on a combination of Bag of Visual Words page representations and font information. Noce et al. [33] also leverage a multi-modal approach for the DIC problem using classspecific key-terms along with an image classifier. Gallo et al. [17] leverage deep neural networks tackling the PSS problem as a result of a DIC process detecting a change of the document class between consecutive pages; though this process only works if we assume documents alternate in the sequential streams. Hamdi et al. [21] create embedding features using doc2vec and compare these embeddings against page neighbors with cosine similarity to classify the start of a new page which showed difficulty in identifying single page documents with their approach. The current state-of-the-art approaches [7] [44] [20] [14] for Page Stream Segmentation use multi-modal deep learning based embeddings and treat PSS as a binary classification problem. Wiedemann et al. [44] used the VGG16 architecture for image embeddings and pre-trained FastText for word embeddings. Later, Demirtas et al. [14] improved performance by adding interpage context for 33 semantic classes. 3.2 3.3 Change Point Detection We refer to Charles et al. [42] for a comprehensive survey of algorithms for the task of offline change point detection. The earliest works in change point detection can be found in the domain of industrial quality control in locating a shift in the mean of independent and identically distributed (iid) Gaussian variables. This type of task has since been actively researched in a variety of other domains such as speech processing, financial analysis, bio-informatics, climatology and other areas of physical phenomena or human activity that are monitored with sensors. Change point detection can be divided into two categories, online and offline methods. Online methods often referred to as event/anomaly detection usually applied in real-time setting whereas offline methods are also known as signal segmentation methods and focus on signal segmentation after the signal has been collected. Furthermore offline change point detection methods fall into two categories, (1) where the number of change points K is known and (2) where the number of change points K is unknown. For our PSS task, we seek an offline change point detection algorithm which must be applicable where the number of change points K is unknown and is computationally efficient. The earliest and most established search method within the change point literature is Binary Segmentation (BS) [40] [39]. The BS method iteratively segments subsets of a sequence and can extend a single change point to multiple change points. The BS method is computationally efficient 𝑂 (𝑛 log 𝑛) however since change points are wholly dependent on the first change point the BS method and as such is not considered an exact method. Auger et al. [6] propose the Segment Neighbourhood (SN) exact search method which searches Knee/Elbow Estimation Several approaches exist in the task of detecting knee/elbow points in discrete data. Léger [29] defined curvature as the curvature circumscribed by a set of three discrete points of a function. This method relies on only three points to estimate knee/elbow points 2 the entire segmentation space using dynamic programming. SN computes a cost for all possible segmentation’s between 0 and Q, the max number of change points specified. Due to the exhaustive search of SN the computation cost is significant 𝑂 (𝑄𝑛 2 ) and as observed data increases linearly the SN method computation cost approaches 𝑂 (𝑛 3 ). Yao [46] and Jackson et al. [25] propose the Optimal Partitioning method (OP) exact method, which iteratively identifies a change point and computes a cost related to data prior to the last change point plus the cost for the segment from the last changepoint to the end of the data. Compared to the SN method, the OP method provides an improvement to computational efficiency with computation time being 𝑂 (𝑛 2 ). Killick et al. [28] introduce a modification to the OP method denoted as PELT which, under certain conditions, result in linear computational cost 𝑂 (𝑛) whilst retaining an exact minimisation of the OP search method. PELT achieves this via a combination of optimal partitioning and pruning data points which can never be minima from the minimisation performed at each search iteration. PELT has been applied on a wide range of data sets such as DNA sequences [24] [30], physiological signals [23], oceanographic data [28], and textual data based on term frequencies [32] [47] and word2vec embeddings [34]. 4 Figure 1: Page and Document distributions of WOO data METHODOLOGY The methodology is organized as follows. We first introduce our WOO data set as well as our baseline synthetic data set. We then review our Knee/Elbow 𝑘 estimation methods followed by a review of our change point detection approach PELT. We then close this section with a brief overview of the metrics used to evaluate our results. 4.1 Table 1: Brown Corpus Sentence Examples Line number Data 4.1.1 Wet Open Overheid. In this evaluation we use a recently released stream of data provided by the Dutch government under the Open Government Act, or Wet Open Overheid(WOO). This data originates from sets of pdf documents which contained concatenated emails, reports, whatsapp conversations, and meeting minutes which have been annotated by members from Follow the Money [1] and students of the University of Amsterdam. We evaulate a total of 108 streams having at least two documents, excluding 2 from our total of 110 streams. The average page count in each stream is 827 pages (Fig. 1 top left) and the average document count is 224 documents per stream (Fig. 1 top right). The total number of pages in our data set is 89491 and the total document count is 24181. We observe that 102 streams have a ratio of less than 10 pages per document (Fig. 1 bottom left ) and in aggregate we observe an average of 3 pages per document after having removed two outliers from our data set (Fig. 1 bottom right). Sentence 10 That could be easily done , but there is little reason in it . 11 It would come down to saying that Fromm paints with a broad brush , and that , after all , is not a conclusion one must work toward but an impression he has from the outset . 12 the effect of the digitalis glycosides is inhibited by a high concentration of potassium in the incubation medium and is enhanced by the absence of potassium ( Wolff , 1960 ) . B. Organification of iodine The precise mechanism for organification of iodine in the thyroid is not as yet completely understood . However , the formation of organically bound iodine , mainly mono-iodotyrosine , can be accomplished in cell-free systems . 13 14 12, 13, and 14 belong to the next group of sentences. The semantic difference is clear and our segmentation approach should be able to detect the semantic separation and differentiate between the sentence groups. To generate baseline results for our 𝑘 estimation methods, we take inspiration from Charles et al. [42] and refer to the Brown corpus [15] to generate more synthetic samples. We generate synthetic data from the Brown corpus using the 𝑝𝑎𝑟𝑎𝑠 function that is available in the NLTK package which extracts paragraphs from the corpus. We then filter all paragraphs to only paragraphs containing 4.1.2 Synthetic Text Data. The authors of the Ruptures [42] change point detection Python library introduce a synthetically generated data set of 99 sentences which split into 10 distinct groups of nine to eleven sentences. The sentences are linearly ordered with the group they belong and show clear semantic variation from group to group. Table 1 shows five sentences in the order they are found in the data set. Sentences 10 & 11 belong to one group and sentences 3 at a minimum 9-11 sentences with each sentence being at least 6 words long, yielding 252 unique paragraphs across 15 varying categories of topics such as ’adventure’, ’government’, ’humor’, etc. We use these 252 unique paragraphs to iteratively construct 1000 unique combinations of paragraphs such that each paragraph is followed by a paragraph from a different topic category. The above synthetic data set can be considered a proxy to our WOO data set if we consider each sentence as a proxy for a page and each paragraph as a proxy for a document in the context of page stream segmentation. Figure 2: Distance metrics relative to a dendogram 4.1.3 Embeddings. We evaluate a set of pre-trained and fine tuned embeddings for both image and text modalities on the WOO data. The image vector representations are generated from the last layer of a pre-trained and fine tuned VGG16 CNN model based on the prior works of Noce et al. [33], Gallo et al. [17], and Weidemann et al. [44]. The pre-trained text vectors are generated using the Dutch Roberta model [13], using SentenceTransformers to generate vectors for a page. The fine tuned text embeddings are generated from a BERT model trained for sequence classification using huggingface [45]. Lastly, we use the pre-trained 𝑎𝑙𝑙 − 𝑀𝑖𝑛𝑖𝐿𝑀 − 𝐿6 − 𝑣2 BERT model from hugginface to generate text embeddings for our synthetic data. 4.2 to all points on the evaluation graph with the goal of minimizing the the weighted RMSE of the fit between the two lines. The two lines 𝐿𝑐 and 𝑅𝑐 are fit at each 𝑐 where 𝑐 is the iterative cluster cutoff value and 𝑏 being total count of clusters. 𝑐 −1 𝑏 +𝑐 × 𝑅𝑀𝑆𝐸 (𝐿𝑐 ) + × 𝑅𝑀𝑆𝐸 (𝑅𝑐 ) (2) 𝑏−1 𝑏−1 Equation 2 defines the weighted RMSE, where the partition of 𝐿𝑐 and 𝑅𝑐 is at 𝑥 = 𝑐. The L-method seeks the value of 𝑐 , such that 𝑅𝑀𝑆𝐸𝑐 is minimized. 𝑎𝑟𝑔𝑚𝑖𝑛𝑅𝑀𝑆𝐸𝑐 = 4.2.3 Refined L Method. As already discussed in [37], the L-method evaluation graph plots each distance data point in an effort to then find the knee of the series of distances. In cases where the practitioner is fitting the L-method to a long array of distances, the L-method evaluation graph can lead to a long tail of irrelevant data points to be plotted which skew the actual knee finding process of the L-method. Salvador et al. [37] additionally proposes the Refined L Method which is an iterative pruning process in which a knee is identified using the classic L method and each subsequent iteration reduces the total count of data points 𝑏 and recomputes the L method knee in an attempt to find a more accurate knee cutoff. The iterative process stops after the first instance in which a reduction in 𝑏 does not lead to a change in 𝑐. K Segment Estimation Methods 4.2.1 L Method. Though knee/elbow estimation of a curve is considered a heuristic process, a definition for the curvature for continuous functions has been previously [38] [37] defined as equation 1. For continuous functions the knee/elbow point is the point of maximum curvature. The maximum curvature can be found by measuring how much a function differs from a straight line. The closed form 𝐾 𝑓 (𝑥) in equation 1 defines this curvature 𝑓 (𝑥) at any given point as a function of its first and second derivative. The maximum of the first derivative is also known as the inflection point of the curve. This is not representative of the knee and instead only captures where the rate of increase reaches a maximum. In contrast the maximum curvature definition precisely matches the concept of a knee. 𝐾 𝑓 (𝑥) = 𝑓 ′′ (𝑥) 3 (1 + 𝑓 ′ (𝑥) 2 ) 2 4.2.4 Kneedle. Kneedle [38] approximates maximum curvature (knee) as the set of points in a curve that are local maxima if the curve is rotated (𝜃 ) degrees clockwise about (𝑥𝑚𝑖𝑛 , 𝑦𝑚𝑖𝑛 ) through the line formed by the points (𝑥𝑚𝑖𝑛 , 𝑦𝑚𝑖𝑛 ) and (𝑥𝑚𝑎𝑥 , 𝑦𝑚𝑎𝑥 ). The Kneedle algorithm takes into account the head and tail points of an error curve and approximates the knee of a curve by identifying points on a curve where the curve becomes inherently more "flat". The first step of Kneedle is to use a smoothing spline to smooth the shape of the input data. Then the range of x and y values are normalized to the unit square. Next, perpendicular distances are calculated between each discrete data point and the diagonal line between the first and last data point. The distances from the prior step are then used to find a local maxima of the new set of normalized points which is the knee point derived by Kneedle. It is additionally possible to adjust the 𝑆 sensitivity of the Kneedle algorithm which defines how aggressive/conservative the Knee detection should be. Lower values of 𝑆 lead to more accurate Knee points; in online scenarios a higher 𝑆 may be preferred. (1) However as shown in [5] fitting a continuous function to discrete curvatures produce poor results. Thus for our PSS task we instead seek to detect the knee/elbow points using a method that is more appropriate for discrete data. 4.2.2 L-method. introduced by Stan Salvador et al. [37] proposes an efficient knee finding algorithm to automatically determine the number of clusters using the discrete distances as returned by the hierarchical clustering (these are the same distances that are used to create a dendogram, Fig. 2). In Hierarchical clustering each cluster is associated to a distance metric computed greedily that specifies the distance between itself and the nearest cluster after just one single iteration. The L-method uses a two dimensional evaluation graph (Fig. 3) where the x-axis plots the count of clusters and the y-axis plots an evaluation metric such as, cluster distance. The method iteratively fits two lines starting at each end of the evaluation graph 4.2.5 DFDT. Antunes et al. [4] propose DFDT for the task of cluster 𝑘 estimation. DFDT is a computationally efficient knee estimation 4 4.3 Metrics For the evaluation of our cluster 𝑘 approximation methods we refer to metrics that allow us to measure our ability to label a page as the start of a new document (we interchange "sentence" for "pages" and "paragraphs" for "documents" as it relates to our synthetic data). For this purpose we measure the Macro F1 score, Weighted Block F1 score , and RMSE to measure RMSE of the estimated 𝑘 respective to true 𝑘. 4.3.1 The Macro F1 score. , also referred to as Page F1 by Lukas et al. [8], can be defined by the formulas below. True and False positives and negatives are defined by the standard 𝑇 𝑃, 𝐹 𝑃, 𝑇 𝑁 , and 𝐹 𝑁 respectively; each explicitly referring to whether the given page is labeled as "start page". 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑐𝑎𝑙𝑙 = Figure 3: All possible best-fits with respective RMSE for synthetic example data 𝑃𝑎𝑔𝑒𝐹 1 = 𝑃𝑎𝑔𝑒𝐹 1 = method with the aim to improve the Knee estimation of discrete curve’s with long tails. DFDT is a hybrid between the Menger curvature [41] and the L-method. DFDT estimates the first derivative of a given curve which represents the slope of the tangent line. Using this tangent line, DFDT identifies the point where the function has a sharp angle. Instead of fitting two straight lines like the L-method, DFDT uses the IsoData [35] threshold algorithm to find the ideal split between high and low values of the first derivative. The advantage of this approach is that the threshold algorithm splits data based on actual values of the discrete curve and not the quantity of values, mitigating the effect of long tails. (6) 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (7) 2 ∗𝑇𝑃 (2 ∗ 𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 ) (8) 𝐷𝑜𝑐𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝐷𝑜𝑐𝑅𝑒𝑐𝑎𝑙𝑙 = 𝐷𝑜𝑐𝐹 1 = 𝐷𝑜𝑐𝐹 1 = (3) 𝑊𝑇 𝑃 |𝑇 𝑃 | + |𝐹 𝑃 | 𝑊𝑇 𝑃 |𝑇 𝑃 | + |𝐹 𝑁 | (10) (11) 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (12) 𝑊𝑇 𝑃 |𝑇 𝑃 | + 0.5(|𝐹 𝑃 | + |𝐹 𝑁 |) (13) 4.3.3 RMSE. is also computed in order to evaluate the performance of our cluster 𝑘 approximation methods. We use RMSE to compare the estimated number of clusters to the ground truth number of clusters in our dataset. To calculate the RMSE, we first compute the difference between the ground truth and estimated number of clusters for each sample in our dataset. The lower the RMSE, the closer the estimated number of clusters is to the ground truth While ensuring optimality of OP and improving computational efficiency to 𝑂 (𝑛), Killick et al [28] propose a pruning step as demonstrated in equation 4. If Equation 4 is true at 𝜏 ′ ≤ 𝑡 − 1, then 𝜏 ′ can be pruned for all future time steps. 𝐹 (𝜏 ′ ) + 𝐶 (𝑦 (𝜏 ′ +1):𝑡 ) < 𝐹 (𝑡 ) 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 (5) 4.3.2 The Weighted Block F1 score. [43], also referred to as Doc F1 by Lukas et al. [8], can be defined by the formulas below. The ground truth block is denoted as 𝐷𝑡 and a predicted block defined as 𝐷𝑝 . A block is a set of connected pages. The intersection of 𝐷𝑡 and 𝐷𝑝 is called 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑜𝑣𝑒𝑟𝑈 𝑛𝑖𝑜𝑛; 𝐼𝑜𝑈 (𝐷𝑡 , 𝐷𝑝 ), calculated using Jaccard similarity. A pair of blocks (𝐷𝑡 , 𝐷𝑝 ) is considered True Positive if more than half of the pages overlap, 𝐼𝑜𝑈 (𝐷𝑡 , 𝐷𝑝 ) > 0.5. 𝐹 𝑃 is defined as the set that contains all predicted blocks 𝐷𝑝 that are not in a TP pair and 𝐹 𝑁 as the set of all ground truth blocks 𝐷𝑡 that are not in a TP pair. This error metric approach essentially allows for Weighted True Positives, denoted as WTP. ∑︁ 𝑊𝑇 𝑃 = {𝐼𝑜𝑈 (𝐷𝑡 , 𝐷𝑝 )|(𝐷𝑡 , 𝐷𝑝 ) ∈ 𝑇 𝑃 } (9) 4.2.6 PELT. Change point detection (CPD) detects shifts in time series trends. CPD can be used to detect anomalous sequences/states both in real-time (online) and retroactively (offline). Pruned Exact Linear Time (PELT) builds on Optimal Partitioning (OP) with an added step to prune candidate change points. OP aims to minimize the cost function shown in equation 3 using dynamic programming. The algorithm works by optimising at each time step based on the optimal solution at all previous time steps. At each step 𝑡, the algorithm considers time steps after the last change point to the current time step, as candidates for a new change point. Starting with F(0) = 0 and given a value of 𝛽, F(𝑡) for all values of 𝑡 can be calculated, with a resulting optimal cost. The above equation is iteratated for each step 𝑡=1 : 𝑛 which results in 𝑂 (𝑛 2 ) time. 𝐹 (𝑡) = 𝑚𝑖𝑛 0≤𝑇 <𝑡 [𝐹 (𝜏) + 𝐶 (𝑦𝜏+1,𝑡 ) + 𝛽] 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 (4) 5 number of clusters, indicating a better performance of the clustering algorithm. 5 then provide an exact 𝑘 as well as the specific change points in the given data. On Table 3 we show the performance of our 𝑘 approximation methods on synthetic data using both simple count vector embeddings as well as BERT based embeddings. For the baseline results we used four Knee based 𝑘 approximation methods as well as PELT. Additionally we focus on the BERT embeddings as they are the topic of the WOO evaluation as well. These baselines show the relevance of our 𝑘 approximation methods for the task of PSS and set the scene for our research question related to PSS evaluation of the CNN and BERT embeddings on the WOO data. Referring to our research questions. We find that the PELT change point segmentation approach on our synthetic data using BERT embeddings is by far the best performing 𝑘 approximation method with an RMSE of 2.23. The remaining 𝑘 approximation methods yield RMSE that is significantly worse with the next best approximation method being PELT using CountVector embeddings. RESULTS We first review our 𝑘 approximation baseline results on our synthetically generated data. Then we review our results on the WOO dataset. 5.1 Baseline Results To set a baseline for our evaluation of the WOO data we use the average document length of our streams as shown on Figure 1. Our approach is simply to take the average document length of 𝑘 = 3 and evaluate our metrics on the assumption that every 𝑛 = 3 page is the start of a new document. With this approach we see the results shown on table 2. Table 2: Baseline results for the Naive mean on all streams. Page F1 0.04 Doc F1 0.22 5.3 RMSE 174 On table 4 we show the results of our 𝑘 approximation methods. We abbreviate fine-tuned and pre-trained embeddings to FT and PT respectively. The results are representative of all 110 streams in the WOO dataset for PELT, Kneedle, and DFDT. For the L-Method and Refined L-Method there is a 20 sample minimum as suggested by the original author [37]. Our results on our WOO data show clear deterioration with Page F1, Doc F1, and RMSE being far worse than the experiments on the synthetic data. As evident by the Naive mean approach our Page F1 is very poor which suggests that this approach is very poor at capturing a "document start page". The Doc F1 is quite poor but is slightly higher than Page F1 and suggests that the Naive mean approach is able to overlap with the ground truth blocks, in a small number of cases. Lastly, the high RMSE for the Naive mean suggests that on average the count of documents 𝑘 in a stream is off by 174 using this approach. 5.2 Table 4: Performance of 𝑘 approximation methods on WOO data 𝑛 = 100 Synthetic Data Results Embedding Image FT Image FT Image FT Image FT Image FT Image PT Image PT Image PT Image PT Image PT Text FT Text FT Text FT Text FT Text FT Text PT Text PT Text PT Text PT Text PT Table 3: Performance of 𝑘 approximation methods on synthetic data 𝑛 = 1000 Embedding BERT CountVector BERT BERT BERT BERT Method PELT PELT Refined L-Method L-Method DFDT Kneedle Page F1 0.608 0.117 0.0633 0.056 0.055 0.054 Doc F1 0.743 0.392 0.123 0.116 0.107 0.091 WOO Results RMSE 2.23 8.19 54.1 54.2 49.9 8.36 Next we evaluate our 𝑘 approximation methods on the synthetic data. We perform this evaluation in the same manner for both synthetically generated data and for the WOO data. We create vector representations of each segment within our corpus and look to approximate an 𝑘 which matches as closely to the ground truth 𝑘 as possible. In the case of the Knee based approximation methods we first calculate cosine distances calculated by clustering the vector embeddings. Next we attempt to find a Knee point on the series of distances with the hypothesis that the Knee point will most closely match the ground truth 𝑘. In the case of the PELT change point detection approximation approach, we are able to pass the multivariate embeddings directly into the PELT algorithm. PELT will 6 Method PELT Kneedle Refined L-Method L-Method DFDT PELT Kneedle Refined L-Method L-Method DFDT PELT Kneedle Refined L-Method L-Method DFDT PELT Kneedle Refined L-Method L-Method DFDT Page F1 0.1829 0.0295 0.0443 0.0547 0.0350 0.1806 0.0205 0.0349 0.0538 0.0679 0.0423 0.0107 0.0200 0.0203 0.0353 0.0187 0.0251 0.0238 0.0615 0.0458 Doc F1 0.1860 0.0555 0.1511 0.1947 0.0828 0.1842 0.0666 0.0788 0.1317 0.1075 0.2582 0.0463 0.2364 0.2363 0.2505 0.0527 0.0523 0.0518 0.1318 0.0811 RMSE 1037 343 165 155 374 1037 340 500 448 428 148 350 165 166 157 331 341 353 136 342 6 DISCUSSION 6.1 RQ 1: How effective are our cluster estimation methods on synthetic data sets? use for baseline results. We use this corpus to generate 𝑛 = 1000 samples and evaluate our approximation techniques. In our evaluation we find that our cluster 𝑘 approximation techniques perform well on the BERT vector representations of the synthetically generated data with a promising RMSE score of 2.23. Additionally we observe a Page F1 of 0.608 and Doc F1 of 0.743. Out of the five approximation techniques the PELT change point detection approach showed the best performance by a large margin with the next closest approach yielding an RMSE of 8.19. Next we apply the same approximation techniques to the WOO data set introduced by van Heusen et al. [43]. We find that all five approaches degrade in performance significantly as compared to the synthetic data set. PELT on both pretrained and fine tuned image embeddings shows large RMSE errors. These lage RMSE errors for the image embeddings are due to PELT identifying each page as a change point for 79 streams which feeds into a very large RMSE error. It is worth noting that the synthetically generated data is by design semantically different at each synthetically generated segment. In other words each paragraph generated originates from a different category of topics which allows for the cosine distances to be large enough paragraph to paragraph that the PELT change point detection method picks up on these semantic change points. In contrast we observe that the WOO corpus contains cases where consecutive pages contain the same text on both pages. In some specific case this may occur where for example an email conversation on one page is consequently referred to verbatim in two separate documents that occur consecutively. Lastly for future work we suggest penalty optimization or penalty learning for PELT. As PELT requires a penalty parameter an area of improvement would be to identify a penalty parameter that works on the corpus within the training set and apply this penalty on the testing set. We tested all of our cluster 𝑘 estimation methods on synthetically generated data from the Brown corpus [15]. Our evaluation on 1000 synthetically generated samples showed very clearly the most viable solution which we found to be the PELT changepoint detection approach. Using BERT embeddings in combination with PELT yielded a Page F1 at 0.608 which was almost 6x the same approach with CountVector embeddings and Doc F1 at 0.743 which is double that of the CountVector embeddings. The BERT+PELT embedding approach also yielded an average RMSE at 2.23 which shows that the change points by PELT have an error rate on average of around ±2 𝑘 for our 𝑛 = 1000. The results on the synthetic data using the BERT vector representation of the text are well adapted to the synthetic data. This proves the viability of using a change point detection approach within the task of text segmentation. 6.2 RQ 2: How effective are our cluster estimation methods on the Wet Open Overheid (WOO) data? We tested all of our cluster 𝑘 estimation methods on the WOO data and observe significantly different results from the baseline results on the synthetic data. The lowest RMSE we observe is 136 using the L-method on the pretrained BERT embeddings closely followed by PELT on the fine-tuned BERT embeddings. The PELT approach has highest Page F1 and Doc F1 on the pretrained image embeddings however has a RMSE over 1000. This is due to PELT identifying almost every page as a new page for the image embeddings for both the pretrained and fine tuned models. 6.3 REFERENCES RQ 3: Which estimation method shows the greatest improvement in PSS clustering performance? [1] [n. d.]. Follow the money - platform voor onderzoeksjournalistiek. https: //www.ftm.nl/ [2] Onur Agin, Cagdas Ulas, Mehmet Ahat, and Can Bekar. 2015. An approach to the segmentation of multi-page document flow using binary classification. In International Conference on Graphic and Image Processing. [3] Mário Antunes, Henrique Aguiar, and Diogo Nuno Pereira Gomes. 2019. AL and S Methods: Two Extensions for L-Method. 2019 7th International Conference on Future Internet of Things and Cloud (FiCloud) (2019), 371–376. [4] Mário Antunes, Diogo Gomes, and Rui L Aguiar. 2018. Knee/elbow estimation based on first derivative threshold. In 2018 IEEE Fourth International Conference on Big Data Computing Service and Applications (BigDataService). IEEE, 237–240. [5] Mário Antunes, Joana Ribeiro, Diogo Nuno Pereira Gomes, and Rui. L. Aguiar. 2018. Knee/Elbow Point Estimation through Thresholding. 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud) (2018), 413–419. [6] I. E. Auger and Charles E. Lawrence. 1989. Algorithms for the optimal identification of segment neighborhoods. Bulletin of mathematical biology 51 1 (1989), 39–54. [7] Fabricio Ataides Braz, Nilton Correia da Silva, and Jonathan Alis Salgado Lima. 2021. Leveraging effectiveness and efficiency in Page Stream Deep Segmentation. Eng. Appl. Artif. Intell. 105 (2021), 104394. [8] Lukas Busch and Maarten Marx. 2022. Using deep learned vector representations for page stream segmentation by agglomerative clustering. (2022). [9] Kevyn Collins-Thompson. 2002. A Clustering-Based Algorithm for Automatic Document Separation. [10] Erik Cuevas. 2020. An agent-based model to evaluate the COVID-19 transmission risks in facilities. Computers in Biology and Medicine 121 (2020), 103827 – 103827. [11] Hani Daher and Abdel Belaïd. 2013. Document flow segmentation for business applications. In Electronic Imaging. [12] Hani Daher, Mohamed-Rafik Bouguelia, Abdel Belaïd, and Vincent Poulain d’Andecy. 2014. Multipage Administrative Document Stream Segmentation. After all our evaluation we found the BERT vector embeddings of the synthetic data in combination with PELT change point detection to be a very compelling approach in identifying text based segments. The application of this approach on the synthetic data and the positive results suggest that it is relevant to our task of PSS and, under certain circumstances, may provide a viable approach. However upon testing the approach on our research specific WOO corpus, we find that the vector embeddings we used in combination with PELT and other Knee based 𝑘 approximation methods performed significantly worse than the performance observed on the synthetic data. Improvements against the Naive mean baseline are also observed and most notable the RMSE is lower in 7 different 𝑘 approximation approaches. 7 CONCLUSION In this evaluation we attempt to improve the clustering performance for the task of PSS on the WOO data by using five cluster 𝑘 approximation techniques. We additionally identified the Brown [15] corpus as a good data source to generate synthetic data and 7 2014 22nd International Conference on Pattern Recognition (2014), 966–971. [13] Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2020. RobBERT: a Dutch RoBERTa-based Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3255– 3265. https://doi.org/10.18653/v1/2020.findings-emnlp.292 [14] Mehmet Arif Demirtacs, Berke Oral, Mehmet Yasin Akpinar, and Onur Deniz. 2022. Semantic Parsing of Interpage Relations. ArXiv abs/2205.13530 (2022). [15] W. N. Francis and H. Kucera. 1979. Brown Corpus Manual. Technical Report. Department of Linguistics, Brown University, Providence, Rhode Island, US. http://icame.uib.no/brown/bcm.html [16] Lutz Franke and G. Dierkes. 1999. A non-linear fatigue damage rule with an exponent based on a crack growth boundary condition. International Journal of Fatigue 21 (1999), 761–767. [17] Ignazio Gallo, Lucia Noce, Alessandro Zamberletti, and Alessandro Calefati. 2016. Deep Neural Networks for Page Stream Segmentation and Classification. 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (2016), 1–7. [18] Albert Gordo, Marçal Rusiñol, Dimosthenis Karatzas, and Andrew D. Bagdanov. 2013. Document Classification and Page Stream Segmentation for Digital Mailroom Applications. 2013 12th International Conference on Document Analysis and Recognition (2013), 621–625. [19] Pepijn Groenen and Maarten Marx. 2022. Multi-source clustering evaluation of deep learning page-stream segmentation methods on 2 governmental data. (2022). [20] Abhijit Guha, Abdulrahman Alahmadi, D. Samanta, Mohammad Zubair Khan, and Ahmed H. Alahmadi. 2022. A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain. IEEE Access PP (2022), 1–1. [21] Ahmed Hamdi, Joris Voerman, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain d’Andecy, and Jean-Marc Ogier. 2017. Machine Learning vs Deterministic RuleBased System for Document Stream Segmentation. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 05 (2017), 77–82. [22] Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis. 2015. Evaluation of deep convolutional nets for document image classification and retrieval. 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (2015), 991–995. [23] Kaylea Haynes, Paul Fearnhead, and Idris Arthur Eckley. 2016. A computationally efficient nonparametric approach for changepoint detection. Statistics and Computing 27 (2016), 1293 – 1305. [24] Toby Hocking, Gudrun Schleiermacher, Isabelle Janoueix-Lerosey, Valentina Boeva, Julie Cappo, Olivier Delattre, Francis R. Bach, and Jean-Philippe Vert. 2013. Learning smoothing models of copy number profiles using breakpoint annotations. BMC Bioinformatics 14 (2013), 164 – 164. [25] Brad Jackson, Jeffrey D. Scargle, David Barnes, Sundararajan Arabhi, Alina Alt, Peter Gioumousis, Elyus Gwin, Paungkaew Sangtrakulcharoen, Linda Tan, and Tun Tao Tsai. 2005. An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters 12 (2005), 105–108. [26] Romain Karpinski and Abdel Belaïd. 2016. Combination of Structural and Factual Descriptors for Document Stream Segmentation. 2016 12th IAPR Workshop on Document Analysis Systems (DAS) (2016), 221–226. [27] Rahim Khan, Yurong Qian, and Sajid Naeem. 2019. Extractive based Text Summarization Using KMeans and TF-IDF. International Journal of Information Engineering and Electronic Business (2019). [28] Rebecca Killick, Paul Fearnhead, and Idris Arthur Eckley. 2012. Optimal detection of changepoints with a linear computational cost. J. Amer. Statist. Assoc. 107 (2012), 1590 – 1598. [29] J. C. Leger. 1999. Menger curvature and rectifiability. Annals of Mathematics 149 (1999), 831–869. [30] Robert Maidstone, Toby Hocking, Guillem Rigaill, and Paul Fearnhead. 2014. On optimal multiple changepoint algorithms for large data. Statistics and Computing 27 (2014), 519 – 533. [31] Thomas Meilender and Abdel Belaïd. 2009. Segmentation of continuous document flow by a modified backward-forward algorithm. In Electronic Imaging. [32] Q. Niu, J. Liu, Masashi Kato, T. Aoyama, and Momoko Nagai-Tanima. 2022. Fear of Infection and Sufficient Vaccine Reservation Information Might Drive Rapid Coronavirus Disease 2019 Vaccination in Japan: Evidence from Twitter Analysis. In medRxiv. [33] Lucia Noce, Ignazio Gallo, Alessandro Zamberletti, and Alessandro Calefati. 2016. Embedded Textual Content for Document Image Classification with Convolutional Neural Networks. Proceedings of the 2016 ACM Symposium on Document Engineering (2016). [34] N. Pedrazzini and Barbara McGillivray. 2022. Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers. In NLP4DH. [35] Tony Ridler. 1978. Picture thresholding using an iterative selection method. IEEE Transactions on Systems, Man, and Cybernetics 8 (1978), 630–632. [36] Marçal Rusiñol, Volkmar Frinken, Dimosthenis Karatzas, Andrew D. Bagdanov, and Josep Lladós. 2014. Multimodal page classification in administrative document image streams. International Journal on Document Analysis and Recognition (IJDAR) 17 (2014), 331–341. [37] Stan Salvador and Philip K. Chan. 2004. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. 16th IEEE International Conference on Tools with Artificial Intelligence (2004), 576–584. [38] Ville A. Satopaa, Jeannie R. Albrecht, David E. Irwin, and Barath Raghavan. 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. 2011 31st International Conference on Distributed Computing Systems Workshops (2011), 166–171. [39] Alastair Scott and Martin Knott. 1974. A Cluster Analysis Method for Grouping Means in the Analysis of Variance. Biometrics 30 (1974), 507. [40] Ashish K. Sen and Muni S. Srivastava. 1975. On Tests for Detecting Change in Mean. Annals of Statistics 3 (1975), 98–108. [41] Xavier Tolsa. 2000. Principal values for the Cauchy integral and rectifiability. [42] Charles Truong, Laurent Oudre, and Nicolas Vayatis. 2019. Selective review of offline change point detection methods. [43] Ruben van Heusden, J. Kamps, and maarten marx. 2022. WooIR: A New Open Page Stream Segmentation Dataset. Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval (2022). [44] Gregor Wiedemann and Gerhard Heyer. 2018. Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features. ArXiv abs/1710.03006 (2018). [45] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. https://doi.org/10.48550/ARXIV.1910.03771 [46] Yi-Ching Yao. 1984. Estimation of a Noisy Discrete-Time Step Function: Bayes and Empirical Bayes Approaches. Annals of Statistics 12 (1984), 1434–1447. [47] Qian Ye, Kaan Ozbay, Fan Zuo, and Xiaohong Chen. 2021. Impact of Social Media on Travel Behaviors during the COVID-19 Pandemic: Evidence from New York City. Transportation Research Record: Journal of the Transportation Research Board (2021). 8

Cluster K Estimation for Page Stream Segmentation

Related documents

Products

Support

Cluster K Estimation for Page Stream Segmentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib