Uploaded by Md likhon

A Novel Web Scraping Approach Using the Additional Information Obtained

advertisement
Received March 3, 2020, accepted March 25, 2020, date of publication March 31, 2020, date of current version April 13, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2984503
A Novel Web Scraping Approach Using the
Additional Information Obtained
From Web Pages
ERDINÇ UZUN
Department of Computer Engineering, Çorlu Faculty of Engineering, Tekirdağ Namık Kemal University, 59860 Tekirdağ, Turkey
e-mail: erdincuzun@nku.edu.tr
ABSTRACT Web scraping is a process of extracting valuable and interesting text information from web
pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the
extraction process, these studies first create a DOM tree and then access the necessary data through this
tree. The construction process of this tree increases the time cost depending on the data structure of the
DOM Tree. In the current web scraping literature, it is observed that time efficiency is ignored. This study
proposes a novel approach, namely UzunExt, which extracts content quickly using the string methods and
additional information without creating a DOM Tree. The string methods consist of the following consecutive
steps: searching for a given pattern, then calculating the number of closing HTML elements for this pattern,
and finally extracting content for the pattern. In the crawling process, our approach collects the additional
information, including the starting position for enhancing the searching process, the number of inner tag
for improving the extraction process, and tag repetition for terminating the extraction process. The string
methods of this novel approach are about 60 times faster than extracting with the DOM-based method.
Moreover, using these additional information improves extraction time by 2.35 times compared to using only
the string methods. Furthermore, this approach can easily be adapted to other DOM-based studies/parsers in
this task to enhance their time efficiencies.
INDEX TERMS Computational efficiency, algorithm design and analysis, web crawling and scraping,
document object model.
I. INTRODUCTION
A web page contains unnecessary content such as menus,
advertisements, banners, footers, sitemaps and necessary
content such as title, summary, main text, price, and description in terms of information that users need. With the increase
of unnecessary content on the web pages, it has become
essential to eliminate unnecessary content and to extract necessary content that can be used by the text processing applications such as search engines, question-answering systems,
recommendation systems, trend detection/monitoring, sentiment analysis, and e-commerce market monitoring. Many
studies [1] in this task focus on determining the extraction of
the data/pattern automatically. However, the time efficiency
of this process is not taken into consideration in these studies. Therefore, this study deals with the acceleration of the
extraction process rather than deciding the extraction pattern.
And this study introduces a novel approach, namely UzunExt,
The associate editor coordinating the review of this manuscript and
approving it for publication was Dongxiao Yu.
61726
which accomplishes the extraction process in a shorter time
and uses fewer resources. Moreover, how this approach can
be adapted to other studies is explained thoroughly.
Web pages are made of HTML elements and data between
these elements. Web scraping is a process of extracting specific data between these elements for providing data for other
applications such as online price change monitoring, product
review analyzing, weather data monitoring, tracking online
presence, gathering articles and so on. The internet is a rich
source of ‘‘big data’’ for these applications. Most of the
studies [2] in ‘‘big data’’ concentrate on the time efficiency
of deep learning models. However, to increase the time efficiency of obtaining data is an important issue. The UzunExt
approach consists of two main components: crawling web
pages and extracting data from them. It uses the additional
information obtained from web pages during the crawling
process for increasing the extraction time efficiency.
Most of the academic studies described in Section II
for this task are based on creating a DOM tree that is
a tree-based hierarchy for representing a web document.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
Moreover, HTML parsers that can be used for web scraping
are also based on the DOM tree. The following list presents
several parsers for four different programming languages.
•
•
•
•
.Net: HAP, AngleSharp, Microsoft HtmlDocument
Java: jsoup, Lagarto, HtmlCleaner
Python: HTML Parser of The Standard Library,
html5lib, html5-parser, lxml, AdvancedHTMLParser
Node.js: Cheerio, Jsdom, Htmlparser2, Parse5
In the DOM tree construction, all HTML elements in a
web document are taken into consideration. However, constructing a DOM Tree is not convenient for efficient web
scraping. A few studies, which do not require creating a DOM
Tree, have been proposed in the literature. Uzun et al. [3], [4]
and Uçar et al. [5] eliminate the construction of the DOM
tree in their algorithms. Wu [6] divides a news page into
text fragments, prepares a set of content features and layout
features from these fragments and finally uses a Bayesian
algorithm to find appropriate content for online newspapers.
All approaches described above make a generalization for a
web document. These studies are mostly about automatically
extracting the desired data from web pages and generating
an extraction pattern for this data rather than making an
effective extraction. This study is based on developing effective web scraper. Three additional information obtained from
the extraction process of web pages is used to improve the
following extraction processes. Additionally, this approach
can also be adapted to the methods of previous studies as a
module.
Different HTML parsers in programming languages can be
used for creating a DOM tree and extracting content from this
tree. Uzun et al. [7] try three different well-known parsers
in.Net. In our study, we use these parsers for understanding
the time consumption of DOM creation and extraction of content. However, our expectation is that a simple text processing
on the required elements can speed up the extraction process.
The most important issue in text processing is the string
matching step of an extraction algorithm. Buluş et al. [8]1
compared 31 different matching algorithms for searching
HTML elements in web pages. Their experiments show that
Skip Search algorithm is the best algorithm in this task. Our
study takes the best search results of these algorithms into
account and additionally uses the IndexOf method in string
Class of the.Net. One of the additional information of our
approach is to predict the starting position information of the
searched element on a web page of a website for accelerating
the extraction process. Moreover, Grubbs’ test [9] is utilized
to eliminate the invalid position information obtained from
web pages for a website. The second additional information is the number of inner tag in the desired element for
finishing the extraction process in a shorter time. Finally,
the last additional information is the repetition of the desired
HTML element in a web document. Taking this repetition
1 All algorithms are open-source and available via the web page.
https://github.com/erdincuzun/SMA.NET.
VOLUME 8, 2020
into account, a part of the document instead of the whole
document can be used for terminating the extraction process.
Web pages are collected via a web crawler that discovers
and retrieves all website pages automatically. Some information obtained from the web pages can be used for improving
the crawling process [10], [11]. A focused crawler downloads
web pages that are relevant by analyzing the data on the
pages [12]. There are many studies on this task that uses
information in web pages to improve this task [13]–[17]. Web
pages of a website contain similar elements and have a similar
hierarchy. Uzun et al. [4] improve the crawling process by
predicting these elements. In our study, the extraction process is enhanced by analyzing three additional information
obtained from previous web pages of a website.
The rest of the study is organized as follows. The second
section gives information about the literature on web scraping. The third section introduces the details of the problem.
The fourth section presents web content extraction methods
using DOM Tree. The fifth section covers the novel approach
used for this task. The sixth section is dedicated to the experiments for comparing our approach with other methods and
assessing the effectiveness of the algorithm. The last section
provides our conclusion and future studies.
II. STUDIES IN WEB SCAPING
The most studied issue for web scraping is to determine the
extraction patterns automatically. On the other hand, these
patterns are created manually by developers/experts in the
implementation of web scrapping tools that are used HTML
parsers. Determining the extraction pattern manually is
explained in detail in Section III.
Automated techniques can be grouped into supervised,
unsupervised and hybrid approaches. In a supervised
approach, extraction patterns are learned from labeled data.
Early studies for this approach are mainly based on wrapper induction methods [18]–[21]. Recent studies are about
machine learning techniques [22]–[25] that focus on predictions based on a learning model obtained from the training
dataset. While researchers in this field try to experiment
with different machine learning algorithms and to obtain new
features for improving their learning models.
An unsupervised approach automatically finds extraction
patterns from single or multiple web pages and eliminates
the manual labelling task. This approach can be classified into three groups: statistical-based, template detection, and assumption techniques. Statistical-based techniques
utilize statistical information including content ratio [26],
tag ratio [27], link density [28], [29], quantity of parent nodes [30], quantity of HTML/word node tokens [31],
link/text percentage of nodes [32], ratio of non-hyperlink
character number to character number that hyperlinks contain [33] and ratio of words/leafs [34]. Template detection
techniques [35]–[38] find the main content blocks in all
blocks. A template is a master web page or shell document
that is used as a basis for creating new web pages. Blocks
are parts of a template including main content blocks, menus,
61727
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
advertisements, banners, footers and so on. Assumption techniques [39]–[47] use the structure of the web document,
content nodes, main content block, some elements, style
information, etc. to extract relevant content.
A hybrid approach tries to exploit the advantages of both
approaches. Uzun et al. [3], [4] propose a hybrid approach
that has two steps: the first step uses a DOM (Document
Object Model) tree for learning extraction patterns and
the second step eliminates DOM Tree creation for efficient
extraction. The first step has three procedures including creating a DOM Tree, preparing features and predicting appropriate elements with an appropriate machine learning method.
The second step uses a simple string matching algorithm to
extract content by employing an extraction pattern that can be
used on other web pages of the related website. This second
step eliminates the first step and gives time efficiency to the
algorithm. These two studies are convenient for extracting
online newspapers, one-page blogs, and programming solution/example descriptions. Uçar et al. [5] developed a similar
algorithm for extracting review parts of a web document.
The second step of this study is utilized as one of the baselines
in our experiments.
There are many studies in determining the extraction pattern automatically. Besides, developers/experts prepare this
pattern manually in web scraping tools. All of the techniques,
approaches, and tools are DOM-based and ignore time efficiency in the extraction process. The main contribution of
this study is to increase the time efficiency of the extraction
process. Moreover, our approach collects three additional
information obtained from web pages of a web site and
improves time efficiency in the extraction process.
III. PROBLEM DEFINITION
A web browser renders a display from a web document.
In the background of this rendering process, there are three
additional operations including construction of a DOM Tree,
processing of CSS (Cascading Style Sheets) and execution
of JavaScript. Fig. 1 shows a simple web page and the DOM
Tree constructed from a document.
A web document, D, contains characters that is CH =
{ch1 , ch2 ,. . . , chn }. HTML parsers find HTML elements,
E = {e1 , e2 , . . ., en } from C. An HTML element is typically made up of two tags: an opening tag (OT) and closing tag (CT). Content, C, is written between these tags.
C may also contain inner HTML elements, Ee. All tags, T =
{t1 , t2 , . . ., tn }, are determined by W3C (World Wide Web
Consortium) that is an international community responsible for developing the Web standards. The web page has
tags such as <html>, <head>, <title>, <body>, <div>, <h1>,
<h2>, and <p> that are given in Fig. 1. Moreover, many
different tags can be utilized in web pages. The opening tag
of an element (OTe ) may have attributes and their values
that are defined as A = {a1 = v1 , a2 = v2 , . . ., an = vn }.
Attributes such as id, class, and itemprop that are shown
in Fig. 1 help to provide extra information about elements.
This extra information can be used for formatting the desired
61728
elements in a web document. Some start tags such as <html>,
<head>, <body> and <title> in Fig. 1 may not contain any
attributes that are represented as A = { }. On the other
hand, a tag may contain one or more attributes like <div> tag
in Fig. 1. For example, the first <div> tag has the attribute
that contains id as the attribute name and ‘‘bigcontainer’’ as
attribute value that is case-insensitive and separated by an
equal sign. As a result, an HTML element can be expressed as
follows:
e = (OTe (te , Ae ), (Ce , Ee ), CTe )
(1)
A DOM tree is utilized to represent HTML elements and
their hierarchies. In web scraping, OTe can be used for preparing an extraction pattern to find C by searching on D. The
desired elements are determined by an expert programmer
as needed. For example, the following opening tags can be
employed to eliminate the unnecessary content and to extract
necessary content in Fig. 1.
• <div class = ‘‘contentAll’’>: All necessary content
• <h1 itemprop = ‘‘headline’’>: Title of the web page
• <h2 itemprop = ‘‘Summary’’>: Summary of the web
page
• <div itemprop = ‘‘articleBody’’>: Main text of the web
page
• <p>: Inner texts of the web page
A web developer can prepare an extraction pattern by
using the tag of an element (te ) and specifying desired
attributes (Ae ) in OTe . Moreover, this pattern is used to
produce different web pages for a website. Therefore, when
required patterns are resolved for a website, the extraction
process can be easily applied by using resolved elements.
A software developer may choose one or more than one of
these extraction patterns for her / his purpose. For instance,
if a developer is developing a search engine project, instead
of storing the entire HTML document, a developer can
store the text in a particular element such as <div class =
‘‘contentAll’’>. On the other hand, if there is not enough
storage space for this task, you can store only texts of
two elements including <h1 itemprop = ‘‘headline’’> and
<h2 itemprop = ‘‘Summary’’>.
IV. EXTRACTION METHODS
Extraction methods in web scraping can be classified as
DOM-based and string-based methods. DOM-based methods
are based on creating a DOM-tree for a web page and searching a specific element in this tree. String-based methods are
focused on directly extracting content from CH.
In a web browser, a web developer can access, add, alter
and delete all elements in a web page by using methods of
JavaScript-based on the DOM tree. Parsers of programming
languages in this task are usually built on this structure. For
example, these parsers have similar methods that create a
DOM tree and access the elements of this tree. Extraction
methods mostly used to perform web scraping by traversing
the DOM tree.
VOLUME 8, 2020
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
FIGURE 1. An example of an HTML document and Dom Tree of this document.
In JavaScript, there are some methods including getElementsByTagName, getElementByID, getElementsByClassName, and getElementsByName for traversing a DOM tree
and extracting content (Ce ) using a tag (te ) or attribute of
id, class, and name (Ae ) in the OTe, respectively. However,
these methods are not appropriate for accessing attribute of
itemprop. querySelector() method returns the first element
that matches a specified CSS selector(s) in an HTML document. For example, querySelector(‘‘[itemprop = headline]’’)
method can be used to access the element of <h1 itemprop =
‘‘headline’’>. In this study, MS_HtmlDocument in.Net has
VOLUME 8, 2020
similar to methods of JavaScript therefore it is selected
for comparison with other methods. Although such methods are available for traversing the DOM tree, the W3C,
which sets official Web standards, has introduced new standard XPath for navigating through elements and attributes
in the DOM tree. XPath uses path expressions to access
an element. These path expressions look very much like
the path expressions in computer file systems. For selecting
<h1 itemprop = ‘‘headline’’> element, XPath expression
is ’/html/body/div/div/h1[@itemprop = ‘‘headline’’]’. XPath
expressions can also be utilized in JavaScript, Java, Python,
61729
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
FIGURE 2. The flowchart of the UzunExt.
Ruby, C# and many other languages. In this study, the .Net
parser HAP2 is preferred to implement XPath expressions.
Some programming languages have general query facilities for applying to all sources of information. In .Net, LINQ
(Language-Integrated Query) supports these facilities. In this
study, the AngleSharp parser3 that allows using of LINQ is
selected.
Three methods described above are based on the DOM tree
that is a convenient structure for manipulating all elements.
However, an expert programmer only needs a certain number of elements for web scraping. In this situation, regular
expressions, the well-known text processing technique, can
be considered as a solution for obtaining the content of certain
elements. For example, the following expressions can be used
for extraction:
• <h1.itemprop =.headline.>(.*?)</h1>
• <h2.itemprop =.Summary.>(.*?)</h2>
• <p>(.*?)</p>
The form of <h1>, <h2> and <p> elements are suitable for a
regular language that is formal language [48]. A regular language that can be expressed using a regular expression. However, the form of <div> element is not appropriate for using
regular expressions because this element can be represented
in a nested form. That is, the form of <div> element is
context-free because the number of inner <div> elements is
not known. If the number of inner <div> elements are known,
the problem can be expressed with a regular expression.
Moreover, the time complexity can be improved because the
regular languages can be recognized by a final state machine
of m-states where m is a finite number. This means it requires
O(m) space and O(n) time, where n is the length of the
data. On the other hand, the context-free languages can be
recognized by a non-deterministic pushdown automaton-like
CYK algorithm [49]. The time complexity of this algorithm
is O(n3 ∗ G), where n is the length of the data and the number
of rules. This study aims to try to improve the time efficiency
of web scraping by converting it to a regular language level.
There are few studies that have attempted to improve the
time efficiency of the web scraping process. Uzun et al. [50]
propose a simple parser, namely SET Parser, using basic
string matching for efficient extraction. This parser has been
used by some studies [3]–[5] for eliminating DOM tree
construction. This study introduces a much more efficient
approach using the additional information obtained from a
crawling process.
2 HAP (HTML Agility Pack) is an agile HTML parser that builds a DOM
Our approach, namely UzunExt, is two-fold: extracting content from a web page and predicting suitable values of the
stored additional data. Fig. 2 presents the flowchart of the
approach. According to Fig. 2, after a web page is downloaded, extraction patterns are obtained from the rule file for
V. APPROACH OVERVIEW
tree and allows to parse this DOM Tree. For more information http://htmlagility-pack.net/.
3 The ultimate angle brackets parser library (AngleSharp) constructs a
DOM tree based on the official W3C specifications. For more information
https://github.com/AngleSharp/AngleSharp.
61730
VOLUME 8, 2020
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
TABLE 1. An example extraction pattern file holding the OTe , type of
content and additional information.
a website and hyperlinks are parsed from this page. The rule
file holds rules as given Table 1 for each website. A hyperlink
starts with an anchor opening tag <a, and includes a hyperlink
reference href = ‘‘URL for the page’’. The following regular expression <a.href..(.*?).>.*?</a> can easily be used to
extract all hyperlinks in a web page.
In Table 1, the OTe and type of content are determined
by an expert web programmer. The OTe is a pattern that
is utilized for searching in the downloaded web page. The
additional information is estimated during a crawling process
for speeding up search and extraction operations. A conventional search algorithm starts at the beginning of the
document. However, the search algorithm in our approach
uses the starting position information that means the initial
index. The number of inner tag information holds the number
of tag occurrences for the content of the element. A value
of −1 in the number of inner tag information indicates that the
number of tag occurrences in the element is not fixed. If this
value is fixed, this situation reduces the duration time of the
extraction process because of time complexity. The repetition
information is used for breaking the extraction process after
the first extraction. In this case, the rest of the document is
not checked.
After obtaining rules, the extraction step is applied to each
rule for a web page. This process uses additional information
loaded from the rule file. It extracts the content of a web
page and finds the starting position, the number of inner
element and the repetition as data. The content of a web page
and additional data stores into the storage. After storing the
process, data is evaluated in the prediction step for finding
values of the starting position, the number of inner element
and the repetition. Finally, these values are saved to the rule
file as additional information.
The rest of this section gives more information about the
approach. Firstly, the extraction step is described through an
algorithm. The second subsection covers the prediction step
of the additional information. Thirdly, the outputs of these
steps are examined by an example. The second subsection
is about searching algorithms that can be used in the extraction step. The final subsection describes how can easily be
adapted to the algorithms of other studies to improve their
time efficiencies.
VOLUME 8, 2020
Algorithm 1 Initialization of the Extraction Step
Data: OTe : searching pattern, source(CH): the
source of a web page, startingPosition,
innerTagCount, repetition, sa: a search
algorithm
Result: List: a string list that holds the result of
extraction, data_startingPosition: holds the
first index value of searching
1 start_tn = resolveStartTag(OTe );
2 end_tn = resolveEndTag(OTe );
3 start_index = Search(OTe, source, startingPosition,
sa);
4 if start_index == −1 then
5
start_index = Search(OTe, source, 0, sa);
6 end
7 data_startingPosition = start_index;
8 if start_index == −1 then
9
end_index = Search(end_tn, source, start_index,
sa) - start_index+ end_tn.Length;
10
start_tag = 0;
11
if innerTagCount > −1 then
12
start_tag = innerTagCount;
13
end
14
end_tag = 0;
15
result_temp = ‘‘’’;
16
call Algorithm 2;
17 end
18 return List, data_startingPosition
A. EXTRACTION STEP
The main algorithm of the UzunExt extracts web content
from web pages by using string methods. It has six parameters as input, including the opening tag (pattern - OTe ),
source, starting position (startingPosition), number of nested
tag name (innerTagCount), repetition of tag name (repetition)
and a search algorithm (sa). The OTe determines the pattern
to be searched in the source text (CH). The starting position
specifies an index that is the starting position to search in
the CH. The number of nested tag name parameter defines the
formal language whether is a regular language or context-free
language. When the nested parameter value is ‘‘−1’’, it means
that content-free language is selected. On the other hand,
if this parameter is greater than zero, it means regular language. The repetition of an element name is Boolean value
to determine whether the element name has one item or more
than one item in the source. The starting position, number
of nested tag name and repetition parameters are taken from
the rule file. The search algorithm allows the use of different
search algorithms. The initialization of the Extraction step is
given in Algorithm 1.
In Algorithm 1, the tag name (start_tn) and the closing
tag name (end_tn) are parsed for the OTe (Line 01, 02).
For instance, when the OTe is <div class = ‘‘contentAll’’>,
the start_tn is <div and the end_tn is </div>. The Search
61731
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
Algorithm 2 The Extraction Step of UzunExt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
while True do
sub_temp = source.Substring(start_index,
end_index);
result_temp + = sub_temp;
if innerTagCount == −1 then
start_tag + = Count(start_tn, sub_temp, sa);
end
end_tag++;
if start_tag ! = end_tag then
start_index+ = end_index;
end_index = Search(end_tn, source,
start_index, sa) - start_index +
end_tn.Length;
else
add result_temp to List;
if repetition == false then
break;
end
start_index = Search(OTe, source,
start_index+ end_index, sa);
if start_index == −1 then
break;
end
end_index = Search(end_tn, source,
start_index, sa) - start_index +
end_tn.Length;
result_temp = ‘‘’’;
if innerTagCount > −1 then
start_tag = innerTagCount;
else
start_tag = 0;
end
end_tag = 0;
end
end
method returns the index of the first occurrence (start_index)
of the OTe in a source (Line 3). If the appropriate value
for the initial index (startingPosition) is known, the number of comparisons in the loop of the search algorithm is
reduced, thus making the search more efficient. Values of
data_startingPosition obtained from web pages are stored to
be used in order to the prediction step of our approach. However, if the startingPosition is not selected correctly, the algorithm continues searching until the end of the CH, and the
OTe cannot be found. Although this situation is not desirable,
the algorithm also searches for the text before the starting
position (Line 5). The search method is utilized to find both
the start_index and the end_index. After the values are found
in the initial process, the extraction step is started as given
Algorithm 2.
In Algorithm 2, Substring method (Line 2) extracts the
characters from the CH, between two specified indices
61732
(start_index, end_index) determined in Algorithm 1, and
returns the new substring. This substring is appended to
result_temp (Line 3). If the value of innerTagCount is −1,
it means that innerTagCount is not known. In this case,
the Count method (Line 5) calculates and returns the number of the start_tn for a given string (sub_temp), and the
algorithm works like a context-free language. On the other
side, this method is not needed if the algorithm knows the
number of occurrences of the start_tn in a source. In this case,
the algorithm works like regular language and the start_tag
sets with a fixed value that is taken from innerTagCount.
The element <div class = ‘‘contentAll’’> in Figure 1 has two
elements including itself and the element <div itemprop =
‘‘articleBody’’>. If the algorithm knows the innerTagCount
as two, it does not need using the Count method. In other
cases, the algorithm calculates the number of start_tn and
updates the start_tag (Line 5).
The last additional information in Algorithm 2 is the repetition. If the value of repetition is False, the algorithm breaks
the loop after the first extraction (Line 17). However, when a
web page can have multiple content like the elements of <div
class = ‘‘adv’’> and <p> in Figure 1, the value of repetition
must be True for obtaining all the results. Otherwise, the algorithm continues to search the other OTe in the CH (Line 02).
If the search result is ‘‘−1’’ (Line 17), the algorithm breaks
the loop. If the search result is not −1, the end_index is
found by the Search method (Line 20). Then, the variables
are updated and the loop continues to extract the other content
(goto Line 2).
Algorithm 1 returns values including a List that is a string
list of extraction results and data_startingPosition that is the
index of the first occurrence of OTe in the CH. Values of
data_startingPosition are collected in order to utilize in the
prediction step.
B. PREDICTION STEP
In this section, the prediction of the startingPosition,
innerTagCount, and repetition are separately examined.
1) PREDICTION OF THE startingPosition
Each extraction pattern has own value for the startingPosition
as given in Table 1. For finding this value, Algorithm 1 stores
all extraction positions to the storage. After storing process,
the minimum value of these values can be selected as the
starting position, to speed up the search process. The value of
starting position can be discovered while crawling web pages
belonging to a website. Algorithm 3 indicates the prediction
of the StartingPosition.
In Algorithm 3, the Min method returns the smallest value
in a list. The Len method returns the length (the number of
items) of a list. The UzunExt approach uses Grubbs’ test to
assess whether one of the values in the startingPosition_List
is a significant outlier from the rest before returning the
minimum value. However, at least three values are needed
for this test to be applied (Line 02). There are two versions
as maximum and minimum of this test. Our algorithm uses
VOLUME 8, 2020
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
TABLE 2. The outputs of the algorithms of the approach with an example.
Algorithm 3 Finding an Appropriate Value for the
StartingPosition
Data: List_startingPosition: all data obtained from
Algorithm 1 for each rule
Result: min: an appropriate value for using as the
StartingPosition
1 min = Min(List_startingPosition);
2 if Len(List_startingPosition)> = 3 then
3
mean = Calculate_Mean(List_startingPosition);
4
sd = Calculate_SD(List_startingPosition);
5
if sd == 0 then
6
return List_startingPosition[0];
7
8
9
10
11
12
min = List_startingPosition[0];
foreach value in List_startingPosition do
G = |mean - value| / sd;
if G < CriticalZ[Len(List_startingPosition)]
and min > value then
min = value;
return min;
to test whether the minimum value is an outlier as one of the
following one-sided tests:
G=
Y ∗ Ymin
s
(2)
with Y , s, and Ymin denoting the sample mean (Line 02), standard deviation (Line 03: sd), and the minimum value of the
List_startingPosition, respectively. This test detects outliers
from normal distributions. The value of G compares with the
Grubbs’ critical value table (Line 09: CriticalZ) that is used to
determine whether the minimum value is the outliner. Thanks
to Grubbs’ test, the values furthest away from the mean is
ignored in order to estimate the close value for the mean.
C. PREDICTION OF THE innerTagCount AND REPETITION
In the prediction step, the innerTagCount and repetition can
be determined by examining at least two web pages of a
website. For the opening tag <div id = ‘‘bigcontainer’’>
in Fig. 1, the number of inner tags is six. If these numbers
are equal in two web pages, this value can be used as the
innerTagCount. If the number of repetition is more than one
for two web pages, it means that the value of the repetition
is true. For the opening tag <div class = ‘‘adv’’> in Fig. 1,
the value of repetition is true. On the other hand, the value
of repetition is false for elements <div id = ‘‘bigcontainer’’>,
<div class = ‘‘header’’>, <div class = ‘‘menu’’>, <div class =
‘‘contentAll’’>, <h1 itemprop = ‘‘headline’’>, and so on.
Two web pages are enough for most websites, but some
websites and some extraction patterns need more than two
web pages. For example, the number of inner tag in the
opening tag <div id = ‘‘bigcontainer’’> is six in the first
five pages, but this value may change on other web pages.
Therefore, simple tests are done for every website in order
VOLUME 8, 2020
to find out how many pages are enough for the prediction
step. Simple heuristic tests show that six/seven web pages are
enough for prediction of the innerTagCount and repetition.
In the study, ten web pages are selected for the prediction step
by considering the number of errors that may occur on other
websites. For these two parameters, these values should be
rechecked at a certain time because a web designer can make
changes in the structural design of his/her website.
D. OUTPUTS OF THE ALGORITHMS
In this section, the outputs returned from 15 web pages of
an example website are examined to understand the approach
better. Table 2 indicates the outputs obtained from steps of the
extraction and prediction.
In Table 2, the algorithm of the extraction step takes the
additional information of startingPosition, innerTagCount,
and repetition from the prediction step. In the first web page,
these information are 0, −1, and True, respectively. The value
of 0 in startingPosition means that the search process starts
from the beginning of the web page. The prediction step finds
the most appropriate value for startingPosition by looking at
the previous data obtained from the extraction step. If the
value of the prediction step is not appropriate, the search
process starts again at the beginning of the web page. This
situation can be considered as a second search (SS) that
occurred three times in Table 2.
The value of −1 in innerTagCount indicates that the number of the nested tag should be calculated. The value of
True in repetition means that the extraction process continues
until the end of the source of a web page. These values are
calculated throughout the first ten web pages. After ten web
pages, the prediction step checks the data obtained from the
extraction step for determining the additional information on
innerTagCount and repetition. The data of ITC in the first
ten web pages are 5 and equal for all extraction processes.
Therefore, ITC information sets as 5. On the other hand,
the data of repetition for the first web pages is False and repetition information sets as False. The effect of each additional
information is examined in the experiment section.
61733
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
Algorithm 4 Discovering an Appropriate Element for
Efficient Extraction
Data: a web page: a html document, a specific element:
the resulting element from any extraction
algorithm
Result: theElement: The opening tag
1 HTMLElements = CreateDOM(a web page);
2 theElement = HTMLElements.FindElement(a specific
element);
3 while true do
4
if theElement = Body or theElement.Unique then
5
return theElement;
6
else
7
theElement = theElement. parentElement;
TABLE 3. Information about dataset.
E. SEARCHING ALGORITHMS
In this section, firstly information about the dataset and
extraction patterns prepared for this task is given. The second subsection covers the average extraction time of three
DOM-based parsers and SET Parser. In the last section,
the time efficiency of the UzunExt approach is examined
from several different perspectives. All experiments are carried out on an Intel Core i5-3.2Ghz 8 GB RAM computer with
Windows 10 operating system.
to understand whether the element is unique. Otherwise,
the parent of the element is the new element for the loop
(Line 07).
VI. EXPERIMENTS
One of the most crucial issues in web data extraction is the
searching process. There are many string searching algorithms, sometimes called string matching algorithms, in the
literature. Moreover, popular programming languages such
as Java, C#, Python, Ruby, and JavaScript have indexOf
methods that return the position of the first occurrence of
a specified value in a string. Buluş et al. [8] try to find
suitable pattern matching algorithms in web pages. In this
study, seven of these algorithms and indexOf methods are
used in the experiments. Moreover, the Count method in
Algorithm 2 makes use of the Search method.
F. APPLYING THE UzunExt APPROACH TO OTHER STUDIES
According to the experiments, it is necessary to use a unique
opening tag for efficient extraction. Moreover, this unique tag
improves the time efficiency of the UzunExt approach. The
methods applied to most of the other published studies may
recommend for any tag in this task. However, there is no
effective extraction over repetitive tags, and it can also cause
errors. For example in Fig. 1, although the main text can
be obtained via the <p> tag, the same text can be acquired
through the <div class = ‘‘contentAll’’> tag. Since the <p>
tag can be repeated throughout a web page, the entire web
page is scanned. That is, the UzunExt performs an extraction
process for each <p> tag. Additionally, the <p> tag can also
contain an unnecessary content of a web page. Therefore,
Algorithm 4 that finds a unique opening tag for the appropriate extraction pattern is developed. Algorithm 4 contains
pseudocode that can easily be adapted to previous studies in
this task and allows for faster extraction of web content.
In Algorithm 4, a DOM tree including all the HTML
elements and their hierarchy is created (Line 01). The location
of a specific HTML element on the DOM tree is found via
FindElement method (Line 02). If the element found is a
<body> element or unique element, the algorithm does not
need to traverse the other elements (Line 05). All elements
which have the same attributes in the DOM tree are searched
61734
A. DATASET AND EXTRACTION PATTERNS
The approach presented in this study requires more than
one web page per website. Unfortunately, since we do not
have an appropriate dataset for our requests, a simple crawler
has been developed to create this dataset. This crawler was
used to download 30 web pages for 100 websites containing
the text of different languages including English, Russian,
German, French, Italian, Spanish, Turkish, Chinese, and
Japanese. The datasets of previous studies focus only on
a single category such as main content, comment content
and repetitive content. This study contains web pages in
many different categories including article, dictionary, health,
movie, newspaper, shopping, and trip. After downloading
3000 web pages, 2-4 extraction patterns were determined for
each website. The statistical results obtained from web pages
and extraction patterns are given in Table 3 and Table 4,
respectively.
In Table 3, websites are divided into seven different categories in order to examine the differences between websites.
Table 3 indicates the correlation between the average web
page sizes, the average number of characters in web pages and
the average number of elements in the DOM tree for different
categories in our dataset. The average size of web pages in the
seven categories is different from each other. When the web
pages in the categories of articles, dictionary, and health are
examined in detail, it appears that these web pages contain
fewer advertisements and hyperlinks than other web pages.
Therefore, the average web page sizes of these categories are
smaller than the other web pages. On the other hand, the web
VOLUME 8, 2020
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
TABLE 4. Information about extraction patterns.
pages in the categories of movie, newspaper, shopping, and
trips contain more additional information, linearly. In the
following subsections, the correlation between the effects on
the average response time of the algorithms and the increase
of the average file size, the number of characters, the number
of elements is examined.
Table 4 gives the statistical information about 247 extraction patterns. These rules include 13 different tags and may
include one or more attributes such as ID, class, and others.
Some elements have a fixed number of inner tags in web
pages of a website. However, 49% of <div> tag is not fixed.
Determining the number of the inner tag can speed up the
extraction process of the UzunExt approach. For some extraction patterns, after completing the first extraction, there is no
need to extract the other extraction. That is, some extraction
patterns are not repetitive for a web page. Only 3% of elements are repetitive in the prepared rules. In the following
sections, the effect of these situations on the UzunExt is
analyzed.
B. EXTRACTION WITH THE DOM-BASED ALGORITHMS
AND SET PARSER
Most programming languages have several parsers for
DOM-based operations. In this study, three different parsers
including MS_HtmlDocument, AngleSharp, and HAP are
used to create a DOM Tree and find a specific element on this
tree. In the DOM tree, finding the specific element is enough
to obtain the content. Moreover, SET Parser, which does not
need to create a DOM tree, performs the extraction process
using regular expressions and string methods. Table 5 indicates the average duration time obtained from the dataset and
extraction patterns for these three parsers and SET Parser.
In Table 5, the DOM-based parsers including MS_HtmlDo
cument, AngleSharp, and HAP take an average creation time
of 16.461 ms, 6.921 ms, and 10.377 ms, respectively to create
a DOM tree of all web pages. Additionally, these parsers give
an average finding time of 2.542 ms, 1.064, and 0.486 ms,
respectively in order to reach the content of the opening
tag. Finally, the average extraction time reusults of these
algorithms are about 19.004 ms, 7.985 ms, and 10.863 ms,
respectively. The MS_HtmlDocument parser gives the worst
result in both creating and finding. The average duration
VOLUME 8, 2020
TABLE 5. Information about extraction patterns.
of the AngleSharp parser in the creation of a DOM tree is
better than the use of the HAP parser. On the other hand,
the average finding time of the HAP parser is better than the
average finding time of other DOM-based parsers. Although
the HAP parser has a longer average time for creating a
DOM tree, it gives faster results in the finding of the element.
The AngleSharp parser can be selected if there is only one
search on a DOM tree. On the other hand, the HAP parser
is suitable if more than one search is performed for this
task. SET Parser has an average extraction time of 4.150 ms.
In the DOM-based algorithms, the number of elements is an
important issue, but in SET Parser, the web page size and
the number of characters are crucial issues for the extraction
process. SET Parser searches in a text of a web page without performing the creating process. Table 5 indicates that
SET Parser allows for fast extraction when compared to the
DOM-based parsers. The UzunExt approach is intended to
make this extraction faster than other extraction parsers.
C. EXTRACTION WITH THE UzunExt APPROACH
In this section, firstly the determination of the most suitable
search algorithm for the search part of the UzunExt approach
is investigated. Then, the effects of different extraction ways,
the effects of the additional information, efficiency of different ways during the crawling process and the effects of
tags are examined for a better understanding of the UzunExt
approach.
61735
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
TABLE 6. The average duration time of searching algorithms.
1) SEARCHING ALGORITHMS
Searching algorithms, which is one of the crucial parts of
the UzunExt approach, has been studied many times [51].
Buluş et al. [8] have observed the performances of 31 different searching algorithms on web pages. Seven of these
algorithms including Skip Search [52], Tuned Hume and
Sunday [53], Horspool [54], Backward Nondeterministic
Dawg Matching [55], Optimal Mismatch [56], Boyer and
Moore [57] and Raita [58] are used in our study. Moreover,
IndexOf method, the .NET Approach’s base class library,
is compared with these algorithms. This method performs a
culture-sensitive search. For optimizing string performance,
the culture-sensitive search is ignored with StringComparison.Ordinal (Indexof_Ordinal) that is one of the enumeration
values in order to specify the patterns for the search.
A search algorithm starts from the beginning of the text
and continues until the pattern is found in this text. That is,
the starting index of a search algorithm is zero, and the ending
index is the length of the web page. Determining the starting
index can speed up the search by reducing the number of
comparisons in the search. However, determining this index
is a difficult task. The UzunExt approach selects and uses the
appropriate index for a website and its extraction patterns.
The appropriate index is the minimum index obtained from
30 web pages in the experiments for a website and its extraction patterns. Table 6 shows the average duration time of the
search algorithms and the effects of selecting the appropriate
index for these algorithms.
Most search algorithms have two main phases: the preprocessing phase and the searching phase. The preprocessing
phase of an algorithm consists of collecting some information like statistical data or character ordering about a pattern and making a model to search. The searching phase
finds the index of a pattern in a text by using information
obtained from the preprocessing phase. Nevertheless, some
algorithms do not need a preprocessing phase. In Table 6,
the Indexof_Ordinal method, which does not require a preprocessing phase, gives the best result with 0.039 ms. The
Indexof_Ordinal method is much more effective when compared to the Indexof method. The algorithm that makes the
pattern searchable as soon as possible is the Skip Search
61736
TABLE 7. The average extraction time of three different ways in the
UzunExt for each category.
algorithm with a preprocessing time of 0.001 ms. The most
appropriate starting index for each website has been selected
for understanding the effect of starting from a forward index
(see the appropriate index in Table 6). According to Table 6,
using the appropriate index has been a positive influence on
the searching phase. After determining the appropriate index,
all the average searching time are improved. Table 5 indicates that the algorithms, which spend much time on preprocessing, are not suitable for this task. In this study,
the Indexof_Ordinal method is used for the extraction step
of the approach.
2) EFFECTS OF DIFFERENT WAYS OF THE UzunExt
APPROACH
The UzunExt approach can be adjusted in three different
ways. (1) The approach can extract relevant content without
using any additional information. In this case, the search
is performed from the top of the document, our algorithm
extracts each element found by the searching process, and
the algorithm continues this search until the end index of a
document. This way is the simple extraction of the approach.
(2) The second way, which is the default way of the approach,
predicts the additional information including startingPosition, innerTagCount, and repetition during the crawling process. (3) The third way, which is created to understand the
effectiveness of the default way better, uses predefined constants for additional information. These constants can be
set manually by an expert programmer. The third way can
be considered as a gold standard for the approach. In this
study, the optimal values obtained from 30 web pages of
each website rules are utilized as constants. Table 7 indicates
the average extraction time of three different ways and the
number of average characters obtained from extraction results
for different categories.
The average extraction time of the first way, which does
not use any additional information, is about 0.130 ms. The
AngleSharp parser, which has the best average extraction
time, is about 7.985 ms. That is, it means that the first way
provides an improvement of 98.37% compared to the AngleSharp parser. Moreover, the average extraction time of SET
Parser is about 4.150 ms. The first way provides an improvement of 48.03% compared to SET Parser. In the second way,
VOLUME 8, 2020
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
TABLE 8. The average extraction time of three different ways in the
UzunExt for each category.
task, the average extraction time enhances from 0.130 ms to
0.093. This improvement is about 28.58%. The number of
appropriate extraction tasks for the information of innerTagCount and repetition is 5010 and 6570, respectively. That is,
these information can be useful in these extraction tasks. The
innerTagCount information enhances a little improvement
with about 3.08% in the 5010 extraction tasks. The repetition
information provides a significant improvement with about
30.32% in the 6570 extraction tasks. Consequently, the information of startingPosition and repetition yield a significantly
better extraction time that the innerTagCount information.
4) COMPARISON OF THREE DIFFERENT WAYS DURING THE
CRAWLING PROCESS
the startingPosition information starts to be learned from the
first web page of a website. Information of InnerTagCount
and repetition are decided after obtaining ten web pages,
and these information are used in the following web pages.
In Table 7, the average extraction time with only the position
information is 0.119 ms on the first ten web pages. On the
next twenty web pages, the other two information improves
the average time with 0.055 ms. The average duration time of
the extraction process influences the length of the extraction
results as well as the size of the web page. As a result,
the second way provides significant improvement according
to the first way. The third way uses the most appropriate
values as additional information. This way yields the best
average extraction time of 0.056 ms. However, this technique
needs to downloaded web pages for obtaining these additional
information. But the second way continues to improve the
time efficiency of the extraction step during the crawling
process. The first way and the second way provide very close
results after the first ten web pages.
3) EFFECTS OF THE ADDITIONAL INFORMATION IN THE
UzunExt APPROACH
In this experiment, the effects of each information on
the approach are examined separately for understanding
improvements. startingPosition information is useful for all
extraction tasks. Nevertheless, the information of innerTagCount and repetition are not appropriate for all extractions.
Therefore, the contribution of these information is also examined concerning the number of appropriate extraction tasks.
Table 8 shows the effects of these information for all extractions and appropriate extractions.
There are 247 extraction patterns, and 30 web pages for
a website are utilized in the experiments. In other words,
7410 extraction tasks have been completed. The startingPosition information improves the average extraction time
from 0.130 ms to 0.095 ms. Thus, there is a significant
improvement with about 27.01% when using only the startingPosition information. The average extraction time in the
innerTagCount fell from 0.130 ms to 0.129 ms. This information provides a slight improvement with about 1.42%.
When using only the repetition information in the extraction
VOLUME 8, 2020
Figure 3 indicates the average extraction time for three different ways adjusted in the UzunExt approach. The average
extraction time for each web page of a website is considered in this experiment. In three different ways, the average minimum extraction times are 0.099 ms, 0.023 ms, and
0.026 ms, respectively. On the other hand, the average maximum extraction time is 0.343 ms, 0.257 ms, and 0.259 ms,
respectively. Depending on the file size and the number of
characters resulting from extractions, all of the ways are
affected similarly. The contribution of the additional information in Figure 3 is noticeable. The average time results of
the second and third ways are very close, especially after the
10th web page.
5) EFFECTS OF TAGS FOR THE UzunExt APPROACH
In this experiment, the third way of the UzunExt approach,
which has the most appropriate additional information,
is used for understanding the effects of the tags. Table 9 indicates the effects of tags on the average extraction time,
the average number of characters for content obtained from
the extraction step of the UzunExt approach and the number
of extraction tasks for tags.
In Table 9, the best average extraction time is 0.008 ms
with the <h1>, <h2>, <h3>, and <header> (H). The average
number of characters for content in these tags is about 173.26.
This value is significantly less than the average number of
characters of other tags. These tags allow for faster extraction
as expected because they contain fewer characters. When
the number of characters increases, the extraction time slows
down as seen in the <p>, <span>, and <ul> tags. These tags
have an average extraction time of 0.028 ms and the content
of these tags contains an average of 173.26 characters. These
results show the correlation between the extraction time and
the number of characters. However, the tags of <article>,
<section>, <table>, <field>, <main> (ASTFM) and the <div>
tag don’t support the expected findings. The average extraction time of the ASTFM tags is about 0.041 ms, and the
number of characters in the content of these tags is 17391.97.
The number of characters in the content of the <div> tag
is less than the number of characters in the content of the
ASTFM tags. However, the average extraction time of the
<div> tag is about 0.081 ms which is slower than the average
61737
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
FIGURE 3. Three different ways of the UzunExt approach.
TABLE 9. Effects of tags.
extraction time of the ASTFM. Because of this slowness,
a <div> tag can contain a number of other <div> tags. That is,
this tag is a nested structure. Therefore, the number of inner
tag name (innerTagCount) is also examined in Table 9.
The <div> tag is trendy to layout a web page, so the
number of extraction tasks is 4530 for this tag. Knowing
this number can provide efficiency in the extraction step.
When the appropriate number is known, the average extraction time is only 0.012 ms for 2310 extraction tasks. On the
other hand, the average extraction time increases to 0.151 ms
61738
for 2220 extraction tasks. However, another reason for this
increase is the number of characters in the content. The
number of characters is about 1952.48 and 10350.47 for the
average extraction time of 0.012 ms and 0.151 ms, respectively. The <div> tag is mostly used nowadays, but we expect
that the usage rate of the tags of <article>, <section>, and
<main> will increase in the future and these elements can
take the place of the <div> tag. These tags are new tags that
are introduced in HTML5 so the usage rate of these elements
on web pages is low.
The following suggestions may be offered for the programmer who prepares the extraction patterns for web data
extraction. The H tags are especially useful for obtaining
the title of a web page. The p, span, and ul tags can be
used to extract content containing few characters. The tags of
<article>, <section>, and <main> can be suggested to get the
main content. However, the <div> tag is highly preferred by
web designers. According to the experiments, we do not particularly recommend a <div> tag that contains too many inner
<div> tags as an extraction pattern. Moreover, an extraction
pattern, which has a unique value, may be preferred in terms
of fast extraction. The general decision is more appropriate
tags as an extraction pattern that does not contain inner tags,
is non-repetitive, and has fewer characters in their content.
VII. CONCLUSION
Most of the previous studies on web scraping are about
automatically extracting web content by ignoring time efficiency. These studies use traditional extraction methods based
VOLUME 8, 2020
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
on a DOM tree which takes all elements in a web page into
account. However, such an approach increases the time cost
in the extraction step. In this study, a novel approach, which
provides time efficiency by employing the string methods
and additional information obtained from web pages of a
website during a crawling process, is introduced. The average
extraction time of the simple way in this approach gives better
efficiency than the average extraction time of AngleSharp
parser with approximately 60 times. Moreover, this approach
provides additional information including starting position,
the number of the nested tag, and repetition of the tag. Thanks
to these information, the average extraction time is about
2.35 times better than the average extraction time of the
simple way of the approach. Simply, while an extraction
process is completed in about 140 days with the DOM-based
parser, this process can be finished in approximately one day
by use of the UzunExt approach. From another perspective,
if one wishes to benefıt from parallel computing to gain
time, a cluster of 140 cores and an enormous amount of
memory will be needed, while the same process can be done
with only a single, moderately equipped computer using the
proposed approach. Notably, the additional information of
starting position and repetition in this approach dramatically
improves the time efficiency of the extraction step.
A web designer/developer determines elements, attributes
for these elements, and a hierarchy between these elements
for creating a website. A researcher tries to discover the
desired ones from these elements by developing algorithms,
techniques, methods, and approaches. Recently, the determination stage of the attribute in these elements has gained
importance. There is a community activity, schema.org,
to determine the attributes of elements for the use of web
designers. This community offers vocabularies covered entities, relationships between entities and actions for different
fields. Many applications from Google, Microsoft, Pinterest,
Yandex, and over 10 million websites use these vocabularies
to identify their elements. Besides, these companies support
the creation of these vocabularies. When elements and their
attributes are known, it will be much easier to prepare an
extraction pattern, especially for the UzunExt approach.
The UzunExt approach is an open-source project and
available via the web page https://github.com/erdincuzun/
UzunExt. It can be easily integrated into other projects that
need web content. Moreover, it can be used to enhance the
time efficiency of automatic web content extraction studies
by determining an appropriate element. In particular, it can be
adapted to crawler applications in order to increase the time
efficiency of the content extraction step.
In this study, we deal with the text of a web page. However, this text can be changed with Ajax requests. Traditional
solutions [59] use an embedded browser to render web pages
and execute these requests. However, these solutions are
expensive, in terms of time and network traffic. In the future,
we will try to detect these requests without rendering a web
page. Moreover, we also plan to develop an effective and
efficient web scraper that can create datasets automatically
VOLUME 8, 2020
for different purposes. Besides, we will also aim to develop
an algorithm that automatically obtains the desired element
without creating a DOM tree.
ACKNOWLEDGMENT
The author would like to offer his special thanks to
Assist. Prof. Dr. Heysem Kaya for comments, support, and
encouragement.
REFERENCES
[1] E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, ‘‘Web data
extraction, applications and techniques: A survey,’’ Knowl.-Based Syst.,
vol. 70, pp. 301–323, Nov. 2014.
[2] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald,
and E. Muharemagic, ‘‘Deep learning applications and challenges in big
data analytics,’’ J. Big Data, vol. 2, no. 1, p. 1, Dec. 2015.
[3] E. Uzun, H. V. Agun, and T. Yerlikaya, ‘‘A hybrid approach for extracting
informative content from Web pages,’’ Inf. Process. Manage., vol. 49, no. 4,
pp. 928–944, Jul. 2013.
[4] E. Uzun, E. S. Güner, Y. Kılıçaslan, T. Yerlikaya, and H. V. Agun,
‘‘An effective and efficient Web content extractor for optimizing the
crawling process,’’ Softw., Pract. Exper., vol. 44, no. 10, pp. 1181–1199,
Oct. 2014.
[5] E. Uçar, E. Uzun, and P. Tüfekci, ‘‘A novel algorithm for extracting the
user reviews from Web pages,’’ J. Inf. Sci., vol. 43, no. 5, pp. 696–712,
Oct. 2017.
[6] Y.-C. Wu, ‘‘Language independent Web news extraction system based on
text detection framework,’’ Inf. Sci., vol. 342, pp. 132–149, May 2016.
[7] E. Uzun, H. N. Buluş, A. Doruk, and E. Özhan, ‘‘Evaluation of Hap,
AngleSharp and HtmlDocument in Web content extraction,’’ in Proc. Int.
Sci. Conf. (UNITECH), vol. 2, 2017, pp. 275–278.
[8] H. N. Buluş, E. Uzun, and A. Doruk, ‘‘Comparison of string matching
algorithms in Web documents,’’ in Proc. Int. Sci. Conf. (UNITECH), vol. 2,
2017, pp. 279–282.
[9] F. E. Grubbs, ‘‘Procedures for detecting outlying observations in samples,’’
Technometrics, vol. 11, no. 1, pp. 1–21, Feb. 1969.
[10] S. Chakrabarti, M. van den Berg, and B. Dom, ‘‘Focused crawling:
A new approach to topic-specific Web resource discovery,’’ Comput. Netw.,
vol. 31, nos. 11–16, pp. 1623–1640, May 1999.
[11] K. Shchekotykhin, D. Jannach, and G. Friedrich, ‘‘XCrawl: A high-recall
crawling method for Web mining,’’ Knowl. Inf. Syst., vol. 25, no. 2,
pp. 303–326, Nov. 2010.
[12] B. Pinkerton, ‘‘Finding what people want: Experiences with the
WebCrawler,’’ in Proc. 2nd Int. World Wide Web Conf., vol. 94, 1994,
pp. 17–20.
[13] S. Chakrabarti, K. Punera, and M. Subramanyam, ‘‘Accelerated focused
crawling through online relevance feedback,’’ in Proc. 11th Int. Conf.
World Wide Web (WWW), 2002, pp. 148–159.
[14] M. Diligentit, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori,
‘‘Focused crawling using context graphs,’’ in Proc. 26th Int. Conf. Very
Large Data Bases (VLDB), 2000, pp. 527–534.
[15] H. Dong and F. K. Hussain, ‘‘SOF: A semi-supervised ontology-learningbased focused crawler,’’ Concurrency Comput., Pract. Exper., vol. 25,
no. 12, pp. 1755–1770, Aug. 2013.
[16] M. Kumar, R. Bhatia, and D. Rattan, ‘‘A survey of Web crawlers for
information retrieval,’’ WIREs Data Mining Knowl. Discovery, vol. 7,
no. 6, p. e1218, Nov. 2017.
[17] F. Menczer, G. Pant, and P. Srinivasan, ‘‘Topical Web crawlers: Evaluating adaptive algorithms,’’ ACM Trans. Internet Technol., vol. 4, no. 4,
pp. 378–419, Nov. 2004.
[18] N. Kushmerick, Wrapper Induction for Information Extraction. Seattle,
Washington: Univ. of Washington, 1997.
[19] L. Liu, C. Pu, and W. Han, ‘‘XWRAP: An XML-enabled wrapper construction system for Web information sources,’’ in Proc. 16th Int. Conf.
Data Eng., 2000, pp. 611–621.
[20] R. Das and I. Turkoglu, ‘‘Creating meaningful data from Web logs for
improving the impressiveness of a Website by using path analysis method,’’
Expert Syst. Appl., vol. 36, no. 3, pp. 6635–6644, Apr. 2009.
[21] B. Fazzinga, S. Flesca, and A. Tagarelli, ‘‘Schema-based Web wrapping,’’
Knowl. Inf. Syst., vol. 26, no. 1, pp. 127–173, Jan. 2011.
61739
E. Uzun: Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages
[22] H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen, ‘‘Mining Web informative
structures and contents based on entropy analysis,’’ IEEE Trans. Knowl.
Data Eng., vol. 16, no. 1, pp. 41–55, Jan. 2004.
[23] M. Zachariasova, R. Hudec, M. Benco, and P. Kamencay, ‘‘Automatic
extraction of non-textual information in Web document and their classification,’’ in Proc. 35th Int. Conf. Telecommun. Signal Process. (TSP),
Jul. 2012, pp. 753–757.
[24] Z. Li, W. K. Ng, and A. Sun, ‘‘Web data extraction based on structural
similarity,’’ Knowl. Inf. Syst., vol. 8, no. 4, pp. 438–461, Nov. 2005.
[25] H. S. Maghdid, ‘‘Web news mining using new features: A comparative
study,’’ IEEE Access, vol. 7, pp. 5626–5641, 2019.
[26] T. Gottron, ‘‘Content code blurring: A new approach to content extraction,’’
in Proc. 19th Int. Conf. Database Expert Syst. Appl. (DEXA), Sep. 2008,
pp. 29–33.
[27] T. Weninger, W. H. Hsu, and J. Han, ‘‘CETR: Content extraction via
tag ratios,’’ in Proc. 19th Int. Conf. World wide Web (WWW), 2010,
pp. 971–980.
[28] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, ‘‘DOM-based content
extraction of HTML documents,’’ in Proc. 12th Int. Conf. World Wide Web,
2003, pp. 207–214
[29] P. A. R. Qureshi and N. Memon, ‘‘Hybrid model of content extraction,’’
J. Comput. Syst. Sci., vol. 78, no. 4, pp. 1248–1257, Jul. 2012.
[30] C. Mantratzis, M. Orgun, and S. Cassidy, ‘‘Separating XHTML content from navigation clutter using DOM-structure block analysis,’’ in
Proc. 16th ACM Conf. Hypertext Hypermedia (HYPERTEXT), Jan. 2005,
pp. 145–147.
[31] A. Finn, N. Kushmerick, and B. Smyth, ‘‘Fact or fiction: Content classification for digital libraries,’’ in Proc. Joint DELOS-NSF Workshop, Personalization Recommender Syst. Digit. Libraries, 2001. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/citations;jsessionid=8E0FC70BEE7
DFA696487A2F7C6B622FA?doi=10.1.1.21.3834
[32] G. Adam, C. Bouras, and V. Poulopoulos, ‘‘CUTER: An efficient useful
text extraction mechanism,’’ in Proc. Int. Conf. Adv. Inf. Netw. Appl.
Workshops (AINA), May 2009, pp. 703–708.
[33] R. Gunasundari, ‘‘A study of content extraction from Web pages based
on links,’’ Int. J. Data Mining Knowl. Manage. Process, vol. 2, no. 3,
pp. 230–236, May 2012.
[34] D. Insa, J. Silva, and S. Tamarit, ‘‘Using the words/leafs ratio in the
DOM tree for content extraction,’’ J. Log. Algebr. Program., vol. 82, no. 8,
pp. 311–325, Nov. 2013.
[35] Z. Bar-Yossef and S. Rajagopalan, ‘‘Template detection via data mining
and its applications,’’ in Proc. 11th Int. Conf. World Wide Web (WWW),
2002, pp. 580–591.
[36] D. Chakrabarti, R. Kumar, and K. Punera, ‘‘Page-level template detection
via isotonic smoothing,’’ in Proc. 16th Int. Conf. World Wide Web (WWW),
2007, p. 61.
[37] L. Fu, Y. Meng, Y. Xia, and H. Yu, ‘‘Web content extraction based on
Webpage layout analysis,’’ in Proc. 2nd Int. Conf. Inf. Technol. Comput.
Sci., Jul. 2010, pp. 40–43.
[38] Y. Wang, B. Fang, X. Cheng, L. Guo, and H. Xu, ‘‘Incremental Web
page template detection by text segments,’’ in Proc. IEEE Int. Workshop
Semantic Comput. Syst. (WSCS), Jul. 2008, pp. 174–180.
[39] D. Cai, S. Yu, J. R. Wen, and W. Y. Ma, ‘‘Extracting content structure
for Web pages based on visual representation,’’ in Web Technologies
and Applications (Lecture Notes in Computer Science: Lecture Notes in
Artificial Intelligence: Lecture Notes in Bioinformatics), vol. 2642. Berlin,
Germany: Springer, 2003, pp. 406–417.
[40] J. L. Hong, E.-G. Siew, and S. Egerton, ‘‘ViWER-data extraction for search
engine results pages using visual cue and DOM tree,’’ in Proc. Int. Conf.
Inf. Retr. Knowl. Manage. (CAMP), Mar. 2010, pp. 167–172.
[41] D. Yang and J. Song, ‘‘Web content information extraction approach based
on removing noise and content-features,’’ in Proc. Int. Conf. Web Inf. Syst.
Mining (WISM), vol. 1, Oct. 2010, pp. 246–249.
[42] B. Zhang and X.-F. Wang, ‘‘Content extraction from chinese Web page
based on title and content dependency tree,’’ J. China Universities Posts
Telecommun., vol. 19, pp. 147–189, Oct. 2012.
[43] L. N. L. Figueiredo, G. T. de Assis, and A. A. Ferreira, ‘‘DERIN: A data
extraction method based on rendering information and n-gram,’’ Inf. Process. Manage., vol. 53, no. 5, pp. 1120–1138, Sep. 2017.
[44] J. Zeleny, R. Burget, and J. Zendulka, ‘‘Box clustering segmentation:
A new method for vision-based Web page preprocessing,’’ Inf. Process.
Manage., vol. 53, no. 3, pp. 735–750, May 2017.
61740
[45] D. Song, F. Sun, and L. Liao, ‘‘A hybrid approach for content extraction
with text density and visual importance of DOM nodes,’’ Knowl. Inf. Syst.,
vol. 42, no. 1, pp. 75–96, Jan. 2015.
[46] Y. Fang, X. Xie, X. Zhang, R. Cheng, and Z. Zhang, ‘‘STEM: A suffix treebased method for Web data records extraction,’’ Knowl. Inf. Syst., vol. 55,
no. 2, pp. 305–331, May 2018.
[47] Z. Tan, C. He, Y. Fang, B. Ge, and W. Xiao, ‘‘Title-based extraction of news
contents for text mining,’’ IEEE Access, vol. 6, pp. 64085–64095, 2018.
[48] N. Chomsky, ‘‘Three models for the description of language,’’ IEEE Trans.
Inf. Theory, vol. IT-2, no. 3, pp. 113–124, Sep. 1956.
[49] M. Sipser, ‘‘Introduction to the theory of computation,’’ ACM Sigact News,
vol. 27, no. 1, pp. 27–29, 1996.
[50] E. Uzun, T. Yerlikaya, and M. Kurt, ‘‘A lightweight parser for extracting
useful contents from Web pages,’’ in Proc. 2nd Int. Symp. Comput. Sci.
Eng. (ISCSE), Kuşadasi, Turkey, 2011, pp. 67–73.
[51] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to
Algorithms, 3rd ed. Cambridge, MA, USA: MIT Press, 2009.
[52] C. Charras, T. Lecro, and J. D. Pehoushek, ‘‘A very fast string matching algorithm for small alphabets and long patterns,’’ in Combinatorial
Pattern Matching (Lecture Notes in Computer Science: Lecture Notes in
Artificial Intelligence: Lecture Notes in Bioinformatics), vol. 1448. Berlin,
Germany: Springer, 1998, pp. 55–64.
[53] A. Hume and D. Sunday, ‘‘Fast string searching,’’ Softw., Pract. Exper.,
vol. 21, no. 11, pp. 1221–1248, Nov. 1991.
[54] R. N. Horspool, ‘‘Practical fast searching in strings,’’ Softw., Pract. Exper.,
vol. 10, no. 6, pp. 501–506, Jun. 1980.
[55] G. Navarro and M. Raffinot, ‘‘A bit-parallel approach to suffix automata:
Fast extended string matching,’’ Combinatorial Pattern Matching (Lecture
Notes in Computer Science: Lecture Notes in Artificial Intelligence: Lecture Notes in Bioinformatics), vol. 1448. Berlin, Germany: Springer, 1998,
pp. 14–33.
[56] D. M. Sunday, ‘‘A very fast substring search algorithm,’’ Commun. ACM,
vol. 33, no. 8, pp. 132–142, Aug. 1990.
[57] R. S. Boyer and J. S. Moore, ‘‘A fast string searching algorithm,’’ Commun.
ACM, vol. 20, no. 10, pp. 762–772, Oct. 1977.
[58] T. Raita, ‘‘Tuning the boyer-Moore-horspool string searching algorithm,’’
Softw., Pract. Exper., vol. 22, no. 10, pp. 879–884, Oct. 1992.
[59] R. R. Fayzrakhmanov, E. Sallinger, B. Spencer, T. Furche, and G. Gottlob,
‘‘Browserless Web data extraction,’’ in Proc. World Wide Web Conf. World
Wide Web (WWW), 2018, pp. 1095–1104.
ERDINÇ UZUN received the bachelor’s, master’s, and Ph.D. degrees from the Department of
Computer Engineering, Trakya University, Edirne,
Turkey, in 2001, 2003, and 2007, respectively.
He is currently supervising five graduate students. He is interested in developing a career that
combines teaching and research while maintaining
his interest in the field of information retrieval,
data mining, and natural language processing.
After graduating in 2001, he started his academic
career at the Department of Computer Engineering, Trakya University,
in 2001, where he worked as a Research Assistant, from 2001 to 2007, for
seven years. In 2007, he completed his doctoral thesis on the development of
a web-based system that can automatically learn subcategorization frames.
It is a thesis that combines the fundamental fields of computer science such
as information retrieval, machine learning, and natural language processing. Later in 2007, he started his career at the Department of Computer
Engineering, Tekirdağ Namık Kemal University. He was the vice dean in
his faculty, from 2008 to 2010. After 2007, he worked in web search, web
mining, and web content extraction. His work, ‘‘A hybrid approach for
extracting informative content from web pages,’’ in 2013, was one of the
most cited works in the field of web content extraction with over 50 citations.
He supports education not only with courses but also with lecture notes and
blog posts shared on erdincuzun.com. He has more than 30 publications and
more than 150 citations. He has also been a member of the Board of Directors
of the Faculty, since 2017. In 2019, he started to work as the Director of the
Çerkezköy Vocational School of his University. He worked as a referee and
panelist in various TÜBİTAK programs.
VOLUME 8, 2020
Download