Beyond Text Analysis: Image-Based Evaluation of

Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka
Beyond Text Analysis: Image-Based Evaluation of
Health-Related Text Readability Using Style Features
by
Freddy Nole Bafuka
S.B., Computer Science & Electrical Engineering, M.I.T., 2006
Research Fellow, Decision Systems Group (DSG), Harvard Medical School
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
HIVES
Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
MASSACHUSIETTS
May 2009
Copyright 2009 Freddy Nole Bafuka. All rights reserved.
JUL
LIB
The author hereby grants to M.I.T. permission to reproduce and
to distribute publicly paper and electronic copies of this thesis document in whole and in part in
any medium now known or hereafter created.
Author
Dep rtme t of Electrical Engineering and ComputeLr o..
/
May 27, 2009
Certified by
_
William J Long
Principal Research Associate, CoMputer Science & Art. Int. Lab, MIT
Thesis Supervisor
Certified by
Research Scientist, qomputer
INSTITUTE
OF TEC HNOLOGY
'urUN,
Dorothy Curtis
ience & Art. ±nt. Lab, M;
DSG Affiliate
... / / Thesis Supervisor
Accepted by
Arthur C. Smith
Professor of Electrical Engineering
Chairman, Department Committee on Graduate Theses
0 2009
ARIES
Using StyleJor Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafiuka
Beyond Text Analysis: Image-Based Evaluation of
Health-Related Text Readability Using Style Features
by
Freddy N. Bafuka
Submitted to the
Department of Electrical Engineering and Computer Science
May 28, 2009
In Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
Many studies have shown that the readability of health documents presented to
consumers does not match their reading levels. An accurate assessment of the
readability of health-related texts is an important step in providing material that
match readers' literacy. Current readability measurements depend heavily on text
analysis (NLP), but neglect style (text layout). In this study, we show that style
properties are important predictors of documents' readability. In particular, we build
an automated computer program that uses documents' style to predict their
readability score. The style features are extracted by analyzing only one page of the
document as an image. The scores produced by our system were tested against
scores given by human experts. Our tool shows stronger correlation to experts'
scores than the Flesch-Kincaid readability grading method. We provide an end-user
program, VisualGrader,which provides a Graphical User Interface to the scoring
model.
Thesis Supervisors:
William J. Long,
Title: Principal Research Associate, Computer Science & Art. Int. Lab, MIT
Dorothy Curtis
Title: Research Scientist, Computer Science & Art. Int. Lab, MIT; DSG Affiliate
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
Table of Contents
1. Introduction and Motivation
4
2. Background
3. Feature Extraction
5
12
4. Machine Learning Models Used
5. Results
22
30
6. Discussion
61
7. Real-World Usage
65
8. Conclusion
68
9. Acknowledgments
10. References
69
70
Using Style JbrEvaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
1. Introduction and Motivation
Readability is defined as the ease with which a document can be read[1]. Many studies have shown that
the readability of the health information provided to consumers does not match their reading levels[2].
Even though healthcare providers and writers have tried to make more readable materials, most patientoriented web sites, pamphlets, drug-labels, and discharge instructions still require a tenth grade reading
level or higher[3]. More than half of consumer-oriented web pages present college-level material[3]. A
study by Doak et al found that patients, who may be more stressed, read on average five grades lower
than the last year completed in school[4]. Misunderstandings of health information have been linked to
higher risk of consumers making unwise health decisions, which in turn leads to poorer health and higher
health care costs[5].
To provide more readable health texts, The Decision Support Group under Qing Zeng-Treitler at Brigham
and Women's Hospital will develop a computer program to translate texts to a readability level
appropriate to several consumer reading levels. This program will be based on statistical natural language
processing techniques. Providing health texts of appropriate readability to consumers should help improve
comprehension, self management and, potentially, clinical outcome[3]. We envision the following
scenario:
Gary is a diabetespatient with poor metabolic control. Laura, a nurse educator,talks with Gary about
exercise and weight control.During their conversation, Laurasenses that Gary's literacy level is
inadequatefor use of the latest teaching materials on the importance of exercise, which are written for
average (seventh to ninth grade) readingability. With the help of a readabilityadjustment software tool,
she quickly generates a simplified versionfor Gary to take home. Because Gary can understandthe
materials,he is motivated to follow their advice and exercise, which in turn helps to controlhis illness
andprevent complications.
In order to translate a text from a higher to a lower literacy level, we must be able to correctly assess its
readability. Having an accurate evaluation of a document's readability level provides guidance as to which
tools or algorithms will be most appropriate for its translation to a more easily readable target. The goal of
this study was to develop and a evaluate a new approach for assessing the readability of health-related
documents.
Using Style for Evaluation ofReadability of Health Documents-Thesis by FreddyN Bafuka
5
2. Background
2.1 Previous Works
Several well-known word processing software products such as Microsoft Word, WordPerfect,
and Lotus, provide generalized readability evaluation tools, using text analysis methods. Some of the
features used in text analysis are extracted using Natural Language Processing (NLP) tools. In this thesis,
we use the terms NLP and text analysis interchangeably.
Among the most widely used methods based on text analysis are the Simple Measure of
Gobbledygook (SMOG) formula[6] and the Flesch-Kincaid grading formula[7], and the Gunning Fox
Index (GFI)[8]. These methods computes readability scores based on text unit length, and yield scores
that can be interpreted as the number of years of education needed to easily read the document. The
Flesch-Kincaid method converts Flesch Reading Ease scores[9] into a grade-level. The SMOG formula
computes readability scores using the number of sentences and the number of polysyllabic words-that is,
words with more than 3 syllables. The GFI method uses sentence length and the percentage of
polysyllabic words.
While these methods perform well for general use, several studies have shown they are often
inadequate for health-related documents, as they fail to capture many important features unique to health
documents[ 10],[ 11]. In addition, some of the measurements used by these methods become inappropriate
for the evaluation of some health-related fields[12],[13]. In particular, they do not measure text cohesion,
or sentence coherence, which studies have found to be an essential part of easily understanding the
English language[ 14],[15]. A study by Rosemblat et al[ 10] suggests that the "ability to communicate the
main point" and familiarity with terminology should be considered as additional properties in measuring
health text readability. In addition, a study by Ownby[ 13] suggests that vocabulary complexity, sentence
complexity and use of passive voice are the appropriate measures of text readability. Zeng-Treitler et
al[ 11] pointed out that electronic health records (EHRs), consumer health materials, and scientific journal
articles exhibit many syntactic and semantic properties that are unaccounted for by existing readability
measurements. These are examples of text properties not measured by formulas such as Flesch-Kincaid,
SMOG and GFI. Hence, the need for specialized evaluation tools for health-related documents.
Several such specialized systems have been developed to evaluate the readability of health
documents, using a Natural Language Processing approach. These systems are based on features such as
Using Style for Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka
frequency of certain parts of speech, sentence lengths, parse tree structures, text cohesion and so forth.
Kim et al[16], for instance, developed a new readability measurement for health-related text, based on the
differences in semantic and syntactic features, in addition to the text unit length already used by the
general methods mentioned earlier. Zeng-Treitler et al[ 17], developed a new measurement of consumer
familiaritywith health terminology, which can be used as an additional predictor in assessing the
readability of health-related documents. These new types of measurements are providing better
assessment of health-related documents, where generalized methods might not be optimal. For instance, a
common feature used by the generalized methods mentioned earlier, is the number of polysyllabic terms.
A word with more than three syllables is considered a difficult word, and thereby increases the difficulty
of the text that contains it, whereas words with fewer syllables are considered easier[ 18]. While this might
be the case for general texts (in the English language), it is not the case when it comes to some healthrelated documents. Many words with same number of syllables can have varying levels of difficulty[ 17].
The word "diabetes", for instance, has 4 syllables but is generally a term with which a healthcare
consumer would be familiar. The words "aspirin", "Anbesol" and "aplisol" have same of syllable but
varying difficulty level[ 17].
2.2 A New Approach: Style-based Features
While many text evaluation systems have used the Natural Language Processing approach, the effect of
the layout of the text on readability has been very little-explored. In this study we show that a strong
relation exists between certain textual layout features of a document and its readability level. We extract
these features using an image-based approach. Rather than explore what the text says and how it says it,
we rather consider how the text looks. Throughout this document we use term style to refer to text layout,
and use both terms interchangeably.
2.3 Advantages of Image-Based over NLP-Based Evaluation of Text
Readability
This image-based evaluation, which converts the text into an image, presents several advantages
over the traditional NLP approach.
Many of the features used in Natural Language Processing techniques for evaluation of text
difficulty, fail when applied to health documents[ 10]. As mentioned above, the number of syllables per
word may not always be an optimal indicator of the difficulty level of health-related text.
Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka
Secondly, health documents come in a variety formats: printed journals, pamphlets, medical
records, web-pages, etc. An NLP-based system depends heavily and entirely on accessing the actual text.
Such a system would have to be able to parse HTML code, for instance, to extract the text of a web-page.
It would also have to be able to receive a PDF as input and have the appropriate tools to parse that type of
file as well. This need for flexibility in type of input adds a great overhead for an NLP-based system
intended for general use. A commonly-encountered challenge with healthcare-related NLP tools, is that
they are usually difficult to adapt, generalize and reuse[ 19]. Very few NLP-based systems developed by
one healthcare institution have successfully been adapted for use by an unrelated institution[ 19]. One
reason is that medical NLP tools are often overly customized to domain or institution-specific document
formats and other text characteristics[ 19].
Moreover, some medical documents are not available electronically, but only in printed form.
Some medical records, or hand-written notes by doctors and nurses are a common examples. An NLPbased system would not be able to process such document formats; the document's text would have to be
extracted first. In an image-based system, however, the document can simply be scanned and the system
can work with its image.
Lastly, the features used in NLP-based systems are not consistent across all natural languages.
For instance, some languages have, on average, more syllables per word. An NLP-based system that uses
such features would have to be retrained in each specific language in order to perform accurately. While
natural languages differ widely in content, and features such as length of words, sentences, they are by
and large similar in text style-a journal publication will almost always be formatted in columns, for
instance; a title will often be in bold and bigger than the rest of of the text, etc. Therefore, it is unlikely
that a system based on style will need retraining for use in another language. This language-independent
aspect of the style-based system is a great asset in the health field, where patients come from various
language and cultural backgrounds.
2.4 Previous Work on Text Evaluation Using Text Layout Features
While many studies have acknowledged the importance of text layout in assessing the readability of
health documents, very little exploration of it has been done[ 17]. Two previous studies, Mosenthal and
Kirsch [20], and Doak and Root [21], have explored text layout as part of a readability scoring scheme.
Mosenthal and Kirsch developed the PMOSE/IKIRSCH method for measuring the readability of graphs,
tables and illustrations, in turn giving a measure of the readability of a document. Doak and Root
developed the Suitable Assessment of Materials (SAM), which also attempts to measure text organization,
Using Style Jbr Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
layout and document design. However, these two systems are complex and not computerized. In this
study we provide a fully automated, artificially intelligent approach to extracting various style features,
evaluating them and constructing a regression model to map documents' style features to their readability
level. As far as we know, there is no other computer tool of the kind we developed and present here.
2.5 Components of this Thesis
The general goal of the thesis is to assess the readability of health texts, using only the style of the
document. More specifically, we would like to perform two tasks:
(i) Build a computerized model, which, given an arbitrary health-related document as input, will
output a readability score for that document, using its style properties. We would like the score
given by the model to be as close as possible to the score that a human expert in readability would
give to the same document. Throughout the rest of this thesis, we will refer to this task as the
score predictionproblem, or, the regressionproblem. We also use the terms score and grade
interchangeably, to refer to the numerical measure of a document's readability. We use the terms
gold standardand target, to refer to the score given to a document by a human experts.
(ii) Build a computerized model, which, given an arbitrarily health-related document as input, will
classify it as either easy-to-reador hard-to-read,using its style properties. We would like the
classification decision of this model, to agree as often as possible with the decision of a human
reader on the same documents. Throughout the rest of this thesis, we will refer to this task as the
classificationproblem. We also use the term target or ground-truth,to refer to the class ("easy" or
"hard") given to a document by a human.
In order to rightly relate documents' style to readability score, or the easy- or hard-to-read classes,
we will need three main components:
(i)
A numerical representation of stylistic properties of documents. Chapter 3 details which
properties (features) we chose and how we quantize them. We refer to this part of the project as
the feature extraction step. Because we approach each document as an image, and use image
processing techniques to extract the style features, we also refer to this part of the project as the
image processingstep. We will often use the term feature to refer to an actual property (right
margin, for instance,) andfeature value, to refer to the actual measurement of that property for a
specific document (e.g., "Document 1 has a margin of 0.1"). We also refer to the features as
variables or predictors and use both terms interchangeably.
Using Style for Evaluation of Readabilityof Health Documents-Thesis by Freddy N Bafuka
(ii) A machine learningmodel, which provides a mathematical relationship between the feature
values of documents and their score or class. A machine learning model is built using a set of
examples. In this project, an example is a document whose feature values and human-expertsgiven score or class are used as part of building the machine learning model with the optimal
parameters that define the relation between document features and scores or class. We refer to
this building process as the trainingof the model. We refer to the set of documents used in
training as the trainingset. We evaluate the performance of a model by comparing its output
(score or class) for new documents (not used in the training set), to the score or class given by
the human observers, for the same documents. We refer to this group of new documents used for
the purpose of evaluating a model's performance as the testing set (or simply test set). Chapter 4
describes the various machine learning models we chose to use, and the processes by which we
evaluate their performance. Chapter 5 reports and analyzes the results.
(iii) A set of documents from which we can create training and testing sets. More specifically, we
needed three data sets: (1) a set of documents with scores given by human expert reviewers, to
be used for the score prediction task; (2) a set of easy-to-read documents and (3) a set of hard-toread documents. We consider the last two sets as one data set for the classification task.
Throughout this thesis we use the term data or dataset,to refer to documents, and in particular,
the set of feature values extracted from them. We provide more information about the data sets in
this section.
2.5.1. Feature Extraction and Image Processing
This project can be thought of has having a image processing step, which provides input to the
machine learning step. The feature extraction part of this project uses some common image processing
techniques. In particular, many of the features are extracted by detecting boundaries between text and
background (or whitespace) regions. This process of breaking an image into meaningful regions for
further analysis, using pixel similarities, is referred to as image segmentation[22]. Shi and Malik[23]
presented a mathematical representation of image segmentation, as the segmentation of a connected graph
with pixels acting as nodes connected to neighboring pixels by weighted edges. The segmentation task,
then consists of minimizing the weights between pixels belonging to different objects and maximizing
them for pixels belonging to the same object. That is also the basic concept we use to differentiate
between text and background areas. Essentially, we attempt to extract some information on the presence
of text on a page's image, by carefully canvasing through the page, and detecting regions with sharp
changes in color intensity. The resulting feature values are then fed into the building of a machine
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bajika
learning model, which makes a decision about the type document (easy vs. hard, or score). This technique
is also similar to the one used in some object recognition methods, such as in the well-known facedetection system developed by Viola and Jones[24]. Viola and Jones searches a gray-scale image for
rectangular regions of sharp changes in color intensity. These features are then used by a machine
learning model to recognize color-intensity changes that correspond to the set of boundaries need to form
a human face. The Viola and Jones method however, uses human subjects to label hundreds of images of
faces to form a training set. In this project, the feature extracted is fully automated.
Image segmentation is also used in Optical Character Recognition (OCR), to detect text in an
image. OCR is very heavily used in the processing and routing of mail[25]. OCR systems also depend
canvasing an image to find the text boundary, and the resulting features are used by a machine learning
model to detect if a character is present and which character it is.
In the medical field, image segmentation is used widely in medical imaging for diagnosis,
measuring tissue volumes, computer-guided surgeries and location of pathologies[26].
2.5.2. Expert-Rated Data Set
To provide a reliable set of documents for this and other health-related readability projects, the
Decision Support Group under Qing Zeng-Treitler at Brigham and Women's Hospital carefully compiled
a corpus (dataset) of 324 health-related documents[27]. To obtain a diverse sample, 6 different types of
documents were collected: consumer education materials, news stories, clinical trial records, clinical
reports, scientific journal articles and consumer education material targeted at kids. Table 2.1 shows the
number of documents taken from each type.
Document Type
Count
Consumer education material
142
News report
34
Clinical trial record
39
Scientific journal article
38
Physicians' Notes
38
Consumer education material targeted at kids
33
Table 2.1: Sample size ofeach document type used toform
the 324-document data set with human expert score
.
The consumer education material documents were obtained from MedlinePlus, National Institute
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
11
of Diabetes and Digestive and Kidney Diseases, and Mayo Clinic. The News report documents were
taken from The New York Times, CNN, BBC, and TIME. The clinical trial documents were obtained
from ClinicalTrials.gov. Scientific journal articles originated from the DiabetesCare, Annals of Internal
Medicine, Circulation-Journal of the American Heart Association, Journal of Clinical Endocrinology
and Metabolism and the British Medical Journal. Physicians' notes were obtained from the Brigham and
Women's Hospital internal records. Lastly, the consumer education material for kids came from the
American Diabetes Association.
With data collected, a panel of 5 health literacy and clinical experts and a patient representative
were assembled to assess the readability of the 324 documents. Each expert was asked to grade the
documents using a 1-7 scale. Each expert carefully reviewed and graded the documents independently.
Many documents were graded by more than one experts. In those cases, we used the average of scores as
the final gold standard for the document.
2.5.3 Easy and Hard-to-Read Data Set
The data of easy- and hard-to-read documents used in the project is a subset of the one used by
Kim et al[ 16]. The easy data set consisted of 195 self-labeled easy-to-read health materials from various
web information resources including MedlinePlus and the Food and Drug Administration consumer
information pages. They covered various topics on disease, wellness, and health policy. The hard dataset
consisted of 172 scientific biomedical journal and medical textbook articles on several topics, including
various diseases, wellness, biochemistry, and policy issues.
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
3. Features Extraction Method
The first implementation work was done on MATLAB R2007a, on a Dell Pentium Dual-Core,
Windows Vista machine. In an effort to make the system more freely accessible, a Java implementation
was also done, on the same machine. We present the details of both implementations later in this
document. There are slight variations in the two implementations. We mention those when necessary,
throughout the description of the method. Given that the Java implementation is the latest, contains more
features, we consider it the main implementation of this paper, and the one that we expect others to use
3.1 Conversion to Image Format
There were 324 expert-rated documents used, with readability scores ranging from 1 (easiest to
read) to 7 (hardest to read). Additionally, there was a set of 195 documents simply considered "easy to
read", compared to another set of 172 documents considered "hard to read".
Each document was converted, from its original format, to a set of images, with one image for
each page. However, we used style features from just one page-the "first page" of the document. Cover
pages, and table-of-content pages, and title pages were ignored. Hence we refer to the "first page" of the
document as the page in which the actual content of the document begins. The size of the images obtained
from the documents differed, depending on the original format and size of the document. All images
were converted to gray scale, with possible pixel values of 0-255.
Feature Extraction and Evaluation Pipeline
For each document, the following steps are followed to extract the features.
*
Document is converted from its original format to an image format.
*
Each page is converted to one image.
*
The image is converted to gray-scale.
*
Each resulting image is preprocessed and a new image is returned.
*
The preprocessed image is sent as input to the DocFeatureExtractor module, which returns a
numerical value for each feature, for that image.
Features Extracted
From each image, eleven features were extracted, to create an eleven-variable observation. The features
Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka
extracted were:
-
Average white space (WSR)
-
Number of columns (CLC)
-
Number of lines per column (LNC)
-
Left margin size (LMS)
-
Right margin size (RMS)
-
Average gray-scale value (AGS)
-
Top margin size (TMS)
-
Bottom margin size (BMS)
-
Interline Space to Line Size Ratio (LSR)
-
Interline Space Ratio (ISR)
-
Maximum to Minimum Line Size Ratio (MMR)
-
Number of Colors (CLRC)
-
Number of Page (PGC)
The extraction of these 13 features is detailed in the following subsections. Not all features where
used in testing the MATLAB implementation. In particular, the interline space ratio (ISR), the number
colors and the number of pages, were not used.
A few assumptions are made throughout the feature extraction stage. One is that the background
color is lighter than the letters' color. The background value is determined taking the average pixel value
of the 4 leftmost columns of pixels. Another assumption is that all pages have a left margin with a width
of at least 4 pixels.
3.1.1 Number of Columns
The number of columns in a page is determined by scanning the image of the page from left to
right, examining a vertical strip that extends from the top to the bottom of the image, and has a small
width (equal to 1% of the total width of the image). The average value of the pixels within this vertical
strip determines whether the strip within margin region or within column region. If the average value is
less than value of the background minus 1, the vertical strip is considered within column region,
otherwise, it is within margin region. The number of transitions from margin to column and vice versa
indicates the number of columns in the document. Note that in the Java implementation the value of the
Using Style for Evaluationof Readability ofHealth Documents-Thesis by Freddy N Bafuka
background is always 255, on a traditional 0-255 gray-scale (please refer to the preprocessing steps
described later in this chapter). This step was not taken in the MATLAB implementation.
3.1.2 Number of Lines Per Column
The lines within a column are detected with the same algorithm used for column detection. The
image is rotated 90 degrees counter-clockwise, and each line is detected as if it were a column. However,
the value of some of the input parameters to the algorithm are modified to detect lines. The width of the
vertical strip is exactly equal to 1 pixel, given that spaces between lines are much smaller than spaces
between columns. The 1-pixel width was determined by trial and error.
3.1.3 Average White Space
The average white space was computed by measuring the width of the space unoccupied by
letters at the beginning and at the end of each line in a column of text. The sum of this width for all lines,
divided by the sum of the total width of all lines, gives the average white space ratio. To determine the
space unoccupied by letters in a line, the line is scanned from left to right, until non-background color is
detected. A similar technique is used by scanning from right to left, to find the width of white space at the
end of a line.
Meal Planning
Some people with diabetes use carbohydrate counting to balance their food and
insulin. Carbohydrates, or "carbs," are what our bodies use for fuel. The more
carbs you eat, the higher your blood glucose goes. And the higher your blood
glucose, the more insulin you need to move the sugar into your cells
Figure 3.1. An illustrationof the whitespacefeature, which captures how much of the
width of each line of text is unoccupied on both the right and left sides. In this
example, the gray lines show the whitespace on the right side.
3.1.4 Margins and Gray-Scale Value
The left margin is determined by the width (number of pixels) of the space between the left edge
of the image and the start of the leftmost column. The resulting value is divided by the total width of the
image, and gives the margin value used as feature. The right margin is computed similarly, using the
width of the space between the end of the right-most column and the right edge of the image.
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
1
Likewise, the top margin is determined by the length of the space between the top of the first line
of text and the upper edge of the image, after preprocessing. Since the various columns of text might not
start at the same line-that is one might be shifted down vertically, as is sometimes the case with first
column, especially-the "first" line of text refers here to the first line of the column with the highest top
edge. As lines within the columns are detected, the value of the highest line is updated as necessary.
Similarly, the bottom margin is determined by length of the space between the bottom edge of the lowest
line and the bottom edge of the image, after preprocessing. We also account for the fact that one column
be "shorter" or end higher than others-as is often the case with the last column of text. Therefore, the
value of the lowest line is updated as the columns are scanned through during line detected. The line with
the highest Y-value (lowest line) is retained for bottom-margin computation.
The average gray-scale value of the page is a simple average of the value of all pixels in the
image. It gives an indication of how "dense" the document is.
Figure 3.2 shows the margin features, as well as the inter-column space.
Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka
Figure3.2a. The widths of the orange rectanglesare left and right margins. The heights
of the light green rectangles are the top and bottom margins. The blue rectangleis the
inter-column space, which we use to detect columns.
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
ir~knrzn np~rl
S
rMWO.
:' m~
z
RICHTS
ymnwAs
7 w fnSwmeA
NowPeSumr-Y honimlno
daC*i art"ow
m
by
adVO
takave
n
At :kthrw
L-!
**.1S
X1**744
1 1
-OM
t?"mmi-
-h.waymwna nondr
Ldn
a't
weorr*
ary
4
drts t. .tnC 'Arm
trAwri.
mrd wa n
:n
r.waj
s hls
.:oW;YV
n13* I U*V
tim* -im Z
wm.
mwed
yma 11WORS
f
NWWa
uhzshms
e
dwl
skn
a mr
r"tr uiw fdmawgn
WiAd r, i
irm
zwm
ctew
-
LhOA
a'i
d=u
wth t
vAw
insem
and
N
4vhn
a!
"rr*WWWAM~
n
1.4Jsdn
nf rW
Ywiam
Marm
chk:Wo
wa
d11a a d1a 3m;Wamr1m flfid% Engenn and 4m
% k #Wvatro wdme d
Pasher ::A
d
14
u ,rtmdn 111113311
,y#1a
U~.h 34.aas
wao"WA.va
m
l.y
r?Xin
a*nd1
n
r
ti*m
7
m 33*141
Amninr 111.1
3
*1
*r
and
amy ivi
: wL toum
tim mm1r
in " bemd hmsh 4err
wkwhb
Fnwo Me11
AL14
*= Tim aMhrin 41.EINr m.
=*1da1
had
.4 11
a.:abase=
-Al
I
rte 3dromd
any
d6imsura wm*4
ahns ch dm P nar wy
w:-r pa*
wVl * a werY
bfles
i
5irlna
aghpan
pla lnha
m
V*I'A
to~s
A ?RIVAITJCEiZBRAI1L
Vnmared
1113
twearSauwm
in
unrn#
ascamrdmjbfe
And
In. l
N4.ma
(rq
army
11nMs cw*d. a1
trl
:Nxy
a t 1.W 4,.n*m .
3:
ra.3-11
5.
*rwng ,ai4ne .an**
::SA
wasan 3an
hdwas
wM-A
rdi~ypMrn! m~;It
,%V
f
mRm
Lmpn~e
maiLnUa~resI
arml membr
al
(1 asram
Anta
wrhom and aophs.
:n m MWn,. :w FAqd W=s
anm
1 u
dmar
,ape
MFM
im
(7ira:00=0"ni My -,amna bw V
IIRaMA n a
Ira
nm
JA.n1wd
a,*M.
a tIrm anolmen w
d~ap mrry
VMa
AS1610A'
WiEAVTI'
as"W smew had rt ild wany tho kt
mmmr a;md
rrrar
and
a b"MION
wwCinod
an fl3SPECIT
=n=m a'VFo1W
f
1:%%4
and .43ma
.
dmeran d wanM
.
remedmv"n* sh. 6imvaA -hat dU W. daumms h4
: am d m1.1 tomy thar h.
n:*padAmd
Irssm
d
Wl;
*111P
B mham
3*31. *113! .131
II . II.*h
mml
nmnwkm
and
stAim4111
w3I, wave
w
11h you
and,
a*.&AWwandarmAd th W1
w
-mrn
wa.y 1Asm
Ir. me4=
a .31mm as ha 4r,
au.rF3nay ow ni mm and 1te prwma w1 i
whoAm
a
-hs-Wh
r
waq
mPldt
Ararmed
by
d
A4.mump
A.11*m1 wa*411* adMh
.ahts.
a1
rwdiwmrJ
ao
sm
nd w-b yjmem thamim
AWN
q thar
a" mwom
P"Iat w rxvsan
al 6
h rmn
hp6 zatm
unpu
hdwe &Mrhd
3
r*A masMany rhl- #gs aWmI*w
=a ad &AdwWWcof dch*f
on tmam
n
&Wmmmwvdwnam
ta
da
hea
04r: * hr pumi
a 1ng !MWMpd
l atumatv.e1Ud
wm dt*
wwn =L umsev Arsg t
tm or
dwar wr now a
mart a as!an A U amhom n
NArAng rwr I wush 2 wPL and : iam &,Am
aibaitrm w hmvrmyI
aA
AtL ANiTi PndalAAIANnna So ==
s
m4
401Wd W*t
1.4.'
20.4.4
n3 h4T WA
pn
A-, CA4
PV# Ashon, xwsfawmiMqad
oamm 3.meamn
o1
sm Wm
am,
=4 d*!
,m4 ..
d4 1
dA*' twh
0
L-M owparm ag. n40
stimma(.KE 1h
Aw
%maal
kjlgnu
pma
~
hAg
yhh * .0 00onem 's~
&mnFrr~a
panetakdmmr
na1
drg
bool W
m
w
H ALTH PERSCM
Arl
(3s6W a
in
xwm
VNV WrC42T
hp*
shO uppe
Was imFm
o
ummfmt h
yniaor :. ak Was = no A0%ktesan neam shnid ho
mLaWr hyV214t, rAM
trCid qW T? of = da
Pidrest
,0 ar's may chu'm mdAd up us a644 ramps
w+Nmohm uad thony nuso
uS
om
es
any 7aam
whn masrmw m*rh-ta me:wd
sa
wwn& Vy
"
wma pmredmw
aat&OaW& AM MaWMmay
*wk Mswom Prefty admg Wh Atho
de adewmk
OWWUN;YM341d * 2 OrPAC
Wh O&W ft"
s=
mimm rampa rh had Ama d e tb man wam tAh
-mlinc Whosi~yr WXA tarl
Cis.-Ilamm- =Wat VmM
a
VW7
washr wimdwy msand deah~Iddt
meb
ami4
,Am ski (0 mumm mudand Volk thar wMP
bwmn d&pn maL psnue and snay auda don
m
ramoer#
sup
ma
Wthifa t En r
d apng Mr
=ae.&w rampl s. swa
=a
Ma
pva
VR
d uwa nwI
e.mHTS
kmw
g a
UMAN
1d
b
mSWandnd,
nmesu
smmmmiri Wyms
vmskon rawd
nd
V
to
i
rmanashmidMA
armo
dM6
aNManis
aman
Or
h
mwaf
mmwdysnr~r
ddmm
radod
Ahmwus Ad th
do
and
a n
glea
wWA
wraid Was ek
hIan
A
AmAwrven
dnghi
nM
by
arpllbndrshrthanWa
hme whm n w
MinW
Am mInanf
and
acq
raw Mm ww goemy amannd
paradean at hmb
af
O
homid
Alurmmani
mhoen
eatl
fit tiar
homfidam
2amrW~a#
Ar od
dr
lawn
MW1j"
sMWd
040m0ba=S6WOdmaWWg
hAftA ram Vadma
iaso
of our
eghts ad 06M
ho Ra*i me XapmJdeat mthrt
hQussaMa
hmmsh
m"N
rnimAl
-boa
Featmar
t twram
VWompt a wa
aman,
aquft
inma
tm
ma ams
findpMany
Tin mantannawal
Tm
N
r0
,
M.'flum
whinv"
do
manypresr
nan
est
Atvwweharmw
me
drtlrr.
bl~r? C*crul~kurLln.ir~u"
i~~slxwk~ ?OLu~r
~i Ir~rrl
Figure 3.2b. The originalimage of the page used infigure 3.2a. Note the footer was
ignored. The bottom margin is the height of the space between the bottom edge and the
last line of actual text detected.
3.1.7 Interline Space Ratio
The purpose of this feature is to capture the sparsity of the of text. This feature captures how
much the vertical space is occupied by lines of text, and who much is background. It is analogous to the
"average white space" features, which computes horizontal white spaces. It is computed by adding up the
heights of all lines of text within each column and dividing that value from the total height of the image.
The resulting value, subtracted from 1, gives the interline space ratio.
Figure 3.3 shows the interline space.
Using Style for Evaluationof Readability ofHealth Documents-Thesis by FreddyN Bafuka
FINDINGS
},
)interline space
M
.amy,
Jmun
13. 2005:A09
Diabetes Treatment Helps Babies' Health
Women who develop diabetes durin pregnancy give birth to healt
aggressively treated, according to a large study that helps bolster t
pregnant women for diabetes.
In the study Australian researchers followed 1.000 women diasno
diabetes. uring their thrd trmester the women were separated n
Figure 3.3a The interline spaces (heights of the gray
rectangles shown above. As can be seen the interlinespace
can vary greatly between differentpairs of adjacent lines.
FINDINGS
Post
Mond&y, June 13, 2005; A09
Diabetes Treatment Helps Babies' Health
Women who develop diabetes during pregnancy give birth to healt
aggressively treated, according to a large study that helps bolster t
pregnant womn for diabetes.
In the study, Australian researchers followed 1,000 women diagno
diabetes. During their third trimester, the women were separated i
Figure 3.3b. The originalimage usedfor illustrationin
figure 3.3a.
3.1.8 Interline-to-Line Size Ratio
This feature further captures the sparsity of the page. It completes the previous feature in that it
would capture the difference between a single-spaced and a double-spaced document. For instance, a page
with two line of text using 60-point size font will yield the same value for the previous feature, for 12
lines using 10-point size font. The spacing between the lines might not have changed. This feature,
therefore, finds a ratio between the average size of lines of text and the average size of interline space.
Scientific papers for instance, when published in a journal, will tend to be single-spaced and denser. The
line-to-whitespace ratio is computed by adding up all the line sizes and all the interline spaces (heights),
and taking the ratio of the two numbers. In figure 3.4, we show the line sizes. The interline spaces are
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
shown in figure 3.3a.
FINDINGS } Line Size
Post
Nvldv., ue 13.2t' : A0.
Diabetes Treatment Helps Babies' Health
) Line Size
Women who develop diabetes during pregnancy give birth to healt
aggressirely treated, according to a large study that helps bolster t
pregnant women for diabetes. I Line Size
In the study, Australian researchers followed 1.000 women diagno
diabetes. During their third trimester, the women were separated m
Figure 3.4. In red, 3 different line sizes are pointed out.
3.1.8 Maximum-to-Minimum Line Size Ratio
The maximum-to-minimum line size ratio tries to capture the variation in font size in a document.
A value of 1 will mean that the same font size has been used throughout the page. A value of 6 will mean
that the biggest line of text is 6 times larger than the smallest. Such a large ratio will probably indicate
that several font sizes were used between the biggest and smallest one, which, in turn, indicates that the
document is complex-the number fonts used give an indication to the number of subheadings used in the
document.
3.1.9 Color Count
The number of colors in each document was extracted by search through each pixel of the page's
image (before its conversion to gray-scale). Each color was assigned a unique number, by multiplying
each of the color channel (red, green, blue) by 1, 103, and 106, respectively. From each pixel of the image,
the color was extracted as a unique. That number was added to a set. Once colors had been extracted from
all pixels and added to the set, the number of elements in the set indicated the number of colors in the
image.
3.1.10 Number of Pages
This is the only feature that is a property of the whole document, and not of one page. The
number of page per document was extracted by searching through the directory of images counting the
number of images associated to the same document. Each image is interpreted as a page. Documents with
more than 3 pages were given a page count of 3.
UsingStyle fbr Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka
3.2 Preprocessing Stages
Before the above features are actually extracted and evaluated from a page's image, a few
preprocessing steps occur.
3.2.1 Pixel Value Extrapolation
To further create a clear demarcation between the text line areas and background, darker pixels
are given a value of 0 on an 0-255 gray-scale, while lighter colors are extrapolated to 255. This
extrapolation eliminates unwanted side effect such as anti-aliasing, caused by graphic renderers.
3.2.2 Horizontal Slide Preprocessing
Even within the text regions of a document's page, a large amount of space is background. As a
result, the average pixel value within text region can still be very close to background region. We
therefore add a preprocessing step to increase the number of dark pixels within the text region. The step
consists in "sliding" the image horizontally, on itself, which causes a thickening of the letters in the
horizontal direction. The result is greater contrast between margin (just background) regions and text
regions. See figure 3.5.
many health problems are linked to obesity. A person's body
fat percentage gives a good indicator of whether they are
obese. Hence body fat percentage is widely suggested as one
Figure3.5a: Originalimage before horizontalslide.
MEW hWl
h ib mr
n ullo
dto obsia Apm bo
d
t pMeatapl -l£a loo dlae tor whth thy 0
obl. MXm bodyt rct i ily
o llt d on/
Figure3.5b: Image after horizontal slide, accentuating
letter pixels.
3.2.3 Non-Content Text Removal
Often, some documents come with text that should not be considered part of the document. For
instance, when printing a web page, the web browser usually introduces a header or footer containing the
URL of the web page being printed. Such a header would cause a misrepresentation of the top margin, for
instance. Therefore, as an extra preprocessing step, we remove all text that are "too" close to the edges.
Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka
3.2.4 Differences between the Java and MATLAB Preprocessing
The Java implementation does not use the horizontal-slide processing, for usability reasons, as the
implementation is very slow, where as the operation is fairly rapid in MATLAB. The pixel value
extrapolation was the best alternative to accentuate foreground vs. background differences. The MATLAB
implementation, on the other hand, does not use pixel extrapolation, as the horizontal-slide provides
adequate accentuation of the areas. Our results later showed that either method provided adequate ability
to extract features correctly. We recommend that any implementation use at least one of those two
preprocessing steps.
3.3 Extraction
The feature extraction was done on all documents without failure, although with some warnings
from the ColumnDetector module. Two types of situations can cause the ColumnDetector to trigger a
warning. First, there are cases in which the right boundary detected by the system happens to be such that
a few lines in the column extend slightly beyond it. This normally happens just before the right margin of
the document, when the text is left-justified, for instance. As the scanning progresses, the portions of lines
extending beyond the column boundary can trigger a false column-start event. Usually, no end will be
detected for such a false column, until the scanning reaches the right edge of the document. In these cases,
the system infers that a "false end column" was started after the right boundary of the column, and
ignores it. The Java implementation prints a warning on the standard error, stating "Warning: false
end column.
Ignoring..." Note that in this case, there is no failure, but rather a prevention of
one. The second type of warning occurs when a column was started much earlier in the document--closer
to the left edge-and yet no end to the column is detected till the scanning reaches the right edge of the
document. In this case, the end of the page (right edge) is considered the end of the column. The Java
implementation prints a warning on standard error, stating "Warning: column end was not
found.
Assuming end of page..." The MATLAB implementation throws a waming as well,
but does not use different warning statement to distinguish between the two cases. Of the 324 page
images used, less than 30 generated a warning, and virtually all were warnings of the first type.
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
4. Machine Learning Algorithms Used
To predict the scores of expert-rated documents, we needed a regression algorithm, since the
scores values are continuous-not discrete. We chose to train and test two different regression models:
(i) Linear Regression
(ii) k-Nearest Neighbor
To classify between easy-to-read and hard-to-read documents, we needed a classification
algorithm. We chose to train and test two classification models:
(i)
SVM
(ii) Logistic Regression
From each document, one page was used to extract style features-except, of course, for the page
count, a property of the entire document. The Java feature-extractor module extracts 13 features per
document, while the MATLAB implementation extracts 10, as discussed earlier. For the purpose of this
section, we will refer to m as the number of features extracted per document. For the Java
implementation, m = 13 and for the MATLAB counterpart, m = 10. We refer to the set of m feature values
extracted from a particular document as the input from that document.
Let D be the vector of inputs D = ( DI, D 2, ..., D,)'obtainedfrom a particular data set. For the
expert-rated data set, n = 324. For the easy-to-read and hard-to-read data set, n = 195 + 172 = 367. Each
input Di = ( da, di2,... , dim) is a vector of feature values extracted from the i-th document in a given data
set. Therefore, D can be thought of as a n-by-m matrix.
Let G be the grade given to the i-th document by the experts, in the expert-rated data set. G can
therefore be thought of as a n-by- vector, with n=324. And let L, be the label (0 for easy, 1 for hard)
given to the i-th document in the easy- and hard-to-read data set. L can therefore be considered an n-by-1
vector, with n = 367.
4.1 Regression Models for Prediction of Experts' Scores
In the prediction of the experts' scores, our goal is, ideally, to create a score function S such that
S(D,) = Gi. In practice, however, we expect some level of error, which we define as: error(i)= Gi- S(Di)|.
4.1.1 Linear Regression
Linear regression is a mathematical model that represents the output (score in this case), as a
Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka
linear function of the variables. It often provides an adequate interpretation of the effect of the inputs on
the output values[28]. In the next section, we consider an alternative model that does not make the
assumption of a linear relationship between input and output.
Linear regression attempts to formulate S(Di) as follows:
(4.1)
S(D,)=0o+Z dij
j=1
where the m+1 constant coefficients (also called parameters) ,fo,fl, ... , 8m are unknown. Taking
all inputs, or observations into account, we can rewrite equation 4.1 as follows:
+
S(D)=
Y0 D. .j
(4.2)
j=1
where
D: j represents thej-th column of D.
The constant coefficients make S a linear model. The essential part of training this linear
regression model is finding the best set offlj parameters. There exists several criteria for determining the
"best" set of fj values. We chose to use the residualsum ofsquares (RSS), which is the most widely used
measure for comparing such sets of parameters[29]. Given a vector of parameters fl = (fro, f , ... ,m),
we
define the residual sum of square as the sum of the square of the errors:
RSS()=
(G - S(D)) 2
(4.3)
(Gg-fo-# du j) 2
()=
(4.5)
i=1
RSS (
i=l
j=1
We therefore define the best value for the vector of parameters fl as follows:
I = argmin RSS (g)
(4.6)
We found the best set of parameter values by using MATLAB's glmfit function. The glmfit
function takes two main inputs:
(i) An n-by-m matrix, representing n observations (in our case, document inputs), each having m
predictor values (in our case, feature values). For this input, we provided a K-by-m matrix
constructed from K rows of the matrix D described above. We describe this further in the cross
validation section later in the paper.
(ii) An n-by-1 vector, representing the target values. In our case, the targets are the experts' grades.
For this input, we provide a K-by- vector constructed from K corresponding elements of G
Using Style for Evaluation ofReadability of HealthDocuments-Thesis by Freddy N Bafuka
vector described above. We give more details in the cross-validation section.
(iii) A specification for a distribution.We used a "normal" distribution, which causes gimfit to apply
linear regression as formulated in equation 4.1. We used the default values for all other inputs to
glmfit.
The g mfit function outputs a vector of m+1 elements, containing the values for flo,
fl, ... , im, that
minimize the residual sum of squares.
Given a new document, D,, never seen before by the system, our model will predict its score by
evaluating S(D,), using formula 4.1, with the parameter values obtained from glmfit.
4.1.2 k-Nearest Neighbor
The k-Nearest Neighbor algorithm is a memory-based system and does not require any search for
best-fit parameters. Each input Di can be thought of as a point in a m-dimensional space. Given a new
input D, never before seen by the system, we find the k closest points in distance to D, and compute its
score by taking an average of its neighbors. We use Euclidean distance. Let K be the set of the k closet
points to D,. We formulate S(D,) for k-nearest neighbor as follow:
S(D,)=
-Y Gi
(4.6)
ieK
Throughout this project, we used k= 3.
Despite its simplicity, k-nearest neighbor, it is known to perform well, and has been used
successfully in many applications, include some image-processing related tasks[30],[31]. We use it as
second alternative to linear regression, as it does not make any assumption about the linearity of the
input-to-output effect.
4.1.3 Training and Testing Using Cross-Validation
For both the linear regression and k-nearest-neighbor models, we used a 5-fold cross-validation to
run a training-testing cycle.
(i)
Linear Regression
The 324 inputs in D were randomly placed into 5 folds (or partitions) of approximately
equal size (4 of the 5 folds had 65 inputs, while one had 64). The first fold, FI, was used as test
set, while the other 4 partitions were combined to form a training set. Given the inputs in the
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
25
training set, and their corresponding grades, we used the glmfit function to find the best set
parameters for a linear regression model, as discussed in section 4.1.1 above. The resulting
parameters were then used to apply the score function S to each of the inputs in F, according to
equation 4.1. We then repeated the experiment (second cycle) using F 2, the second fold, as the test
set and join the remaining 4 others to form the training set. We did the same for folds 3, 4 and 5,
running 5 cycles total.
For each of those cycles, we recorded the average error.Let Ri be the set of inputs used
for the training set when Fi is used as the testing set. And let Si be the score function with the beta
values obtained by using R, as input to glmfit. We define the average error for a cycle i, avgErri as
follows:
avgErr,=1
avgE
Fi
Gj-Si(Dj)
D
F
(4.7)
-
We define the final average error for linear regression as follows:
5
avgErr
5=
5 i=1
avgErri
(4.8)
We repeated the whole experiment several times, randomly generating 5 new folds each
time. We found that the results were relatively the same.
(ii) k-Nearest Neighbor
In a similar manner to the process described in part (i) above, the 324 inputs in D were randomly
placed into 5 folds. The 5-KNN model was tested on each of the fold, while using the remaining 4
partitions as a training set. For KNN, a "training" set is simply the set used as the memory from
which the k closest points to a test point are taken. There is no actual training stage comparable to
the one for linear regression. More formally, in this experiment, if R, is the set of inputs used for
the training set when Fi is used as the testing set, then we define S, as in equation 4.6, with the
added constraint that K is strictly a subset of Ri . With Si thus defined, we evaluate each Si (Dj) for
all Dj in F . We perform this routine for all 5 folds. We also compute and record the average error
exactly as described in section (i) above (but with using Si as defined in this section).
Using Style or Evaluationof Readability of HealthDocuments-Thesis by Freddy N Bafuka
4.2 Classification Model for Recognition of Easy vs. Hard Documents
To determine whether a document is easy or hard to read, our ideal goal would be to create a
classification function C such that C(Di) = Li. In practice, however, we can expect that some documents
could be misclassified. Given a test set, we will define the errorof the classification model as the number
of documents misclassified.
4.2.1 SVM
We use the Support Vector Machine (SVM) algorithm to build the first classification model.
SVM treats the inputs as data points in an m-dimensional space. As mentioned earlier, m is the number of
features (predictors) of each input. Given data points from two different classes, the SVM algorithm
attempts to find a boundary line, plane or hyperplane (depending the value of m) that spatially separates
the points of the two classes into two regions. The SVM method uses just a few points from each class,
called support vectors, that lie at the boundary of the classes[32]. The assumption is that points at the
boundary are the ones that are critical for defining a separating line. The line (or plane, or hyperplane) is
chosen so as to maximize its distance from the support vectors on both regions.
Given a new, unseen data point, an SVM model will classify it based on which side of the
boundary it lies on. Therefore, the output of our SVM model is binary: either a 0 (easy-to-read) or a 1
(hard-to-read). It is possible that a test point would fall on the separating hyperplane. In that case, the
point's class is ambiguous, and it is not classified. Such a point is also not included in the classification
accuracy measurements we describe later. It is also possible for a point to fall on the wrong side of the
separating hyperplane-and in that case, it is considered missclassified.
We used MATLAB's implementation of the SVM algorithm, provided through the svmtrain
function. The svmtrain function uses a linear kernel by default, which means that a simple dot product is
used to determine the location of a test point with respect to the separating line, and the input is not
transformed into another space (which is useful in some applications). The svmtrain function takes two
arguments:
(i) An n-by-m matrix containing n observations, each having m predictors. In our case, we provided
a subset of the matrix D of document inputs described earlier in this section.
(ii) An n-by-1 vector containing the ground truth values for the inputs given as the first argument
above. For this argument, we provided the corresponding subset from the L vector of labels,
described earlier.
Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka
The svmtra in function returns the bias and slope values for each dimension, for the separating
hyperplane.
4.2.2 Logistic Regression
Given an input, the SVM model gives us only its class label as output. It would be useful to have
a measure of confidence for the classification of a document. Logistic regression is a mathematical model
that provides a real value as raw output. The output, which is always between 0 and 1, can be interpreted
as the probability that a given input is of the positive class--"hard to read", in our case. The higher the
value, the more likely the document is hard to read. The smaller the value, the more like it is to be easy to
read. How high the value of the output must be in order for the input to be classified as hard-to-read is a
question that we analyze, in a search for the appropriate cutoff threshold. Logistic regression is regarded
as an efficient supervised learning algorithm for estimating the probability of an outcome or class
variable. In spite of its simplicity, logistic regression has shown successful performance in a range of
fields. It is widely used in a many fields because its results are easy to interpret[33].
Let Di be a document from the easy- and hard-to-read data set, as described earlier. We define a
variable z as follow:
Z = B 0+ 8dil +
B2 di+...+
m,d
(4.9)
where the for fo, i, ... , fl,,, are the unknown parameter values.
The logistic regression model is described as follow:
+(z)=
l+e-'
(4.10)
The output from this function, 1(z), is what we define as the "raw output" of our logistic
regression model. From equation 4.10, we deduce that higher z values will produce an output closer to 1,
while smaller z values will yield raw output closer to 0. Hence, the key part of building the appropriate
linear regression model is to find the best set of parameters that will produce raw output close to 1 for the
hard documents and close to 0 for the easy documents. We can also use the residual sum of squares to
evaluate the best parameter values, as we did for linear regression earlier. Given a set R of documents
labeled either 0 (easy) or 1 (hard to read) and a vector fl of parameters, we can define the RSS formally,
for this case, as follows:
RSS(
)= I
(L-l(z))2
(4.11)
D ER
We define the best set of parameters as in equation 4.6 (but using the RSS definition given in
Using Style Jbr Evaluation ofReadability of HealthDocuments-Thesis by Freddy N Bafuka
formula 4.11 above).
We found the best parameters again by using MATLAB's solution implemented in the glmfit
function. The arguments passed to glmfit in this case are similar to those described in section 4.1.1,
except that the observations are taken from the easy- and hard-to-read data set, the target values are
correspondingly taken from the L vector described earlier, and we use "binomial" for the distribution
argument (so that glmfit computes the parameters based on equations 4.9-4.11)
4.2.3 Training and Testing Classification Algorithms
(i)
SVM
We used a 3-fold cross validation scheme to train and test the SVM model. The 367 easyand hard-to-read documents were randomly assigned to 3 folds of approximately equal sizes-2
of the folds having 122 distinct documents each, and one having 123. We performed the first
iteration (or cycle) of training and testing by using the first fold, F1, as a test set. The remaining
two folds were combined to form a training set, R1 . R1 being therefore a 244- or 245-by-m matrix,
was used as the first argument to svmtrain, as discussed in section 4.2.1. The corresponding
ground-truth labels, a vector of 244 (or 245) elements, was passed as the second argument to
svmtrain.
The testing stage consisted of using the hyperplane obtained from svmtrain to classify the
122 (or 123) inputs in F. The number of misclassified inputs was noted. We define that number
as the error for the iteration. We performed a second and third iteration in the same manner using
fold F 2, and F 3, respectively, as the test set. For each iteration, we also noted its error.The total
errorwas obtained by adding the 3 error values obtained from the 3 iterations. We define the
average error of the SVM model as the total error divided by the total number of inputs tested
through all iterations, 367. We report the correct rate, which we define as 1 - averageerror.
(ii) Logistic Regression
To build a logistic regression model, we also use a 3-fold cross-validation scheme, as
described in the previous subsection. The 367 data points were again randomly placed into 3 folds
of approximately equal sizes. During each one of the 3 cross-validation iterations, one of the folds
was used as a test set, while the other two were combined to form a training set. The training set
was used as the first argument to glmfit (the observations,) as described in the previous section.
Using Style for Evaluationof Readability of Health Documents-Thesis by FreddyN Bafuka
29
The corresponding labels, a 122- or 123-element vector, was used as the second argument-the
vector of targets. We obtained the best set of parameters from th glmfit function's output. These
parameters values were substituted into equation 4.9. We then obtained raw output values for the
inputs in the test set for the iteration, using equation 4.10. So, once all cross-validation iterations
had been run, we had three logistic regression models, each with its set of optimal parameters,
and a raw output value for each of the 367 inputs, corresponding their use as part of one the test
set associated with their fold.
In order to test the performance of the logistic regression model, we needed a threshold to
use as cutoff between the two classes. A document whose raw output value was above or equal to
the threshold was considered of class 1 (hard to read). A raw value below the threshold classified
the document as easy to read (class 0). The threshold could be any value in the [0 1] range.
However, given the finite number of test inputs, only a finite number of threshold values would
make a difference in performance. Hence, we considered potential thresholds only at the raw
output values. That is, each raw output value was tested as a potential threshold. Therefore, we
tested a total of 367 threshold values.
For each threshold value, we computed the true positive rate and the false positive rate.
The true positive rate, also called the sensitivity, is the fraction of the positive test points that were
correctly classified (as positive). In our case, the true positive rate is the number of hard
documents whose raw output values were equal to or above the threshold (i.e., correctly
classified) divided by the total number of hard documents in the test set. The false positive rate is
the fraction of incorrectly classified negative samples. In this testing stage, the false positive rate
is the number of easy documents whose raw output values were equal to or above the threshold
(i.e, classified as hard) divided by the total number of easy documents in the test set. The
specificity is defined as 1 -false positive rate. In addition to the true and false positive rates, we
also computed the overall accuracy, or correctrate, of the model for each threshold value. We
define the correctrate as the number of test documents correctly classified divided by the total
number of documents tested (367).
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
5. Results
The results reported focus on the Java implementation of the system, as we expect it to be the most
accessible implementation. We however, provide a summary of the result of the MALAB implementation
at the end of this chapter.
5.1 Prediction of Expert Scores
For the linear regression and k-nearest neighbor models, we report the correct rate, which we
define as follows:
correct rate=l- avgErr
(5.1)
We defined avgErr,the average error, in section 4.1.3, and it is simply a measure of the average
difference between the output of the regression models and the gold standard from the experts. The
division by 7 is for normalization, in order to guarantee that the correct rate is between 0 and 1. A correct
rate of 1 would mean that a model's output matched the experts grade perfectly. Table 5.1 presents the
correct rates for the two models, as the main result of this project.
Regression Model
Correct Rate
Linear Regression
0.875
k-Nearest Neighbor
0.856
Table 5.1 The correct rates of regression models.
To visualize the performance of the two models, we provide plots that compare the output of the
models to the gold standard. As explained in the previous chapter, after all cross validation iterations have
been performed, every input in the data set D must have been used exactly once as part of a test set. We
recorded these output values, and hence, had a score for each of the 324 inputs from the testing stage. We
present a plot of the experts-given grades versus the output values of our model. More specifically, we
begin by sorting the 324 gold standard scores in ascending order, to create a set of x-axis values. We order
the models' outputs on the y-axis correspondingly. To reduce the number of points on the graph and
eliminate outliers effects, we take the average of every 10 consecutive values , to obtain 33 new values on
each axis. We chose a group size of 10 arbitrarily. The resulting plot, which we call here a calibration
plot, gives a visualization of how closely related our models' predictions are to the experts'. The
calibration plot for an ideal model would yield points lying on a straight line passing through the origin
Using Style for Evaluation of Readability ofHealth Documents-Thesis by FreddyN Bafuka
31
with a slope of 1. A completely random model will produce points scattered randomly on the graph,
following no lines. The green line represents the ideal case. Figures 5.1 and 5.2 show the calibration plot
for the linear regression and the k-nearest neighbor models, respectively. In order to compare both models
visually, figure 5.3 plots output of both models.
Calibration Plot for LR Model (Gsize = 10)
0o
o0
00
00
011
00
3.5
/-
3
t
,
4
,
i
|
I
I
3
3.5
4
,
i
i
4.5
5
Expert Scores
5.5
6
6.5
I
7
Figure 5.1: Linear Model'spredictions compared to the gold standard.
Using Style for Evaluationof Readabilityof Health Documents-Thesis by Freddy N Bafuka
Calibration Plot for Knn Model (Gsize=10)
7
0O
0
O
6
O
O
0
0
0
0o
3
0
0
0 0
o
0o
z
0
, -0
0
0
0
2
2.5
3.5
3
4
4.5
5
Expert Scores
5.5
6
6.5
7
Figure 5.2: k-Nearest-NeighborModel's predictions compared to the gold standard.
Calibration Plot for both LR and Knn Models (Gsize=10)
O D-
O
0000
0zO
O
O
O
O
O
0 D.
O
o
O
to
o
LR
El
KNN
Ideal
O EI
3
I
1
3.5
4
I
I
4.5
5
Expert Scores
I
I
I
5.5
6
6.5
7
Figure 5.3: LR and KNN predictions compared to the gold standard.
Using Style for Evaluation of Readabilityof Health Documents-Thesis by Freddy N Bafuka
5.1.1 Variable Selection
We were interested to know which features were most useful in forming the model. There exist
several methods for variable selection. We chose to examine each feature as a one-predictor model and
use linear regression to map its values to the gold standards values. The lower the deviance, the more
useful the feature. More specifically, given D, the n-by-m matrix of n inputs Di= (dil, d 2, ... di,,,), we
considered each columnj of D, as dataset of its own, D:j, = (di,, d2j, ..., d,j). We use the equation:
Sj (Di)=
o+d f3l
(5.1)
and solve for the set of parameters that minimizes the deviance between the outputs of Si and the gold
values, G. We found the minimal deviance for each featurej = 1, 2, ..., m. The feature with the smallest
minimal deviance was considered the best. Using the deviance to compare two models is a common
method for determining relevance of variable. For normal linear regression models-which is what we
are using in 5.1--the deviance is equivalent to the residual sum of squares, RSS[34]. We computed the
RSS using the same tools described for linear regression in the previous chapter. We ranked the features,
from best to worst, based on the RSS of their model. The features, ranked from best to worst were:
1. Left Margin (LMS)
2. Page Count (PGC)
3. Maximum-to-Minimum Line Size Ratio (MMR)
4. Bottom Margin (BMS)
5. Average Gray Scale Pixel Value (AGS)
6. Average Number of Lines per Column (LNC)
7. Top Margin (TMS)
8. Line-to-Space Ratio (LSR)
9. Number of Columns (CLC)
10. Right Margin (RMS)
11. Whitespace Ratio (WSR)
12. Color Count (CLRC)
13. Interline Space Ratio (ISR)
We then built 13 different models, each of a subset of D, using only the columns corresponding to
the k best features, k = 1, 2, ..., 13. We computed the deviance of each of those models. The first model
included only the best feature, the second include the top two, etc. In figure 5.4, we report a graph of the
change in deviation as more features are added to the model, in the order determined by the variable
UsingStyle for Evaluation ofReadability of HealthDocuments-Thesis by Freddy N Bafuka
selection step.
Variables Improvement of Model
415
410
405
400
395
390
385
380
375
370
1
2
4
10
8
6
Number of Best Variables Used
12
13
14
Figure 5.4: The comparison of 13 models, each using 1-13 bestfeatures, respectively. The
deviance decreases as morefeatures areadded.
We labeled the first 5 features on the graph. As we expected, the model improves as more features
are used, showing a reduction in deviance at each step. We note that there is no significant improvement
in the model beyond the sixth best feature addition.
5.1.2 Typical Feature Values
In this section, we report the typical values of some features. Figures 5.5-5.8 show the value of
the 4 best features, as determined in the previous subsection. The feature values are plotted as a function
of the experts' grades. It should be noted that since many documents have the same gold standard scores,
and of those, many have the same feature values, there are few distinct points on some of the plots, as
many points are identical. Each graph, however, does have a point for each of the 324 documents.
Using Style for Evaluationof Readability ofHealth Documents-Thesis by Freddy N Bafuka
Best Feature 1Values Across Scores
0.15 -
0000
0
*
**
*
*0
*
0.145 0.14 0.135 0.13 0.125 0.12 0.115 -
0-
*
0.11 -
•
o0-
0.105 0.1
1Figure
2
3
4
Experts Scores
5
7
6
Figure 5.5: Valuesfor the left margin, the bestfeature.
Best Feature 2 Values Across Scores
0*
0
*o*
*0
0
o
*0
*
*
**-
2.5 -
•
*
*
0*
1
1
2
2
3
3
4
4
Experts Scores
0
*
5
6
Figure5.6: Values for the page count, the 2 nd bestfeature.
00-
0-
Using Style for Evaluationof Readability ofHealth Documents-Thesis by Freddy N Bafuka
Best Feature 3 Values Across Scores
2.5 F
C:
**
M
++
1.5
*
*
+
1
2
3
4
Experts Scores
*
.
5
6
7
Figure 5.7: Values for the maximum-to-minimum line size ratio, the 3" best
feature. One outlierpoint is not shown.
Best Feature 4 Values Across Scores
*
0.610.5 k*
*
*e*
0.410.3k
*
*e
SI
0.2
A
*+
*
0.1
2
3
4
Experts Scores
Figure 5.8: Values for the bottom margin, the 4 th bestfeature.
Due to the fact that many points of some of feature value plots are identical, it is difficult to
distinguish the points that represent typical feature values from the outliers. We therefore also provide
Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka
plots of the average features values per score buckets. Each bucket, bi, is defined as the group of
documents whose gold standard scores rounds to i. That is, a document with an experts-given grade of 4.4
is considered as part of bucket 4, whereas a score of 4.5 would make a document part of bucket 5. In
Figures 5.9 through 5.14, we show average feature value per bucket for the best 6 variables.
Average Value of Best Feature
0.15
0.145
0.14
. 0.135 -
S0.13
J 0.125 0.12
0.115
0.11
1
2
3
4
5
6
7
Experts Score Buckets
Figure 5.9: Average valuesfor the left margin, the bestfeature.
Using Style for Evaluationof Readabilityof Health Documents-Thesis by Freddy NBafuka
Average Value of 2nd Best Feature
2.8
2.7
2.6
2.5
2.4 2.3
a 2.2
2.1
1.9
1.8
1
2
3
4
5
Experts Score Buckets
6
7
Figure5.10: Average valuesfor the page count, the second bestfeature.
38
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
Average Value of 3rd Best Feature
1.8
1.71.6
1.2 -
13
1.2M
1.1
1
2
3
5
4
Experts Score Buckets
6
7
Figure5.11: Average valuesfor the maximum-to-minimum line size ratio, the
3'd bestfeature.
Using Style for Evaluationof Readability ofHealth Documents-Thesis by Freddy N Bafuka
Average Value of 4th Best Feature
0.2
0.19 0.18 L
0.17
•
0.16
S0.15
0
o 0.14
0.13
0.12 0.11
0.1
1
2
3
4
5
Experts Score Buckets
6
7
Figure 5.12 Average valuesfor the bottom margin, the 4 th bestfeature.
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
Average Value of 5th Best Feature
245.5
245
244.5
244
243.5
243
'
242.5
242
241.5 241
240.5
1
2
3
4
5
Experts Score Buckets
6
7
Figure5.13: Average valuesfor the averagegray scalepixel value, the 5 th bestfeature.
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
Average Value of 6th Best Feature
35
34
33
C-
E
-32
o
c)
a 31
C
0
o
30
-j
29
28
27
1
2
3
4
5
Experts Score Buckets
6
7
Figure 5.14: Average values for the number of lines per column, the 6th bestfeature.
To provide further visual comparison between the features, we plotted the average value per
bucket of several features on the same graph. Because the values of the various features have different
ranges, we normalized all values. Each features' values are normalized using its maximum value. Figure
5.15 shows a plot of the normalized average values by score category (bucket) for the best 4 features.
Using Style for Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka
The Best 4 Feature Average Values, Normalized
LMS (Blue), PGC(Red), MMR(Green), BMS(Cyan)
0.95
0.85
0.8
0.75
0.7
0.65
0.551
1
2
3
4
5
6
7
Gold Standard Score Buckets
Figure 5.15: A comparison of the average values per grade categoryfor the best 4
features: Left margin (LMS), page count (PGC), max-to-min line size ratio (MMR), and
bottom margin (BMS).
Figure 5.16 shows a similar plot for the fifth and sixth best features on the same plot. However,
because we have only two ranges to plot, the values are not normalized. Instead, we use the left y-axis for
one feature and the right y-axis for the other.
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
Average Values of 5th and 6th Best Feature
250
a,
.3.-J'
0f)
_. 245
cn
o
0C)
co
L
L
M
M
0)
_.
<ia)
240
-25
1
2
3
4
5
Gold Standard Score Buckets
6
7
Figure 5.16: Average values per grade categoryfor the averagegray-scalepixel
values (the 5 th bestfeature) and the number of lines per column ( 6th bestfeature).
5.2 Classification of Easy vs. Hard-to-Read Document
For both the SVM and Logistic Regression models, we report the accuracy (or correct rate),
which is the number of correctly classified documents divided by the total number of documents
classified. We also report the sensitivity and specificity, as discussed in section 4.2. We report the value
of these three indicators (accuracy, sensitivity and specificity), for each of the 3 iterations of crossvalidation. We also report the overall performance, which is the result computed from using the 3 crossvalidation test sets as one test set of 367 documents.
The sensitivity and specificity reported for the logistic regression model are those that correspond
to the best threshold. We define the best threshold (or optimal threshold)as the one that yields the highest
accuracy. We present the three performance indicators in tables 5.2, 5.3 and 5.4, as the main results for
the classification experiment of this project. Table 5.2 presents the performance of the SVM classifier in
each of the cross-validation folds, as well as the overall results. Table 5.3 presents the corresponding
Using Style for Evaluationof Readability of Health Documents-Thesis by FreddyN Bafuka
results for the logistic regression classifier. Table 5.4 summarizes a comparison of the two type of
classifiers.
Fold 1
Fold 2
Fold 3
Overall
Accuracy
0.8852
0.8862
0.9016
0.8910
Sensitivity
0.8596
0.8793
0.8596
0.8663
Specificity
0.9077
0.8923
0.9385
0.9128
Table 5.2. SVMperformance in the 3-fold cross
validation and overall.
Fold 1
Fold 2
Fold 3
Fold 4
Accuracy
0.9187
0.9180
0.9016
0.9128
Sensitivity
0.8966
0.9474
0.8246
0.8895
Specificity
0.9385
0.8923
0.9692
0.9333
Table 5.3. Logistic regressionperformance in the 3-fold
cross-validationand overall.
Accuracy
Sensitivity
Specificity
SVM
0.886
0.892
0.878
Logistic Regression
0.913
0.895
0.928
Model
Table 5.4. Accuracy, sensitivity (recall)and specificity
(precision)of the Support Vector Machine model and the three
Logistic Regression model.
5.2.1 FurtherEvaluation of Logistic Regression Classifiers
Table 5.3 shows the performance the logistic regression model at only the most accurate threshold
value. To get a general measure of the logistic regression model, considering all 367 threshold values
obtained from cross-validation testing steps, we present two plots:
(i) An accuracy plot
(ii) An ROC curve
The accuracy plot shows the correct rate for each threshold value. To find the accuracy at a
particular threshold, we simply classify all output values equal to or above the threshold as class 1 ("hardto-read") and all others class 0 ("easy-to-read"). Therefore, we expect low accuracy at the extreme
Using Style for EvaluationofReadability of Health Documents-Thesis by Freddy N Bafuka
threshold values of 0 and 1, as the model will classify all points as positive and negative, respectively. We
expect higher accuracy in the middle region. In figure 5.17, we provide an accuracy plot. The red marker
indicates the best threshold value. As can be seen, this threshold value correspond to the highest point in
the graph (maximum accuracy).
Accuracy Plot for Logistic Regression
0.9
0.85
0.8
0.75
2
0.7
0
0
0.65
o 0.6
0.55
0.5
0.45
0
0.2
0.6
0.4
Thresholds
0.8
1
Figure 5.17. Accuracy of various classifiersas a function of their thresholdvalue.
The red marker indicates the best threshold,having the highest accuracy.
To further visualize the results, we plot the receiver operatingcurve (ROC)[35],[36]. On the xaxis of an ROC curve are the false positive rates corresponding to increasing threshold values. On the yaxis are the corresponding true positive rates. If we think of this task as recognizing hard documents and
rejecting non-hard (easy) documents, the true positive rate measures the classifier's ability to recognize
the desired object (hard documents). This is also called the recall.The false positive rate measures the
classifier's vulnerability to false alarms (an easy document having some resemblance to a hard document).
This is a measure of the model's precision.An ideal model would have true positive rate of 1, indicating
that the model recognized all hard documents presented to it. The ideal model would have a false positive
rate of 0, meaning that no easy document fooled the classifier. In practice, however, a low false positive
rate is achieved by using a higher threshold value, making the classifier more conservative in accepting
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
documents as hard. The downside of making the classifier's acceptance criteria more strict is that some
positive objects (hard documents) will be rejected. The counterpart is that a lower threshold value gives a
more permissive classifier. Such a classifier is less likely to incorrectly reject a hard document, but having
"lowered the bar", it also becomes more likely to accept an easy document as hard. Hence, the need to
find the right trade-off threshold value. The true positive rate is equivalent to the sensitivity, and the false
positive rate is 1-specificity.
The area under the ROC curve, or simply area under the curve (AUC), gives an estimate of how
good the overall model is[37]. An ideal model will have an area of 1, while a random model will have an
area of 0.5, as the ROC curve will be closer to a straight line from (0,0) to (1,1). The AUC for the logistic
regression model we trained was 0.938. Figure 5.18 shows the ROC curve.
ROC for classification by logistic regression
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
I
0
0
0.2
I
0.6
0.4
False positive rate
I
0.8
1
Figure 5.18. The ROC curvefor the logistic regressionmodel. The red marker
indicates the thresholdwith highest accuracy.
Finally, we looked at the raw output of the logistic regression model to find the typical ranges of
the two types of documents. Figure 5.19 shows all 367 test inputs evaluated in cross-validation testing
stages, using equations 4.9 and 4.10. The horizontal green line shows the best threshold.
Using Style br Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka
Raw Output of Logistic Regression Classifier
0.9
0.8
0.7 0.6
+ 0.5 IL 0.4
0.3 0.2 0.1
-25
+
0
0..--
-20
Hard
Easy
Threshold
-15
-10
-5
0
5
z = b0+dl*bl +d2*b2+... +d13*b13
10
15
Figure5.19. The ROC curvefor the logistic regression model. The red marker
indicates the threshold with highest accuracy.
Multiple Optimal Thresholds
The outputs from the 3 cross-validation models yielded one best threshold value. However, it is
important to mention that that is not always the case. Because the sensitivity and specificity are computed
from the number of correctly classified positive and negative inputs, respectively, while the accuracy is
dependent on the correct classification of all inputs (positive and negative), it is possible to have more
than one optimal threshold. Such thresholds will have the same accuracy, but varying sensitivity and
specificity values. Since we found this to be a common occurrence, we present here an example from a
logistic regression model trained on of half of the data set (184 randomly selected documents) and tested
on the remaining half (183 documents). The resulting model had 3 "best threshold" values. We refer to
them as Logistic Regression Model-l, 2, and 3. In table 5.5, we present the performance of the 3 models.
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bajuka
Model
Accuracy
Sensitivity
Specificity
LR Model-1
0.913
0.895
0.928
LR Model-2
0.913
0.884
0.938
LR Model-3
0.913
0.872
0.949
Table 5.5. Accuracy, sensitivity (recall) and specificity
(precision)of three Logistic Regression models having highest
accuracy.
The trade-off between precision and recall, mentioned earlier, can be seen in table 5.5 with the
three logistic regression models. LR Model-i has the highest recall but lowest precision, while LR
Model-2 has lowest recall but highest precision. Model-3 is in between. In figure 5.20, we show the
accuracy plot for this model. In figure 5.21, we show the ROC curve. The red markers in both graphs
indicate the points corresponding to the best thresholds. The AUC for this model was 0.969.
Accuracy Plot for Logistic Regression
1
0.9
0.8
0.7
0.6
0.5
*
0.4
II
I
0.4
0.6
0.8
Thresholds
Figure 5.20. Accuracy plotfor a logistic regression model with several optimal
thresholds.Each of the optimal thresholds give the same accuracy.
0
0.2
I
Accuracy
Best Threshold
Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka
ROC for classification by logistic regression
1
0.9
0.8
0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
False positive rate
0.7
0.8
0.9
1
Figure 5.21: The ROC curvefor a logistic regressionmodel, with three optimal
thresholds, shown by the red markers. Each of the best thresholdsyield maximal
accuracy, but different true positive andfalse positive rates.
In figure 5.22, we show the raw output from the 183 test point. Three horizontal lines are shown
to indicates the threshold value for LR Model-1, -2, and -3, which had the best accuracy.
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
Raw Output of Logistic Regression on Test Data
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
-10
-8
-6
-4
-2
0
2
z = bO + dl*bl + d2*b2 +... + d13*b13
4
6
Figure 5.22: The raw output of a linearregression model with three optimal
thresholds, showing the evaluation of 97 easy (circle, red) and 86 hard (+, blue)
documents in the test set. The three horizontal lines show the three best-accuracy
threshold values.
5.2.2 The Support Vector Machine Model
The SVM model had a lesser performance than the logistic regression models operating at the
best threshold. All test point were classified, with none falling on the separating hyperplane. Some were
misclassified, as shown by the correct rate in table 5.2. As part of the cross-validation, each of the 367
data points was used once as a test point. We can use the distance of each data point to the separating
hyperplane, as a measure of confidence in the classification. Figure 5.23 shows all 367 documents,
grouped by cross validation fold--each fold having been used as a test set. On the y-axis is the distance to
the hyperplane. On the x-axis, we simply spread the points evenly, per fold, for better viewing.
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy NBafuka
Distance to Hyperplane
4+
2
-I
+
++
+
+
+,
,
F0ld
"
*-+-
*
4+++.++
F ol.
+
*,+
+
+
, d 31
-4
*
- .
+
*
-6
-8
I
Fold 1
Fold 2
Fold 3
Cross-validation folds
Figure 5.22. The distance of the testpoints to the separatinghyperplane of the
SVMclassifier The hard-to-readdocuments (blue, '+') lie generally at a positive
distance, while the easy-to-readdocuments (red, '*)are generally at a negative
distance. The vertical lines showfold grouping.
5.2.3 Variable Selection
While the same 13 predictors used for predicting expert-given scores were also used for the
classification experiments, the data set is different, and the outcome is of a different
type as well--binary,
rather than continuous. Hence, we performed another experiment of variable selection, using a different
approach and of course, using the easy- and hard-to-read data set.
As described in section 5.1, we built 13 one-predictor models, one for each feature value. The
relative performance of each of these single models determines the significance ranking of the feature. In
this experiment, each sub-models, c 1 , c2 ,..,c13, is a logistic regression classifier. Each ci classifier is
trained with half of the values of the i-th feature randomly chosen as the training set and the remaining
half as the testing set. Each classifier's area under the curve, AUC, is recorded. The features are ranked
from most to least significant by descending AUC values.
We found the feature ranking in this classification problem to be different from the regression
problem discussed in section 5.1.1. The top six were all different in both experiments. We notice
that the
most significant variables are those that capture line spacing and the white space. The difference between
4
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
the rankings in the classification problem versus the prediction of expert score could be due to the fact
that some features might be significant in determining whether a particular document's score is likely to
fall above or below 4, for instance, but might not be significant in predicting the exact score. Another
feature could be more significant only in determining whether a particular document is of class 7, for
instance, as opposed to 1-6. The former type of feature could be more useful in distinguishing between an
easy or hard document, whereas the latter type would be more useful in predicting one score bucket. Table
5.6 presents the full ranking of features for the classification of easy vs. hard documents and provides the
equivalent rankings for the regression problem (predicting expert scores). We do note that the data sets
used are different, so table 5.6 is not intended to be interpreted as the definite comparison of the features'
significance in the two models. Further study is needed to reach a more definite conclusion on that point.
We also note that some of the features were close matches. Therefore, different runs of the
experiment yielded slight changes in ranking (some features move up or down one rank), depending on
the training and test set obtain from randomization.
ClassificationProblem
Regression Problem
1
Line-to-Space Ratio
Left Margin
2
Whitespace Ratio
Page Count
3
Number of Lines per Column
Max-to-Min Line Size Ratio
4
Interline Space Ratio
Bottom Margin
5
Color Count
Average Gray Scale Value
6
Average Gray Scale Value
Number of Lines per Column
7
Right Margin Size
Top Margin Size
8
Top Margin Size
Line-to-Space Ratio
9
Max-to-Min Line Size Ratio
Number of Columns
10
Bottom Margin
Right Margin Size
11
Left Margin
Whitespace Ratio
12
Number of Columns
Color Count
13
Page Count
Interline Space Ratio
Rank
Table 5.6: The feature ranking in the classificationof easy vs. harddocuments,
comparedto the ranking in predictingexperts readabilityscores.
To further check the rankings obtained by using the AUC of the single-feature models as criterion,
we performed rankings using three additional criteria:
Using Style for EvaluationofReadability ofHealth Documents-Thesis by Freddy N Bafuka
54
(i) Using the accuracy of 13 single-feature models, each being a logistic regression model trained
with half of the data and tested on the other half. Figure 5.24 shows the accuracy of each features,
in order of ranking, while figure 5.23 shows the equivalent for the AUC criterion.
(ii) Using a two-way t-test, with pooled variance[38]. Essentially, for each feature, we test the null
hypothesis that the feature values for the easy-to-read and hard-to-read documents come from the
same distribution, at the 5% significance level. Figure 5.25 shows the result of the t-test, from
significant to least significant feature.
(iii) Using the deviance, a generalization of the residual sum of squares. In this method, we start by
selecting the single-feature logistic regression classifier which yields the smallest deviation. This
will be the most significant feature. We then add a second feature to the model, by trying each of
the remaining 12 features. The feature whose addition gives the 2-feature model with least
deviation, is the second most significant. We choose a 3 'd,4t, 5', and all other feature ranks in the
same manner. Figure 5.26 shows the model's improvement as the significant features are added to
it.
Individual Perforamance of Variables by AUC
0.9 - LSR
D 0.8
=
o
0.7
o
YWSR
ISR
LNC
CLRC
AGS
a)
T 0.6
< 0.5
PGC
0.4
1 2
4
6
8
10
12 13
Variables (Most-to-Least Significant by AUC)
Figure5.23: The area under the ROC curvefor each single-feature classifier
Using Style for Evaluation ofReadability of HealthDocuments-Thesis by FreddyN Bafuka
U.9
_
Variables Individual Performance by Accuracy
0.85
LSR
- LNC WSR
0.8
ISR
0.75
CLRC
AGS
0.7
0.65
0.6
0.55
PGC
0.5
SCLC
0.45
1
2
4
6
8
10
Variables (Most-to-Least Significant)
12 13
Figure 5.24: The accuracyof the 13 single-featureclassifiers usedforfeature ranking.
Variable Significance by T-test
16
-
14
12
10
864
2
0-
1
2
4
6
8
10
Variables (Most-to-Least Significant)
12
13
Figure 5.25: The two-way t-test usedfor feature ranking.
Using Style for Evaluationof Readabilityof Health Documents-Thesis by Freddy N Bafuka
Variables Improvement of Model
340
LSR
320
300
280
.
.I--
>
VS'
260
-MS
240
220
PGC
LMS
200
AGS
180160
,RMS
0
CLRC CLC
2
4
6
8
10
12
Number of Variables Used (in order of significance)
14
Figure 5.26: Logistic regression model improves (decreasein deviance) as more
significantfeatures are added to it.
The results of the first three criteria (AUC, accuracy, and t-test) largely agree, especially for the 6
most significant features. The deviance criteria gives half of the same top 6 features. One of the most
notable differences is the page count feature, which dramatically changes ranking position when deviance
is used as the ranking criteria. We note that the first 3 graphs (figures 5.23-5.25) show several consistent
(relatively) flat regions. These represent closely-significant features. Indeed, for the criteria that depend
on randomization (AUC and accuracy), a change of ranking tends to occur among the features forming
the flat regions, under different randomizations. For instance the whitespace ratio (WSR) and the number
of line per column (LNC), which appear in the first flat region, often swap position under different re-runs
of the same experiment.
We further conducted an analysis to determine the subset of features that contribute most to the
model's performance and those that only add marginal improvement. Figure 5.26 above gives one
indication already. But in addition, we built 13 logistic regression classifiers, s , s2, ..., s 13, where each
si is a model that uses the i most significant classification variables. Each model is built using half of the
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
data randomly selected as a training set and the remaining half as a test set, in a manner analogous to the
description given in the previous chapter. The exact same training and test sets are used to evaluate all 13
models to ensure a fair comparison. However, these training and testing sets are different from the ones
we used to evaluate the single feature models (cl, through cl 3). Each model's accuracy (correct rate) was
evaluated. In figure 5.27, we show a plot of the accuracy values as a function of the number of most
significant variables used. As would be expected, the accuracy increases as more features are included in
the model. The occasional drops in the graph are due to the closely matching features performing
differently under different randomly selected test sets than the one used to rank the features. Figure 5.28 is
similar, showing the AUC value for the model in a generally increasing trend, as more features are
included in the classifier. The occasional drops are for the reason already mentioned.
Accuracy for Classifier Using the n Most Significant Features
0.92
0.9
0.88
0.88 a)
"4-a) 0.86
0.84
0.82
0.8
0
12
14
8
10
4
6
Number of most significant features used
Figure 5.27: Accuracy of various classifiers increasingas more significantfeatures
are added.
2
Using Style for Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka
AUC for Classifier Using the n Most Significant Features
0.94
0.93
0.92
0.91
0.9
0.89
I
4
6
8
10
12
14
Number of most significant features used
Figure5.28: The area under the ROC curve increases as more significancefeatures
are added. The local minimums (in this plot as well as in figure 5.18) are due to
closely matchingfeatures behaving differently under new test set.
0.88
0
2
5.2.4 Typical Feature Values
Finally, we give a brief visualization of the data, which shows how separable it is. Knowing the
separability of the data gives us a sense of how appropriate a separation model (such as SVM) would be
for this problem. Given the high-dimensionality of he data, we look at the projection of the feature values
on various 2-D planes. More specifically, we show the documents plotted on axes representing the two
features, as found in the previous subsection. In figures 5.20a-d, we show the documents from a potential
test set (half the dataset chosen randomly) plotted by their values for the two most significant features
(MSF).
Using Style for Evaluation of Readability ofHealth Documents-Thesis by FreddyN Bafuka
0.4
4-
50
++
0.3
4-44- 4-%+
4*4
t,
-,
*+
40
+
30
0.2
+
+I
20
+
+
++
10
3
0.5
1
1.5
MSF 1
2
0.1
3
2.5
0.2
MSF 2
0.3
0.4
x 105
2.5
0.8
2
0.6
4+
+
1
0.4
+
0.2
0
l~
*
10
20
30
MSF 3
40
50
0.5
60
0L
0.2
0.4
1
0.6
0.8
MSF 4
Figure 5.20a-d: Randomly selected documents, plotted on the feature space. Each axis MSFi
represents the i-th most significantfigure. The easy documents are shown as red '*' signs, the
harddocuments as blue '+'signs.As thefeatures become more insignificant,a boundary between
the two classes becomes less obvious.
5.3 MATLAB Implementation
The MATLAB version of the system, being 3-features short of the Java implementation, had the
lesser performance. For the prediction of scores on expert-rated documents, the linear regression model of
the MATLAB implementation performed with a correct rate of 0.87 while the KNN model had a 0.85
correct rate. In the classification of easy- vs. hard-to-read documents, the SVM model performed with a
Using Style fbor Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka
60
correct rate of 0.82. A logistic regression model was not explored on the MATLAB implementation. We
must note that without the 3 additional features, the JAVA implementation's performance shows no
measurable distinction from its MATLAB counterpart. These results show two things: first, so far as the
features common to both implementations are concerned, there is no visible difference between the two
implementations-and there ought not to be, if both implementations correctly extract the feature values.
Secondly, the fact that the Java implementation's performance increases significantly with the addition of
3 new features indicates that at least of those 3 additional features is very significant. Indeed, two of those
3 features are the interline space ratio (ISR) and the color count (CLRC), both of which were ranked
among the 6 most significant features by most of the criteria used in the classification problem's variable
selection (see section 5.2.3). This also explains the fact that we see the greatest improvement of the Java
implementation over the MATLAB one mainly in the classification problem.
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
6. Discussion
Overall, our system (which we will refer to here as the style-based gradingtool) performed well
in predicting the readability level of documents. We compared the scores given by the style-based grading
tool and by the Flesch-Kincaid readability test to the scores given by human experts. We used 286
documents for which all three scores were available. The scores from our system were obtained from the
linear regression model. The Flesch-Kincaid scores were obtained from the GNU Style
(v 1.10-
rc4) program. The Flesch-Kincaid method is a standard algorithm widely used for measuring
readability[ 14], including in popular word processing software programs, such as Microsoft Word and
Lotus. The tests showed that our system's grades correlated better with the human experts' scores than
Flesch-Kincaid did. In table 6.1 we provide the correlation values.
Human Experts
SB Grading Tool
0.7131
Flecsh-Kincaid
0.6495
SB Grading Tool
0.6506
Table 6.1 The correlationbetween readabilityscores given by
human experts, Flesch-Kincaidand the Style-Based GradingTool.
In predicting document scores, we noticed that many feature values were most distinct in the
lower and higher score ranges (1-2 and 6-7). This can be seen on figure 5.14. The middle range (3-5) had
more ambiguous feature values. To investigate a bit further, we performed an experiment in which we
tested classification of score brackets. For instance, we wanted to determine whether a document's score
would be in the [1-3] bracket or the [4-7] bracket. Hence, we trained six classifiers, one for each such
score bracket. Each classifier was an SVM model, trained using 5-fold cross-validation. The results did
confirm that the lower and higher range brackets were more correctly classifiable, while the middle range
was more ambiguous. Table 6.2 presents the correct rate of the classifiers trained for each bracket pair.
Using Style Jfr Evaluation of Readability of Health Documents-Thesis by Freddy N Bajika
Score Brackets
Correct Rate
1 vs. 2-7
0.98
1-2 vs. 3-7
0.91
1-3 vs. 4-7
0.76
1-4 vs. 5-7
0.79
1-5 vs. 6-7
0.87
1-6 vs. 7
0.82
Table 6.2. The correct rates of 6 SVM classifiers trained
to determine a document's bracket. The results seem to
be in line withfigure 5.14.
While the 6 most significant features varied between the regression and classification problems,
they were all features with a large number of potential values. The only exception is the page count
(which has only 3 possible values, yet was second most significant feature in the regression problem). The
features with more variable values, such as margins, seem more useful than discrete variables with a very
small set of values, such as number of columns. The feature selection and ranking (in the regression
problem), gave an overall sense of the significance of each feature. However, for different score ranges,
different features might be more significant. This could further explain the difference in the rankings
between the regression and classification tasks. However, further study will be needed for a more definite
conclusion.
It is also important to point out that the deviation criterion (used for variable selection in the
classification problem) produced a noticeably different feature ranking than the other 3 criteria (AUC,
accuracy, and t-test). We noted that the page count in particular, was ranked
4th
most significant feature by
the deviance criterion, whereas the other three criteria ranked as one of the two least significant features.
The left-margin also moved dramatically from a lower significance rank (by the AUC, accuracy and t-test
criteria) to a higher significance position according the deviation criteria. The deviation criterion (in the
classification problem) gave a ranking closer to the one from the regression problem. Since the ranking
for the regression problem was also done using deviance, we suggest that a possible explanation for the
difference shown in table 5.6 lies in the different criteria and methods used to obtain the rankings, rather
than in the features themselves, the models or the data. That is, using deviance in the classification
problem yielded a ranking more similar to the deviance-based ranking of the regression problem.
However, further study will be needed to reach a more definite conclusion.
Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka
The question arises as to whether our system actually captures readability information or whether
it is simply learning different bins of data. This is a common question for any machine learning system.
However, we have several facts that indicate that our system does capture some real information. First, we
can think of health documents as items being generated by certain underlying processes. We can think of
health documents for children as being generated by one process and medical journals as being generated
by another process. Each of these two processes has distinct attributes imparted to the items it producessome of those attributes are stylistic. We could therefore argue that our system learned the attributes of the
underlying process that generates health documents. Secondly, an experiment was done in which the
stylistic features described in this paper were added to the NLP-based system developed by our team at
the Decision Support Group at the Brigham & Women's Hospital. This NLP-based system uses several
text analysis features to find the probability that a document belongs to the easy or hard sample. With that
probability, a final score between in the 1-7 range is computed. The results showed that adding stylistic
features improved this NLP-based system only slightly (using the correlation of the output values from
the system with the expert's grades as the performance measure). In other words, little new information
came from the stylistic feature, overall. We infer from this that the stylistic features on their own captured
much of the same information captured by the NLP-based features. This experiment was ran at an earlier
stage of this project, and is still ongoing (A full description of the experiment and results will be released
in publication in the near future). However, some of the stylistic features presented in this thesis were not
used in that experiment, as they were not yet implemented. Since some of those omitted features are
significant ones, it could be that future testing will reveal a much more significant improvement of the
NLP-based system when these style features are added to it.
It also important to note that while readability can be defined as the level of difficulty of the
actual text of a document, it can also be used to refer to how willing a reader would be to read the
document. A reader might be more willing to read instructions on a well-laid out web page with 12-point
font size, than he will be to read the same document on fine print at the back of a bottle. A reader might be
willing to read a one-page summary on a health topic than a 100-page report on the same topic. Our stylebased grading tool is more an attempt to capture that second dimension of readability.
For the tool that we are releasing to end users with this thesis, we use the average of the output of
all the linear regression models built, as a final output. More specifically, given a new document, we
obtain 5 different scores (one from each of the models trained through cross-validation) and return the
average as the final score from our tool. Similarly, we use all three logistic regression models from crossvalidation to obtain three raw output values for a new document. The average output value and the
Using Style br Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
threshold are used to classify the new document as hard or easy.
Our implementation of the image-based style evaluation system does have some known
limitations. Most importantly, the system will not work properly in the case of a background darker than
the foreground. In such a case, the interline spaces will be considered lines and the inter-column spaces
will be treated as columns, and vice-versa.
6.1 Further Work
There are several areas of possible interest in pursuing this project.
6.1.1 Newer Features
To capture more the style, it would be helpful to include newer features. One feature that we
believe could make a great improvement would be to extract the number of letters per area.This will
provide a better measurement of the number of font sizes used in the text. From this feature, we could
potentially create a table of content for the documents or infer the structure hierarchy of the document.
The complexity of the hierarchy will give an indication of the complexity of the document.
However, such a feature-and several others we considered like it--require the use of an OCR
software to get reasonably accurate measurements. As far as we search, we were not able to find a reliable
open source OCR tool. In our effort to keep our system open source, freely and easily accessible, we
decided not to implement this feature. Currently, our best approximation of this feature is the maximum to
minimum line size ratio, described in section 3 above.
6.1.2 Newer Training Scheme
We also thought of a more specialized method of training and evaluation, of the regression model.
As mentioned earlier, we could think of health documents as being generated by various underlying
processes, each of which produces documents of a specific difficulty level. For the purpose of this study,
we could think of 7 different processes generating documents. We could therefore trained 7 different
binary classifiers, and each of those will learn to recognize documents from just one readability level.
Having trained 7 such systems, one for each of the 7 classes, we can evaluate a specific document by
inputting it to each of the 7 classifiers. The classifier that classifies the document with the highest score
will determine the class that document belongs to. Such a system will be similar to those used in
determining topics of a document, in well-known NLP problems.
Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka
7. Real-World Usage
As part of this thesis we developed a program with a Graphical User Interface (GUI), which
provides a user-friendly tool for evaluating the readability of document using style properties, as
described in this paper. We called this tool VisualGrader.The VisualGrader allows users to load the image
of a document's page from a computer's file system. With just one button click, the program extracts all
features values for the page. The page count feature, a property of the entire document, is obtained by
searching the directory from which the loaded image came from, and looking at file with similar names
except for a number identifier. More specifically, the tool expects files names to be of the format
name_wxyz. ext, where name is can be any valid sub-string for a file name, and w, x, y and z are digits,
associated with each page, and ext is any valid image file extension. All files with the same name part
(and differing only the 4-digit number before the file extension) are consider to be from the same
document. The tool computes and display the documents' score. Figure 7.1 shows an example. The figure
above a health-related document loaded in the grader. Besides giving the document's score, the program
also shows some of the features. More specifically, the blue lines denote columns and text lines, while the
orange lines indicate white-space. Other features, such as interline-to-line ratio, can already be visualized
qualitatively, in the original document.
We assume that the health-care professional using our system will do the conversion of the
document from its original format to an image or set of images. There are several options for obtain
images from other formats. For hard copies, a scanner can be used. For PDF formats, we suggest using
the application "Smart PDF Converter". A free version of Smart PDF Converter is available, but will
convert only up to the first 3 pages of a documents. We used Smart PDF Converter as part of image data
gathering, and successfully converted hundreds of documents to images or sets of images, up to the first 3
pages. For websites and other HTML-based formats, we recommend CutePDF Printer, a free software
program that provides capabilities for creating PDF documents from text selected on a web-page. We
used CutePDF to obtain several of the easy- and hard-to-read documents from various websites. Once in
PDF format, we used the program mentioned earlier to convert the document to image format.
The tool comes in two possible look-and-feel. It can be made mimic the standard look-and-feel of
the Operating System it is running on, or it can have the default Java look. Due to well-known problems
with Java Swing's JFileChooser, choosing file with the native look (standard OS look) is very slow,
especially on the first use. We provide both versions. The one shown in figure 7.1 uses Windows Vista
native look.
Using Style br Evaluation ofReadability oqfHealth Documents-Thesis by Freddy N Bajuka
Brief Implementation Overview
The Document Feature Viewer is entirely implemented using Java Swing. It is based on a ModelView-Controller design pattern. The model consists of the Java implementation of feature extraction
back-end described in this paper.
The view consists of a specialized JPanel, that we called ImageView. The ImageView component
can show an image given a file and zooming factor. Also, given a set of feature values, such as a list of
line start and end positions, and columns start and end positions, it can draw them appropriately, with any
zooming value.
The controller of the Document Feature Viewer is comprised of a JFrame high-level window, a
Browse button that allows the user to browse directories and files and select an image file, and Feature
button that triggers the scanning and feature extraction, with the back-end. The contents of the JFrame
consists of the a ImageView widget (described briefly above). The Zoom-In and Zoom-Out buttons allows
the user to examine the image and feature tracing closer or take a farther view point.
Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka
__
[,= 1'g 11"OW0
VisualGrader -- Readability Evaluation Tool by Freddy Bafuka
SBrowse...
Zoom In
[-*'"
ade
Evaluation
DSGDBTOO8000001.jp
Score
2.75
)SGI)1'1"009
Dealing With D iabetes
,you',
feeiip bad about pqselor you diabstes.you',e not
<'iagi-,sed wilhs ,dbetesisa big sho,ck. On-e miine Vt] think
qr1
tl11,IIAt you Ie ,elltt.ir
IM uh
every dayi idl
iri a b ut
Il IVIuUL LI IMlI
l
II
rI
IGUV UIU
I
ffl
r'I Iyti.
Iyu
u
te iid btha youidoI't -ave diabetes L
Ink abuuul dU
iumtes. You rilay ytrrL
ay PI- trh
DaBelkl. You
l
y
kyllRi
. vyIy
I
I l
L
II
WEll
uI
lr~I ll
raW
,iad tyour ,arer-sl. fier-id ,or sibli-nsr i-,e Ouie,- tohis,
gl fIui iated masily. Yuu
*
rinl
it eymv fel uim
py wil,iu-
Lere -,ful 1.
Yvu, v y L
l l1.1uCZ.
l
Yuu'r C :a . Ir
ue.I,
- r 2r
'..vu fees like diabetes has
I t ined Your I.
Strs s. Ytu' e wo, ried etbo yqui diabetes aiultiiI 1
feel too busy - ike yau'lI ne
et eve,-y thhly ne.
6 tui
0t. You're afts id t! t havi i wd invets is ya tir a t i. Yo
ihe .hICneS that you f-;,ily lvas ri;
ade to help ou cae
MerM-Ie11-,l.
L
ruti1
Iui
1uu
,1
.
Ii
i't's
ioriiial
to ieel a
iv.;Rd. c ru
fused. ,-,d
allu=
all
the it
: .Th,
Ju,'i nave to 9U!l t!lat
way
Figure7.1. A document loaded in VisualGrader.The document received a rating of 3.6
Using Style for Evaluation ofReadability ofHealth Documents-Thesis by Freddy NBafuka
68
8. Conclusion
The methods described in this paper demonstrate that health documents within a particular
readability level have distinct stylistic features. We have been able to correctly extract style-based features
that provide a measurable distinction between documents of various readability levels. Moreover, we
were able to do this using only one page from each document. We provide user-friendly tool, with a
Graphical User Interface, allowing health-care professionals to easily use the scoring algorithms we have
developed and presented in this thesis.
Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka
69
9. Acknowledgments
I would like to express my deepest gratitude to the following people: Dr. Lucila OhnoMachado who brought me to DSG and whose course (Medical Decision Support, fall 2006)
taught me many of the machine learning fundamentals I used in this project. Dr. Qing ZengTreitler for giving guidance that led to the start and development of this project. Ms. Dorothy
Curtis, for helping in many practical ways all along this project, and reviewing this document.
Dr. Bill Long who ensured that the findings of this project were reported in a manner that meets
MIT's thesis standards. Anne Hunter for being the best academic advisor and the most useful
person to students in the EECS department. Prof. Arthur Smith, of the EECS Graduate
Admission. Mr. Peter Majeed, MIT '06, for helping early in this project, in the time-consuming
task of converting many web pages to PDF format, in order to form one of the data sets.
This work was supported by the NIH grant R01-DK-075837-0 1A , and by grants from
the Department of Electrical Engineering and Computer Science at MIT.
Using Style jbr Evaluationof Readability of HealthDocuments-Thesis by Freddy N Bafuka
70
10. References
[1]
Zakaluk BL, Samuels SJ, editors. Readability: it's past,present,andfuture. Newark: International
Reading Association 1988.
[2]
Root J, Stableford S. Easy-to-Read Consumer Communications.A Missing Link in Medicaid.
Managed Care, Journal of Health Politics, Policy, and Law. 1999;24:1-26.
[3]
Leroy G, Eryilmaz E, Laroya BT. Characteristicsof Health Text. AMIA Annu Symp Proc. 2006;
2006: 479-483.
[4]
Doak CC, Doak LG, Root JH. Teaching Patients With Low Literacy Skills. Lippincott Williams &
Wilkins; 1996.
[5] Ad Hoc Committee on Health Literacyfor the Council on Scientific Affairs AMA. Health Literacy:
Report of the Council on Scientific Affairs. JAMA. 1999:281-552. 557.
[6]
McLaughlin, G Harry, SMOG Grading- A New ReadabilityFormula. Journal of Reading, 1969
[7]
Readability formula: Flesch-Kincaid grade
http://csep.psyc.memphis.edu/cohmetrix/readabilitvresearch.htm Retrieved Feb 23 2007
[8]
The Fog index: a practical readability scale http://www.as.wvu.cdu/-tmiles/fog.htmi Retrieved
Feb 23 2007.
[9] Flesch, Rudolph, A New Readability Yardstick. Journal of Applied Psychology in 1948.
[10]
Rosemblat G Loga R Tse T Graham L. Text Featuresand Readability:Expert Evaluation of
Consumer Health Text. MEDNET 2006 In press.
[11]
Zeng-Treitler Q, Kim H, Goryachev S, Keselman A, Slaughter L, Smith A C. Text Characteristics
of Clinical Reports and their Implicationsfor the Readability of PersonalHealth Records.
Medinfo 2007.
[12]
Gemoets D Rosemblat G Tse T Loga R. Assessing readabilityofconsumer health information: an
exploratorystudy. Medinfo 2004. 11 :(Pt 2)869-73.
[13]
Ownby RL. Influence of vocabulary and sentence complexity andpassive voice on the readability
of consumer-orientedmental health information on the internet.AMIA. 2005:585-8.
[14]
Kim H, Goryachev S, Rosemblat G, Browne A, Keselman A, Zeng-Treitler Q. Beyond Surface
Characteristics:A New Health Text-Specific Readability Measurement.AMIA. 2007
[15] Zeng-Treitler Q, Goryachev S, Tse T, Keselman A, and Boxwala A. Estimating Consumer
Familiaritywith Health Terminology: A Context-basedApproach. Journal American Medical
Informatics Association. 2008 May-Jun; 15(3): 349-356.
[16]
Osborne H. Health Literacy From A To Z: PracticalWays To Communicate Your Health Sudbury,
MA: Jones and Bartlett; 2004.
[17]
Halliday MA, Hasan R. Cohesion in English. London: Longman; 1976.
[18]
McNamara DS, Kintsch E, Songer NB, Kintsch W. Are Good Texts Always Better? Interactions of
Text Coherence, BackgroundKnowledge, and Levels of Understandingin Learningfrom Text.
Cognition and Instruction. 1996;(14):1-43.
[19]
Zeng Q, Goryachev S, Weiss S, Sordo M, Murphy S, and Lazarus R. Extractingprincipal
diagnosis,co-morbidity andsmoking statusfor asthma research: evaluation of a naturallanguage
processing system. BMC Medical Informatics and Decision Making. 2006; 6: 30.
Using Style for Evaluation of Readabilityof Health Documents-Thesis by Freddy N Bafuka
[20]
Mosenthal P, Kirsch I. A New Measure ofAssessing Document Complexity: The
PMOSE/IKIRSCHDocument ReadabilityFormula. Journal of Adolescent & Adult Literacy.
1998;41(8):638-57.
[21]
Doak CC, Doak LG, Root JH. Teaching Patients With Low Literacy Skills. 2nd ed. Lippincott:
Williams & Wilkins; 1996.
[22]
Linda G. Shapiro and George C. Stockman (2001). Computer Vision, pp 279-325, New Jersey,
Prentice-Hall
[23]
Shi, J. and Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMMI), 22(8), 888-905.
7]
[24] Viola P, Jones M. Robust Real-time Object Detection. Second International Worship on Statistical
and Computational Theory of Vision-Modeling, Learning, Computing and Sampling. 2001.
[25] Ching-Huei Wang and Sargur N. Srihari. Object Recognition in Structured and Random
Environments: Locating Address Blocks on Mail Pieces. AAAI86. 1986
[26] Dzung L. Pham, Chenyang Xu, and Jerry L. Prince (2000): CurrentMethods in Medical Image
Segmentation. Annual Review of Biomedical Engineering, volume 2, pp 315-337.
[27] Kandula S, and Zeng-Treitler Q. Creatinga Gold Standardfor the Readability Measurement of
Health Texts. AMIAAnnu Symp Proc. 2008.
[28]
T. Hastie, R. Ribshirani, J. Friedman (2001). The Elements of StatisticalLearning-DataMining,
Inference, andPrediction,p. 41. Springer Series in Statistics.
[29]
T. Hastie, R. Ribshirani, J. Friedman (2001). The Elements of StatisticalLearning-DataMining,
Inference, and Prediction,p. 42. Springer Series in Statistics.
[30] Russel, S., Norvig P. (2003). ArtificialIntelligence-A Modern Approach. p. 735, 888. Prentice
Hall
[31]
T. Hastie, R. Ribshirani, J. Friedman (2001). The Elements of StatisticalLearning-DataMining,
Inference, and Prediction,p. 417. Springer Series in Statistics.
[32]
T. Hastie, R. Ribshirani, J. Friedman (2001). The Elements of StatisticalLearning-DataMining,
Inference, and Prediction,p. 371-377. Springer Series in Statistics.
[33]
V. Robles, C. Bielza, P. Larrafiaga, S. Gonzalez, L. Ohno-Machado. Optimizing logistic
regressioncoefficients for discriminationand calibrationusing estimation of distribution
algorithms. TOP, Springer Berlin / Heidelberg Vol 16, p. 345-366. 2008.
[34]
Dursun Aydln, A Comparison of the Sum ofSquares in Linear and PartialLinear Regression
Models. Proceedings of World Academy of Science, Engineering and Technology. Vol 32, 2008.
[35]
T. Fawcett, ROC Graphs:Notes and PracticalConsiderationsfor Researchers,2004.
[36]
M. Zweig and G. Campbell, Receiver-OperatingCharacteristic(ROC)Plots: A Fundamental
EvaluationTool in ClinicalMedicine, Clin. Chem. 39/4, 561-577, 1993.
[37]
Theodoridis, S. and Koutroumbas, K. (1999) Pattern Recognition. Academic Press, pp. 341-342.
[38]
Huan Liu, Hiroshi Motoda (1998) FeatureSelectionfor Knowledge Discovery andData Mining.
Kluwer Academic Publisher