Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka Beyond Text Analysis: Image-Based Evaluation of Health-Related Text Readability Using Style Features by Freddy Nole Bafuka S.B., Computer Science & Electrical Engineering, M.I.T., 2006 Research Fellow, Decision Systems Group (DSG), Harvard Medical School Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of HIVES Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology MASSACHUSIETTS May 2009 Copyright 2009 Freddy Nole Bafuka. All rights reserved. JUL LIB The author hereby grants to M.I.T. permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole and in part in any medium now known or hereafter created. Author Dep rtme t of Electrical Engineering and ComputeLr o.. / May 27, 2009 Certified by _ William J Long Principal Research Associate, CoMputer Science & Art. Int. Lab, MIT Thesis Supervisor Certified by Research Scientist, qomputer INSTITUTE OF TEC HNOLOGY 'urUN, Dorothy Curtis ience & Art. ±nt. Lab, M; DSG Affiliate ... / / Thesis Supervisor Accepted by Arthur C. Smith Professor of Electrical Engineering Chairman, Department Committee on Graduate Theses 0 2009 ARIES Using StyleJor Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafiuka Beyond Text Analysis: Image-Based Evaluation of Health-Related Text Readability Using Style Features by Freddy N. Bafuka Submitted to the Department of Electrical Engineering and Computer Science May 28, 2009 In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Many studies have shown that the readability of health documents presented to consumers does not match their reading levels. An accurate assessment of the readability of health-related texts is an important step in providing material that match readers' literacy. Current readability measurements depend heavily on text analysis (NLP), but neglect style (text layout). In this study, we show that style properties are important predictors of documents' readability. In particular, we build an automated computer program that uses documents' style to predict their readability score. The style features are extracted by analyzing only one page of the document as an image. The scores produced by our system were tested against scores given by human experts. Our tool shows stronger correlation to experts' scores than the Flesch-Kincaid readability grading method. We provide an end-user program, VisualGrader,which provides a Graphical User Interface to the scoring model. Thesis Supervisors: William J. Long, Title: Principal Research Associate, Computer Science & Art. Int. Lab, MIT Dorothy Curtis Title: Research Scientist, Computer Science & Art. Int. Lab, MIT; DSG Affiliate Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka Table of Contents 1. Introduction and Motivation 4 2. Background 3. Feature Extraction 5 12 4. Machine Learning Models Used 5. Results 22 30 6. Discussion 61 7. Real-World Usage 65 8. Conclusion 68 9. Acknowledgments 10. References 69 70 Using Style JbrEvaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka 1. Introduction and Motivation Readability is defined as the ease with which a document can be read[1]. Many studies have shown that the readability of the health information provided to consumers does not match their reading levels[2]. Even though healthcare providers and writers have tried to make more readable materials, most patientoriented web sites, pamphlets, drug-labels, and discharge instructions still require a tenth grade reading level or higher[3]. More than half of consumer-oriented web pages present college-level material[3]. A study by Doak et al found that patients, who may be more stressed, read on average five grades lower than the last year completed in school[4]. Misunderstandings of health information have been linked to higher risk of consumers making unwise health decisions, which in turn leads to poorer health and higher health care costs[5]. To provide more readable health texts, The Decision Support Group under Qing Zeng-Treitler at Brigham and Women's Hospital will develop a computer program to translate texts to a readability level appropriate to several consumer reading levels. This program will be based on statistical natural language processing techniques. Providing health texts of appropriate readability to consumers should help improve comprehension, self management and, potentially, clinical outcome[3]. We envision the following scenario: Gary is a diabetespatient with poor metabolic control. Laura, a nurse educator,talks with Gary about exercise and weight control.During their conversation, Laurasenses that Gary's literacy level is inadequatefor use of the latest teaching materials on the importance of exercise, which are written for average (seventh to ninth grade) readingability. With the help of a readabilityadjustment software tool, she quickly generates a simplified versionfor Gary to take home. Because Gary can understandthe materials,he is motivated to follow their advice and exercise, which in turn helps to controlhis illness andprevent complications. In order to translate a text from a higher to a lower literacy level, we must be able to correctly assess its readability. Having an accurate evaluation of a document's readability level provides guidance as to which tools or algorithms will be most appropriate for its translation to a more easily readable target. The goal of this study was to develop and a evaluate a new approach for assessing the readability of health-related documents. Using Style for Evaluation ofReadability of Health Documents-Thesis by FreddyN Bafuka 5 2. Background 2.1 Previous Works Several well-known word processing software products such as Microsoft Word, WordPerfect, and Lotus, provide generalized readability evaluation tools, using text analysis methods. Some of the features used in text analysis are extracted using Natural Language Processing (NLP) tools. In this thesis, we use the terms NLP and text analysis interchangeably. Among the most widely used methods based on text analysis are the Simple Measure of Gobbledygook (SMOG) formula[6] and the Flesch-Kincaid grading formula[7], and the Gunning Fox Index (GFI)[8]. These methods computes readability scores based on text unit length, and yield scores that can be interpreted as the number of years of education needed to easily read the document. The Flesch-Kincaid method converts Flesch Reading Ease scores[9] into a grade-level. The SMOG formula computes readability scores using the number of sentences and the number of polysyllabic words-that is, words with more than 3 syllables. The GFI method uses sentence length and the percentage of polysyllabic words. While these methods perform well for general use, several studies have shown they are often inadequate for health-related documents, as they fail to capture many important features unique to health documents[ 10],[ 11]. In addition, some of the measurements used by these methods become inappropriate for the evaluation of some health-related fields[12],[13]. In particular, they do not measure text cohesion, or sentence coherence, which studies have found to be an essential part of easily understanding the English language[ 14],[15]. A study by Rosemblat et al[ 10] suggests that the "ability to communicate the main point" and familiarity with terminology should be considered as additional properties in measuring health text readability. In addition, a study by Ownby[ 13] suggests that vocabulary complexity, sentence complexity and use of passive voice are the appropriate measures of text readability. Zeng-Treitler et al[ 11] pointed out that electronic health records (EHRs), consumer health materials, and scientific journal articles exhibit many syntactic and semantic properties that are unaccounted for by existing readability measurements. These are examples of text properties not measured by formulas such as Flesch-Kincaid, SMOG and GFI. Hence, the need for specialized evaluation tools for health-related documents. Several such specialized systems have been developed to evaluate the readability of health documents, using a Natural Language Processing approach. These systems are based on features such as Using Style for Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka frequency of certain parts of speech, sentence lengths, parse tree structures, text cohesion and so forth. Kim et al[16], for instance, developed a new readability measurement for health-related text, based on the differences in semantic and syntactic features, in addition to the text unit length already used by the general methods mentioned earlier. Zeng-Treitler et al[ 17], developed a new measurement of consumer familiaritywith health terminology, which can be used as an additional predictor in assessing the readability of health-related documents. These new types of measurements are providing better assessment of health-related documents, where generalized methods might not be optimal. For instance, a common feature used by the generalized methods mentioned earlier, is the number of polysyllabic terms. A word with more than three syllables is considered a difficult word, and thereby increases the difficulty of the text that contains it, whereas words with fewer syllables are considered easier[ 18]. While this might be the case for general texts (in the English language), it is not the case when it comes to some healthrelated documents. Many words with same number of syllables can have varying levels of difficulty[ 17]. The word "diabetes", for instance, has 4 syllables but is generally a term with which a healthcare consumer would be familiar. The words "aspirin", "Anbesol" and "aplisol" have same of syllable but varying difficulty level[ 17]. 2.2 A New Approach: Style-based Features While many text evaluation systems have used the Natural Language Processing approach, the effect of the layout of the text on readability has been very little-explored. In this study we show that a strong relation exists between certain textual layout features of a document and its readability level. We extract these features using an image-based approach. Rather than explore what the text says and how it says it, we rather consider how the text looks. Throughout this document we use term style to refer to text layout, and use both terms interchangeably. 2.3 Advantages of Image-Based over NLP-Based Evaluation of Text Readability This image-based evaluation, which converts the text into an image, presents several advantages over the traditional NLP approach. Many of the features used in Natural Language Processing techniques for evaluation of text difficulty, fail when applied to health documents[ 10]. As mentioned above, the number of syllables per word may not always be an optimal indicator of the difficulty level of health-related text. Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka Secondly, health documents come in a variety formats: printed journals, pamphlets, medical records, web-pages, etc. An NLP-based system depends heavily and entirely on accessing the actual text. Such a system would have to be able to parse HTML code, for instance, to extract the text of a web-page. It would also have to be able to receive a PDF as input and have the appropriate tools to parse that type of file as well. This need for flexibility in type of input adds a great overhead for an NLP-based system intended for general use. A commonly-encountered challenge with healthcare-related NLP tools, is that they are usually difficult to adapt, generalize and reuse[ 19]. Very few NLP-based systems developed by one healthcare institution have successfully been adapted for use by an unrelated institution[ 19]. One reason is that medical NLP tools are often overly customized to domain or institution-specific document formats and other text characteristics[ 19]. Moreover, some medical documents are not available electronically, but only in printed form. Some medical records, or hand-written notes by doctors and nurses are a common examples. An NLPbased system would not be able to process such document formats; the document's text would have to be extracted first. In an image-based system, however, the document can simply be scanned and the system can work with its image. Lastly, the features used in NLP-based systems are not consistent across all natural languages. For instance, some languages have, on average, more syllables per word. An NLP-based system that uses such features would have to be retrained in each specific language in order to perform accurately. While natural languages differ widely in content, and features such as length of words, sentences, they are by and large similar in text style-a journal publication will almost always be formatted in columns, for instance; a title will often be in bold and bigger than the rest of of the text, etc. Therefore, it is unlikely that a system based on style will need retraining for use in another language. This language-independent aspect of the style-based system is a great asset in the health field, where patients come from various language and cultural backgrounds. 2.4 Previous Work on Text Evaluation Using Text Layout Features While many studies have acknowledged the importance of text layout in assessing the readability of health documents, very little exploration of it has been done[ 17]. Two previous studies, Mosenthal and Kirsch [20], and Doak and Root [21], have explored text layout as part of a readability scoring scheme. Mosenthal and Kirsch developed the PMOSE/IKIRSCH method for measuring the readability of graphs, tables and illustrations, in turn giving a measure of the readability of a document. Doak and Root developed the Suitable Assessment of Materials (SAM), which also attempts to measure text organization, Using Style Jbr Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka layout and document design. However, these two systems are complex and not computerized. In this study we provide a fully automated, artificially intelligent approach to extracting various style features, evaluating them and constructing a regression model to map documents' style features to their readability level. As far as we know, there is no other computer tool of the kind we developed and present here. 2.5 Components of this Thesis The general goal of the thesis is to assess the readability of health texts, using only the style of the document. More specifically, we would like to perform two tasks: (i) Build a computerized model, which, given an arbitrary health-related document as input, will output a readability score for that document, using its style properties. We would like the score given by the model to be as close as possible to the score that a human expert in readability would give to the same document. Throughout the rest of this thesis, we will refer to this task as the score predictionproblem, or, the regressionproblem. We also use the terms score and grade interchangeably, to refer to the numerical measure of a document's readability. We use the terms gold standardand target, to refer to the score given to a document by a human experts. (ii) Build a computerized model, which, given an arbitrarily health-related document as input, will classify it as either easy-to-reador hard-to-read,using its style properties. We would like the classification decision of this model, to agree as often as possible with the decision of a human reader on the same documents. Throughout the rest of this thesis, we will refer to this task as the classificationproblem. We also use the term target or ground-truth,to refer to the class ("easy" or "hard") given to a document by a human. In order to rightly relate documents' style to readability score, or the easy- or hard-to-read classes, we will need three main components: (i) A numerical representation of stylistic properties of documents. Chapter 3 details which properties (features) we chose and how we quantize them. We refer to this part of the project as the feature extraction step. Because we approach each document as an image, and use image processing techniques to extract the style features, we also refer to this part of the project as the image processingstep. We will often use the term feature to refer to an actual property (right margin, for instance,) andfeature value, to refer to the actual measurement of that property for a specific document (e.g., "Document 1 has a margin of 0.1"). We also refer to the features as variables or predictors and use both terms interchangeably. Using Style for Evaluation of Readabilityof Health Documents-Thesis by Freddy N Bafuka (ii) A machine learningmodel, which provides a mathematical relationship between the feature values of documents and their score or class. A machine learning model is built using a set of examples. In this project, an example is a document whose feature values and human-expertsgiven score or class are used as part of building the machine learning model with the optimal parameters that define the relation between document features and scores or class. We refer to this building process as the trainingof the model. We refer to the set of documents used in training as the trainingset. We evaluate the performance of a model by comparing its output (score or class) for new documents (not used in the training set), to the score or class given by the human observers, for the same documents. We refer to this group of new documents used for the purpose of evaluating a model's performance as the testing set (or simply test set). Chapter 4 describes the various machine learning models we chose to use, and the processes by which we evaluate their performance. Chapter 5 reports and analyzes the results. (iii) A set of documents from which we can create training and testing sets. More specifically, we needed three data sets: (1) a set of documents with scores given by human expert reviewers, to be used for the score prediction task; (2) a set of easy-to-read documents and (3) a set of hard-toread documents. We consider the last two sets as one data set for the classification task. Throughout this thesis we use the term data or dataset,to refer to documents, and in particular, the set of feature values extracted from them. We provide more information about the data sets in this section. 2.5.1. Feature Extraction and Image Processing This project can be thought of has having a image processing step, which provides input to the machine learning step. The feature extraction part of this project uses some common image processing techniques. In particular, many of the features are extracted by detecting boundaries between text and background (or whitespace) regions. This process of breaking an image into meaningful regions for further analysis, using pixel similarities, is referred to as image segmentation[22]. Shi and Malik[23] presented a mathematical representation of image segmentation, as the segmentation of a connected graph with pixels acting as nodes connected to neighboring pixels by weighted edges. The segmentation task, then consists of minimizing the weights between pixels belonging to different objects and maximizing them for pixels belonging to the same object. That is also the basic concept we use to differentiate between text and background areas. Essentially, we attempt to extract some information on the presence of text on a page's image, by carefully canvasing through the page, and detecting regions with sharp changes in color intensity. The resulting feature values are then fed into the building of a machine Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bajika learning model, which makes a decision about the type document (easy vs. hard, or score). This technique is also similar to the one used in some object recognition methods, such as in the well-known facedetection system developed by Viola and Jones[24]. Viola and Jones searches a gray-scale image for rectangular regions of sharp changes in color intensity. These features are then used by a machine learning model to recognize color-intensity changes that correspond to the set of boundaries need to form a human face. The Viola and Jones method however, uses human subjects to label hundreds of images of faces to form a training set. In this project, the feature extracted is fully automated. Image segmentation is also used in Optical Character Recognition (OCR), to detect text in an image. OCR is very heavily used in the processing and routing of mail[25]. OCR systems also depend canvasing an image to find the text boundary, and the resulting features are used by a machine learning model to detect if a character is present and which character it is. In the medical field, image segmentation is used widely in medical imaging for diagnosis, measuring tissue volumes, computer-guided surgeries and location of pathologies[26]. 2.5.2. Expert-Rated Data Set To provide a reliable set of documents for this and other health-related readability projects, the Decision Support Group under Qing Zeng-Treitler at Brigham and Women's Hospital carefully compiled a corpus (dataset) of 324 health-related documents[27]. To obtain a diverse sample, 6 different types of documents were collected: consumer education materials, news stories, clinical trial records, clinical reports, scientific journal articles and consumer education material targeted at kids. Table 2.1 shows the number of documents taken from each type. Document Type Count Consumer education material 142 News report 34 Clinical trial record 39 Scientific journal article 38 Physicians' Notes 38 Consumer education material targeted at kids 33 Table 2.1: Sample size ofeach document type used toform the 324-document data set with human expert score . The consumer education material documents were obtained from MedlinePlus, National Institute Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka 11 of Diabetes and Digestive and Kidney Diseases, and Mayo Clinic. The News report documents were taken from The New York Times, CNN, BBC, and TIME. The clinical trial documents were obtained from ClinicalTrials.gov. Scientific journal articles originated from the DiabetesCare, Annals of Internal Medicine, Circulation-Journal of the American Heart Association, Journal of Clinical Endocrinology and Metabolism and the British Medical Journal. Physicians' notes were obtained from the Brigham and Women's Hospital internal records. Lastly, the consumer education material for kids came from the American Diabetes Association. With data collected, a panel of 5 health literacy and clinical experts and a patient representative were assembled to assess the readability of the 324 documents. Each expert was asked to grade the documents using a 1-7 scale. Each expert carefully reviewed and graded the documents independently. Many documents were graded by more than one experts. In those cases, we used the average of scores as the final gold standard for the document. 2.5.3 Easy and Hard-to-Read Data Set The data of easy- and hard-to-read documents used in the project is a subset of the one used by Kim et al[ 16]. The easy data set consisted of 195 self-labeled easy-to-read health materials from various web information resources including MedlinePlus and the Food and Drug Administration consumer information pages. They covered various topics on disease, wellness, and health policy. The hard dataset consisted of 172 scientific biomedical journal and medical textbook articles on several topics, including various diseases, wellness, biochemistry, and policy issues. Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka 3. Features Extraction Method The first implementation work was done on MATLAB R2007a, on a Dell Pentium Dual-Core, Windows Vista machine. In an effort to make the system more freely accessible, a Java implementation was also done, on the same machine. We present the details of both implementations later in this document. There are slight variations in the two implementations. We mention those when necessary, throughout the description of the method. Given that the Java implementation is the latest, contains more features, we consider it the main implementation of this paper, and the one that we expect others to use 3.1 Conversion to Image Format There were 324 expert-rated documents used, with readability scores ranging from 1 (easiest to read) to 7 (hardest to read). Additionally, there was a set of 195 documents simply considered "easy to read", compared to another set of 172 documents considered "hard to read". Each document was converted, from its original format, to a set of images, with one image for each page. However, we used style features from just one page-the "first page" of the document. Cover pages, and table-of-content pages, and title pages were ignored. Hence we refer to the "first page" of the document as the page in which the actual content of the document begins. The size of the images obtained from the documents differed, depending on the original format and size of the document. All images were converted to gray scale, with possible pixel values of 0-255. Feature Extraction and Evaluation Pipeline For each document, the following steps are followed to extract the features. * Document is converted from its original format to an image format. * Each page is converted to one image. * The image is converted to gray-scale. * Each resulting image is preprocessed and a new image is returned. * The preprocessed image is sent as input to the DocFeatureExtractor module, which returns a numerical value for each feature, for that image. Features Extracted From each image, eleven features were extracted, to create an eleven-variable observation. The features Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka extracted were: - Average white space (WSR) - Number of columns (CLC) - Number of lines per column (LNC) - Left margin size (LMS) - Right margin size (RMS) - Average gray-scale value (AGS) - Top margin size (TMS) - Bottom margin size (BMS) - Interline Space to Line Size Ratio (LSR) - Interline Space Ratio (ISR) - Maximum to Minimum Line Size Ratio (MMR) - Number of Colors (CLRC) - Number of Page (PGC) The extraction of these 13 features is detailed in the following subsections. Not all features where used in testing the MATLAB implementation. In particular, the interline space ratio (ISR), the number colors and the number of pages, were not used. A few assumptions are made throughout the feature extraction stage. One is that the background color is lighter than the letters' color. The background value is determined taking the average pixel value of the 4 leftmost columns of pixels. Another assumption is that all pages have a left margin with a width of at least 4 pixels. 3.1.1 Number of Columns The number of columns in a page is determined by scanning the image of the page from left to right, examining a vertical strip that extends from the top to the bottom of the image, and has a small width (equal to 1% of the total width of the image). The average value of the pixels within this vertical strip determines whether the strip within margin region or within column region. If the average value is less than value of the background minus 1, the vertical strip is considered within column region, otherwise, it is within margin region. The number of transitions from margin to column and vice versa indicates the number of columns in the document. Note that in the Java implementation the value of the Using Style for Evaluationof Readability ofHealth Documents-Thesis by Freddy N Bafuka background is always 255, on a traditional 0-255 gray-scale (please refer to the preprocessing steps described later in this chapter). This step was not taken in the MATLAB implementation. 3.1.2 Number of Lines Per Column The lines within a column are detected with the same algorithm used for column detection. The image is rotated 90 degrees counter-clockwise, and each line is detected as if it were a column. However, the value of some of the input parameters to the algorithm are modified to detect lines. The width of the vertical strip is exactly equal to 1 pixel, given that spaces between lines are much smaller than spaces between columns. The 1-pixel width was determined by trial and error. 3.1.3 Average White Space The average white space was computed by measuring the width of the space unoccupied by letters at the beginning and at the end of each line in a column of text. The sum of this width for all lines, divided by the sum of the total width of all lines, gives the average white space ratio. To determine the space unoccupied by letters in a line, the line is scanned from left to right, until non-background color is detected. A similar technique is used by scanning from right to left, to find the width of white space at the end of a line. Meal Planning Some people with diabetes use carbohydrate counting to balance their food and insulin. Carbohydrates, or "carbs," are what our bodies use for fuel. The more carbs you eat, the higher your blood glucose goes. And the higher your blood glucose, the more insulin you need to move the sugar into your cells Figure 3.1. An illustrationof the whitespacefeature, which captures how much of the width of each line of text is unoccupied on both the right and left sides. In this example, the gray lines show the whitespace on the right side. 3.1.4 Margins and Gray-Scale Value The left margin is determined by the width (number of pixels) of the space between the left edge of the image and the start of the leftmost column. The resulting value is divided by the total width of the image, and gives the margin value used as feature. The right margin is computed similarly, using the width of the space between the end of the right-most column and the right edge of the image. Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka 1 Likewise, the top margin is determined by the length of the space between the top of the first line of text and the upper edge of the image, after preprocessing. Since the various columns of text might not start at the same line-that is one might be shifted down vertically, as is sometimes the case with first column, especially-the "first" line of text refers here to the first line of the column with the highest top edge. As lines within the columns are detected, the value of the highest line is updated as necessary. Similarly, the bottom margin is determined by length of the space between the bottom edge of the lowest line and the bottom edge of the image, after preprocessing. We also account for the fact that one column be "shorter" or end higher than others-as is often the case with the last column of text. Therefore, the value of the lowest line is updated as the columns are scanned through during line detected. The line with the highest Y-value (lowest line) is retained for bottom-margin computation. The average gray-scale value of the page is a simple average of the value of all pixels in the image. It gives an indication of how "dense" the document is. Figure 3.2 shows the margin features, as well as the inter-column space. Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka Figure3.2a. The widths of the orange rectanglesare left and right margins. The heights of the light green rectangles are the top and bottom margins. The blue rectangleis the inter-column space, which we use to detect columns. Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka ir~knrzn np~rl S rMWO. :' m~ z RICHTS ymnwAs 7 w fnSwmeA NowPeSumr-Y honimlno daC*i art"ow m by adVO takave n At :kthrw L-! **.1S X1**744 1 1 -OM t?"mmi- -h.waymwna nondr Ldn a't weorr* ary 4 drts t. .tnC 'Arm trAwri. mrd wa n :n r.waj s hls .:oW;YV n13* I U*V tim* -im Z wm. mwed yma 11WORS f NWWa uhzshms e dwl skn a mr r"tr uiw fdmawgn WiAd r, i irm zwm ctew - LhOA a'i d=u wth t vAw insem and N 4vhn a! "rr*WWWAM~ n 1.4Jsdn nf rW Ywiam Marm chk:Wo wa d11a a d1a 3m;Wamr1m flfid% Engenn and 4m % k #Wvatro wdme d Pasher ::A d 14 u ,rtmdn 111113311 ,y#1a U~.h 34.aas wao"WA.va m l.y r?Xin a*nd1 n r ti*m 7 m 33*141 Amninr 111.1 3 *1 *r and amy ivi : wL toum tim mm1r in " bemd hmsh 4err wkwhb Fnwo Me11 AL14 *= Tim aMhrin 41.EINr m. =*1da1 had .4 11 a.:abase= -Al I rte 3dromd any d6imsura wm*4 ahns ch dm P nar wy w:-r pa* wVl * a werY bfles i 5irlna aghpan pla lnha m V*I'A to~s A ?RIVAITJCEiZBRAI1L Vnmared 1113 twearSauwm in unrn# ascamrdmjbfe And In. l N4.ma (rq army 11nMs cw*d. a1 trl :Nxy a t 1.W 4,.n*m . 3: ra.3-11 5. *rwng ,ai4ne .an** ::SA wasan 3an hdwas wM-A rdi~ypMrn! m~;It ,%V f mRm Lmpn~e maiLnUa~resI arml membr al (1 asram Anta wrhom and aophs. :n m MWn,. :w FAqd W=s anm 1 u dmar ,ape MFM im (7ira:00=0"ni My -,amna bw V IIRaMA n a Ira nm JA.n1wd a,*M. a tIrm anolmen w d~ap mrry VMa AS1610A' WiEAVTI' as"W smew had rt ild wany tho kt mmmr a;md rrrar and a b"MION wwCinod an fl3SPECIT =n=m a'VFo1W f 1:%%4 and .43ma . dmeran d wanM . remedmv"n* sh. 6imvaA -hat dU W. daumms h4 : am d m1.1 tomy thar h. n:*padAmd Irssm d Wl; *111P B mham 3*31. *113! .131 II . II.*h mml nmnwkm and stAim4111 w3I, wave w 11h you and, a*.&AWwandarmAd th W1 w -mrn wa.y 1Asm Ir. me4= a .31mm as ha 4r, au.rF3nay ow ni mm and 1te prwma w1 i whoAm a -hs-Wh r waq mPldt Ararmed by d A4.mump A.11*m1 wa*411* adMh .ahts. a1 rwdiwmrJ ao sm nd w-b yjmem thamim AWN q thar a" mwom P"Iat w rxvsan al 6 h rmn hp6 zatm unpu hdwe &Mrhd 3 r*A masMany rhl- #gs aWmI*w =a ad &AdwWWcof dch*f on tmam n &Wmmmwvdwnam ta da hea 04r: * hr pumi a 1ng !MWMpd l atumatv.e1Ud wm dt* wwn =L umsev Arsg t tm or dwar wr now a mart a as!an A U amhom n NArAng rwr I wush 2 wPL and : iam &,Am aibaitrm w hmvrmyI aA AtL ANiTi PndalAAIANnna So == s m4 401Wd W*t 1.4.' 20.4.4 n3 h4T WA pn A-, CA4 PV# Ashon, xwsfawmiMqad oamm 3.meamn o1 sm Wm am, =4 d*! ,m4 .. d4 1 dA*' twh 0 L-M owparm ag. n40 stimma(.KE 1h Aw %maal kjlgnu pma ~ hAg yhh * .0 00onem 's~ &mnFrr~a panetakdmmr na1 drg bool W m w H ALTH PERSCM Arl (3s6W a in xwm VNV WrC42T hp* shO uppe Was imFm o ummfmt h yniaor :. ak Was = no A0%ktesan neam shnid ho mLaWr hyV214t, rAM trCid qW T? of = da Pidrest ,0 ar's may chu'm mdAd up us a644 ramps w+Nmohm uad thony nuso uS om es any 7aam whn masrmw m*rh-ta me:wd sa wwn& Vy " wma pmredmw aat&OaW& AM MaWMmay *wk Mswom Prefty admg Wh Atho de adewmk OWWUN;YM341d * 2 OrPAC Wh O&W ft" s= mimm rampa rh had Ama d e tb man wam tAh -mlinc Whosi~yr WXA tarl Cis.-Ilamm- =Wat VmM a VW7 washr wimdwy msand deah~Iddt meb ami4 ,Am ski (0 mumm mudand Volk thar wMP bwmn d&pn maL psnue and snay auda don m ramoer# sup ma Wthifa t En r d apng Mr =ae.&w rampl s. swa =a Ma pva VR d uwa nwI e.mHTS kmw g a UMAN 1d b mSWandnd, nmesu smmmmiri Wyms vmskon rawd nd V to i rmanashmidMA armo dM6 aNManis aman Or h mwaf mmwdysnr~r ddmm radod Ahmwus Ad th do and a n glea wWA wraid Was ek hIan A AmAwrven dnghi nM by arpllbndrshrthanWa hme whm n w MinW Am mInanf and acq raw Mm ww goemy amannd paradean at hmb af O homid Alurmmani mhoen eatl fit tiar homfidam 2amrW~a# Ar od dr lawn MW1j" sMWd 040m0ba=S6WOdmaWWg hAftA ram Vadma iaso of our eghts ad 06M ho Ra*i me XapmJdeat mthrt hQussaMa hmmsh m"N rnimAl -boa Featmar t twram VWompt a wa aman, aquft inma tm ma ams findpMany Tin mantannawal Tm N r0 , M.'flum whinv" do manypresr nan est Atvwweharmw me drtlrr. bl~r? C*crul~kurLln.ir~u" i~~slxwk~ ?OLu~r ~i Ir~rrl Figure 3.2b. The originalimage of the page used infigure 3.2a. Note the footer was ignored. The bottom margin is the height of the space between the bottom edge and the last line of actual text detected. 3.1.7 Interline Space Ratio The purpose of this feature is to capture the sparsity of the of text. This feature captures how much the vertical space is occupied by lines of text, and who much is background. It is analogous to the "average white space" features, which computes horizontal white spaces. It is computed by adding up the heights of all lines of text within each column and dividing that value from the total height of the image. The resulting value, subtracted from 1, gives the interline space ratio. Figure 3.3 shows the interline space. Using Style for Evaluationof Readability ofHealth Documents-Thesis by FreddyN Bafuka FINDINGS }, )interline space M .amy, Jmun 13. 2005:A09 Diabetes Treatment Helps Babies' Health Women who develop diabetes durin pregnancy give birth to healt aggressively treated, according to a large study that helps bolster t pregnant women for diabetes. In the study Australian researchers followed 1.000 women diasno diabetes. uring their thrd trmester the women were separated n Figure 3.3a The interline spaces (heights of the gray rectangles shown above. As can be seen the interlinespace can vary greatly between differentpairs of adjacent lines. FINDINGS Post Mond&y, June 13, 2005; A09 Diabetes Treatment Helps Babies' Health Women who develop diabetes during pregnancy give birth to healt aggressively treated, according to a large study that helps bolster t pregnant womn for diabetes. In the study, Australian researchers followed 1,000 women diagno diabetes. During their third trimester, the women were separated i Figure 3.3b. The originalimage usedfor illustrationin figure 3.3a. 3.1.8 Interline-to-Line Size Ratio This feature further captures the sparsity of the page. It completes the previous feature in that it would capture the difference between a single-spaced and a double-spaced document. For instance, a page with two line of text using 60-point size font will yield the same value for the previous feature, for 12 lines using 10-point size font. The spacing between the lines might not have changed. This feature, therefore, finds a ratio between the average size of lines of text and the average size of interline space. Scientific papers for instance, when published in a journal, will tend to be single-spaced and denser. The line-to-whitespace ratio is computed by adding up all the line sizes and all the interline spaces (heights), and taking the ratio of the two numbers. In figure 3.4, we show the line sizes. The interline spaces are Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka shown in figure 3.3a. FINDINGS } Line Size Post Nvldv., ue 13.2t' : A0. Diabetes Treatment Helps Babies' Health ) Line Size Women who develop diabetes during pregnancy give birth to healt aggressirely treated, according to a large study that helps bolster t pregnant women for diabetes. I Line Size In the study, Australian researchers followed 1.000 women diagno diabetes. During their third trimester, the women were separated m Figure 3.4. In red, 3 different line sizes are pointed out. 3.1.8 Maximum-to-Minimum Line Size Ratio The maximum-to-minimum line size ratio tries to capture the variation in font size in a document. A value of 1 will mean that the same font size has been used throughout the page. A value of 6 will mean that the biggest line of text is 6 times larger than the smallest. Such a large ratio will probably indicate that several font sizes were used between the biggest and smallest one, which, in turn, indicates that the document is complex-the number fonts used give an indication to the number of subheadings used in the document. 3.1.9 Color Count The number of colors in each document was extracted by search through each pixel of the page's image (before its conversion to gray-scale). Each color was assigned a unique number, by multiplying each of the color channel (red, green, blue) by 1, 103, and 106, respectively. From each pixel of the image, the color was extracted as a unique. That number was added to a set. Once colors had been extracted from all pixels and added to the set, the number of elements in the set indicated the number of colors in the image. 3.1.10 Number of Pages This is the only feature that is a property of the whole document, and not of one page. The number of page per document was extracted by searching through the directory of images counting the number of images associated to the same document. Each image is interpreted as a page. Documents with more than 3 pages were given a page count of 3. UsingStyle fbr Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka 3.2 Preprocessing Stages Before the above features are actually extracted and evaluated from a page's image, a few preprocessing steps occur. 3.2.1 Pixel Value Extrapolation To further create a clear demarcation between the text line areas and background, darker pixels are given a value of 0 on an 0-255 gray-scale, while lighter colors are extrapolated to 255. This extrapolation eliminates unwanted side effect such as anti-aliasing, caused by graphic renderers. 3.2.2 Horizontal Slide Preprocessing Even within the text regions of a document's page, a large amount of space is background. As a result, the average pixel value within text region can still be very close to background region. We therefore add a preprocessing step to increase the number of dark pixels within the text region. The step consists in "sliding" the image horizontally, on itself, which causes a thickening of the letters in the horizontal direction. The result is greater contrast between margin (just background) regions and text regions. See figure 3.5. many health problems are linked to obesity. A person's body fat percentage gives a good indicator of whether they are obese. Hence body fat percentage is widely suggested as one Figure3.5a: Originalimage before horizontalslide. MEW hWl h ib mr n ullo dto obsia Apm bo d t pMeatapl -l£a loo dlae tor whth thy 0 obl. MXm bodyt rct i ily o llt d on/ Figure3.5b: Image after horizontal slide, accentuating letter pixels. 3.2.3 Non-Content Text Removal Often, some documents come with text that should not be considered part of the document. For instance, when printing a web page, the web browser usually introduces a header or footer containing the URL of the web page being printed. Such a header would cause a misrepresentation of the top margin, for instance. Therefore, as an extra preprocessing step, we remove all text that are "too" close to the edges. Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka 3.2.4 Differences between the Java and MATLAB Preprocessing The Java implementation does not use the horizontal-slide processing, for usability reasons, as the implementation is very slow, where as the operation is fairly rapid in MATLAB. The pixel value extrapolation was the best alternative to accentuate foreground vs. background differences. The MATLAB implementation, on the other hand, does not use pixel extrapolation, as the horizontal-slide provides adequate accentuation of the areas. Our results later showed that either method provided adequate ability to extract features correctly. We recommend that any implementation use at least one of those two preprocessing steps. 3.3 Extraction The feature extraction was done on all documents without failure, although with some warnings from the ColumnDetector module. Two types of situations can cause the ColumnDetector to trigger a warning. First, there are cases in which the right boundary detected by the system happens to be such that a few lines in the column extend slightly beyond it. This normally happens just before the right margin of the document, when the text is left-justified, for instance. As the scanning progresses, the portions of lines extending beyond the column boundary can trigger a false column-start event. Usually, no end will be detected for such a false column, until the scanning reaches the right edge of the document. In these cases, the system infers that a "false end column" was started after the right boundary of the column, and ignores it. The Java implementation prints a warning on the standard error, stating "Warning: false end column. Ignoring..." Note that in this case, there is no failure, but rather a prevention of one. The second type of warning occurs when a column was started much earlier in the document--closer to the left edge-and yet no end to the column is detected till the scanning reaches the right edge of the document. In this case, the end of the page (right edge) is considered the end of the column. The Java implementation prints a warning on standard error, stating "Warning: column end was not found. Assuming end of page..." The MATLAB implementation throws a waming as well, but does not use different warning statement to distinguish between the two cases. Of the 324 page images used, less than 30 generated a warning, and virtually all were warnings of the first type. Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka 4. Machine Learning Algorithms Used To predict the scores of expert-rated documents, we needed a regression algorithm, since the scores values are continuous-not discrete. We chose to train and test two different regression models: (i) Linear Regression (ii) k-Nearest Neighbor To classify between easy-to-read and hard-to-read documents, we needed a classification algorithm. We chose to train and test two classification models: (i) SVM (ii) Logistic Regression From each document, one page was used to extract style features-except, of course, for the page count, a property of the entire document. The Java feature-extractor module extracts 13 features per document, while the MATLAB implementation extracts 10, as discussed earlier. For the purpose of this section, we will refer to m as the number of features extracted per document. For the Java implementation, m = 13 and for the MATLAB counterpart, m = 10. We refer to the set of m feature values extracted from a particular document as the input from that document. Let D be the vector of inputs D = ( DI, D 2, ..., D,)'obtainedfrom a particular data set. For the expert-rated data set, n = 324. For the easy-to-read and hard-to-read data set, n = 195 + 172 = 367. Each input Di = ( da, di2,... , dim) is a vector of feature values extracted from the i-th document in a given data set. Therefore, D can be thought of as a n-by-m matrix. Let G be the grade given to the i-th document by the experts, in the expert-rated data set. G can therefore be thought of as a n-by- vector, with n=324. And let L, be the label (0 for easy, 1 for hard) given to the i-th document in the easy- and hard-to-read data set. L can therefore be considered an n-by-1 vector, with n = 367. 4.1 Regression Models for Prediction of Experts' Scores In the prediction of the experts' scores, our goal is, ideally, to create a score function S such that S(D,) = Gi. In practice, however, we expect some level of error, which we define as: error(i)= Gi- S(Di)|. 4.1.1 Linear Regression Linear regression is a mathematical model that represents the output (score in this case), as a Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka linear function of the variables. It often provides an adequate interpretation of the effect of the inputs on the output values[28]. In the next section, we consider an alternative model that does not make the assumption of a linear relationship between input and output. Linear regression attempts to formulate S(Di) as follows: (4.1) S(D,)=0o+Z dij j=1 where the m+1 constant coefficients (also called parameters) ,fo,fl, ... , 8m are unknown. Taking all inputs, or observations into account, we can rewrite equation 4.1 as follows: + S(D)= Y0 D. .j (4.2) j=1 where D: j represents thej-th column of D. The constant coefficients make S a linear model. The essential part of training this linear regression model is finding the best set offlj parameters. There exists several criteria for determining the "best" set of fj values. We chose to use the residualsum ofsquares (RSS), which is the most widely used measure for comparing such sets of parameters[29]. Given a vector of parameters fl = (fro, f , ... ,m), we define the residual sum of square as the sum of the square of the errors: RSS()= (G - S(D)) 2 (4.3) (Gg-fo-# du j) 2 ()= (4.5) i=1 RSS ( i=l j=1 We therefore define the best value for the vector of parameters fl as follows: I = argmin RSS (g) (4.6) We found the best set of parameter values by using MATLAB's glmfit function. The glmfit function takes two main inputs: (i) An n-by-m matrix, representing n observations (in our case, document inputs), each having m predictor values (in our case, feature values). For this input, we provided a K-by-m matrix constructed from K rows of the matrix D described above. We describe this further in the cross validation section later in the paper. (ii) An n-by-1 vector, representing the target values. In our case, the targets are the experts' grades. For this input, we provide a K-by- vector constructed from K corresponding elements of G Using Style for Evaluation ofReadability of HealthDocuments-Thesis by Freddy N Bafuka vector described above. We give more details in the cross-validation section. (iii) A specification for a distribution.We used a "normal" distribution, which causes gimfit to apply linear regression as formulated in equation 4.1. We used the default values for all other inputs to glmfit. The g mfit function outputs a vector of m+1 elements, containing the values for flo, fl, ... , im, that minimize the residual sum of squares. Given a new document, D,, never seen before by the system, our model will predict its score by evaluating S(D,), using formula 4.1, with the parameter values obtained from glmfit. 4.1.2 k-Nearest Neighbor The k-Nearest Neighbor algorithm is a memory-based system and does not require any search for best-fit parameters. Each input Di can be thought of as a point in a m-dimensional space. Given a new input D, never before seen by the system, we find the k closest points in distance to D, and compute its score by taking an average of its neighbors. We use Euclidean distance. Let K be the set of the k closet points to D,. We formulate S(D,) for k-nearest neighbor as follow: S(D,)= -Y Gi (4.6) ieK Throughout this project, we used k= 3. Despite its simplicity, k-nearest neighbor, it is known to perform well, and has been used successfully in many applications, include some image-processing related tasks[30],[31]. We use it as second alternative to linear regression, as it does not make any assumption about the linearity of the input-to-output effect. 4.1.3 Training and Testing Using Cross-Validation For both the linear regression and k-nearest-neighbor models, we used a 5-fold cross-validation to run a training-testing cycle. (i) Linear Regression The 324 inputs in D were randomly placed into 5 folds (or partitions) of approximately equal size (4 of the 5 folds had 65 inputs, while one had 64). The first fold, FI, was used as test set, while the other 4 partitions were combined to form a training set. Given the inputs in the Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka 25 training set, and their corresponding grades, we used the glmfit function to find the best set parameters for a linear regression model, as discussed in section 4.1.1 above. The resulting parameters were then used to apply the score function S to each of the inputs in F, according to equation 4.1. We then repeated the experiment (second cycle) using F 2, the second fold, as the test set and join the remaining 4 others to form the training set. We did the same for folds 3, 4 and 5, running 5 cycles total. For each of those cycles, we recorded the average error.Let Ri be the set of inputs used for the training set when Fi is used as the testing set. And let Si be the score function with the beta values obtained by using R, as input to glmfit. We define the average error for a cycle i, avgErri as follows: avgErr,=1 avgE Fi Gj-Si(Dj) D F (4.7) - We define the final average error for linear regression as follows: 5 avgErr 5= 5 i=1 avgErri (4.8) We repeated the whole experiment several times, randomly generating 5 new folds each time. We found that the results were relatively the same. (ii) k-Nearest Neighbor In a similar manner to the process described in part (i) above, the 324 inputs in D were randomly placed into 5 folds. The 5-KNN model was tested on each of the fold, while using the remaining 4 partitions as a training set. For KNN, a "training" set is simply the set used as the memory from which the k closest points to a test point are taken. There is no actual training stage comparable to the one for linear regression. More formally, in this experiment, if R, is the set of inputs used for the training set when Fi is used as the testing set, then we define S, as in equation 4.6, with the added constraint that K is strictly a subset of Ri . With Si thus defined, we evaluate each Si (Dj) for all Dj in F . We perform this routine for all 5 folds. We also compute and record the average error exactly as described in section (i) above (but with using Si as defined in this section). Using Style or Evaluationof Readability of HealthDocuments-Thesis by Freddy N Bafuka 4.2 Classification Model for Recognition of Easy vs. Hard Documents To determine whether a document is easy or hard to read, our ideal goal would be to create a classification function C such that C(Di) = Li. In practice, however, we can expect that some documents could be misclassified. Given a test set, we will define the errorof the classification model as the number of documents misclassified. 4.2.1 SVM We use the Support Vector Machine (SVM) algorithm to build the first classification model. SVM treats the inputs as data points in an m-dimensional space. As mentioned earlier, m is the number of features (predictors) of each input. Given data points from two different classes, the SVM algorithm attempts to find a boundary line, plane or hyperplane (depending the value of m) that spatially separates the points of the two classes into two regions. The SVM method uses just a few points from each class, called support vectors, that lie at the boundary of the classes[32]. The assumption is that points at the boundary are the ones that are critical for defining a separating line. The line (or plane, or hyperplane) is chosen so as to maximize its distance from the support vectors on both regions. Given a new, unseen data point, an SVM model will classify it based on which side of the boundary it lies on. Therefore, the output of our SVM model is binary: either a 0 (easy-to-read) or a 1 (hard-to-read). It is possible that a test point would fall on the separating hyperplane. In that case, the point's class is ambiguous, and it is not classified. Such a point is also not included in the classification accuracy measurements we describe later. It is also possible for a point to fall on the wrong side of the separating hyperplane-and in that case, it is considered missclassified. We used MATLAB's implementation of the SVM algorithm, provided through the svmtrain function. The svmtrain function uses a linear kernel by default, which means that a simple dot product is used to determine the location of a test point with respect to the separating line, and the input is not transformed into another space (which is useful in some applications). The svmtrain function takes two arguments: (i) An n-by-m matrix containing n observations, each having m predictors. In our case, we provided a subset of the matrix D of document inputs described earlier in this section. (ii) An n-by-1 vector containing the ground truth values for the inputs given as the first argument above. For this argument, we provided the corresponding subset from the L vector of labels, described earlier. Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka The svmtra in function returns the bias and slope values for each dimension, for the separating hyperplane. 4.2.2 Logistic Regression Given an input, the SVM model gives us only its class label as output. It would be useful to have a measure of confidence for the classification of a document. Logistic regression is a mathematical model that provides a real value as raw output. The output, which is always between 0 and 1, can be interpreted as the probability that a given input is of the positive class--"hard to read", in our case. The higher the value, the more likely the document is hard to read. The smaller the value, the more like it is to be easy to read. How high the value of the output must be in order for the input to be classified as hard-to-read is a question that we analyze, in a search for the appropriate cutoff threshold. Logistic regression is regarded as an efficient supervised learning algorithm for estimating the probability of an outcome or class variable. In spite of its simplicity, logistic regression has shown successful performance in a range of fields. It is widely used in a many fields because its results are easy to interpret[33]. Let Di be a document from the easy- and hard-to-read data set, as described earlier. We define a variable z as follow: Z = B 0+ 8dil + B2 di+...+ m,d (4.9) where the for fo, i, ... , fl,,, are the unknown parameter values. The logistic regression model is described as follow: +(z)= l+e-' (4.10) The output from this function, 1(z), is what we define as the "raw output" of our logistic regression model. From equation 4.10, we deduce that higher z values will produce an output closer to 1, while smaller z values will yield raw output closer to 0. Hence, the key part of building the appropriate linear regression model is to find the best set of parameters that will produce raw output close to 1 for the hard documents and close to 0 for the easy documents. We can also use the residual sum of squares to evaluate the best parameter values, as we did for linear regression earlier. Given a set R of documents labeled either 0 (easy) or 1 (hard to read) and a vector fl of parameters, we can define the RSS formally, for this case, as follows: RSS( )= I (L-l(z))2 (4.11) D ER We define the best set of parameters as in equation 4.6 (but using the RSS definition given in Using Style Jbr Evaluation ofReadability of HealthDocuments-Thesis by Freddy N Bafuka formula 4.11 above). We found the best parameters again by using MATLAB's solution implemented in the glmfit function. The arguments passed to glmfit in this case are similar to those described in section 4.1.1, except that the observations are taken from the easy- and hard-to-read data set, the target values are correspondingly taken from the L vector described earlier, and we use "binomial" for the distribution argument (so that glmfit computes the parameters based on equations 4.9-4.11) 4.2.3 Training and Testing Classification Algorithms (i) SVM We used a 3-fold cross validation scheme to train and test the SVM model. The 367 easyand hard-to-read documents were randomly assigned to 3 folds of approximately equal sizes-2 of the folds having 122 distinct documents each, and one having 123. We performed the first iteration (or cycle) of training and testing by using the first fold, F1, as a test set. The remaining two folds were combined to form a training set, R1 . R1 being therefore a 244- or 245-by-m matrix, was used as the first argument to svmtrain, as discussed in section 4.2.1. The corresponding ground-truth labels, a vector of 244 (or 245) elements, was passed as the second argument to svmtrain. The testing stage consisted of using the hyperplane obtained from svmtrain to classify the 122 (or 123) inputs in F. The number of misclassified inputs was noted. We define that number as the error for the iteration. We performed a second and third iteration in the same manner using fold F 2, and F 3, respectively, as the test set. For each iteration, we also noted its error.The total errorwas obtained by adding the 3 error values obtained from the 3 iterations. We define the average error of the SVM model as the total error divided by the total number of inputs tested through all iterations, 367. We report the correct rate, which we define as 1 - averageerror. (ii) Logistic Regression To build a logistic regression model, we also use a 3-fold cross-validation scheme, as described in the previous subsection. The 367 data points were again randomly placed into 3 folds of approximately equal sizes. During each one of the 3 cross-validation iterations, one of the folds was used as a test set, while the other two were combined to form a training set. The training set was used as the first argument to glmfit (the observations,) as described in the previous section. Using Style for Evaluationof Readability of Health Documents-Thesis by FreddyN Bafuka 29 The corresponding labels, a 122- or 123-element vector, was used as the second argument-the vector of targets. We obtained the best set of parameters from th glmfit function's output. These parameters values were substituted into equation 4.9. We then obtained raw output values for the inputs in the test set for the iteration, using equation 4.10. So, once all cross-validation iterations had been run, we had three logistic regression models, each with its set of optimal parameters, and a raw output value for each of the 367 inputs, corresponding their use as part of one the test set associated with their fold. In order to test the performance of the logistic regression model, we needed a threshold to use as cutoff between the two classes. A document whose raw output value was above or equal to the threshold was considered of class 1 (hard to read). A raw value below the threshold classified the document as easy to read (class 0). The threshold could be any value in the [0 1] range. However, given the finite number of test inputs, only a finite number of threshold values would make a difference in performance. Hence, we considered potential thresholds only at the raw output values. That is, each raw output value was tested as a potential threshold. Therefore, we tested a total of 367 threshold values. For each threshold value, we computed the true positive rate and the false positive rate. The true positive rate, also called the sensitivity, is the fraction of the positive test points that were correctly classified (as positive). In our case, the true positive rate is the number of hard documents whose raw output values were equal to or above the threshold (i.e., correctly classified) divided by the total number of hard documents in the test set. The false positive rate is the fraction of incorrectly classified negative samples. In this testing stage, the false positive rate is the number of easy documents whose raw output values were equal to or above the threshold (i.e, classified as hard) divided by the total number of easy documents in the test set. The specificity is defined as 1 -false positive rate. In addition to the true and false positive rates, we also computed the overall accuracy, or correctrate, of the model for each threshold value. We define the correctrate as the number of test documents correctly classified divided by the total number of documents tested (367). Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka 5. Results The results reported focus on the Java implementation of the system, as we expect it to be the most accessible implementation. We however, provide a summary of the result of the MALAB implementation at the end of this chapter. 5.1 Prediction of Expert Scores For the linear regression and k-nearest neighbor models, we report the correct rate, which we define as follows: correct rate=l- avgErr (5.1) We defined avgErr,the average error, in section 4.1.3, and it is simply a measure of the average difference between the output of the regression models and the gold standard from the experts. The division by 7 is for normalization, in order to guarantee that the correct rate is between 0 and 1. A correct rate of 1 would mean that a model's output matched the experts grade perfectly. Table 5.1 presents the correct rates for the two models, as the main result of this project. Regression Model Correct Rate Linear Regression 0.875 k-Nearest Neighbor 0.856 Table 5.1 The correct rates of regression models. To visualize the performance of the two models, we provide plots that compare the output of the models to the gold standard. As explained in the previous chapter, after all cross validation iterations have been performed, every input in the data set D must have been used exactly once as part of a test set. We recorded these output values, and hence, had a score for each of the 324 inputs from the testing stage. We present a plot of the experts-given grades versus the output values of our model. More specifically, we begin by sorting the 324 gold standard scores in ascending order, to create a set of x-axis values. We order the models' outputs on the y-axis correspondingly. To reduce the number of points on the graph and eliminate outliers effects, we take the average of every 10 consecutive values , to obtain 33 new values on each axis. We chose a group size of 10 arbitrarily. The resulting plot, which we call here a calibration plot, gives a visualization of how closely related our models' predictions are to the experts'. The calibration plot for an ideal model would yield points lying on a straight line passing through the origin Using Style for Evaluation of Readability ofHealth Documents-Thesis by FreddyN Bafuka 31 with a slope of 1. A completely random model will produce points scattered randomly on the graph, following no lines. The green line represents the ideal case. Figures 5.1 and 5.2 show the calibration plot for the linear regression and the k-nearest neighbor models, respectively. In order to compare both models visually, figure 5.3 plots output of both models. Calibration Plot for LR Model (Gsize = 10) 0o o0 00 00 011 00 3.5 /- 3 t , 4 , i | I I 3 3.5 4 , i i 4.5 5 Expert Scores 5.5 6 6.5 I 7 Figure 5.1: Linear Model'spredictions compared to the gold standard. Using Style for Evaluationof Readabilityof Health Documents-Thesis by Freddy N Bafuka Calibration Plot for Knn Model (Gsize=10) 7 0O 0 O 6 O O 0 0 0 0o 3 0 0 0 0 o 0o z 0 , -0 0 0 0 2 2.5 3.5 3 4 4.5 5 Expert Scores 5.5 6 6.5 7 Figure 5.2: k-Nearest-NeighborModel's predictions compared to the gold standard. Calibration Plot for both LR and Knn Models (Gsize=10) O D- O 0000 0zO O O O O O 0 D. O o O to o LR El KNN Ideal O EI 3 I 1 3.5 4 I I 4.5 5 Expert Scores I I I 5.5 6 6.5 7 Figure 5.3: LR and KNN predictions compared to the gold standard. Using Style for Evaluation of Readabilityof Health Documents-Thesis by Freddy N Bafuka 5.1.1 Variable Selection We were interested to know which features were most useful in forming the model. There exist several methods for variable selection. We chose to examine each feature as a one-predictor model and use linear regression to map its values to the gold standards values. The lower the deviance, the more useful the feature. More specifically, given D, the n-by-m matrix of n inputs Di= (dil, d 2, ... di,,,), we considered each columnj of D, as dataset of its own, D:j, = (di,, d2j, ..., d,j). We use the equation: Sj (Di)= o+d f3l (5.1) and solve for the set of parameters that minimizes the deviance between the outputs of Si and the gold values, G. We found the minimal deviance for each featurej = 1, 2, ..., m. The feature with the smallest minimal deviance was considered the best. Using the deviance to compare two models is a common method for determining relevance of variable. For normal linear regression models-which is what we are using in 5.1--the deviance is equivalent to the residual sum of squares, RSS[34]. We computed the RSS using the same tools described for linear regression in the previous chapter. We ranked the features, from best to worst, based on the RSS of their model. The features, ranked from best to worst were: 1. Left Margin (LMS) 2. Page Count (PGC) 3. Maximum-to-Minimum Line Size Ratio (MMR) 4. Bottom Margin (BMS) 5. Average Gray Scale Pixel Value (AGS) 6. Average Number of Lines per Column (LNC) 7. Top Margin (TMS) 8. Line-to-Space Ratio (LSR) 9. Number of Columns (CLC) 10. Right Margin (RMS) 11. Whitespace Ratio (WSR) 12. Color Count (CLRC) 13. Interline Space Ratio (ISR) We then built 13 different models, each of a subset of D, using only the columns corresponding to the k best features, k = 1, 2, ..., 13. We computed the deviance of each of those models. The first model included only the best feature, the second include the top two, etc. In figure 5.4, we report a graph of the change in deviation as more features are added to the model, in the order determined by the variable UsingStyle for Evaluation ofReadability of HealthDocuments-Thesis by Freddy N Bafuka selection step. Variables Improvement of Model 415 410 405 400 395 390 385 380 375 370 1 2 4 10 8 6 Number of Best Variables Used 12 13 14 Figure 5.4: The comparison of 13 models, each using 1-13 bestfeatures, respectively. The deviance decreases as morefeatures areadded. We labeled the first 5 features on the graph. As we expected, the model improves as more features are used, showing a reduction in deviance at each step. We note that there is no significant improvement in the model beyond the sixth best feature addition. 5.1.2 Typical Feature Values In this section, we report the typical values of some features. Figures 5.5-5.8 show the value of the 4 best features, as determined in the previous subsection. The feature values are plotted as a function of the experts' grades. It should be noted that since many documents have the same gold standard scores, and of those, many have the same feature values, there are few distinct points on some of the plots, as many points are identical. Each graph, however, does have a point for each of the 324 documents. Using Style for Evaluationof Readability ofHealth Documents-Thesis by Freddy N Bafuka Best Feature 1Values Across Scores 0.15 - 0000 0 * ** * *0 * 0.145 0.14 0.135 0.13 0.125 0.12 0.115 - 0- * 0.11 - • o0- 0.105 0.1 1Figure 2 3 4 Experts Scores 5 7 6 Figure 5.5: Valuesfor the left margin, the bestfeature. Best Feature 2 Values Across Scores 0* 0 *o* *0 0 o *0 * * **- 2.5 - • * * 0* 1 1 2 2 3 3 4 4 Experts Scores 0 * 5 6 Figure5.6: Values for the page count, the 2 nd bestfeature. 00- 0- Using Style for Evaluationof Readability ofHealth Documents-Thesis by Freddy N Bafuka Best Feature 3 Values Across Scores 2.5 F C: ** M ++ 1.5 * * + 1 2 3 4 Experts Scores * . 5 6 7 Figure 5.7: Values for the maximum-to-minimum line size ratio, the 3" best feature. One outlierpoint is not shown. Best Feature 4 Values Across Scores * 0.610.5 k* * *e* 0.410.3k * *e SI 0.2 A *+ * 0.1 2 3 4 Experts Scores Figure 5.8: Values for the bottom margin, the 4 th bestfeature. Due to the fact that many points of some of feature value plots are identical, it is difficult to distinguish the points that represent typical feature values from the outliers. We therefore also provide Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka plots of the average features values per score buckets. Each bucket, bi, is defined as the group of documents whose gold standard scores rounds to i. That is, a document with an experts-given grade of 4.4 is considered as part of bucket 4, whereas a score of 4.5 would make a document part of bucket 5. In Figures 5.9 through 5.14, we show average feature value per bucket for the best 6 variables. Average Value of Best Feature 0.15 0.145 0.14 . 0.135 - S0.13 J 0.125 0.12 0.115 0.11 1 2 3 4 5 6 7 Experts Score Buckets Figure 5.9: Average valuesfor the left margin, the bestfeature. Using Style for Evaluationof Readabilityof Health Documents-Thesis by Freddy NBafuka Average Value of 2nd Best Feature 2.8 2.7 2.6 2.5 2.4 2.3 a 2.2 2.1 1.9 1.8 1 2 3 4 5 Experts Score Buckets 6 7 Figure5.10: Average valuesfor the page count, the second bestfeature. 38 Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka Average Value of 3rd Best Feature 1.8 1.71.6 1.2 - 13 1.2M 1.1 1 2 3 5 4 Experts Score Buckets 6 7 Figure5.11: Average valuesfor the maximum-to-minimum line size ratio, the 3'd bestfeature. Using Style for Evaluationof Readability ofHealth Documents-Thesis by Freddy N Bafuka Average Value of 4th Best Feature 0.2 0.19 0.18 L 0.17 • 0.16 S0.15 0 o 0.14 0.13 0.12 0.11 0.1 1 2 3 4 5 Experts Score Buckets 6 7 Figure 5.12 Average valuesfor the bottom margin, the 4 th bestfeature. Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka Average Value of 5th Best Feature 245.5 245 244.5 244 243.5 243 ' 242.5 242 241.5 241 240.5 1 2 3 4 5 Experts Score Buckets 6 7 Figure5.13: Average valuesfor the averagegray scalepixel value, the 5 th bestfeature. Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka Average Value of 6th Best Feature 35 34 33 C- E -32 o c) a 31 C 0 o 30 -j 29 28 27 1 2 3 4 5 Experts Score Buckets 6 7 Figure 5.14: Average values for the number of lines per column, the 6th bestfeature. To provide further visual comparison between the features, we plotted the average value per bucket of several features on the same graph. Because the values of the various features have different ranges, we normalized all values. Each features' values are normalized using its maximum value. Figure 5.15 shows a plot of the normalized average values by score category (bucket) for the best 4 features. Using Style for Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka The Best 4 Feature Average Values, Normalized LMS (Blue), PGC(Red), MMR(Green), BMS(Cyan) 0.95 0.85 0.8 0.75 0.7 0.65 0.551 1 2 3 4 5 6 7 Gold Standard Score Buckets Figure 5.15: A comparison of the average values per grade categoryfor the best 4 features: Left margin (LMS), page count (PGC), max-to-min line size ratio (MMR), and bottom margin (BMS). Figure 5.16 shows a similar plot for the fifth and sixth best features on the same plot. However, because we have only two ranges to plot, the values are not normalized. Instead, we use the left y-axis for one feature and the right y-axis for the other. Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka Average Values of 5th and 6th Best Feature 250 a, .3.-J' 0f) _. 245 cn o 0C) co L L M M 0) _. <ia) 240 -25 1 2 3 4 5 Gold Standard Score Buckets 6 7 Figure 5.16: Average values per grade categoryfor the averagegray-scalepixel values (the 5 th bestfeature) and the number of lines per column ( 6th bestfeature). 5.2 Classification of Easy vs. Hard-to-Read Document For both the SVM and Logistic Regression models, we report the accuracy (or correct rate), which is the number of correctly classified documents divided by the total number of documents classified. We also report the sensitivity and specificity, as discussed in section 4.2. We report the value of these three indicators (accuracy, sensitivity and specificity), for each of the 3 iterations of crossvalidation. We also report the overall performance, which is the result computed from using the 3 crossvalidation test sets as one test set of 367 documents. The sensitivity and specificity reported for the logistic regression model are those that correspond to the best threshold. We define the best threshold (or optimal threshold)as the one that yields the highest accuracy. We present the three performance indicators in tables 5.2, 5.3 and 5.4, as the main results for the classification experiment of this project. Table 5.2 presents the performance of the SVM classifier in each of the cross-validation folds, as well as the overall results. Table 5.3 presents the corresponding Using Style for Evaluationof Readability of Health Documents-Thesis by FreddyN Bafuka results for the logistic regression classifier. Table 5.4 summarizes a comparison of the two type of classifiers. Fold 1 Fold 2 Fold 3 Overall Accuracy 0.8852 0.8862 0.9016 0.8910 Sensitivity 0.8596 0.8793 0.8596 0.8663 Specificity 0.9077 0.8923 0.9385 0.9128 Table 5.2. SVMperformance in the 3-fold cross validation and overall. Fold 1 Fold 2 Fold 3 Fold 4 Accuracy 0.9187 0.9180 0.9016 0.9128 Sensitivity 0.8966 0.9474 0.8246 0.8895 Specificity 0.9385 0.8923 0.9692 0.9333 Table 5.3. Logistic regressionperformance in the 3-fold cross-validationand overall. Accuracy Sensitivity Specificity SVM 0.886 0.892 0.878 Logistic Regression 0.913 0.895 0.928 Model Table 5.4. Accuracy, sensitivity (recall)and specificity (precision)of the Support Vector Machine model and the three Logistic Regression model. 5.2.1 FurtherEvaluation of Logistic Regression Classifiers Table 5.3 shows the performance the logistic regression model at only the most accurate threshold value. To get a general measure of the logistic regression model, considering all 367 threshold values obtained from cross-validation testing steps, we present two plots: (i) An accuracy plot (ii) An ROC curve The accuracy plot shows the correct rate for each threshold value. To find the accuracy at a particular threshold, we simply classify all output values equal to or above the threshold as class 1 ("hardto-read") and all others class 0 ("easy-to-read"). Therefore, we expect low accuracy at the extreme Using Style for EvaluationofReadability of Health Documents-Thesis by Freddy N Bafuka threshold values of 0 and 1, as the model will classify all points as positive and negative, respectively. We expect higher accuracy in the middle region. In figure 5.17, we provide an accuracy plot. The red marker indicates the best threshold value. As can be seen, this threshold value correspond to the highest point in the graph (maximum accuracy). Accuracy Plot for Logistic Regression 0.9 0.85 0.8 0.75 2 0.7 0 0 0.65 o 0.6 0.55 0.5 0.45 0 0.2 0.6 0.4 Thresholds 0.8 1 Figure 5.17. Accuracy of various classifiersas a function of their thresholdvalue. The red marker indicates the best threshold,having the highest accuracy. To further visualize the results, we plot the receiver operatingcurve (ROC)[35],[36]. On the xaxis of an ROC curve are the false positive rates corresponding to increasing threshold values. On the yaxis are the corresponding true positive rates. If we think of this task as recognizing hard documents and rejecting non-hard (easy) documents, the true positive rate measures the classifier's ability to recognize the desired object (hard documents). This is also called the recall.The false positive rate measures the classifier's vulnerability to false alarms (an easy document having some resemblance to a hard document). This is a measure of the model's precision.An ideal model would have true positive rate of 1, indicating that the model recognized all hard documents presented to it. The ideal model would have a false positive rate of 0, meaning that no easy document fooled the classifier. In practice, however, a low false positive rate is achieved by using a higher threshold value, making the classifier more conservative in accepting Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka documents as hard. The downside of making the classifier's acceptance criteria more strict is that some positive objects (hard documents) will be rejected. The counterpart is that a lower threshold value gives a more permissive classifier. Such a classifier is less likely to incorrectly reject a hard document, but having "lowered the bar", it also becomes more likely to accept an easy document as hard. Hence, the need to find the right trade-off threshold value. The true positive rate is equivalent to the sensitivity, and the false positive rate is 1-specificity. The area under the ROC curve, or simply area under the curve (AUC), gives an estimate of how good the overall model is[37]. An ideal model will have an area of 1, while a random model will have an area of 0.5, as the ROC curve will be closer to a straight line from (0,0) to (1,1). The AUC for the logistic regression model we trained was 0.938. Figure 5.18 shows the ROC curve. ROC for classification by logistic regression 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 I 0 0 0.2 I 0.6 0.4 False positive rate I 0.8 1 Figure 5.18. The ROC curvefor the logistic regressionmodel. The red marker indicates the thresholdwith highest accuracy. Finally, we looked at the raw output of the logistic regression model to find the typical ranges of the two types of documents. Figure 5.19 shows all 367 test inputs evaluated in cross-validation testing stages, using equations 4.9 and 4.10. The horizontal green line shows the best threshold. Using Style br Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka Raw Output of Logistic Regression Classifier 0.9 0.8 0.7 0.6 + 0.5 IL 0.4 0.3 0.2 0.1 -25 + 0 0..-- -20 Hard Easy Threshold -15 -10 -5 0 5 z = b0+dl*bl +d2*b2+... +d13*b13 10 15 Figure5.19. The ROC curvefor the logistic regression model. The red marker indicates the threshold with highest accuracy. Multiple Optimal Thresholds The outputs from the 3 cross-validation models yielded one best threshold value. However, it is important to mention that that is not always the case. Because the sensitivity and specificity are computed from the number of correctly classified positive and negative inputs, respectively, while the accuracy is dependent on the correct classification of all inputs (positive and negative), it is possible to have more than one optimal threshold. Such thresholds will have the same accuracy, but varying sensitivity and specificity values. Since we found this to be a common occurrence, we present here an example from a logistic regression model trained on of half of the data set (184 randomly selected documents) and tested on the remaining half (183 documents). The resulting model had 3 "best threshold" values. We refer to them as Logistic Regression Model-l, 2, and 3. In table 5.5, we present the performance of the 3 models. Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bajuka Model Accuracy Sensitivity Specificity LR Model-1 0.913 0.895 0.928 LR Model-2 0.913 0.884 0.938 LR Model-3 0.913 0.872 0.949 Table 5.5. Accuracy, sensitivity (recall) and specificity (precision)of three Logistic Regression models having highest accuracy. The trade-off between precision and recall, mentioned earlier, can be seen in table 5.5 with the three logistic regression models. LR Model-i has the highest recall but lowest precision, while LR Model-2 has lowest recall but highest precision. Model-3 is in between. In figure 5.20, we show the accuracy plot for this model. In figure 5.21, we show the ROC curve. The red markers in both graphs indicate the points corresponding to the best thresholds. The AUC for this model was 0.969. Accuracy Plot for Logistic Regression 1 0.9 0.8 0.7 0.6 0.5 * 0.4 II I 0.4 0.6 0.8 Thresholds Figure 5.20. Accuracy plotfor a logistic regression model with several optimal thresholds.Each of the optimal thresholds give the same accuracy. 0 0.2 I Accuracy Best Threshold Using Style for Evaluation of Readability ofHealth Documents-Thesis by Freddy N Bafuka ROC for classification by logistic regression 1 0.9 0.8 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 False positive rate 0.7 0.8 0.9 1 Figure 5.21: The ROC curvefor a logistic regressionmodel, with three optimal thresholds, shown by the red markers. Each of the best thresholdsyield maximal accuracy, but different true positive andfalse positive rates. In figure 5.22, we show the raw output from the 183 test point. Three horizontal lines are shown to indicates the threshold value for LR Model-1, -2, and -3, which had the best accuracy. Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka Raw Output of Logistic Regression on Test Data 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 -10 -8 -6 -4 -2 0 2 z = bO + dl*bl + d2*b2 +... + d13*b13 4 6 Figure 5.22: The raw output of a linearregression model with three optimal thresholds, showing the evaluation of 97 easy (circle, red) and 86 hard (+, blue) documents in the test set. The three horizontal lines show the three best-accuracy threshold values. 5.2.2 The Support Vector Machine Model The SVM model had a lesser performance than the logistic regression models operating at the best threshold. All test point were classified, with none falling on the separating hyperplane. Some were misclassified, as shown by the correct rate in table 5.2. As part of the cross-validation, each of the 367 data points was used once as a test point. We can use the distance of each data point to the separating hyperplane, as a measure of confidence in the classification. Figure 5.23 shows all 367 documents, grouped by cross validation fold--each fold having been used as a test set. On the y-axis is the distance to the hyperplane. On the x-axis, we simply spread the points evenly, per fold, for better viewing. Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy NBafuka Distance to Hyperplane 4+ 2 -I + ++ + + +, , F0ld " *-+- * 4+++.++ F ol. + *,+ + + , d 31 -4 * - . + * -6 -8 I Fold 1 Fold 2 Fold 3 Cross-validation folds Figure 5.22. The distance of the testpoints to the separatinghyperplane of the SVMclassifier The hard-to-readdocuments (blue, '+') lie generally at a positive distance, while the easy-to-readdocuments (red, '*)are generally at a negative distance. The vertical lines showfold grouping. 5.2.3 Variable Selection While the same 13 predictors used for predicting expert-given scores were also used for the classification experiments, the data set is different, and the outcome is of a different type as well--binary, rather than continuous. Hence, we performed another experiment of variable selection, using a different approach and of course, using the easy- and hard-to-read data set. As described in section 5.1, we built 13 one-predictor models, one for each feature value. The relative performance of each of these single models determines the significance ranking of the feature. In this experiment, each sub-models, c 1 , c2 ,..,c13, is a logistic regression classifier. Each ci classifier is trained with half of the values of the i-th feature randomly chosen as the training set and the remaining half as the testing set. Each classifier's area under the curve, AUC, is recorded. The features are ranked from most to least significant by descending AUC values. We found the feature ranking in this classification problem to be different from the regression problem discussed in section 5.1.1. The top six were all different in both experiments. We notice that the most significant variables are those that capture line spacing and the white space. The difference between 4 Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka the rankings in the classification problem versus the prediction of expert score could be due to the fact that some features might be significant in determining whether a particular document's score is likely to fall above or below 4, for instance, but might not be significant in predicting the exact score. Another feature could be more significant only in determining whether a particular document is of class 7, for instance, as opposed to 1-6. The former type of feature could be more useful in distinguishing between an easy or hard document, whereas the latter type would be more useful in predicting one score bucket. Table 5.6 presents the full ranking of features for the classification of easy vs. hard documents and provides the equivalent rankings for the regression problem (predicting expert scores). We do note that the data sets used are different, so table 5.6 is not intended to be interpreted as the definite comparison of the features' significance in the two models. Further study is needed to reach a more definite conclusion on that point. We also note that some of the features were close matches. Therefore, different runs of the experiment yielded slight changes in ranking (some features move up or down one rank), depending on the training and test set obtain from randomization. ClassificationProblem Regression Problem 1 Line-to-Space Ratio Left Margin 2 Whitespace Ratio Page Count 3 Number of Lines per Column Max-to-Min Line Size Ratio 4 Interline Space Ratio Bottom Margin 5 Color Count Average Gray Scale Value 6 Average Gray Scale Value Number of Lines per Column 7 Right Margin Size Top Margin Size 8 Top Margin Size Line-to-Space Ratio 9 Max-to-Min Line Size Ratio Number of Columns 10 Bottom Margin Right Margin Size 11 Left Margin Whitespace Ratio 12 Number of Columns Color Count 13 Page Count Interline Space Ratio Rank Table 5.6: The feature ranking in the classificationof easy vs. harddocuments, comparedto the ranking in predictingexperts readabilityscores. To further check the rankings obtained by using the AUC of the single-feature models as criterion, we performed rankings using three additional criteria: Using Style for EvaluationofReadability ofHealth Documents-Thesis by Freddy N Bafuka 54 (i) Using the accuracy of 13 single-feature models, each being a logistic regression model trained with half of the data and tested on the other half. Figure 5.24 shows the accuracy of each features, in order of ranking, while figure 5.23 shows the equivalent for the AUC criterion. (ii) Using a two-way t-test, with pooled variance[38]. Essentially, for each feature, we test the null hypothesis that the feature values for the easy-to-read and hard-to-read documents come from the same distribution, at the 5% significance level. Figure 5.25 shows the result of the t-test, from significant to least significant feature. (iii) Using the deviance, a generalization of the residual sum of squares. In this method, we start by selecting the single-feature logistic regression classifier which yields the smallest deviation. This will be the most significant feature. We then add a second feature to the model, by trying each of the remaining 12 features. The feature whose addition gives the 2-feature model with least deviation, is the second most significant. We choose a 3 'd,4t, 5', and all other feature ranks in the same manner. Figure 5.26 shows the model's improvement as the significant features are added to it. Individual Perforamance of Variables by AUC 0.9 - LSR D 0.8 = o 0.7 o YWSR ISR LNC CLRC AGS a) T 0.6 < 0.5 PGC 0.4 1 2 4 6 8 10 12 13 Variables (Most-to-Least Significant by AUC) Figure5.23: The area under the ROC curvefor each single-feature classifier Using Style for Evaluation ofReadability of HealthDocuments-Thesis by FreddyN Bafuka U.9 _ Variables Individual Performance by Accuracy 0.85 LSR - LNC WSR 0.8 ISR 0.75 CLRC AGS 0.7 0.65 0.6 0.55 PGC 0.5 SCLC 0.45 1 2 4 6 8 10 Variables (Most-to-Least Significant) 12 13 Figure 5.24: The accuracyof the 13 single-featureclassifiers usedforfeature ranking. Variable Significance by T-test 16 - 14 12 10 864 2 0- 1 2 4 6 8 10 Variables (Most-to-Least Significant) 12 13 Figure 5.25: The two-way t-test usedfor feature ranking. Using Style for Evaluationof Readabilityof Health Documents-Thesis by Freddy N Bafuka Variables Improvement of Model 340 LSR 320 300 280 . .I-- > VS' 260 -MS 240 220 PGC LMS 200 AGS 180160 ,RMS 0 CLRC CLC 2 4 6 8 10 12 Number of Variables Used (in order of significance) 14 Figure 5.26: Logistic regression model improves (decreasein deviance) as more significantfeatures are added to it. The results of the first three criteria (AUC, accuracy, and t-test) largely agree, especially for the 6 most significant features. The deviance criteria gives half of the same top 6 features. One of the most notable differences is the page count feature, which dramatically changes ranking position when deviance is used as the ranking criteria. We note that the first 3 graphs (figures 5.23-5.25) show several consistent (relatively) flat regions. These represent closely-significant features. Indeed, for the criteria that depend on randomization (AUC and accuracy), a change of ranking tends to occur among the features forming the flat regions, under different randomizations. For instance the whitespace ratio (WSR) and the number of line per column (LNC), which appear in the first flat region, often swap position under different re-runs of the same experiment. We further conducted an analysis to determine the subset of features that contribute most to the model's performance and those that only add marginal improvement. Figure 5.26 above gives one indication already. But in addition, we built 13 logistic regression classifiers, s , s2, ..., s 13, where each si is a model that uses the i most significant classification variables. Each model is built using half of the Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka data randomly selected as a training set and the remaining half as a test set, in a manner analogous to the description given in the previous chapter. The exact same training and test sets are used to evaluate all 13 models to ensure a fair comparison. However, these training and testing sets are different from the ones we used to evaluate the single feature models (cl, through cl 3). Each model's accuracy (correct rate) was evaluated. In figure 5.27, we show a plot of the accuracy values as a function of the number of most significant variables used. As would be expected, the accuracy increases as more features are included in the model. The occasional drops in the graph are due to the closely matching features performing differently under different randomly selected test sets than the one used to rank the features. Figure 5.28 is similar, showing the AUC value for the model in a generally increasing trend, as more features are included in the classifier. The occasional drops are for the reason already mentioned. Accuracy for Classifier Using the n Most Significant Features 0.92 0.9 0.88 0.88 a) "4-a) 0.86 0.84 0.82 0.8 0 12 14 8 10 4 6 Number of most significant features used Figure 5.27: Accuracy of various classifiers increasingas more significantfeatures are added. 2 Using Style for Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka AUC for Classifier Using the n Most Significant Features 0.94 0.93 0.92 0.91 0.9 0.89 I 4 6 8 10 12 14 Number of most significant features used Figure5.28: The area under the ROC curve increases as more significancefeatures are added. The local minimums (in this plot as well as in figure 5.18) are due to closely matchingfeatures behaving differently under new test set. 0.88 0 2 5.2.4 Typical Feature Values Finally, we give a brief visualization of the data, which shows how separable it is. Knowing the separability of the data gives us a sense of how appropriate a separation model (such as SVM) would be for this problem. Given the high-dimensionality of he data, we look at the projection of the feature values on various 2-D planes. More specifically, we show the documents plotted on axes representing the two features, as found in the previous subsection. In figures 5.20a-d, we show the documents from a potential test set (half the dataset chosen randomly) plotted by their values for the two most significant features (MSF). Using Style for Evaluation of Readability ofHealth Documents-Thesis by FreddyN Bafuka 0.4 4- 50 ++ 0.3 4-44- 4-%+ 4*4 t, -, *+ 40 + 30 0.2 + +I 20 + + ++ 10 3 0.5 1 1.5 MSF 1 2 0.1 3 2.5 0.2 MSF 2 0.3 0.4 x 105 2.5 0.8 2 0.6 4+ + 1 0.4 + 0.2 0 l~ * 10 20 30 MSF 3 40 50 0.5 60 0L 0.2 0.4 1 0.6 0.8 MSF 4 Figure 5.20a-d: Randomly selected documents, plotted on the feature space. Each axis MSFi represents the i-th most significantfigure. The easy documents are shown as red '*' signs, the harddocuments as blue '+'signs.As thefeatures become more insignificant,a boundary between the two classes becomes less obvious. 5.3 MATLAB Implementation The MATLAB version of the system, being 3-features short of the Java implementation, had the lesser performance. For the prediction of scores on expert-rated documents, the linear regression model of the MATLAB implementation performed with a correct rate of 0.87 while the KNN model had a 0.85 correct rate. In the classification of easy- vs. hard-to-read documents, the SVM model performed with a Using Style fbor Evaluation ofReadability ofHealth Documents-Thesis by Freddy N Bafuka 60 correct rate of 0.82. A logistic regression model was not explored on the MATLAB implementation. We must note that without the 3 additional features, the JAVA implementation's performance shows no measurable distinction from its MATLAB counterpart. These results show two things: first, so far as the features common to both implementations are concerned, there is no visible difference between the two implementations-and there ought not to be, if both implementations correctly extract the feature values. Secondly, the fact that the Java implementation's performance increases significantly with the addition of 3 new features indicates that at least of those 3 additional features is very significant. Indeed, two of those 3 features are the interline space ratio (ISR) and the color count (CLRC), both of which were ranked among the 6 most significant features by most of the criteria used in the classification problem's variable selection (see section 5.2.3). This also explains the fact that we see the greatest improvement of the Java implementation over the MATLAB one mainly in the classification problem. Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka 6. Discussion Overall, our system (which we will refer to here as the style-based gradingtool) performed well in predicting the readability level of documents. We compared the scores given by the style-based grading tool and by the Flesch-Kincaid readability test to the scores given by human experts. We used 286 documents for which all three scores were available. The scores from our system were obtained from the linear regression model. The Flesch-Kincaid scores were obtained from the GNU Style (v 1.10- rc4) program. The Flesch-Kincaid method is a standard algorithm widely used for measuring readability[ 14], including in popular word processing software programs, such as Microsoft Word and Lotus. The tests showed that our system's grades correlated better with the human experts' scores than Flesch-Kincaid did. In table 6.1 we provide the correlation values. Human Experts SB Grading Tool 0.7131 Flecsh-Kincaid 0.6495 SB Grading Tool 0.6506 Table 6.1 The correlationbetween readabilityscores given by human experts, Flesch-Kincaidand the Style-Based GradingTool. In predicting document scores, we noticed that many feature values were most distinct in the lower and higher score ranges (1-2 and 6-7). This can be seen on figure 5.14. The middle range (3-5) had more ambiguous feature values. To investigate a bit further, we performed an experiment in which we tested classification of score brackets. For instance, we wanted to determine whether a document's score would be in the [1-3] bracket or the [4-7] bracket. Hence, we trained six classifiers, one for each such score bracket. Each classifier was an SVM model, trained using 5-fold cross-validation. The results did confirm that the lower and higher range brackets were more correctly classifiable, while the middle range was more ambiguous. Table 6.2 presents the correct rate of the classifiers trained for each bracket pair. Using Style Jfr Evaluation of Readability of Health Documents-Thesis by Freddy N Bajika Score Brackets Correct Rate 1 vs. 2-7 0.98 1-2 vs. 3-7 0.91 1-3 vs. 4-7 0.76 1-4 vs. 5-7 0.79 1-5 vs. 6-7 0.87 1-6 vs. 7 0.82 Table 6.2. The correct rates of 6 SVM classifiers trained to determine a document's bracket. The results seem to be in line withfigure 5.14. While the 6 most significant features varied between the regression and classification problems, they were all features with a large number of potential values. The only exception is the page count (which has only 3 possible values, yet was second most significant feature in the regression problem). The features with more variable values, such as margins, seem more useful than discrete variables with a very small set of values, such as number of columns. The feature selection and ranking (in the regression problem), gave an overall sense of the significance of each feature. However, for different score ranges, different features might be more significant. This could further explain the difference in the rankings between the regression and classification tasks. However, further study will be needed for a more definite conclusion. It is also important to point out that the deviation criterion (used for variable selection in the classification problem) produced a noticeably different feature ranking than the other 3 criteria (AUC, accuracy, and t-test). We noted that the page count in particular, was ranked 4th most significant feature by the deviance criterion, whereas the other three criteria ranked as one of the two least significant features. The left-margin also moved dramatically from a lower significance rank (by the AUC, accuracy and t-test criteria) to a higher significance position according the deviation criteria. The deviation criterion (in the classification problem) gave a ranking closer to the one from the regression problem. Since the ranking for the regression problem was also done using deviance, we suggest that a possible explanation for the difference shown in table 5.6 lies in the different criteria and methods used to obtain the rankings, rather than in the features themselves, the models or the data. That is, using deviance in the classification problem yielded a ranking more similar to the deviance-based ranking of the regression problem. However, further study will be needed to reach a more definite conclusion. Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka The question arises as to whether our system actually captures readability information or whether it is simply learning different bins of data. This is a common question for any machine learning system. However, we have several facts that indicate that our system does capture some real information. First, we can think of health documents as items being generated by certain underlying processes. We can think of health documents for children as being generated by one process and medical journals as being generated by another process. Each of these two processes has distinct attributes imparted to the items it producessome of those attributes are stylistic. We could therefore argue that our system learned the attributes of the underlying process that generates health documents. Secondly, an experiment was done in which the stylistic features described in this paper were added to the NLP-based system developed by our team at the Decision Support Group at the Brigham & Women's Hospital. This NLP-based system uses several text analysis features to find the probability that a document belongs to the easy or hard sample. With that probability, a final score between in the 1-7 range is computed. The results showed that adding stylistic features improved this NLP-based system only slightly (using the correlation of the output values from the system with the expert's grades as the performance measure). In other words, little new information came from the stylistic feature, overall. We infer from this that the stylistic features on their own captured much of the same information captured by the NLP-based features. This experiment was ran at an earlier stage of this project, and is still ongoing (A full description of the experiment and results will be released in publication in the near future). However, some of the stylistic features presented in this thesis were not used in that experiment, as they were not yet implemented. Since some of those omitted features are significant ones, it could be that future testing will reveal a much more significant improvement of the NLP-based system when these style features are added to it. It also important to note that while readability can be defined as the level of difficulty of the actual text of a document, it can also be used to refer to how willing a reader would be to read the document. A reader might be more willing to read instructions on a well-laid out web page with 12-point font size, than he will be to read the same document on fine print at the back of a bottle. A reader might be willing to read a one-page summary on a health topic than a 100-page report on the same topic. Our stylebased grading tool is more an attempt to capture that second dimension of readability. For the tool that we are releasing to end users with this thesis, we use the average of the output of all the linear regression models built, as a final output. More specifically, given a new document, we obtain 5 different scores (one from each of the models trained through cross-validation) and return the average as the final score from our tool. Similarly, we use all three logistic regression models from crossvalidation to obtain three raw output values for a new document. The average output value and the Using Style br Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka threshold are used to classify the new document as hard or easy. Our implementation of the image-based style evaluation system does have some known limitations. Most importantly, the system will not work properly in the case of a background darker than the foreground. In such a case, the interline spaces will be considered lines and the inter-column spaces will be treated as columns, and vice-versa. 6.1 Further Work There are several areas of possible interest in pursuing this project. 6.1.1 Newer Features To capture more the style, it would be helpful to include newer features. One feature that we believe could make a great improvement would be to extract the number of letters per area.This will provide a better measurement of the number of font sizes used in the text. From this feature, we could potentially create a table of content for the documents or infer the structure hierarchy of the document. The complexity of the hierarchy will give an indication of the complexity of the document. However, such a feature-and several others we considered like it--require the use of an OCR software to get reasonably accurate measurements. As far as we search, we were not able to find a reliable open source OCR tool. In our effort to keep our system open source, freely and easily accessible, we decided not to implement this feature. Currently, our best approximation of this feature is the maximum to minimum line size ratio, described in section 3 above. 6.1.2 Newer Training Scheme We also thought of a more specialized method of training and evaluation, of the regression model. As mentioned earlier, we could think of health documents as being generated by various underlying processes, each of which produces documents of a specific difficulty level. For the purpose of this study, we could think of 7 different processes generating documents. We could therefore trained 7 different binary classifiers, and each of those will learn to recognize documents from just one readability level. Having trained 7 such systems, one for each of the 7 classes, we can evaluate a specific document by inputting it to each of the 7 classifiers. The classifier that classifies the document with the highest score will determine the class that document belongs to. Such a system will be similar to those used in determining topics of a document, in well-known NLP problems. Using Style for Evaluation of Readability of Health Documents-Thesis by Freddy N Bafuka 7. Real-World Usage As part of this thesis we developed a program with a Graphical User Interface (GUI), which provides a user-friendly tool for evaluating the readability of document using style properties, as described in this paper. We called this tool VisualGrader.The VisualGrader allows users to load the image of a document's page from a computer's file system. With just one button click, the program extracts all features values for the page. The page count feature, a property of the entire document, is obtained by searching the directory from which the loaded image came from, and looking at file with similar names except for a number identifier. More specifically, the tool expects files names to be of the format name_wxyz. ext, where name is can be any valid sub-string for a file name, and w, x, y and z are digits, associated with each page, and ext is any valid image file extension. All files with the same name part (and differing only the 4-digit number before the file extension) are consider to be from the same document. The tool computes and display the documents' score. Figure 7.1 shows an example. The figure above a health-related document loaded in the grader. Besides giving the document's score, the program also shows some of the features. More specifically, the blue lines denote columns and text lines, while the orange lines indicate white-space. Other features, such as interline-to-line ratio, can already be visualized qualitatively, in the original document. We assume that the health-care professional using our system will do the conversion of the document from its original format to an image or set of images. There are several options for obtain images from other formats. For hard copies, a scanner can be used. For PDF formats, we suggest using the application "Smart PDF Converter". A free version of Smart PDF Converter is available, but will convert only up to the first 3 pages of a documents. We used Smart PDF Converter as part of image data gathering, and successfully converted hundreds of documents to images or sets of images, up to the first 3 pages. For websites and other HTML-based formats, we recommend CutePDF Printer, a free software program that provides capabilities for creating PDF documents from text selected on a web-page. We used CutePDF to obtain several of the easy- and hard-to-read documents from various websites. Once in PDF format, we used the program mentioned earlier to convert the document to image format. The tool comes in two possible look-and-feel. It can be made mimic the standard look-and-feel of the Operating System it is running on, or it can have the default Java look. Due to well-known problems with Java Swing's JFileChooser, choosing file with the native look (standard OS look) is very slow, especially on the first use. We provide both versions. The one shown in figure 7.1 uses Windows Vista native look. Using Style br Evaluation ofReadability oqfHealth Documents-Thesis by Freddy N Bajuka Brief Implementation Overview The Document Feature Viewer is entirely implemented using Java Swing. It is based on a ModelView-Controller design pattern. The model consists of the Java implementation of feature extraction back-end described in this paper. The view consists of a specialized JPanel, that we called ImageView. The ImageView component can show an image given a file and zooming factor. Also, given a set of feature values, such as a list of line start and end positions, and columns start and end positions, it can draw them appropriately, with any zooming value. The controller of the Document Feature Viewer is comprised of a JFrame high-level window, a Browse button that allows the user to browse directories and files and select an image file, and Feature button that triggers the scanning and feature extraction, with the back-end. The contents of the JFrame consists of the a ImageView widget (described briefly above). The Zoom-In and Zoom-Out buttons allows the user to examine the image and feature tracing closer or take a farther view point. Using Style for Evaluationof Readability of Health Documents-Thesis by Freddy N Bafuka __ [,= 1'g 11"OW0 VisualGrader -- Readability Evaluation Tool by Freddy Bafuka SBrowse... Zoom In [-*'" ade Evaluation DSGDBTOO8000001.jp Score 2.75 )SGI)1'1"009 Dealing With D iabetes ,you', feeiip bad about pqselor you diabstes.you',e not <'iagi-,sed wilhs ,dbetesisa big sho,ck. On-e miine Vt] think qr1 tl11,IIAt you Ie ,elltt.ir IM uh every dayi idl iri a b ut Il IVIuUL LI IMlI l II rI IGUV UIU I ffl r'I Iyti. Iyu u te iid btha youidoI't -ave diabetes L Ink abuuul dU iumtes. You rilay ytrrL ay PI- trh DaBelkl. You l y kyllRi . vyIy I I l L II WEll uI lr~I ll raW ,iad tyour ,arer-sl. fier-id ,or sibli-nsr i-,e Ouie,- tohis, gl fIui iated masily. Yuu * rinl it eymv fel uim py wil,iu- Lere -,ful 1. Yvu, v y L l l1.1uCZ. l Yuu'r C :a . Ir ue.I, - r 2r '..vu fees like diabetes has I t ined Your I. Strs s. Ytu' e wo, ried etbo yqui diabetes aiultiiI 1 feel too busy - ike yau'lI ne et eve,-y thhly ne. 6 tui 0t. You're afts id t! t havi i wd invets is ya tir a t i. Yo ihe .hICneS that you f-;,ily lvas ri; ade to help ou cae MerM-Ie11-,l. L ruti1 Iui 1uu ,1 . Ii i't's ioriiial to ieel a iv.;Rd. c ru fused. ,-,d allu= all the it : .Th, Ju,'i nave to 9U!l t!lat way Figure7.1. A document loaded in VisualGrader.The document received a rating of 3.6 Using Style for Evaluation ofReadability ofHealth Documents-Thesis by Freddy NBafuka 68 8. Conclusion The methods described in this paper demonstrate that health documents within a particular readability level have distinct stylistic features. We have been able to correctly extract style-based features that provide a measurable distinction between documents of various readability levels. Moreover, we were able to do this using only one page from each document. We provide user-friendly tool, with a Graphical User Interface, allowing health-care professionals to easily use the scoring algorithms we have developed and presented in this thesis. Using Style for Evaluation ofReadability of Health Documents-Thesis by Freddy N Bafuka 69 9. Acknowledgments I would like to express my deepest gratitude to the following people: Dr. Lucila OhnoMachado who brought me to DSG and whose course (Medical Decision Support, fall 2006) taught me many of the machine learning fundamentals I used in this project. Dr. Qing ZengTreitler for giving guidance that led to the start and development of this project. Ms. Dorothy Curtis, for helping in many practical ways all along this project, and reviewing this document. Dr. Bill Long who ensured that the findings of this project were reported in a manner that meets MIT's thesis standards. Anne Hunter for being the best academic advisor and the most useful person to students in the EECS department. Prof. Arthur Smith, of the EECS Graduate Admission. Mr. Peter Majeed, MIT '06, for helping early in this project, in the time-consuming task of converting many web pages to PDF format, in order to form one of the data sets. This work was supported by the NIH grant R01-DK-075837-0 1A , and by grants from the Department of Electrical Engineering and Computer Science at MIT. Using Style jbr Evaluationof Readability of HealthDocuments-Thesis by Freddy N Bafuka 70 10. References [1] Zakaluk BL, Samuels SJ, editors. Readability: it's past,present,andfuture. Newark: International Reading Association 1988. [2] Root J, Stableford S. Easy-to-Read Consumer Communications.A Missing Link in Medicaid. Managed Care, Journal of Health Politics, Policy, and Law. 1999;24:1-26. [3] Leroy G, Eryilmaz E, Laroya BT. Characteristicsof Health Text. AMIA Annu Symp Proc. 2006; 2006: 479-483. [4] Doak CC, Doak LG, Root JH. Teaching Patients With Low Literacy Skills. Lippincott Williams & Wilkins; 1996. [5] Ad Hoc Committee on Health Literacyfor the Council on Scientific Affairs AMA. Health Literacy: Report of the Council on Scientific Affairs. JAMA. 1999:281-552. 557. [6] McLaughlin, G Harry, SMOG Grading- A New ReadabilityFormula. Journal of Reading, 1969 [7] Readability formula: Flesch-Kincaid grade http://csep.psyc.memphis.edu/cohmetrix/readabilitvresearch.htm Retrieved Feb 23 2007 [8] The Fog index: a practical readability scale http://www.as.wvu.cdu/-tmiles/fog.htmi Retrieved Feb 23 2007. [9] Flesch, Rudolph, A New Readability Yardstick. Journal of Applied Psychology in 1948. [10] Rosemblat G Loga R Tse T Graham L. Text Featuresand Readability:Expert Evaluation of Consumer Health Text. MEDNET 2006 In press. [11] Zeng-Treitler Q, Kim H, Goryachev S, Keselman A, Slaughter L, Smith A C. Text Characteristics of Clinical Reports and their Implicationsfor the Readability of PersonalHealth Records. Medinfo 2007. [12] Gemoets D Rosemblat G Tse T Loga R. Assessing readabilityofconsumer health information: an exploratorystudy. Medinfo 2004. 11 :(Pt 2)869-73. [13] Ownby RL. Influence of vocabulary and sentence complexity andpassive voice on the readability of consumer-orientedmental health information on the internet.AMIA. 2005:585-8. [14] Kim H, Goryachev S, Rosemblat G, Browne A, Keselman A, Zeng-Treitler Q. Beyond Surface Characteristics:A New Health Text-Specific Readability Measurement.AMIA. 2007 [15] Zeng-Treitler Q, Goryachev S, Tse T, Keselman A, and Boxwala A. Estimating Consumer Familiaritywith Health Terminology: A Context-basedApproach. Journal American Medical Informatics Association. 2008 May-Jun; 15(3): 349-356. [16] Osborne H. Health Literacy From A To Z: PracticalWays To Communicate Your Health Sudbury, MA: Jones and Bartlett; 2004. [17] Halliday MA, Hasan R. Cohesion in English. London: Longman; 1976. [18] McNamara DS, Kintsch E, Songer NB, Kintsch W. Are Good Texts Always Better? Interactions of Text Coherence, BackgroundKnowledge, and Levels of Understandingin Learningfrom Text. Cognition and Instruction. 1996;(14):1-43. [19] Zeng Q, Goryachev S, Weiss S, Sordo M, Murphy S, and Lazarus R. Extractingprincipal diagnosis,co-morbidity andsmoking statusfor asthma research: evaluation of a naturallanguage processing system. BMC Medical Informatics and Decision Making. 2006; 6: 30. Using Style for Evaluation of Readabilityof Health Documents-Thesis by Freddy N Bafuka [20] Mosenthal P, Kirsch I. A New Measure ofAssessing Document Complexity: The PMOSE/IKIRSCHDocument ReadabilityFormula. Journal of Adolescent & Adult Literacy. 1998;41(8):638-57. [21] Doak CC, Doak LG, Root JH. Teaching Patients With Low Literacy Skills. 2nd ed. Lippincott: Williams & Wilkins; 1996. [22] Linda G. Shapiro and George C. Stockman (2001). Computer Vision, pp 279-325, New Jersey, Prentice-Hall [23] Shi, J. and Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMMI), 22(8), 888-905. 7] [24] Viola P, Jones M. Robust Real-time Object Detection. Second International Worship on Statistical and Computational Theory of Vision-Modeling, Learning, Computing and Sampling. 2001. [25] Ching-Huei Wang and Sargur N. Srihari. Object Recognition in Structured and Random Environments: Locating Address Blocks on Mail Pieces. AAAI86. 1986 [26] Dzung L. Pham, Chenyang Xu, and Jerry L. Prince (2000): CurrentMethods in Medical Image Segmentation. Annual Review of Biomedical Engineering, volume 2, pp 315-337. [27] Kandula S, and Zeng-Treitler Q. Creatinga Gold Standardfor the Readability Measurement of Health Texts. AMIAAnnu Symp Proc. 2008. [28] T. Hastie, R. Ribshirani, J. Friedman (2001). The Elements of StatisticalLearning-DataMining, Inference, andPrediction,p. 41. Springer Series in Statistics. [29] T. Hastie, R. Ribshirani, J. Friedman (2001). The Elements of StatisticalLearning-DataMining, Inference, and Prediction,p. 42. Springer Series in Statistics. [30] Russel, S., Norvig P. (2003). ArtificialIntelligence-A Modern Approach. p. 735, 888. Prentice Hall [31] T. Hastie, R. Ribshirani, J. Friedman (2001). The Elements of StatisticalLearning-DataMining, Inference, and Prediction,p. 417. Springer Series in Statistics. [32] T. Hastie, R. Ribshirani, J. Friedman (2001). The Elements of StatisticalLearning-DataMining, Inference, and Prediction,p. 371-377. Springer Series in Statistics. [33] V. Robles, C. Bielza, P. Larrafiaga, S. Gonzalez, L. Ohno-Machado. Optimizing logistic regressioncoefficients for discriminationand calibrationusing estimation of distribution algorithms. TOP, Springer Berlin / Heidelberg Vol 16, p. 345-366. 2008. [34] Dursun Aydln, A Comparison of the Sum ofSquares in Linear and PartialLinear Regression Models. Proceedings of World Academy of Science, Engineering and Technology. Vol 32, 2008. [35] T. Fawcett, ROC Graphs:Notes and PracticalConsiderationsfor Researchers,2004. [36] M. Zweig and G. Campbell, Receiver-OperatingCharacteristic(ROC)Plots: A Fundamental EvaluationTool in ClinicalMedicine, Clin. Chem. 39/4, 561-577, 1993. [37] Theodoridis, S. and Koutroumbas, K. (1999) Pattern Recognition. Academic Press, pp. 341-342. [38] Huan Liu, Hiroshi Motoda (1998) FeatureSelectionfor Knowledge Discovery andData Mining. Kluwer Academic Publisher