A characteristic of text documents.. “the sum total of all those elements within a given piece of printed material that affect the success of a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949) “ease of understanding or comprehension due to the style of writing” (Klare, 1963) Readability encompasses a number of areas… Syntactic complexity of the text ▪ grammatical arrangement of words within a sentence, (e.g. active / passive sentences have been shown to affect readability) ▪ Simple/compound sentence/complex sentences Organization of text ▪ discourse structure ▪ textual cohesion Semantic complexity of the text Improve literacy rate Improving instruction delivery Judging technical manuals Matching text to appropriate grade level And many more… Assign score to text based on some textual cues (e.g., average sentence length) Readability formula Over 200 formulas by 1980s (DuBay 2004) Textual cues ▪ sentence length, percentage of familiar words, and word length, syllables per word etc. Testing validity: correlating predicted score to reading comprehension score Flesch Reading Ease score Score = 206.835 – (1.015 × ASL) – (84.6 × ASW) Score in [0 to 100] ASL = average sentence length ASW = average number of syllables per word Dale-Chall Formula Maintains a list of “easy words”. Score = .1579PDW + .0496ASL + 3.6365 ▪ PDW= Percentage of Difficult Words FOG index Lexile scale Commonalities among formulae Linear regression over some predictor variables Traditional readability measures are robust for large sample size (textbook and essays) as compared to short and consize web documents. Web documents are generally noisy Resource: Predicting Reading Difficulty With Statistical Language Models, Kevyn Collins-Thompson and Jamie Callan LM can encode more complex relationships as compared to simple linear regression model in traditional readability measures A probabilistic distribution in all grade levels Relative difficulty of words can be obtained statistically as compared to hardcoded approach in traditional measures Earlier grade readers tend to use more concrete words (e.g. red); later grade readers use more abstract words (e.g., determine) Same observations in web documents Syntactic features are ignored Word (semantic) feature based model Formulated in a classification framework For a given text passage 𝑇, predict the semantic difficulty of 𝑇 relative to a specific grade level 𝐺𝑖 ▪ Likelihood that the words of 𝑇 were generated from a representative language model of 𝐺𝑖 words Text 𝐿𝑀𝐺1 𝐿𝑀𝐺2 𝐺1 difficulty score 𝐺2 difficulty score 𝐿𝑀𝐺𝑛 𝐺𝑛 difficulty score Word type 1 𝑤1 Word type 2 (𝑤2 ) 𝑻 Token 𝐿𝑀 𝐺𝑖 = {𝑃 𝑤1 𝐺𝑖 , 𝑃 𝑤2 𝐺𝑖 , … , 𝑃(𝑤𝑘 |𝐺𝑖 )} Word type k (𝑤𝑘 ) “In a recent three-way election for a large country, candidate A received 20% of the votes, candidate B received 30% of the votes, and candidate C received 50% of the votes. If six voters are selected randomly, what is the probability that there will be exactly one supporter for candidate A, two supporters for candidate B and three supporters for candidate C in the sample?” 6! 𝑃 𝐴 = 1, 𝐵 = 2, 𝐶 = 3 = 0.21 0.32 0.53 = 0.135 1! 2! 3! Multi-nomial Distribution 𝑛 independent trials ▪ Each of which leads to a success of exactly one of 𝑘 categories ▪ Each category has a given fixed success probability ▪ Probability mass function 𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑘 ; 𝑛, 𝑝1 , 𝑝2 , … , 𝑝𝑘 = 𝑃 𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 , … 𝑋𝑘 = 𝑥𝑘 𝑛! = 𝑝1𝑥1 … 𝑝𝑘𝑥𝑘 𝑥1 ! … 𝑥𝑘 ! Unigram language model Hypothetical author generates tokens of 𝑇as follows: Choosing a grade language model 𝐺𝑖 according to prior probability distribution 𝑃(𝐺𝑖 ) ▪ “I will write for grade level 4” [explicit] Choosing a passage length |𝑇| according to probability distribution 𝑃(|𝑇|) ▪ “I will write no more than 100 words” [Explicit/Implicit] Sampling |𝑇| tokens from 𝐺𝑖 ’s multi-nomial word distribution ▪ “I will pick up words with certain distribution” [Implicit] We need to compute 𝑃 𝐺𝑖 𝑇 : Probability that 𝑇 is generated from LM 𝐺𝑖 Bayes’ Theorem 𝑃 𝐺𝑖 𝑇 = 𝑃 𝐺𝑖 𝑃(𝑇|𝐺𝑖 ) 𝑃(𝑇) Compute 𝑃(𝑇|𝐺𝑖 ) ▪ 𝑃 𝑐ℎ𝑜𝑜𝑠𝑖𝑛𝑔 𝑎 𝑡𝑒𝑥𝑡 𝑜𝑓 𝑙𝑒𝑛𝑔𝑡ℎ 𝑇 × 𝑚𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝐺𝑖 ▪ 𝑃 𝑇 𝐺𝑖 = 𝑃 𝑇 𝑇! 𝑃(𝑤|𝐺𝑖 )𝐶(𝑤) 𝑤∈𝑉 𝐶 𝑤 ! Classification model arg max 𝑃(𝐺𝑖 |𝑇) 𝑖 𝑃(𝑤|𝐺𝑖 )𝐶(𝑤) 𝑃 𝐺𝑖 𝑃 𝑇 𝑇 ! 𝑤∈𝑉 𝐶 𝑤 ! = arg max 𝑖 𝑃(𝑇) 𝑃(𝑤|𝐺𝑖 )𝐶(𝑤) = arg max 𝑃 𝐺𝑖 𝑃 𝑇 𝑇 ! 𝑖 𝐶 𝑤 ! 𝑤∈𝑉 Classification model arg max log 𝑃(𝐺𝑖 |𝑇) 𝑖 = arg max[ 𝑖 𝑤∈𝑉 𝐶 𝑤 log 𝑃(𝑤|𝐺𝑖 ) − 𝑤∈𝑉 log 𝐶 log 𝑃 𝐺𝑖 + log 𝑃 𝑇 + log( 𝑇 !)] 𝑤 !+ Simplified assumptions All grades are equally likely a priori All passage lengths are equally likely Simplified classification model arg max log 𝑃(𝐺𝑖 |𝑇) 𝑖 = arg max[ 𝑖 𝐶 𝑤 log 𝑃(𝑤|𝐺𝑖 ) − 𝑤∈𝑉 log 𝐶 𝑤 ! + 𝑤∈𝑉 log 𝑃 𝐺𝑖 + log 𝑃 𝑇 + log( 𝑇 !)] Simplified classification model arg max log 𝑃(𝐺𝑖 |𝑇) = 𝑖 arg max[ 𝑖 𝑤∈𝑉 𝐶 𝑤 log 𝑃(𝑤|𝐺𝑖 )] Example 1: Passage 𝑻 = "the red ball” 𝐿 𝐺1 𝑇 = 𝑙𝑜𝑔 0.0600 + 𝑙𝑜𝑔 0.0008 + 𝑙𝑜𝑔 0.00010 = −𝟖. 𝟑𝟏𝟗 𝐿 𝐺5 𝑇 = log 0.0700 + log 0.0004 + log 0.00005 = −8.8 54 𝐿 𝐺12 𝑇 = log 0.08 + log 0.0002 + log 0.00001 = −9.796 Example 2: Passage T “the red perimeter” 𝐿 𝐺1 𝑇 = −9.319 𝐿 𝐺5 𝑇 = −8.076 𝐿 𝐺12 𝑇 = −9.097 Example 2: Passage T “the perimeter was optimal” 𝐿 𝐺1 𝑇 = −12.523 𝐿 𝐺5 𝑇 = −11.678 𝐿 𝐺12 𝑇 = −11.097 What if a word does not belong to a language model for a grade level A 0 probability will be assigned Redistribute a part of probability mass of known words to rare and unseen words Smooth individual grade-based language model using Good-Turing smoothing We have estimate of total probability mass of all unseen words We need to find each unseen word’s share of this total probability mass Uniform probability distribution? Usage of discriminative words are clustered towards grade levels. Borrow probability mass from neighboring grade classes The type w occurs in one or more grade models (which may or may not include 𝐺𝑖 ) 𝑃 𝑤 𝐺𝑖 = 𝑘 𝛼𝑘 𝑃𝑘 𝑘 𝛼𝑘 ▪ 𝑃𝑘 = 𝑃 𝑤 𝐺𝑘 ▪ 𝛼𝑘 = 𝜙 𝑖, 𝑘, 𝜎 is a kernel distance function between i and k. ▪ Gaussian Kernel ▪ 𝜙 𝑖, 𝑘, 𝜎 = exp − 𝑖−𝑘 2 𝜎2 𝒊 𝒌 Readability Score assigned documents 𝒑𝟏 , 𝒑𝟐 , … . , 𝒑𝒏 New doc Training Regression Model: 𝑹 = 𝜷𝟎 + 𝜷𝟏 𝒑𝟏 + … . +𝜷𝒏 𝒑𝒏 Readability Score Resource: Revisiting Readability: A Unified Framework for Predicting Text Quality, Emily Pitler and Ani Nenkova There are different predictor variables indicating readability score What is a the contribution of individual predictor variable in readability score? Testing methodology Collect Readability Corpus Extract Predictor Variable Measure <readability score, predictor variable>Correlation Pearson product-moment correlation coefficient (𝑟) Captures relationship between two variables that are linearly related (𝑌 = 𝛽0 + 𝛽1 𝑋). 𝑟= (𝑋𝑖 −𝑋)(𝑌𝑖 −𝑌) 2 𝑋𝑖 −𝑋 2 (𝑌𝑖 −𝑌) −1 ≤ 𝑟 ≤ +1 -Ve +Ve +Ve -Ve How statistically significant 𝑟 value is? t-test for statistical significance ▪ Expressed through 𝒑-𝒗𝒂𝒍𝒖𝒆 ▪ Computed through null hypothesis the use of drug X to treat disease Y is no better than not using any drug ▪ 𝒑-𝒗𝒂𝒍𝒖𝒆 of 0.001 signifies ▪ there is a 1 in 100 chance that we would have seen these observations if the variables were unrelated. ▪ If 𝒑-𝒗𝒂𝒍𝒖𝒆 computed for a dataset is less than predefined limit (say 𝑝 < 0.001), null hypothesis is rejected. ▪ Correlation is statistically significant Methodology Create a readability dataset ▪ “On a scale of 1 to 5, how well written is this text?” Identify a group of predictor variables Measure correlation between readability scores and value of predictor variable Decide on the effectiveness of predictor variables based on correlation score and 𝑝-𝑣𝑎𝑙𝑢𝑒 Average Characters/Word the average number of characters per word Average Words/Sentence average number of words per sentence Max Words/Sentence Maximum number of words per sentence Text length Limit on 𝑝-𝑣𝑎𝑙𝑢𝑒=0.05 Unigram model: probability of an article 𝐶(𝑤) , 𝑀 𝑤 𝑃(𝑤|𝑀) is the background corpus ▪ Wall Street Journal and AP News corpus Log-likelihood 𝑤𝐶 𝑤 log(𝑃(𝑤|𝑀)) This model will be biased towards shorter articles Why? Compensation Linear regression with predictor variables as log- likelihood and no of words in the article Log likelihood, WSJ article likelihood estimated from a language model from WSJ Log likelihood, NEWS article likelihood according to a unigram language model from NEWS LL with length, WSJ Linear regression of WSJ unigram and article length LL with length, NEWS Linear regression of NEWS unigram and article length Average parse tree height Average number of noun phrases per sentence Average number of verb phrases per sentence Average number of subordinate clauses per sentence Counting SBAR nodes in parse tree Curious case of average verb phrases No of verb phrases per sentence may increase the text complexity ▪ average verb phrases should have a negative correlation Let’s look at the following examples It was late at night, but it was clear. The stars were out and the moon was bright. (1) It was late at night. It was clear. The stars were out. The moon was bright. (2) Aspects of well written discourse Cohesive devices like pronouns, definite descriptions, topic continuity Number of pronouns per sentence Number of definite articles per sentence Average cosine similarity Word overlap Word overlap over nouns and pronouns Entity based approach towards local coherence discourse coherence is achieved in view of the way discourse entities are introduced and discussed Some entities are more salient than others ▪ Salient entities are more likely to appear in prominent syntactic positions (such as subject or object), and to be introduced in a main clause. ▪ Centering theory models the continuity of discourse Entity-Grid discourse representation Each text is represented by an entity grid ▪ A two-dimensional array that captures the distribution of entities across text sentences. Optional Resource: Modeling Local Coherence: An Entity-Based Approach, Regina Barzilay and Mirella Lapata S => Entity appears in subject phrase O => Entity appears in subject phrase X => appears in any other phrase − => does no appear If a noun phrase appears more than once in a sentence, we resort to grammatical role based ranking [S>O>X] -- Sentence 1: ‘Microsoft’ appears as subject (S) and rest (X) category -- Mark entry for Microsoft as S A local entity transition is a sequence 𝑆, 𝑂, 𝑋, − 𝑛 represents entity occurrences and their syntactic roles in 𝑛 adjacent sentences Each transition will have certain probability given a grid. 𝑃 𝑆, − =6 75 = 0.08 Text -> distribution defined over transition types Feature vector Probability counts for a fixed set of transition types Each grid rendering 𝑗 of document 𝑑𝑖 Φ 𝑥𝑖𝑗 = (𝑃1 𝑥𝑖𝑗 , 𝑃2 𝑥𝑖𝑗 , … , 𝑃𝑚 (𝑥𝑖𝑗 )) ▪ 𝑚 is the number of predefined transitions ▪ 𝑃𝑡 (𝑥𝑖𝑗 ) is the probability of transition 𝑡 in grid 𝑥𝑖𝑗 Sentence Ordering Task determining an optimal sequence in which to present a pre-selected set of information-bearing items ▪ Concept-to-Text generation ▪ Multi-document summarization Simpler task ▪ Rank alternative sentence ordering ▪ Which from pair of ordering (𝑑𝑜1 > 𝑑𝑜2 ) is better in terms of coherence? Training set Ordered pairs of alternative rendering 𝑥𝑖𝑗 , 𝑥𝑖𝑘 of same document 𝑑𝑖 . ▪ Where degree of coherence for 𝑥𝑖𝑗 is greater than that of 𝑥𝑖𝑘 . Training objective ▪ To find parameter vector 𝒘 ▪ To yield a ranking score function that minimizes number of violations of pairwise rankings provided in training set Modelling ▪ ∀ 𝑥𝑖𝑗 , 𝑥𝑖𝑘 ∈ 𝑟 ∗ : 𝑤. Φ 𝑥𝑖𝑗 > Φ(𝑥𝑖𝑘 ) ▪ 𝑤. Φ 𝑥𝑖𝑗 − Φ 𝑥𝑖𝑘 >0 ▪ Support Vector Machine Conctraint Optimization problem Consider a document as a bag of discourse relations Language model defined over relations instead of words Probability of a document generated with 𝑛 number of relation tokens and 𝑘 number of relation types Log-likelihood of a document based on its discourse relations ▪ log 𝑃 𝑛 + log 𝑛! + 𝑘 1 (𝑥𝑖 log 𝑝𝑖 − log(𝑥𝑖 !)) Increase in number of discourse relations in a document will lower the log-likelihood Number of relations in a document as feature 200+ readability measures and still counting Are they really looking at deeper aspects of language comprehension? Are they tuned towards individual reading abilities? Is reader in the loop? How do we comprehend sentences? How do we store and access words? How do we resolve ambiguities?