hw5pr4 - Computer Science Division

advertisement
Homework 5, Problem 4: A program that reads
[20 extra points; individual or pair]
Submission: Submit your hw5pr4.py file to the submission server
The Flesch Index (FI) is a numerical measure of the readability of a
particular piece of text. Textbook publishers and editors often use it
to ensure that material is written at a level appropriate for the
intended audience.
For a given piece of text, the Flesch readability, or Flesch index (FI),
is given by:
FI = 206.835 - 84.6 * numSyls/numWords
numWords/numSents
- 1.015 *
where:
1. numSyls is the total number of syllables in the text
2. numWords is the total number of words in the text
3. numSents is the total number of sentences in the text
The FI is simply a linear combination of two text-related ratios,
subtracted off of a constant offset. FI is usually reported as an
integer, and casting the floating-point result to an integer is
completely fine for this purpose.
Here are some resulting readability scores for a few example
publications or styles of writing (cited from Cay Horstmann's book,
p. 266). It is, in fact, possible to have a negative readability score:

95 - Comics

82 - Advertisements

65 - Sports Illustrated

57 - Time Magazine

39 - New York Times

10 - Auto insurance policy

-6 - Internal Revenue Code
In this problem you will write a program to compute the Flesch
readability score or Flesch Index (FI) for different pieces of text.
Because the rules for counting words, sentences, and syllables can
be tricky, it is important to allow the user to play with various
inputs and see how they decompose into syllables, sentences, and
words. For this reason, you will structure this program in a function
named flesch() that provides a menu of options for analyzing text.
Example run from a complete flesch program
Here is what the user should see when they run your flesch()
function -- changes to the text are OK, but please keep to the
option numbers listed here and their functionality, because it will
make your program much easier to grade!
Welcome to the text readability calculator!
Your options include:
(1)
(2)
(3)
(4)
(9)
Count sentences
Count words
Count syllables in one word
Calculate readability
Quit
What option would you like?
The first three options will help you troubleshoot errors in
computing the three components of the Flesch readability Index. In
addition, they suggest three functions that you should write to
support this menu:

sentences( text ), which takes in any string of text and
returns the number of sentences present, according to the
rules below

words( text ), which takes in any string of text and returns
the number of words present, according to the rules below

syllables( oneword ), which takes in one word of text,
called oneword and returns the number of syllables present,
according to the rules below
You are welcome to write additional helper functions, as well!
Punctuation, whitespace, and some functions to get you
started
The input text to options 1, 2, and 4 in the readability menu may be
any string at all. Because of this, there need to be some guidelines
that define what constitutes a sentence, a word, and a syllable
within a word.
We will use the term raw word to represent a space-separated
string of characters that, itself, contains no spaces. Python's string
objects contain a useful method (function) that can split a string
into space-separated raw words: it is called split(). For example,
>>> s = "This is a sentence."
>>> s.split()
['This', 'is', 'a', 'sentence.']
>>> s = "This
is \n a sentence." # \n is a newline
>>> print s
This
is
a sentence.
>>> s.split()
['This', 'is', 'a', 'sentence.']
Thus split returns a list of raw words, with all whitespace removed.
The following function might be useful -- feel free to copy it to your
hw5pr2.py file and use it in order to extract a list of raw words from
a string of input text:
def makeListOfWords( text ):
""" returns a list of words in the input text """
L = text.split()
return L
Admittedly, you could avoid using this function simply by calling
split as needed.
After whitespace, punctuation is a second cue that we will use to
define what constitutes a sentence and a word.
The following function will be useful in stripping non-alphabetic
characters from raw words - you should also use this in your
hw5pr2.py file:
def dePunc( rawword ):
""" de-punctuationifies the input string """
L = [ c for c in rawword if 'A' <= c <= 'Z' or 'a' <=
c <= 'z' ]
# L is now a list of alphabetic characters
word = ''.join(L)
# this _strings_ the elements of L
together
return word
Because most of the non-alphabetic characters we need to remove
will be punctuation, the function is called dePunc. It uses a list
comprehension that creates a list of all and only alphabetic
characters (hence the if, which is allowed in list comprehensions).
That list L, however is not usable as a string, so the "magic" line
word = ''.join(L) converts the list L into a string held by the
variable word. That line is really not magic -- join is simply a
method of all string objects.
Definition of "a word":
For this problem, we will define a word to be a raw word with all of
the non-alphabetic characters removed. For example, if the raw
word was one-way, then the corresponding word (with punc.
removed) would be oneway. A raw word will never create more than
one punctuation-removed word.
However, a raw word could have only non-alphabetic characters.
For example, the raw word 42!will disappear entirely when the nonalphabetic characters are removed. We will insist that a word has at
least one alphabetic character (thus, the empty string is not a
word).
Here's an example of dePunc in action:
>>> dePunc( "I<342,don'tYou?" )
'IdontYou'
Counting words
So, with these two functions as background, the number of words in
a body of text is defined to be the number of non-empty, alphabetonly, space-separated raw words in that text. Using
makeListOfWords and dePunc will help to write a function numWords(
text ), that takes any string as input and returns the number of
words, as defined above, as output.
Here are some examples -- note that this is not an exhaustive list of
possibilities! You may want to test some other cases, as well.
>>> words( 'This sentence has 4 words.' )
4
>>> words( 'This sentence has five words.' )
5
>>> words( '42 > 3.14159 > 2.71828' )
0
Note that these rules have their limitations! The first example
probably should be considered to have 5 words, but because it
would take a person to disambiguate all of the many possibilities
(and different people might disagree!), we will stick with this
definition, despite its limitations.
Counting sentences
The rules used for counting sentences within a string are

We will say that a sentence has occurred any time that one of
its raw words ends in a period . question mark ? or
exclamation point ! Note that this means that a plain period,
question mark, or exclamation point counts as a sentence.

The empty string is the only string that has 0 sentences. Any
other string should be considered to have at least one
sentence.
Thus, as long as there is at least one sentence in the text, an
unpunctuated fragment at the end of the text does not count as an
additional sentence. This may seem a bit much, but in fact it means
you don't have to create a special case for the last raw word of the
text.
Here are some examples -- again, this is not an exhaustive list of
possibilities! You may want to test some other cases, as well.
>>> sentences( 'This sentence has 4 words.' )
1
>>> sentences( 'This sentence has no final punctuation' )
1
>>> sentences( 'Hi. This sentence has no final
punctuation' )
1
# Note! Fragments don't count unless there are no
sentences at all
>>> sentences( 'Wow!?! No way.' )
2
>>> sentences( 'Wow! ? ! No way.' )
4
Counting syllables
You will only need to count syllables in punctuation-stripped words
(not raw words with non-alphabetic characters). When writing your
code that counts syllables in a word, you should assume:

A vowel is a capital or lowercase a, e, i, o, u, or y.

A syllable occurs in a punctuation-stripped word whenever:
o Rule 1: a vowel is at the start of a word
o Rule 2: a vowel follows a consonant in a word
o Rule 3: there is one exception: if a lone vowel e or E is at
the end of a (punctuation-stripped) word, then that
vowel does not count as a syllable.
o Rule 4: finally, everything that is a word must always
count as having at least one syllable.
Here are some examples -- this definitely is not an exhaustive list of
possibilities! You will want to test some other cases, as well.
>>> syllables( 'syllables' )
3
>>> syllables( 'one' )
1
>>> syllables( 'science' ) # it's not always correct...
1
As with words and sentences, these rules do not always match the
number of syllables that English speakers would agree on. For
computing readability, however, the errors tend not to impact the
final score too much, since they vary in both directions.
How should the input be collected?
For this problem, use input in order to take in the numeric menu
choices from the user.
However, use raw_input to take in the text for options 1, 2, 3, and
4. This way, the user will not have to type quotes. It is possible to
paste in large amounts of text at the raw_input prompt -- I tried
the entirety of Romeo and Juliet and it worked fine on a PC and
Mac, though you may have to hit return an additional time.
Sample readability scores
In addition to the counting capabilities mentioned above, you should
also implement the overall Flesch readability index as option #4.
A few details:

In addition to the overall readability score, be sure to
print the number of sentences, words, and total number
of syllables in the input text.

Cast the floating-point Flesch readability score to an int.

If the denominator numWords is zero, you should print a
warning and continue to provide the readability menu.
Here are two example runs (just option #4):
Choose an option: 4
Type some text: The cow is of the bovine ilk;
one end is moo, the other milk.
Number of sentences: 1
Number of words: 14
Number of syllables: 16
Readability index: 95
Choose an option: 4
Type some text: Fourscore and seven years ago our fathers
brought forth on this continent a new nation conceived in
Liberty and dedicated to the proposition that all men are
created equal.
Number of sentences: 1
Number of words: 29
Number of syllables: 48
Readability index: 37
Computing the readability of one of your papers
Finally, copy-and-paste one of your own papers (or other works)
into the readability scorer. Include a comment or triple-quoted
string at the TOP of your file that mentions what you tested (you
don't need to include all the text, just what text it was) and what
it's score was... we look forward to seeing the results!
If you have gotten to this point, you have completed problem 4! You
should submit your hw5pr4.py file at the Submission Site.
Next
hw5pr5
Lab 5
Homework 5
Download