Building a Markov model

advertisement
DRAFT
Building a Markov Model
Often patterns may be observed that give rise to insight into data. Unfortunately, in many common
situations, the volume and complexity of the data being observed makes it impossible for a human to
perceive and interpret such patterns. The goal of this exercise is to implement a program that will
enable a computer to quantify structure in data so that structure may be used to characterize the
generator, compare the generator with others, or predict other elements of the context from which the
data were sampled.
As an example, consider the set W = {steam, teams, meets, teems, eat, ate, state, tease, test, mast,
mates} to be a random sample of words taken from frequently used vocabulary in some context. The
key elements of this example are:
1. Letters, which are primitive names for sounds, and which may be viewed as representative of
arbitrary tokens;
2. Words that express or suggest relationship between tokens (letters), such as “e” often, but not
always, appears near “a”; and
3. Transitions from one token to another within a word.
In order to quantify the structure inherent within W, we first observe the set of starting letters S = {a, e,
m, s, t}. There are 11 words in W. Thus, if all that was known regarding the language from which this
sample was drawn was the sample W, one might hypothesize the following probabilities:
1. P(“a” is the first letter of a new word) = 1/11
2. P(“e” is the first letter of a new word) = 1/11
3. P(“m” is the first letter of a new word) = 3/11
4. P(“s” is the first letter of a new word) = 2/11
5. P(“t” is the first letter of a new word) = 4/11
Similarly one may tally the token transitions that occur within the words and determine the probabilities
of transition from one token to another.
Transition Probabilities
Starting Letter Probabilities
Sample output of a python program is given in this section. The numeric output values have been
truncated to no more than 4 digits after the decimal. The first output arose from code that calculated
the first letter probabilities, which are displayed in the following dictionary:
DRAFT
{' e': 0.0909, ' t': 0.3636, ' s': 0.1818, ' m': 0.2727, ' a': 0.0909}
Note the leading blank spaces in each key.
Letter-to-letter and Letter-to-end Transition Probabilities
The second block of output is a list of dictionaries, which provide the transition probabilities for each
letter in W. The dictionaries are presented in alphabetic order of the first symbol of each key, though
the content of any given dictionary may not be in alphabetic order.
{'at': 0.5, 'am': 0.25, 'as': 0.25}
{'em': 0.0769, 'e ': 0.2307, 'ea': 0.3077, 'es': 0.1538, 'et': 0.0769, 'ee': 0.1538}
{'m ': 0.1666, 'ma': 0.3333, 'ms': 0.3333, 'me': 0.1666}
{'st': 0.4444, 'se': 0.1111, 's ': 0.4444}
{'ts': 0.0769, 'ta': 0.0769, 't ': 0.2307, 'te': 0.6154}
In the displays, the start-of-word and end-of-word marks are blank spaces.
Generating Words
Shown below, the next block of program output reveals results produced by running a “word” generator
five times. For each trial, the generator function produces the probability that the word might occur
under the assumption that the sample words are representative of the words in the language. Note that
actual words may be produced, such as “mate” during this run, but many nonsense words may also be
generated.
Also note that an asterisk, “*”, is pre-pended to each word and a hash mark, “#”, is appended to each
word. These were added to the display in order to highlight the start and particularly, the finish of each
word, which includes a blank space in the implementation being used.
Probability= 0.006455083378160301
*mate #
Probability= 0.0034002085284136566
*stes #
Probability= 1.5176308961160885e-08
*ateateastesteates #
Probability= 0.0808080808080808
*s #
Probability= 0.011475703783396091
*ste #
DRAFT
State Diagram
Thanks to Mr. Mark Mason, here is a diagram showing the states {Start, End, A, E, M, S, T} and the
transition probabilities arising from the samples provided in W.
Assignment (Demonstration due by the end of the last lab meeting of the
semester)
Create and apply a python program that will:
1. With a text file (20 points):
a. Read the contents;
b. Assemble the contents into a single string with punctuation deleted and the letters in
each of the words mapped to a single case;
c. Provide a leading blank for the first word, a trailing blank for the last word, and a
separating blank between consecutive words;
d. Close the file
2. Determine the transition probabilities (30 points) for:
DRAFT
a. First letters (these are blank-to-letter transitions);
b. Letter-to-letter and letter-to-blank pairs
3. Present the probabilities in a formatted and organized order (30 points);
4. Use the transition probabilities to generate 10-1000 hypothetical words in the language
represented by the sample and for each of the words (20 points):
a. Display the word;
b. Calculate and display the probability of occurrence of the word.
5. Compare writing samples of three authors with an unknown author and determine which
known author’s work is most nearly similar to that of the unknown author (30 points).
Applications
1. Online shopping suggestions
2. Authorship comparisons
3. Word completion applications
4. Natural language processing
Download