Data Mining Project

advertisement

Data Mining Project

-Jimmy Sit, Kristopher Tadlock, Kevin Mack

Title : Understanding key words used in any kind of elections

Team Member(s): Jimmy Sit, Kristopher Tadlock, Kevin Mack

Primary Objective: Using our local data, eetermine whether word associations can be used to predict election voting results.

Introduction:

Data Mining is defined as a computational process in discovering patterns in large data sets involving methods thru artificial intelligence, machine learning, statistics, and database systems. For our data mining term project, we focused on text analytics of our election ballot questions. Text analytics refers to the process of deriving information from text.

Our intention of using text mining as our data mining project stems from wanting to know the frequency of each word used in every ballot. Using these words, we wanted to find association rules of how one word is related to another. An example of this text association pairing would be whenever the word “school” is used, “oversight” and “rate” is used as well.

Detailed information:

For our project, we used RapidMiner as our framework in discovering and creating word associations.

Using RapidMiner, we were able to break down the ballot questions into its individual words and create a word count. Using this data, we were able to determine the frequency of how often a word is used. Displayed below is the process used to text mine in RapidMiner.

In order to text mine, first we need to “Retrieve” the data from our dataset. After retrieving the data, we

“Set the Role” of our data – or in another words - our column that we plan to use from our data. In our case, we

set our role as the Ballot Question column field. The next process we used was a conversion from “Nominal to

Text”. This conversion converts the nominal attribute of our data to text in order for our text mining operation to proceed. The next process is our operation to convert our text questions into their individual words. This process is called “Process Document from Data”.

Within our “Process from Data”, we have a Sub-Processes that handles the breakdown of our text.

Shown below is our process.

The “Tokenize” operation breaks the question into its individual words. We use a “Filter Stop-words”,

“Filter Tokens”, and “Transform cases” to remove common words that are found in everyday sentences such as

‘a’, ‘the’, ‘I’, etc. The “Filter Token” operation only displays words that have a length of 3 or higher, and the

“Transform cases” sets all the words to lowercase.

When running this entire module, we retrieve the results and a word frequency table of how many times a certain word was used within our data. Displayed below is our result.

After receiving the word count, we added additional processes in order to determine word associativity.

Basically, we wanted to know how one word is related to another. Shown below is the updated process in creating word associativity.

The additional processes uses the “FP-Growth” algorithm to percentage out the words given a minimum support. The support and confidence is a measure to determine whether a rule is valid. The “Create Association

Rules” uses the minimum confidence to create the associative rules for our data. Shown below is the results of our word associations.

On the left hand pane, we are able to select the conclusive words to display their relationship. Using an example displayed above, we are able to see that the premises word such as “bonds, issue” has a rule that links to “school”. If we were to think logically about the English dictionary and past election information, the relationship makes sense. Word associativity helps determine how one word can relate to another.

Learning experience of the project:

The text mining of the project opened us to a different subset of data mining. By using the English language, past election data, and just regular common sense, we were able to understand how one word related to another. We believe that if this were a national election, we would see other key occurrence in words such as

“states”, “country”, “federal”, etc. Remember, our data came from local county elections, so seeing words such as city and school made sense.

The biggest learning experience and the most time consuming came from learning and using the data mining framework, RapidMiner. Before we could create any modules to text mine, we had to learn how to use the framework. After learning how to use the framework, it opened us to different ways in seeing our data.

Summary:

Although we were unable to complete our primary objective due to time constraints, we were able to complete a portion of that objective thru text mining word frequency and word associations. If time permitted, our next step would be to use the word associations generated from our data mining portion and tie it into our data mart to find correlations between words used and election results thru a given time frame.

From our data, we were able to see that the word “city” was used the most in our dataset. We believe that the reason for this is because our data is local. We surmise that if we were to have a higher level of election data, we would find words such as “state” and “federal” as the leader in frequency count.

Bibliography:

El Chief’s Youtube page -https://www.youtube.com/channel/UCCvHzQ5AMU6aJYpjS9kOL6g

Auburnbigdata blogspot – http://auburnbigdata.blogspot.com/2013/02/simple-model-to-generate-association.html

Download