Ashton Friedman Charles McCoy Jarett Miller Political Alignment

advertisement
Ashton Friedman
Charles McCoy
Jarett Miller
Political Alignment Algorithm
Abstract
The goal of this paper is to provide an
examination of an algorithm designed for
matching prospective voters with like-minded
candidates. In this paper we examine the
intricacies of our algorithm and compare it sideby-side with known proprietary programs to
gauge the results using square root mean
differences. This analysis includes our weighted
matching system, why we chose this particular
data set and its source, and our implementation
of fuzzy logic. Algorithm explanations are
provided first. This is followed by analysis and
comparison of the chosen algorithms as well as
those we ultimately discarded.
Introduction
Deciding on whom to vote is never easy.
Keeping up with speeches and the dynamic views
of politicians can be time consuming. This can
lead to a population of voters that do not have
enough knowledge about candidates to vote for
an individual who will represent the voter’s views
in Congress. Our algorithm provides a user a
simple way to learn about the candidates that
support or oppose the same issues as the user and
will thus best represent that user’s views. A set of
20 issues is presented to the user. We have
obtained politician data for the same issues. Once
this data is in quantifiable terms, we use weighted
matching and fuzzy logic to determine a
percentage-based match from a piecewise
comparison. The design for our algorithm is our
own, but is closely related to generic matching
algorithms. The biggest difference between our
algorithms and simpler matching algorithms is
weighting and fuzzy logic implementation. We
chose to implement fuzzy logic in two of our
algorithms due to the imprecise nature of human
opinion. Fuzzy systems were designed to handle
data sets where membership is not a crisp value
but instead belongs to a particular set to a certain
degree. This made it much simpler to quantify a
value such as “Strongly Agree.” We compare our
results to two other websites that implement a
political matching system: VoteSmart and
VoteMatch. Both use proprietary matching
algorithms. Thus, we are able to generate data
that can largely be consistently compared across
all three platforms. Based on our observations
and experimentation with VoteSmart, we chose it
to be our primary benchmark for comparison.
Background for the Algorithms
Our project focused on making an
application for users that would help them find
which politicians their political views most
closely match. We started by finding some of the
major political issues that voters typically
consider when voting. We came across several
websites which contained top issues and decided
to use those found on the vote match quiz.
Initially, we had intended to look at the voting
records of North Carolina politicians and decide
their stance on the top issues based on the votes
they cast in support or opposition of those issues.
This proved to be an insurmountable task for a
semester-long project. Thus, we began looking
for reputable sites that had already implemented
a grading system based on voting record.
VoteMatch had a well-organized and
documented system in place. Upon selecting a
candidate and issue, the site listed all resources
used in calculating the ranking on that issue for
that particular politician. VoteMatch went further
and included a quiz for users that listed the same
issues. Thus, we chose VoteMatch data for our
politician database as well as our question bank
due to the ease of comparing our algorithm’s
performance against the VoteMatch algorithm.
We also compared our findings against the
algorithm used by VoteSmart. Both applications
give their best politician matches based on users’
responses, and although Vote Smart does not
make their algorithm public it gives us another
base for comparison.
Weighted Algorithm
Our algorithm uses weights for each
answer given by the user, and then gives a
percentage match of the user to each politician
contained in our database. We use the same
questions given on the VoteMatch quiz. If a user
strongly favors an issue the weighting factor is
four. An issue which the user favors is given a
three. A neutral stance is weighted with a value
of two. If the user opposes or strongly opposes
an issue the weighting value is one or zero,
respectively. The politician stances, gathered
from Ballotpedia, are given the same weights
within the database. It is possible for a politician
to have an unknown stance on a particular issue
due to a lack of evidence to claim support or
opposition on the issue. In this case the weighted
value will be five, and our algorithm will not
include that issue in the arithmetic process for
finding the percentage match. After the user has
answered all questions, our algorithm then takes
the absolute difference of each of the user and
politician weights for each answer. If the
absolute difference is four, then the user and
politician are completely opposite on their
stances for that issue. If the absolute difference
is zero then they both agree with their stances.
Our algorithm then uses the total possible
value of all questions; one-hundred minus four
times any questions which the politician weighted
value is greater than four, to find the total possible
value by which to divide the sum of the absolute
differences. The total possible value if all stances
are known for the politician is one-hundred.
After knowing the sum of the absolute
differences and the total possible value, our
algorithm subtracts the quotient of the sum of the
absolute differences and the total possible value
from one-hundred, and multiplies that answer by
one-hundred. This process gives the percentage
match of the user to the politician. Our algorithm
will run through this process for each politician in
our database, and then give a listing of each
politician with the percent that they match the
user.
For a more concrete understanding of
how our algorithm works, the following is a
formal mathematical description of the
algorithm’s processes. Several variables are used
in the equations: Total Value Obtained (TVO) is
the sum of the absolute differences, Total
Possible (TP) is the possible total of all weighted
values, Percentage Match (PM) is the percentage
of how much a user’s stances match with a
politician’s stances, User Value (UV) is the
weighted value of a user’s stance on an issue,
Politician Value (PV) is the weighted value of a
politician’s stance on an issue.
TVO=Σ(|UV-PV|)
TP=100-4*Σ(PV>4)
PM=100-(TVO/TP)*100
Analysis Weighted Algorithm
By running several tests we found that
our algorithm runs in O(n) time, where n can
represent the number of politicians in the
database or the number of questions asked. For
the first set of tests, based on the number of
politicians, we wrote a program that fabricated a
number of imaginary politicians ranging from ten
to one-million. The program generated these
politicians, and then gave them weighted values
for stances on each of the twenty questions in the
questions list, ranging from zero to four. We did
not allow unknown stances in the test to ensure
that all tests were ran with the same total possible
value of one-hundred. The weighted values for
each politician were generated using the Random
utility in Java. After generating a list of ten
politicians in this manner, we then used that list
of politicians as our politician database and ran
through our algorithm using user inputs for
strongly favoring all issues, strongly opposing all
issues, neutrality on all issues, and random user
stances for each issue. This process was done for
politician lists of ten, one-hundred, one-thousand,
ten-thousand, one-hundred thousand, two
hundred fifty thousand, five-hundred thousand,
seven hundred fifty thousand, and one-million
politicians. All tests gave run times of O(n). The
following is an image of the graph showing the
results of the test with random user stances. The
blue line is our algorithm’s run time, while the red
line is a linear trend line to show how our
algorithm follows a linear time line.
2.00E+08
We implemented a second comparison
algorithm utilizing Fuzzy Logic and K-Means
Clustering. This algorithm takes in a database of
politicians in the same way our previous
algorithm does. However, instead of simply
comparing the raw politician values against the
raw user values, both sets are used to generate
fuzzy sets. In the case of our program, we created
a fuzzy support set, meaning the greater a user or
politician supported an issue, the more their value
belonged to the fuzzy support set. The
membership function is as follows:
0.00E+00
A = {(𝑥, 𝜇𝐴 (𝑥)) | 𝑥 𝜖 𝑋}
Algorithm Time (ns)
Increasing N Politicians versus
Runtime
0
500000
1000000
Number of "Politicians"
For the second set of tests we increased
the number of questions for the user to answer,
and also the number of politicians which the user
was to be compared to. We found that an increase
in both questions and politicians still gave a run
time of O(n). The following is an image of the
graph showing these findings.
𝜇𝐴 (𝑥) = (𝑟𝑎𝑤 𝑣𝑎𝑙𝑢𝑒)/4
This generates values in the range of 0 to
1 correlating to the amount of support a politician
or user has for a particular issue. Once these
fuzzy sets are generated, the user’s set is
compared against each of the politician sets using
clustering. However, rather than partitioning the
politician databases into a few cluster sets, each
politician is his own cluster. The user set is
assigned as the center of this “cluster.” A
Euclidean distance between the center and the
politician is then calculated that correlates to a
measure of dissimilarity. This distance formula
used in the algorithm is as follows:
𝑛
𝑛
2
𝐷 = ∑( ∑ (||𝑥𝑘 − 𝑐𝑖 || ))
𝑘,𝑥𝑘 ∈𝑋
𝑐=1
Extending the number of politicians past
a mark of ten thousand is purely for testing
purposes. This is because there are not much
more than ten thousand politicians in the United
States. The number of questions needs to be kept
to a minimum that allows for valid results while
making it worth-while for a user to answer all of
the questions, without hitting point where they no
longer care how they answer.
A maximum distance value is also
created by taking the maximum of subtracting
each user dimension value from one or using the
number as is (its distance from zero). This
generates the greatest difference a user could
have from that politician. This difference is
squared and summed as in the previous formula.
𝑛
𝑛
2
𝑚𝑎𝑥𝐷 = ∑( ∑ (||max((1 − 𝑥𝑘 ) , 𝑥𝑘 )|| ))
𝑘,𝑥𝑘 ∈𝑋
𝑐=1
Fuzzy Logic with Clustering Implementation
Dividing
the
distance
by
the
maxDistance yields a value of difference between
the politician and the user. Multiplying this value
by 100 (to convert it to a percentage) and
subtracting it from 100 yields the percentage the
user and the politician match.
Percentage = 100 – distance/maxDistance *100
distance can then be compared to the max
distance also applied in the clustering algorithm.
Dividing these two distances again results in a
percentage of dissimilarity which can be
subtracted from 100 to yield a value of sameness
for the user.
Analysis of Fuzzy Logic with Clustering
This algorithm’s complexity is O( 𝑛3 )
due to the nested for loops used to access the
politician and user fuzzy sets multiple times to
calculate the distance. The outermost loop scans
each politician in the database. The middle loop
scans each politician response. The innermost
loop scans the user’s responses. The distance
calculation is the most costly calculation of the
entire algorithm due to the need to compare each
element in the user array to each element in the
politician arrays. The remainder of the
algorithm’s complexity is similar to the weighted
algorithm because it is simply calculating a
percentage from the two distances computed.
Fuzzy Logic with Set Operations
We implemented a third comparison
algorithm using set operations on the fuzzy sets
generated utilizing the same methods as the
second algorithm. The user set and politician
sets were compared by taking the union of the
two sets and the intersection of the two sets.
Union:
𝜇𝐶 (𝑥) = max(𝜇𝐴 (𝑥), 𝜇𝐵 (𝑥)) = 𝜇𝐴 (𝑥) 𝑉 𝜇𝐵 (𝑥)
Intersection:
𝜇𝐷 (𝑥) = min(𝜇𝐴 (𝑥), 𝜇𝐵 (𝑥)) = 𝜇𝐴 (𝑥) Λ 𝜇𝐵 (𝑥)
The union yielded a set that is the
“smallest” set that contains both A and B. The
intersection, on the other hand, yields the
“largest” set containing A and B. These two sets
can be compared using the distance formula as
was applied in the clustering algorithm. This
Analysis of Fuzzy Logic with Set Operations
This algorithm has the same complexity
as the clustering algorithm because it also
computes a distance between the two sets, the
union set and the intersection set. Thus, it also
operates in O(𝑛3 ).
The Other Algorithms
As stated prior, there are two other
applications which seek to accomplish the same
as our algorithm which we used as comparisons.
The first was VoteSmart. For this we used Vote
Smart’s Vote Easy application. Since Vote
Smart’s algorithm is not public knowledge, we
had to rely on black box testing in order to come
up with results for comparison. Another issue we
encountered was that that their questions do not
map exactly to ours. They present 10 issues to the
user. To overcome this potential pitfall, we had
to introduce some measure of bias in translating
our question set to map onto theirs. [3] They also
use a different answering scheme. We attempted
to match our weighted answers the best we could
with their scheme. We used the method that if the
user set was a two or above, we answered yes to
the correlating VoteSmart question. A two
indicated “somewhat” of an interest in the issue.
A three (and if the answer was no, a one)
indicated the middle option for importance. A
four (or 0 for a no answer) indicated the highest
importance option. After going through this
process with eighteen randomly generated
samples from our algorithm, we compared the
output of the algorithms.
The other application that we compared
our algorithm with is Vote Match. Vote Match’s
weighting scheme is publicly available on their
website and is as follows: the first step involves
gathering a base score for each user answer. A
question where the user’s answer exactly matches
the politician’s answer gets a value of onehundred and fifty. One level away from the
politician’s answer gives a value of seventy-five,
and if the stance is unknown it gets a value of
zero. The second step is to multiply the values
from the first step by percentage weights based
on the importance levels chosen by the user. An
issue that is extremely important gets a percent
multiplier of one-hundred and seventy-five, a
very important issue gets one-hundred and fifty,
a somewhat important issue gets one-hundred and
twenty-five, a slightly important issue gets onehundred percent, and if a value is not known there
is no multiplier. The algorithm is to multiply the
base values from step one by the multipliers from
step 2. [2]
In order to understand how the other
algorithms compared to ours, we used the Root
Mean Squared Deviation (RMSD) of the
percentage matches from each. The RMSD
equation is:
𝛴(𝑥1 −𝑥2 )2
𝑛
RMSD=√
Where x1 is our percentage match, x2 is Vote
Smart’s percentage match, and n is total number
of all politicians we were able to match, 169.
With this equation we found a deviation of 20.10
percent from our algorithm to VoteSmart as
compared to the weighted algorithm. VoteMatch
had a deviation of 27.32 percent. [1]
Both of the tests above were done with
different politicians. This is due to the fact that
our politician database contains some politicians
from the Vote Match database and different ones
from the Vote Smart database. There were also
several politicians that were not contained in
either the Vote Match database used for testing or
the Vote Smart database used for testing. The
reason for the deviations found is that our
algorithm must be different from the algorithms
used by both vote Smart and Vote Match. It is
difficult to declare which algorithms are better,
since we do not know exact algorithms for the
other two applications, but we are confident that
our algorithm supplies a good match. This is
because it takes into account the fact that if a
person favors support of an issue, while a
politician opposes support on that same issue,
there still may be some similarities in their
thinking. This does not appear to be taken into
account with the other algorithms. In our
weighted algorithm, the only way to be
completely different from the politician is to be
completely opposite ends of our spectrum.
Results of Algorithm RMSD:
VoteMatch
vs
Fuzzy
Set 30.66%
Operations
Weighted
vs
Fuzzy
Set 15.63%
Operations
VoteSmart
vs
Fuzzy
Set 20.43%
Operations
Fuzzy Op vs Fuzzy Cluster
12.15%
VoteSmart vs Fuzzy Cluster
16.72%
VoteMatch vs Fuzzy Cluster
25.33%
VoteSmart vs Weighted
27.32%
VoteMatch vs Weighted
20.10%
Discarded Algorithms
The knapsack problem appeared to be a
promising start as a base for our algorithm. If you
imagine the ‘knapsack’ to instead be a
comparison of user data and politician data, and
the ‘stones’ of various weights to be issues, the
two problems map over nicely. The idea was to
assign a 20 digit long binary number to each
politician and user, with each bit representing a
question with 1 or 0 representing yes or no
respectively. This way, the more significant a bit
is in the number, the higher priority question
would be(ex. 1011 would represent yes no yes
yes, with the first ‘yes’ holding a weight of 8, the
no a weight of 4, then the last yes’s being 2 and
1). If a user cared a lot about gun control, for
example, it would occupy a high bit in the 20 bit
number, while another issue of les importance
would be down lower. As long as the user’s data
and the politicians were mapped up, we could use
an inverted XOR to count only the issues that
match up (both a 1 or 0) then multiply out the
corresponding weights on those matched issued;
resulting in a number that could be used for
comparison analysis.
With this implementation, we found that
it was easy to determine the priority of an issue,
but getting user data synced up with our databases
was a big obstacle. Not only that, but it was
difficult to calculate the magnitude of a question
(how polarized a person is on an issue, which was
to be factored in alongside how important the
question is as a whole toward to user). This means
that if a user was in favor (but not strongly in
favor) of an issue, and it was important to them
that the politician also have a moderate view on
that issue, this algorithm would have a hard time
deciphering that without involving numbers
upwards of 2^20.
Next we considered using the stable
marriage problem. The stable marriage problem
looked very promising at first. Instead of the
‘women and men’ in the problem, we would have
users and politicians; instead of each
user/politician having a list of preferred partners
that would ideally match, we’d have lists of
weighted views that would ideally match. In this
case, the user would be the ‘female’ from the
original stable marriage problem, and the
politicians would be the proposing ‘men’. The
user would cycle through the list of politicians
until a match was found. Failing that, the next
most suitable politician would be paired with the
user. The calculations involved in computing the
match would consist of taking the values that
were a match between user and politician and
taking a weighted average. We discovered that a
drawback of using the stable marriage problem
was that in the original problem, the criteria for a
match are thrown out with every match. We
found a slightly different implementation of the
problem called the intern assignment problem.
This is essentially the same as the stable marriage
problem, only with interns and businesses instead
of men and women; and is more “polygamous”
since each ‘business’ can have multiple ‘interns’.
However, even with this implementation,
weighting was still a problem. Difficulties such
as keeping the questions synced with the users
(not all politicians had answered the same
questions) hindered us, since the order of the
questions was important for the weighting system
to work. Also, the stable marriage problem and its
derivatives are geared for situations in which a
large number of variables are matched against an
equal number of different variables, then each
having exactly one match. In our program, we
needed one to match against many. The last
problem stems directly from the previous. The
stable marriage problem is designed to make one
match, stop, and then strike that matched item
from the list of criteria in order to continue
making matches. Our implementation would
require the same data be compared multiple times
without disregarding any data.
A candidate algorithm that generates a
reliable match needs a method to weigh issues
based off the question’s relevance to the user, and
the magnitude of how strongly they agree or
disagree with the issue. Since both the knapsack
problem and the stable marriage problem fall
short in different departments, we are making our
own algorithm capable of weighing questions
based on importance and magnitude. This
algorithm will implement some fuzzy logic
combined with more generic methods of sorting
our weighted values.
As stated earlier, we compare our program to
results from VoteSmart and VoteMatch. Since we
used the same questions as VoteMatch, it was
easy to transpose the data set. To get VoteSmart,
we had to re-word a few questions to make it
match up while ensuring the gist of the questions
did not change.
These results are fairly consistent in the
sense that our results never matched the
VoteMatch results as well as they did the
VoteSmart results. The deviation from
VoteMatch results to our results was roughly
20% while the deviations from VoteSmart results
to ours was roughly 30%. The fuzzy logic set
operations and clustering implementations
showed the best results, followed by the original
algorithm.
Conclusion
Objectively quantifying views that are
fundamentally subjective is challenging. Perhaps
the most difficult part is compiling the data into a
universally manageable format. Turning feelings
into 1’s and 0’s while retaining their significance
is no mean feat. However, a modicum of success
was achieved with a simple matching algorithm.
The fuzzy logic implementation better captured
the complex nature of opinion. Although a
standard deviation of 20% may sound like a lot,
these deviations occur because of different
systems for weighing data. It should be noted
when our algorithm, VoteSmart, and VoteMatch
returned the percentages, they were in near
identical order. The deviations came from
comparing the actual percent match of the user
and each politician. Although those values may
be off, the ranking for compatible politicians for
each user stayed very consistent.
Challenges such as struggling with the
knapsack problem and stable marriage problem
was our basis for utilizing fuzzy logic solutions.
Each algorithm we implemented garnered a
slightly closer result to our benchmark and
broadened our overall view of the study of
algorithms.
References
1. Vote Match.
http://www.ontheissues.org/Quiz/Quiz2014.asp?quiz
=Pres2016. GoVote.com. 2000-2014. Accessed
November 26, 2014.
2. Vote Match.
http://www.ontheissues.org/quizeng/how_it_works.as
p?Dir=. GoVote.com. 2000-2014. Accessed
November 26, 2014.
3. Vote Smart. http://votesmart.org/voteeasy/.
Project Vote Smart. 2014. Accessed November 26,
2014.
4. Jang, J.-S.R; Sun, C.T; Mizutani, E. ,
Neuro-Fuzzy and Soft Computing. Prentice Hall ;
1997.
5. Kruse, Gebhardt, Klawonn, Foundations of
Fuzzy Systems. John Wiley & Sons Ltd.; 1994
Download