Ashton Friedman Charles McCoy Jarett Miller Political Alignment Algorithm Abstract The goal of this paper is to provide an examination of an algorithm designed for matching prospective voters with like-minded candidates. In this paper we examine the intricacies of our algorithm and compare it sideby-side with known proprietary programs to gauge the results using square root mean differences. This analysis includes our weighted matching system, why we chose this particular data set and its source, and our implementation of fuzzy logic. Algorithm explanations are provided first. This is followed by analysis and comparison of the chosen algorithms as well as those we ultimately discarded. Introduction Deciding on whom to vote is never easy. Keeping up with speeches and the dynamic views of politicians can be time consuming. This can lead to a population of voters that do not have enough knowledge about candidates to vote for an individual who will represent the voter’s views in Congress. Our algorithm provides a user a simple way to learn about the candidates that support or oppose the same issues as the user and will thus best represent that user’s views. A set of 20 issues is presented to the user. We have obtained politician data for the same issues. Once this data is in quantifiable terms, we use weighted matching and fuzzy logic to determine a percentage-based match from a piecewise comparison. The design for our algorithm is our own, but is closely related to generic matching algorithms. The biggest difference between our algorithms and simpler matching algorithms is weighting and fuzzy logic implementation. We chose to implement fuzzy logic in two of our algorithms due to the imprecise nature of human opinion. Fuzzy systems were designed to handle data sets where membership is not a crisp value but instead belongs to a particular set to a certain degree. This made it much simpler to quantify a value such as “Strongly Agree.” We compare our results to two other websites that implement a political matching system: VoteSmart and VoteMatch. Both use proprietary matching algorithms. Thus, we are able to generate data that can largely be consistently compared across all three platforms. Based on our observations and experimentation with VoteSmart, we chose it to be our primary benchmark for comparison. Background for the Algorithms Our project focused on making an application for users that would help them find which politicians their political views most closely match. We started by finding some of the major political issues that voters typically consider when voting. We came across several websites which contained top issues and decided to use those found on the vote match quiz. Initially, we had intended to look at the voting records of North Carolina politicians and decide their stance on the top issues based on the votes they cast in support or opposition of those issues. This proved to be an insurmountable task for a semester-long project. Thus, we began looking for reputable sites that had already implemented a grading system based on voting record. VoteMatch had a well-organized and documented system in place. Upon selecting a candidate and issue, the site listed all resources used in calculating the ranking on that issue for that particular politician. VoteMatch went further and included a quiz for users that listed the same issues. Thus, we chose VoteMatch data for our politician database as well as our question bank due to the ease of comparing our algorithm’s performance against the VoteMatch algorithm. We also compared our findings against the algorithm used by VoteSmart. Both applications give their best politician matches based on users’ responses, and although Vote Smart does not make their algorithm public it gives us another base for comparison. Weighted Algorithm Our algorithm uses weights for each answer given by the user, and then gives a percentage match of the user to each politician contained in our database. We use the same questions given on the VoteMatch quiz. If a user strongly favors an issue the weighting factor is four. An issue which the user favors is given a three. A neutral stance is weighted with a value of two. If the user opposes or strongly opposes an issue the weighting value is one or zero, respectively. The politician stances, gathered from Ballotpedia, are given the same weights within the database. It is possible for a politician to have an unknown stance on a particular issue due to a lack of evidence to claim support or opposition on the issue. In this case the weighted value will be five, and our algorithm will not include that issue in the arithmetic process for finding the percentage match. After the user has answered all questions, our algorithm then takes the absolute difference of each of the user and politician weights for each answer. If the absolute difference is four, then the user and politician are completely opposite on their stances for that issue. If the absolute difference is zero then they both agree with their stances. Our algorithm then uses the total possible value of all questions; one-hundred minus four times any questions which the politician weighted value is greater than four, to find the total possible value by which to divide the sum of the absolute differences. The total possible value if all stances are known for the politician is one-hundred. After knowing the sum of the absolute differences and the total possible value, our algorithm subtracts the quotient of the sum of the absolute differences and the total possible value from one-hundred, and multiplies that answer by one-hundred. This process gives the percentage match of the user to the politician. Our algorithm will run through this process for each politician in our database, and then give a listing of each politician with the percent that they match the user. For a more concrete understanding of how our algorithm works, the following is a formal mathematical description of the algorithm’s processes. Several variables are used in the equations: Total Value Obtained (TVO) is the sum of the absolute differences, Total Possible (TP) is the possible total of all weighted values, Percentage Match (PM) is the percentage of how much a user’s stances match with a politician’s stances, User Value (UV) is the weighted value of a user’s stance on an issue, Politician Value (PV) is the weighted value of a politician’s stance on an issue. TVO=Σ(|UV-PV|) TP=100-4*Σ(PV>4) PM=100-(TVO/TP)*100 Analysis Weighted Algorithm By running several tests we found that our algorithm runs in O(n) time, where n can represent the number of politicians in the database or the number of questions asked. For the first set of tests, based on the number of politicians, we wrote a program that fabricated a number of imaginary politicians ranging from ten to one-million. The program generated these politicians, and then gave them weighted values for stances on each of the twenty questions in the questions list, ranging from zero to four. We did not allow unknown stances in the test to ensure that all tests were ran with the same total possible value of one-hundred. The weighted values for each politician were generated using the Random utility in Java. After generating a list of ten politicians in this manner, we then used that list of politicians as our politician database and ran through our algorithm using user inputs for strongly favoring all issues, strongly opposing all issues, neutrality on all issues, and random user stances for each issue. This process was done for politician lists of ten, one-hundred, one-thousand, ten-thousand, one-hundred thousand, two hundred fifty thousand, five-hundred thousand, seven hundred fifty thousand, and one-million politicians. All tests gave run times of O(n). The following is an image of the graph showing the results of the test with random user stances. The blue line is our algorithm’s run time, while the red line is a linear trend line to show how our algorithm follows a linear time line. 2.00E+08 We implemented a second comparison algorithm utilizing Fuzzy Logic and K-Means Clustering. This algorithm takes in a database of politicians in the same way our previous algorithm does. However, instead of simply comparing the raw politician values against the raw user values, both sets are used to generate fuzzy sets. In the case of our program, we created a fuzzy support set, meaning the greater a user or politician supported an issue, the more their value belonged to the fuzzy support set. The membership function is as follows: 0.00E+00 A = {(𝑥, 𝜇𝐴 (𝑥)) | 𝑥 𝜖 𝑋} Algorithm Time (ns) Increasing N Politicians versus Runtime 0 500000 1000000 Number of "Politicians" For the second set of tests we increased the number of questions for the user to answer, and also the number of politicians which the user was to be compared to. We found that an increase in both questions and politicians still gave a run time of O(n). The following is an image of the graph showing these findings. 𝜇𝐴 (𝑥) = (𝑟𝑎𝑤 𝑣𝑎𝑙𝑢𝑒)/4 This generates values in the range of 0 to 1 correlating to the amount of support a politician or user has for a particular issue. Once these fuzzy sets are generated, the user’s set is compared against each of the politician sets using clustering. However, rather than partitioning the politician databases into a few cluster sets, each politician is his own cluster. The user set is assigned as the center of this “cluster.” A Euclidean distance between the center and the politician is then calculated that correlates to a measure of dissimilarity. This distance formula used in the algorithm is as follows: 𝑛 𝑛 2 𝐷 = ∑( ∑ (||𝑥𝑘 − 𝑐𝑖 || )) 𝑘,𝑥𝑘 ∈𝑋 𝑐=1 Extending the number of politicians past a mark of ten thousand is purely for testing purposes. This is because there are not much more than ten thousand politicians in the United States. The number of questions needs to be kept to a minimum that allows for valid results while making it worth-while for a user to answer all of the questions, without hitting point where they no longer care how they answer. A maximum distance value is also created by taking the maximum of subtracting each user dimension value from one or using the number as is (its distance from zero). This generates the greatest difference a user could have from that politician. This difference is squared and summed as in the previous formula. 𝑛 𝑛 2 𝑚𝑎𝑥𝐷 = ∑( ∑ (||max((1 − 𝑥𝑘 ) , 𝑥𝑘 )|| )) 𝑘,𝑥𝑘 ∈𝑋 𝑐=1 Fuzzy Logic with Clustering Implementation Dividing the distance by the maxDistance yields a value of difference between the politician and the user. Multiplying this value by 100 (to convert it to a percentage) and subtracting it from 100 yields the percentage the user and the politician match. Percentage = 100 – distance/maxDistance *100 distance can then be compared to the max distance also applied in the clustering algorithm. Dividing these two distances again results in a percentage of dissimilarity which can be subtracted from 100 to yield a value of sameness for the user. Analysis of Fuzzy Logic with Clustering This algorithm’s complexity is O( 𝑛3 ) due to the nested for loops used to access the politician and user fuzzy sets multiple times to calculate the distance. The outermost loop scans each politician in the database. The middle loop scans each politician response. The innermost loop scans the user’s responses. The distance calculation is the most costly calculation of the entire algorithm due to the need to compare each element in the user array to each element in the politician arrays. The remainder of the algorithm’s complexity is similar to the weighted algorithm because it is simply calculating a percentage from the two distances computed. Fuzzy Logic with Set Operations We implemented a third comparison algorithm using set operations on the fuzzy sets generated utilizing the same methods as the second algorithm. The user set and politician sets were compared by taking the union of the two sets and the intersection of the two sets. Union: 𝜇𝐶 (𝑥) = max(𝜇𝐴 (𝑥), 𝜇𝐵 (𝑥)) = 𝜇𝐴 (𝑥) 𝑉 𝜇𝐵 (𝑥) Intersection: 𝜇𝐷 (𝑥) = min(𝜇𝐴 (𝑥), 𝜇𝐵 (𝑥)) = 𝜇𝐴 (𝑥) Λ 𝜇𝐵 (𝑥) The union yielded a set that is the “smallest” set that contains both A and B. The intersection, on the other hand, yields the “largest” set containing A and B. These two sets can be compared using the distance formula as was applied in the clustering algorithm. This Analysis of Fuzzy Logic with Set Operations This algorithm has the same complexity as the clustering algorithm because it also computes a distance between the two sets, the union set and the intersection set. Thus, it also operates in O(𝑛3 ). The Other Algorithms As stated prior, there are two other applications which seek to accomplish the same as our algorithm which we used as comparisons. The first was VoteSmart. For this we used Vote Smart’s Vote Easy application. Since Vote Smart’s algorithm is not public knowledge, we had to rely on black box testing in order to come up with results for comparison. Another issue we encountered was that that their questions do not map exactly to ours. They present 10 issues to the user. To overcome this potential pitfall, we had to introduce some measure of bias in translating our question set to map onto theirs. [3] They also use a different answering scheme. We attempted to match our weighted answers the best we could with their scheme. We used the method that if the user set was a two or above, we answered yes to the correlating VoteSmart question. A two indicated “somewhat” of an interest in the issue. A three (and if the answer was no, a one) indicated the middle option for importance. A four (or 0 for a no answer) indicated the highest importance option. After going through this process with eighteen randomly generated samples from our algorithm, we compared the output of the algorithms. The other application that we compared our algorithm with is Vote Match. Vote Match’s weighting scheme is publicly available on their website and is as follows: the first step involves gathering a base score for each user answer. A question where the user’s answer exactly matches the politician’s answer gets a value of onehundred and fifty. One level away from the politician’s answer gives a value of seventy-five, and if the stance is unknown it gets a value of zero. The second step is to multiply the values from the first step by percentage weights based on the importance levels chosen by the user. An issue that is extremely important gets a percent multiplier of one-hundred and seventy-five, a very important issue gets one-hundred and fifty, a somewhat important issue gets one-hundred and twenty-five, a slightly important issue gets onehundred percent, and if a value is not known there is no multiplier. The algorithm is to multiply the base values from step one by the multipliers from step 2. [2] In order to understand how the other algorithms compared to ours, we used the Root Mean Squared Deviation (RMSD) of the percentage matches from each. The RMSD equation is: 𝛴(𝑥1 −𝑥2 )2 𝑛 RMSD=√ Where x1 is our percentage match, x2 is Vote Smart’s percentage match, and n is total number of all politicians we were able to match, 169. With this equation we found a deviation of 20.10 percent from our algorithm to VoteSmart as compared to the weighted algorithm. VoteMatch had a deviation of 27.32 percent. [1] Both of the tests above were done with different politicians. This is due to the fact that our politician database contains some politicians from the Vote Match database and different ones from the Vote Smart database. There were also several politicians that were not contained in either the Vote Match database used for testing or the Vote Smart database used for testing. The reason for the deviations found is that our algorithm must be different from the algorithms used by both vote Smart and Vote Match. It is difficult to declare which algorithms are better, since we do not know exact algorithms for the other two applications, but we are confident that our algorithm supplies a good match. This is because it takes into account the fact that if a person favors support of an issue, while a politician opposes support on that same issue, there still may be some similarities in their thinking. This does not appear to be taken into account with the other algorithms. In our weighted algorithm, the only way to be completely different from the politician is to be completely opposite ends of our spectrum. Results of Algorithm RMSD: VoteMatch vs Fuzzy Set 30.66% Operations Weighted vs Fuzzy Set 15.63% Operations VoteSmart vs Fuzzy Set 20.43% Operations Fuzzy Op vs Fuzzy Cluster 12.15% VoteSmart vs Fuzzy Cluster 16.72% VoteMatch vs Fuzzy Cluster 25.33% VoteSmart vs Weighted 27.32% VoteMatch vs Weighted 20.10% Discarded Algorithms The knapsack problem appeared to be a promising start as a base for our algorithm. If you imagine the ‘knapsack’ to instead be a comparison of user data and politician data, and the ‘stones’ of various weights to be issues, the two problems map over nicely. The idea was to assign a 20 digit long binary number to each politician and user, with each bit representing a question with 1 or 0 representing yes or no respectively. This way, the more significant a bit is in the number, the higher priority question would be(ex. 1011 would represent yes no yes yes, with the first ‘yes’ holding a weight of 8, the no a weight of 4, then the last yes’s being 2 and 1). If a user cared a lot about gun control, for example, it would occupy a high bit in the 20 bit number, while another issue of les importance would be down lower. As long as the user’s data and the politicians were mapped up, we could use an inverted XOR to count only the issues that match up (both a 1 or 0) then multiply out the corresponding weights on those matched issued; resulting in a number that could be used for comparison analysis. With this implementation, we found that it was easy to determine the priority of an issue, but getting user data synced up with our databases was a big obstacle. Not only that, but it was difficult to calculate the magnitude of a question (how polarized a person is on an issue, which was to be factored in alongside how important the question is as a whole toward to user). This means that if a user was in favor (but not strongly in favor) of an issue, and it was important to them that the politician also have a moderate view on that issue, this algorithm would have a hard time deciphering that without involving numbers upwards of 2^20. Next we considered using the stable marriage problem. The stable marriage problem looked very promising at first. Instead of the ‘women and men’ in the problem, we would have users and politicians; instead of each user/politician having a list of preferred partners that would ideally match, we’d have lists of weighted views that would ideally match. In this case, the user would be the ‘female’ from the original stable marriage problem, and the politicians would be the proposing ‘men’. The user would cycle through the list of politicians until a match was found. Failing that, the next most suitable politician would be paired with the user. The calculations involved in computing the match would consist of taking the values that were a match between user and politician and taking a weighted average. We discovered that a drawback of using the stable marriage problem was that in the original problem, the criteria for a match are thrown out with every match. We found a slightly different implementation of the problem called the intern assignment problem. This is essentially the same as the stable marriage problem, only with interns and businesses instead of men and women; and is more “polygamous” since each ‘business’ can have multiple ‘interns’. However, even with this implementation, weighting was still a problem. Difficulties such as keeping the questions synced with the users (not all politicians had answered the same questions) hindered us, since the order of the questions was important for the weighting system to work. Also, the stable marriage problem and its derivatives are geared for situations in which a large number of variables are matched against an equal number of different variables, then each having exactly one match. In our program, we needed one to match against many. The last problem stems directly from the previous. The stable marriage problem is designed to make one match, stop, and then strike that matched item from the list of criteria in order to continue making matches. Our implementation would require the same data be compared multiple times without disregarding any data. A candidate algorithm that generates a reliable match needs a method to weigh issues based off the question’s relevance to the user, and the magnitude of how strongly they agree or disagree with the issue. Since both the knapsack problem and the stable marriage problem fall short in different departments, we are making our own algorithm capable of weighing questions based on importance and magnitude. This algorithm will implement some fuzzy logic combined with more generic methods of sorting our weighted values. As stated earlier, we compare our program to results from VoteSmart and VoteMatch. Since we used the same questions as VoteMatch, it was easy to transpose the data set. To get VoteSmart, we had to re-word a few questions to make it match up while ensuring the gist of the questions did not change. These results are fairly consistent in the sense that our results never matched the VoteMatch results as well as they did the VoteSmart results. The deviation from VoteMatch results to our results was roughly 20% while the deviations from VoteSmart results to ours was roughly 30%. The fuzzy logic set operations and clustering implementations showed the best results, followed by the original algorithm. Conclusion Objectively quantifying views that are fundamentally subjective is challenging. Perhaps the most difficult part is compiling the data into a universally manageable format. Turning feelings into 1’s and 0’s while retaining their significance is no mean feat. However, a modicum of success was achieved with a simple matching algorithm. The fuzzy logic implementation better captured the complex nature of opinion. Although a standard deviation of 20% may sound like a lot, these deviations occur because of different systems for weighing data. It should be noted when our algorithm, VoteSmart, and VoteMatch returned the percentages, they were in near identical order. The deviations came from comparing the actual percent match of the user and each politician. Although those values may be off, the ranking for compatible politicians for each user stayed very consistent. Challenges such as struggling with the knapsack problem and stable marriage problem was our basis for utilizing fuzzy logic solutions. Each algorithm we implemented garnered a slightly closer result to our benchmark and broadened our overall view of the study of algorithms. References 1. Vote Match. http://www.ontheissues.org/Quiz/Quiz2014.asp?quiz =Pres2016. GoVote.com. 2000-2014. Accessed November 26, 2014. 2. Vote Match. http://www.ontheissues.org/quizeng/how_it_works.as p?Dir=. GoVote.com. 2000-2014. Accessed November 26, 2014. 3. Vote Smart. http://votesmart.org/voteeasy/. Project Vote Smart. 2014. Accessed November 26, 2014. 4. Jang, J.-S.R; Sun, C.T; Mizutani, E. , Neuro-Fuzzy and Soft Computing. Prentice Hall ; 1997. 5. Kruse, Gebhardt, Klawonn, Foundations of Fuzzy Systems. John Wiley & Sons Ltd.; 1994