Link Prediction Analysis through Machine Learning, Supervised and Unsupervised Shashank Golla Aaron Kemmer 1. What is Link Prediction? Problem Definition • Let Graph G(V, E) represent a network of vertices (V) and edges (E). Consider G₁ and G₂ to be two instances of the G at different times (that may contain different edges or “relationships”), such that Time(G₁) < Time(G₂). By using only information from G₁, generate a set of edges E` such that any e ∈ E` does not exist in G₁ but is expected to exist in G₂. Link Prediction example ◎Consider Facebook friend recommendations - the “people you may know” list. That is, essentially, a collection of edges between you and another user that the system believes may belong in the graph representative of the social network. Example: Twitter Sample A directed, observed network for a single user or “Ego” node with a small follow network Example: Twitter Sample A directed, observed network for a single user or “Ego” node with a small follow network Methodology We’ll show you the right people to follow! Basic Algorithm The basic algorithm is simple, you remove edges from your observed graph, rank new edges based on heuristics, pick top k edges, and evaluate effectiveness. Comparison of Heuristics Testing Link Prediction Here, E’’ represents the edges that exist and the edges you have deleted. Enew represents the edges you’ve predicted in your network. The success region exists between the two. Our process • • • Get data sets of Twitter that have the feature sets we feel are important. Run different heuristics/algorithms on those data sets. Test to see if the result from the algorithms is accurate and be able to quantify this value. Feature Sets • • • Proximity features - Text analysis(data permitting). Aggregate features - Number of similar “Following”, with more weight given to users with more Followers. Topological features: Clustering index Data Sets(Twitter) • • We are using twitter data sets for research purpose from Stanford SNAP and ASU. We are only using an observed network of the data available. We used the Twitter API in order to gather more data ourselves. We intend to use data from our own accounts for testing. Results/Plan • • • So far we’ve mined for all the data sets that we need using the twitter API. We are in the process of writing the code necessary for the algorithms. We’ve already got the process for being able to test and benchmark each of the algorithms. Conclusion • • • In conclusion we feel as though Link Prediction is a very important problem which needs to be more domain oriented and not generalized. For Twitter in specific we had to do a lot of tweaking the feature sets. In the future we hope to be able to do this across different social networks. Citations • 1. “Lecture 24 – Santa Fe Institute. “2013. 30 Mar. 2016 <http://tuvalu.santafe.edu/~aaaronc/courses/54 54/csci5454_sprint2013_CSL10.pdf>