CS 8803 AIA Project Proposal CS 8803 AIA (Advanced Internet Application) Project Proposal: Mining Social Networks Project Members: Abhishek Saxena, Ankit Kharadi, Chirag Rajan. Motivation With the exponential growth of users of social networks and On line communities, researchers have begun to explore and investigate the vast amount of information they contain. So far there has been a tremendous interest in understanding the characteristics of communities – how they form, what causes a community to grow or shrink in size and why some communities are more active than others. For our project we aim to gather user data on social networks (such as user profiles) and then apply statistical learning techniques to gain new insight into aspects such as community dynamics and user behavior. Our objective is two fold- firstly we want to show the scope of classification/prediction algorithms in social networks and second to study which classification/prediction algorithms are good for this purpose. We aim to see if we can provide useful insight into social network behavior. As users of Google would know, their Ad Sense feature is very accurate in placing advertisements relevant to a users interests. Our project tries to use some of the same techniques to provide relevant classification of users into groups. A naïve example of the expediency exhibited by a classification algorithm in this regard shall be to examine a users profile and find out whether he is a good fit for a community which goes hiking on weekends. If so we could recommend this community to the user so that he may join. Objectives As frequent users of social networking sites, we feel that the interactivity of the site could be improved. Today's social networking sites offer a vast amount of information but as users we are expected to explore the myriad of connections and find links. Through our project we hope to find out some these links for users and make them aware of it rather than having them have to do their own research. Coming to another example, if a user's profile says that he/she is a senior in college then it can be inferred with a very high probability that that user would be looking for a job, thus the system could categorize that user as someone who would certainly be a good fit for a community related to job searches. We would like our project to serve as a basis for future implementations. For example in the future if users wanted to mine data we hope our project would provide useful insights into which algorithms to use. Georgia Institute of Technology Page 1 CS 8803 AIA Project Proposal Proposed Work The implementation roadmap for the project shall essentially comprise of gathering social network data possibly from varied sources with the consideration that the data ought to be rich enough to facilitate interpretations beyond what is apparent from a typical social-network user’s perspective. We intend to employ state-of-the-art classification/prediction algorithms to the gathered data given the requirements of the aforementioned task. The techniques shall include Support Vector Machines, Bayesian methods etc. Comparisons between the same is expected to yield insights from several viewpoints. Firstly, successful application of the techniques would insinuate towards the wealth of information that can be gained and may offer new perceptions of the field of social network data mining. Secondly, the premise that one algorithm might necessarily perform better than its peers ,if realized on this data, provides clues into the nature of the data and hence suggests directions for further exploration. Additionally, from a machine learning research perspective, putting the techniques to test in novel domains can be conducive as it might help figuring out some limitations in the existing approaches. Getting into details of the implementation roadmap that we anticipate, we intend to use relevant APIs such as the Facebook API which allow obtaining the underlying social network graph structure and/or accessing individual users’ information.One of the aims for such a project apart from those stated above could be to recognize some important classes of data that allow subtle inferences to be made which is another area we plan to look into.The metrics that the team plans to employ for the evaluation of algorithm performance include : 1. No. of training examples required 2. Classification/Generalization accuracy 3. Overfitting 4. Performance with large feature spaces 5. Versatility across different classes of data 6. Algorithm specific issues such as minimization of errors , ability to model feature dependence, predictive power etc. and an assessment of how far these capabilities prove useful in the considered context. Georgia Institute of Technology Page 2 CS 8803 AIA Project Proposal Related Work A glance at some of the relevant work in application of machine learning techniques to Social Networks leads to the identification of a bifurcation in terms of the ongoing research.There are approaches that aim at detecting the existence of connected networks of users which qualify as social networks and cliques and there‘re those that are targeted at analyzing the pre-existing social networks to model their characteristics or make predictions etc.Our work relates more to the latter.Some of the active research projects in this area include the ones such as the MIT reality mining project.The aforementioned aims at using stochastic techniques to learn models of user behavior in social networks which in turn is expected to allow predicting what the user or a specific group of users is likely to do next.Several other projects including some work at UMass and Cornell identify potential challenges in the realm of social–network data mining and suggest customized approaches to handle relational data. Plan of action From an architecture point of view, we have planned to divide the system into 4 modules: 1. FQL engine: FQL is the query language we will use to extract the data from the users profile. 2. Front End/GUI: This will be the interface with the user which will inform users of the algorithm classification results. 3. Database: All the information extracted using FQL will be stored in a relational database 4. Algorithms: Using open source libraries to implement the various algorithms for classification. In order to implement this project, we will use the following languages/tools: Facebook API: In order to extract the relevant data we plan to use the Facebook API. Facebook API has a proprietary query language FQL which will allow us to query a users profile. Once we get this profile information we will store the data in tables. Using this we will apply classification algorithms to mine the data and get patterns. Tentatively we have chosen to focus only on the Facebook social network because of the easy availability of APIs and thus we can easily extract the data. We have decided to use classification algorithms such as k nearest neighbors clustering ,Support Vector Machines(SVMs) and bayesian methods etc. as stated earlier.Several open source libraries such as LibSVM, SVMLight,JaHMM etc. are available to implement classifiers based on statistical models Georgia Institute of Technology Page 3 CS 8803 AIA Project Proposal The database we will use to store the data will be MySQL. The timelines we have set for the project so far are as follows: Project proposal Feb 15 Feedback from proposal Feb 22 Studying of classification algorithms Feb 28 Design of modules March 7 (Query Module, Front End/GUI, Database, Algorithms) Implementation March 31 Testing of application April 7 Project report and presentation April 14 Evaluation and Testing Methods For evaluation we plan to proceed with a bias on qualitative measures i.e. we’d like to lay an emphasis on the usefulness of the project-work rather than an examination of quantitative accuracies of the implementations hence built,hence our attitude towards evaluation will be efficacious interpretation of our findings. To add to the algorithm specific analysis discussed in the section on proposed work, we intend to provide a comparison of our results with some of those obtained from the ongoing/previous research and provide explanations in case our results diverge from the established ones.Probably ,a particularly useful characteristic of the kind of work that we’ve undertaken may arise from the various ways our work can complement related endeavors. Another approach that can be employed in this case could be some elicitation of feedback from users of a social network.Consider an application on the lines of our focus that suggests interesting communities to a user based on her profile data and activities, a feedback is one of the most useful clues on the system’s performance and utility. Georgia Institute of Technology Page 4 CS 8803 AIA Project Proposal Bibliography 1. Alessandro Acquisti and Ralph Gross(2006),Imagined Communities Awareness, Information Sharing, and Privacy on the Facebook, Carnegie Mellon University 2. Christopher C. Burges A tutorial on support vector machines for pattern recognition 3. David Jensen and Jennifer Neville, Data Mining in Social Networks 4. N. Eagle and A. Pentland (2006), Reality Mining: Sensing Complex Social Systems, Personal and Ubiquitous Computing 5. N. Eagle, A. Pentland, and D. Lazer (2007), Inferring Social Network Structure using Mobile Phone Data 6. P. Koutsourelakis. (2007), Unsupervised Group Discovery and Link Prediction in Relational Datasets: a nonparametric Bayesian approach, Technical Report UCRL-TR-230743, Lawrence Livermore National Laboratory 7. Lafferty et. al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data 8. Hady W. Lauw et al. Mining Social Network from Spatio-Temporal Events 9. Brian Skyrms and Robin Pemantle A dynamic model of social network formation, School of Social Sciences, University of California, Irvine 10. Tanzeem Choudhury(2003), Sensing and Modeling Human Networks,PhD Thesis,MIT Georgia Institute of Technology Page 5