Research Statement Nachiketa Sahoo Assistant Professor of Information Systems Boston University School of Management Boston, MA Introduction Mining large scale corporate data to gather business intelligence has become a part of doing competitive business today. This involves identifying patterns in data collected from a variety of sources for economic gains. Developing methods to do so provides competitive advantages to the firm. The increase in computing power and the expansion of connectivity in recent years have made it possible to gather and analyze unprecedented amount of information. Therefore, these are exciting times for business intelligence research in enterprise settings. Research Interest My research interest falls at the intersection of machine learning and social science. I am particularly motivated to make methodological contributions that help us identify interesting patterns in large enterprise datasets. To this end, I have primarily focused on two topics: information filtering and online social network analysis. I have made efforts to develop data mining techniques that are informed by a multi-disciplinary social science perspective. These approaches have led to effective and practical methods. Each of my work focuses on an important aspect of business operation as described below. Information Filtering Availability of large amount of data has increased the need to carefully select the relevant information for different purposes. This task is accomplished by information filtering. When channeled for individuals with different needs and interests it leads to personalized information filtering with application in product marketing, knowledge management, expertise-finding etc. When there are more products than what customers can possibly evaluate, filtering is important for both the customers and the merchants. By matching the customers to the products they would enjoy, businesses can provide a better experience to their customers and be more profitable. Over the last decade, personalized recommender systems have emerged as important tools for this purpose. These systems assist the merchants in shaping demand and help the end users in finding interesting products and services. Collaborative filtering is the most popular approach for generating recommendations because it only needs a customer’s purchase history or product ratings. Traditionally collaborative filtering approaches have been designed to work with only simple one-dimensional rating (e.g. “Rate this product on a scale of 1 to 10”). In recent years interest in multi-component rating that uses multiple dimensions of a product has been increasing. My work in “Multi-component rating collaborative filtering” provides one of the first algorithms to make use of such ratings[1]. In this research, I use a unique multi-component rating dataset from Yahoo! Movies. Working in the probabilistic graphical model framework, I discover the dependency structure between the rating components by a constrained structure search and validate it by the psychometric theories on the halo effect. I used this dependency structure to develop a mixture model based collaborative filtering algorithm that performs better than using only one component rating, especially when small amount of training data is available. The designed model can also be used to fill-in the component ratings and be used in a rater support system. Thus, the project brings together techniques from statistical machine learning and ideas from psychometric literature to design a collaborative filter that can be used as a marketing tool and as a decision support system. In a subsequent research project I address the problem of generating personalized article recommendations when user preferences are changing. This is a common phenomenon when we observe the behavior of users for a long period of time. The type of movies or books people like when they are teenagers is different from the type of movies or books they like when they are in their twenties. However, the vast majority of the recommender systems literature has focused on static models of the user behavior that does not take this phenomenon into account. Therefore, it provides little guidance about the best way to interpret ratings generated from a changing user preference for making personalized recommendations. For example, with static user preferences assumption two items could be considered similar if they have received similar ratings from a user. However, when the user preferences change two items that are liked by a person at very different points of time cannot be considered similar. I address this problem for the case of implicit ratings generated by users when they select blog articles to read. I model the users’ selection of items as a hidden Markov model. The emission component of the HMM is novel in that it models a variable number of observations in each time period. The number and the selection of articles read by each user in each time period are explicitly modeled. Through this design I am able to observe changes in a user’s preference over time and use it for generating better personalized recommendations. A comparison with several static collaborative filters and a recently proposed dynamic collaborative filter shows that the proposed algorithm significantly outperforms the existing ones [2]. Social Media Mining Blogs have been popular on the Internet for a number of years. Today, many organizations use blogs to interact with their employees. Intra-organizational blogs also function as auxiliary social networks for casual conversation among the employees. Therefore, many organizations have begun using blogs as part of their corporate strategy. In a project in collaboration with a Fortune 500 IT services firm I have explored a number of interesting phenomena that affect employees’ behavior in enterprise blogosphere. I have also developed new ways of mining the activities of the employees to detect patterns that can be useful for the firm. In the first part of an exploratory study I have studied the factors that affect the citation and reply ties in the blog network. Topics of conversations spread through a blog network by citations and they are sustained by the replies from the blog readers. In a two period study I find that the citation and reply links between the users have little correlation with their demographic similarities. Instead, there are a number of online characteristics of the user that are correlated with their future citation and reply links. Some of the interesting observations are: the co-commenting links between pairs of individuals are correlated with their direct comments to each other in the future; if an individual cites another he/she is less often seen to directly reply to the other person—suggesting a substitution of ties; a strong positive correlation of Simmelian ties with future direct interaction between pairs of bloggers; and negative correlation between membership of a pair of bloggers to multiple cliques and a reduced future interaction between the pair of bloggers[3, 4]. Thus, the study confirms some of the existing theories of social networks for the online networks while presenting a few surprising findings as well. In a second study we have explored the blog reading behavior of employees at a firm. This is important because people reading each other’s blogs are the most prevalent tie between people in blogosphere. Every other observed relationship in the blog is based on it. The number of regular readers a blogger has is also a measure of the blogger’s power. However, the reading habit of people has been largely invisible to the researchers. By collecting and analyzing the data on employees’ blog reading through industry partnership we have studied the factors that explain different types of blog reading behavior. By using a state space model to examine the employees’ blog reading behavior we have found that several textual factors such as the difficulty of the writing, the grammatical errors, sentiments expressed in the article all play key roles in shaping the employee’s blog reading behavior [5]. It is possible to get a pulse of the employee sentiments and a summary of their expertise by analyzing the blog conversations. Blog posts can be analyzed to identify opinion developments and opinion leaders. Such analyses can also reveal the nature of expertise within a firm. I approach this task of expertise discovery with a perception that opinions and expertise take interaction to develop, that the messages exchanged between bloggers are not only documents, but they also represent interaction between them. This information is lost if we treat the blog data as simply a collection of text documents. On the other hand, representing the relation between bloggers by the number of citations and replies between them, as done in network analysis, would lose the information that is in the text of the messages. Therefore, we need a new approach. I start with a generalization of ties for citations and replies. Each tie is represented by a vector of term weight representations of the text in the post. Thus the network matrix now becomes a tensor. By adding a time dimension to the tensor one can capture information on the evolution of the blog network. One can perform an importance computation in such a tensor based on the premise that important people, important content and important time periods in the development of a topic will be associated with each other. Neither of these importance scores are available a-priori. However, I have shown that based on the outlined premise these scores can be deduced by a tensor factorization. The factorization not only outputs the significant topics of conversation, but also the important people in each topic and the time trend of each topic. Knowing these can provide important advantages to a firm[6, 7]. This work brings together and extends ideas from network analysis and techniques from text data mining to develop a method to track significant developments and identify important people inside an organization from the online conversations of the employees. Future Research Directions Recommendations for Users in a Flux The topic of personalized recommendation in the context of changing user preferences is a relatively new one. There are a number of very interesting open research problems. The research in recommender systems have shown that uniquely modeling each user’s preference leads to more relevant recommendation for each user. Thus, the natural next step is to uniquely model how the preference changes for each user. To estimate such models traditionally one resorts to stochastic approximations. However, such approaches have large computational complexities that can render them unsuitable for collaborative filtering tasks. I am working on developing deterministic approximation methods based on Variational-Bayes framework to solve this problem. The advantage of the methods based on this framework is that they provide well defined bounds on the approximation quality and often lead to scalable methods. Yet another interesting research direction is to model the preference shifts of an individual as a gradual change instead of an abrupt stochastic jump in a user’s behavior. This approach is intuitive and could be a better model of the changes in user’s preference. Extending the collaborative filtering on the face of changing user preferences to incomplete datasets of explicit ratings is another important open research problem. There has been some work on this problem in the context of Netflix prize competition. However, there is an opportunity for addressing this problem using state space models that provide a flexible framework to model each user’s behavior. Personalized marketing using online social networks Effectively advertising to people on online social networks is an open research problem. This could be an effective method of advertising given the common experience of users seeking opinions on products from people they know and trust. With people spending an increasing amount of time at online social networking sites there is a large advertising opportunity waiting to be developed. The online advertising industry has devised ways to personalize advertisements for the web users. Search engines achieve this by inferring users’ interests from their search queries or from the keywords present on web pages being read by them. Users’ visits to different websites can also be monitored and profiled by the Ad networks. However, this model of advertising treats a user as an isolated agent who is online to seek information. Online personalized advertising has been successful due to effective matching of end users’ information need with the advertisers’ offerings. Increasingly people are spending time online to perform inter-personal activity as opposed to seeking impersonal information. Such activities include checking the status of a friend, exchanging messages, broadcasting one’s own status, etc. Due to this fundamental difference in purpose, old advertising models have generally been ineffective on online social networks. Therefore, a new model of advertising is needed for online social networks. Over the last year I have developed a project in collaboration with a large Internet search and advertising firm that also maintains a large online social network among its users. Some of the questions I am trying to answer in this project are: do user’s online ties have any influence on their response to the advertisements shown in conjunction with their online interaction? Can we infer premium advertising opportunity from a user’s expressed opinions on online social networks and the user’s relationship with other users in the network? Does the model of social communication popularized by online social networks like Facebook, lead to the opportunities for group based advertisements?—among many other interesting research questions. Conclusion With ever increasing amount of product and customer data, filtering them for particular purposes is becoming more and more important. In addition, with the emergence of Web 2.0 phenomenon more and more Internet users are participating in online social activities. This provides us with more information to understand their needs and preferences to be able to create a better filter for them. I plan to carry out research on these important areas by taking an interdisciplinary approach whenever they offer useful insights. I want to make methodological contributions through my work. I also plan to partner with industry and government agencies to obtain funding, unique datasets and to identify important topics for further research. References 1. Sahoo, N., et al., The Halo Effect in Multi-component Ratings and its Implications for Recommender Systems: The Case of Yahoo! Movies. Information Systems Research, Forthcoming. 2. Sahoo, N., P.V. Singh, and T. Mukhopadhyay, A Hidden Markov Model for Collaborative Filtering. 2010. 3. Sahoo, N., R. Krishnan, and J. Callan, Link formation over intra-organizational blog network, in Fourth Symposium on Statistical Challenges in Electronic Commerce Research. 2008. 4. Sahoo, N., R. Krishnan, and J. Callan, Formation of Citation and Reply ties over intra-organizational blog network, in Conference on Information Systems and Technology. 2008. 5. Singh, P.V., N. Sahoo, and T. Mukhopadhyay, Seeking Variety: A Dynamic Model of Employee Blog Reading Behavior. 2010. 6. Sahoo, N. and R. Krishnan, Expertise Discovery from Blogs by Tensor Factorization, in Conference on Information Systems and Technology. 2010. 7. Sahoo, N., R. Krishnan, and C. Faloutsos, Socio-temporal analysis of conversations in intra-organizational blogs, in Workshop on Information Technologies and Systems. 2008.