Research Statement Introduction

Research Statement
Nachiketa Sahoo
Assistant Professor of Information Systems
Boston University School of Management
Boston, MA
Mining large scale corporate data to gather business intelligence has become a part of doing competitive
business today. This involves identifying patterns in data collected from a variety of sources for economic
gains. Developing methods to do so provides competitive advantages to the firm. The increase in
computing power and the expansion of connectivity in recent years have made it possible to gather and
analyze unprecedented amount of information. Therefore, these are exciting times for business
intelligence research in enterprise settings.
Research Interest
My research interest falls at the intersection of machine learning and social science. I am particularly
motivated to make methodological contributions that help us identify interesting patterns in large
enterprise datasets. To this end, I have primarily focused on two topics: information filtering and online
social network analysis. I have made efforts to develop data mining techniques that are informed by a
multi-disciplinary social science perspective. These approaches have led to effective and practical
methods. Each of my work focuses on an important aspect of business operation as described below.
Information Filtering
Availability of large amount of data has increased the need to carefully select the relevant information for
different purposes. This task is accomplished by information filtering. When channeled for individuals
with different needs and interests it leads to personalized information filtering with application in product
marketing, knowledge management, expertise-finding etc.
When there are more products than what customers can possibly evaluate, filtering is important for both
the customers and the merchants. By matching the customers to the products they would enjoy,
businesses can provide a better experience to their customers and be more profitable. Over the last
decade, personalized recommender systems have emerged as important tools for this purpose. These
systems assist the merchants in shaping demand and help the end users in finding interesting products and
services. Collaborative filtering is the most popular approach for generating recommendations because it
only needs a customer’s purchase history or product ratings. Traditionally collaborative filtering
approaches have been designed to work with only simple one-dimensional rating (e.g. “Rate this product
on a scale of 1 to 10”). In recent years interest in multi-component rating that uses multiple dimensions of
a product has been increasing. My work in “Multi-component rating collaborative filtering” provides one
of the first algorithms to make use of such ratings[1]. In this research, I use a unique multi-component
rating dataset from Yahoo! Movies. Working in the probabilistic graphical model framework, I discover
the dependency structure between the rating components by a constrained structure search and validate it
by the psychometric theories on the halo effect. I used this dependency structure to develop a mixture
model based collaborative filtering algorithm that performs better than using only one component rating,
especially when small amount of training data is available. The designed model can also be used to fill-in
the component ratings and be used in a rater support system.
Thus, the project brings together techniques from statistical machine learning and ideas from
psychometric literature to design a collaborative filter that can be used as a marketing tool and as a
decision support system.
In a subsequent research project I address the problem of generating personalized article
recommendations when user preferences are changing. This is a common phenomenon when we observe
the behavior of users for a long period of time. The type of movies or books people like when they are
teenagers is different from the type of movies or books they like when they are in their twenties.
However, the vast majority of the recommender systems literature has focused on static models of the
user behavior that does not take this phenomenon into account. Therefore, it provides little guidance
about the best way to interpret ratings generated from a changing user preference for making personalized
recommendations. For example, with static user preferences assumption two items could be considered
similar if they have received similar ratings from a user. However, when the user preferences change two
items that are liked by a person at very different points of time cannot be considered similar. I address this
problem for the case of implicit ratings generated by users when they select blog articles to read. I model
the users’ selection of items as a hidden Markov model. The emission component of the HMM is novel in
that it models a variable number of observations in each time period. The number and the selection of
articles read by each user in each time period are explicitly modeled. Through this design I am able to
observe changes in a user’s preference over time and use it for generating better personalized
recommendations. A comparison with several static collaborative filters and a recently proposed dynamic
collaborative filter shows that the proposed algorithm significantly outperforms the existing ones [2].
Social Media Mining
Blogs have been popular on the Internet for a number of years. Today, many organizations use blogs to
interact with their employees. Intra-organizational blogs also function as auxiliary social networks for
casual conversation among the employees. Therefore, many organizations have begun using blogs as part
of their corporate strategy. In a project in collaboration with a Fortune 500 IT services firm I have
explored a number of interesting phenomena that affect employees’ behavior in enterprise blogosphere. I
have also developed new ways of mining the activities of the employees to detect patterns that can be
useful for the firm.
In the first part of an exploratory study I have studied the factors that affect the citation and reply ties in
the blog network. Topics of conversations spread through a blog network by citations and they are
sustained by the replies from the blog readers. In a two period study I find that the citation and reply links
between the users have little correlation with their demographic similarities. Instead, there are a number
of online characteristics of the user that are correlated with their future citation and reply links. Some of
the interesting observations are: the co-commenting links between pairs of individuals are correlated with
their direct comments to each other in the future; if an individual cites another he/she is less often seen to
directly reply to the other person—suggesting a substitution of ties; a strong positive correlation of
Simmelian ties with future direct interaction between pairs of bloggers; and negative correlation between
membership of a pair of bloggers to multiple cliques and a reduced future interaction between the pair of
bloggers[3, 4]. Thus, the study confirms some of the existing theories of social networks for the online
networks while presenting a few surprising findings as well.
In a second study we have explored the blog reading behavior of employees at a firm. This is important
because people reading each other’s blogs are the most prevalent tie between people in blogosphere.
Every other observed relationship in the blog is based on it. The number of regular readers a blogger has
is also a measure of the blogger’s power. However, the reading habit of people has been largely invisible
to the researchers. By collecting and analyzing the data on employees’ blog reading through industry
partnership we have studied the factors that explain different types of blog reading behavior. By using a
state space model to examine the employees’ blog reading behavior we have found that several textual
factors such as the difficulty of the writing, the grammatical errors, sentiments expressed in the article all
play key roles in shaping the employee’s blog reading behavior [5].
It is possible to get a pulse of the employee sentiments and a summary of their expertise by analyzing the
blog conversations. Blog posts can be analyzed to identify opinion developments and opinion leaders.
Such analyses can also reveal the nature of expertise within a firm. I approach this task of expertise
discovery with a perception that opinions and expertise take interaction to develop, that the messages
exchanged between bloggers are not only documents, but they also represent interaction between them.
This information is lost if we treat the blog data as simply a collection of text documents. On the other
hand, representing the relation between bloggers by the number of citations and replies between them, as
done in network analysis, would lose the information that is in the text of the messages. Therefore, we
need a new approach. I start with a generalization of ties for citations and replies. Each tie is represented
by a vector of term weight representations of the text in the post. Thus the network matrix now becomes a
tensor. By adding a time dimension to the tensor one can capture information on the evolution of the blog
network. One can perform an importance computation in such a tensor based on the premise that
important people, important content and important time periods in the development of a topic will be
associated with each other. Neither of these importance scores are available a-priori. However, I have
shown that based on the outlined premise these scores can be deduced by a tensor factorization. The
factorization not only outputs the significant topics of conversation, but also the important people in each
topic and the time trend of each topic. Knowing these can provide important advantages to a firm[6, 7].
This work brings together and extends ideas from network analysis and techniques from text data mining
to develop a method to track significant developments and identify important people inside an
organization from the online conversations of the employees.
Future Research Directions
Recommendations for Users in a Flux
The topic of personalized recommendation in the context of changing user preferences is a relatively new
one. There are a number of very interesting open research problems.
The research in recommender systems have shown that uniquely modeling each user’s preference leads to
more relevant recommendation for each user. Thus, the natural next step is to uniquely model how the
preference changes for each user. To estimate such models traditionally one resorts to stochastic
approximations. However, such approaches have large computational complexities that can render them
unsuitable for collaborative filtering tasks. I am working on developing deterministic approximation
methods based on Variational-Bayes framework to solve this problem. The advantage of the methods
based on this framework is that they provide well defined bounds on the approximation quality and often
lead to scalable methods. Yet another interesting research direction is to model the preference shifts of an
individual as a gradual change instead of an abrupt stochastic jump in a user’s behavior. This approach is
intuitive and could be a better model of the changes in user’s preference.
Extending the collaborative filtering on the face of changing user preferences to incomplete datasets of
explicit ratings is another important open research problem. There has been some work on this problem in
the context of Netflix prize competition. However, there is an opportunity for addressing this problem
using state space models that provide a flexible framework to model each user’s behavior.
Personalized marketing using online social networks
Effectively advertising to people on online social networks is an open research problem. This could be an
effective method of advertising given the common experience of users seeking opinions on products from
people they know and trust. With people spending an increasing amount of time at online social
networking sites there is a large advertising opportunity waiting to be developed.
The online advertising industry has devised ways to personalize advertisements for the web users. Search
engines achieve this by inferring users’ interests from their search queries or from the keywords present
on web pages being read by them. Users’ visits to different websites can also be monitored and profiled
by the Ad networks. However, this model of advertising treats a user as an isolated agent who is online to
seek information. Online personalized advertising has been successful due to effective matching of end
users’ information need with the advertisers’ offerings. Increasingly people are spending time online to
perform inter-personal activity as opposed to seeking impersonal information. Such activities include
checking the status of a friend, exchanging messages, broadcasting one’s own status, etc. Due to this
fundamental difference in purpose, old advertising models have generally been ineffective on online
social networks. Therefore, a new model of advertising is needed for online social networks.
Over the last year I have developed a project in collaboration with a large Internet search and advertising
firm that also maintains a large online social network among its users. Some of the questions I am trying
to answer in this project are: do user’s online ties have any influence on their response to the
advertisements shown in conjunction with their online interaction? Can we infer premium advertising
opportunity from a user’s expressed opinions on online social networks and the user’s relationship with
other users in the network? Does the model of social communication popularized by online social
networks like Facebook, lead to the opportunities for group based advertisements?—among many other
interesting research questions.
With ever increasing amount of product and customer data, filtering them for particular purposes is
becoming more and more important. In addition, with the emergence of Web 2.0 phenomenon more and
more Internet users are participating in online social activities. This provides us with more information to
understand their needs and preferences to be able to create a better filter for them. I plan to carry out
research on these important areas by taking an interdisciplinary approach whenever they offer useful
insights. I want to make methodological contributions through my work. I also plan to partner with
industry and government agencies to obtain funding, unique datasets and to identify important topics for
further research.
Sahoo, N., et al., The Halo Effect in Multi-component Ratings and its Implications for Recommender
Systems: The Case of Yahoo! Movies. Information Systems Research, Forthcoming.
Sahoo, N., P.V. Singh, and T. Mukhopadhyay, A Hidden Markov Model for Collaborative Filtering. 2010.
Sahoo, N., R. Krishnan, and J. Callan, Link formation over intra-organizational blog network, in Fourth
Symposium on Statistical Challenges in Electronic Commerce Research. 2008.
Sahoo, N., R. Krishnan, and J. Callan, Formation of Citation and Reply ties over intra-organizational blog
network, in Conference on Information Systems and Technology. 2008.
Singh, P.V., N. Sahoo, and T. Mukhopadhyay, Seeking Variety: A Dynamic Model of Employee Blog
Reading Behavior. 2010.
Sahoo, N. and R. Krishnan, Expertise Discovery from Blogs by Tensor Factorization, in Conference on
Information Systems and Technology. 2010.
Sahoo, N., R. Krishnan, and C. Faloutsos, Socio-temporal analysis of conversations in intra-organizational
blogs, in Workshop on Information Technologies and Systems. 2008.