Research Interests João Sedoc I am a PhD candidate in the computer science department at the University of Pennsylvania. Description of work Presently my main research interest is the development and application of machine learning and statistical techniques toward natural language processing. The representation of words using vector space models is widely used for a variety of natural language processing (NLP) tasks. The two main word embedding categories are cluster based and dense representations. Brown Clustering and other hierarchical clustering methods group similar words based on context. Dense representations such as Latent Semantic Analysis / Latent Semantic Indexing (LSA/LSI), Low Rank Multi-View Learning (LR-MVL) and Word2Vec have shown state-ofthe-art results in syntactic and sentiment tasks. Since current embeddings are focused on word level rather than phrase, they do not capture both fine-grained and coarse context. Therefore encoding meaning at the word, phrase and sentence level is an area of intense research focus. Finally distributional similarity in context does not imply semantic similarity, we find that often antonyms are distributionally similar. Research goals The purpose of our research is to further develop and study novel methods for lower dimensional representation of words, phrases and sentences using structured factorial hidden Markov model (HMM) and signed normalized cut spectral clustering. Understanding meaning at the word, phrase and sentence level is an active area of research; however, our approach is unique in that we extend singular value decomposition based methods using both fine and coarse context as well as paraphrases. By employing these two novel methods, we are able to address these currently unresolved problems. 1 We propose a novel multi-scale factorial hidden Markov model (MFHMM) and an associated spectral estimation scheme. MFHMM provide us with a statistically sound framework for incorporating both fine and coarse context. Using both scales of context is important for tasks such as sentiment analysis and provides better word embeddings for words, phrases and sentences. Our second novel method is the introduction of polarity clustering, which is not captured in popular word embeddings such as Glove, word2vec and eigenwords. Our approach is to apply semi-supervised signed clustering (SSC) by adding negative edges to the graph Laplacian using antonym relations. This technique allows us to generalize to phrases and discriminate polarity at a phrase based level which is currently largely missing in the field. Description of Work In the 2014 academic year, I worked on three main projects. The first project was to understand the correlation between word embeddings and neuroanatomical representations at a sentence level. This project was a collaboration with Murray Grossmans lab in particular Jeff Philips and Brian Avants as well as HRL. My main contribution has to provide vector embeddings which represented sentences. This was achieved using the Stanford parser. This work is leading to a paper where I will be the second author. As we were looking toward reducing the dimensional representation, I started to work on word embedding clusters which brought a major issue that antonyms like happy and sad can belong to the same cluster since these two words can occur in the same context, for instance, I am happy today and I am sad today. This led to signed spectral clustering, which I worked with Jean Gallier who solved the problem of K-cluster signed clustering. This work is moving forward to a paper. Using my prior experience in finance I worked with Jordan Rodu on stock covariance estimation. We extended spectral estimation methods for Hidden Markov Models to MFHMM. We successfully applied and fit this model. We submitted a paper to NIPS. I also presented my preliminary results at the PGMO-COPI conference. I also applied for a Bloomberg research grant with Dean Foster in order to continue my research on MFHMM and SSC. Future Research Academic year 2015 I plan to complete my last WPE I requirement. As well next semester I plan to work with my advisor to select my thesis topic. I intend to finish for publication the paper on multi-scale factorial HMM, a second on signed spectral clustering, and the collaborative paper with Jeff Philips. On the spectral clustering work, a large component of the mathematical theory has already been developed and we have preliminary results; however, many details, such as statistical perturbation and optimization theory still needs to be analyzed formally. Another possible outcome of this research direction is the extension of semantic similarity entailment on the paraphrase database. As well we believe that we maybe able to create antonym relations to other languages using our clustering method with standard machine translation.