Possible summer research projects: 1. One proposed project would support the department's efforts with the Carl Wieman Science Education Initiative. Possible areas of research include (a) evaluation of an online homework system to be trialed in the department over summer 2012, (b) development of a concept inventory for STAT 241/251 and (c) assessing student retention from certain key undergraduate courses, such as STAT 241/251, STAT 302 and STAT 305. This project will be partially funded by the Carl Weiman Science Education Initiative, and partially by the Faculty of Science. Note that INTERNATIONAL STUDENTS are eligible for this position. If you are interested in this position, please contact Dr. Bruce Dunham: b.dunham@stat.ubc.ca 2. The following topics and projects are proposed by Dr. Alexandre Bouchard-Côté. If you are interested in working with Dr. Bouchard on any of these topics, please contact him directly: bouchard@stat.ubc.ca -Topic: Statistics on large scale datasets Context: The field of statistics is currently in the embarrassing situation of having more data than it can handle. Most current statistical computing methods do not scale to the large datasets produced by the web and by large scale scientific projects. In particular, methods that require estimating a posterior distributions often rely on Markov Chain Monte Carlo, a method that is notoriously slow for large datasets. Fortunately, many fast alternatives are emerging. Projects: -- Variational inference is one of these alternatives. It works by looking at the problem of computing the posterior as a constrained optimization problem, and by relaxing this optimization problem. This project would involve looking at a new way of relaxing the optimization problem that could potentially yield much more accurate results. -- Another project in this topic is to explore which MCMC alternatives can be effectively computed in parallel. This can have a high impact in large scale Bayesian analysis scenarios. -- Beyond posterior computation, many other statistical problems need to be scaled up. One project would be to apply some recent randomized algorithms to largescale versions of common parameter estimation problems. -Topic: Phylogenetics Context: The goal of the field of phylogenetics is to reconstruct evolutionary histories by studying genetic relatedness among populations. This is a hard and important biological questions, but most of the current challenges in this field are statistical and computational. - One project would be to look at new alternatives to tree models and their statistical and computational properties. These models bring new capacities such as gene alignment inference but also new challenges. - Until recently, phylogenetics was generally done by using a single DNA sequence to represent each population, but thanks to new genotyping technologies, we can now look at the frequency of certain mutations across many individuals in each population. Phylogenetics models that can exploit that type of data are currently needed. - I am also working on related projects in population and family genetics. Talk to me if you want to hear more about these as well! -Topic: Statistics in linguistics Context: The collection of all human languages is one of the most exciting existing dataset, but surprisingly it has not yet been intensely studied by statisticians. As more than half of the world's languages are currently threatened of extinction, more work is needed on the task of recording and analyzing this data before it is too late. Projects: -- One project would involve developing a machine learning tool to assist field linguists in their work. This is a more applied project that would be done in collaboration with linguists. -- A related project is to extract linguistic data from the web, again by applying machine learning techniques. For example, can machine learning algorithms be used to organize semi-structure datasets such as wiktionary or wikipedia? -- Another project would focus on identifying statistical regularities in languages. Given a cross-linguistic dataset of language and/or language changes, the goal would be to perform a model selection study. 3. The following projects are proposed by Dr. Matias Salibián Barrera. If you are interested in any of these projects, please contact Dr. Salibián directly for more information: matias@stat.ubc.ca A. 1 - Robust inference for linear regression models. This project involves implementing a robust test of hypothesis for linear hypotheses for the regression coefficients. The tests are based on a high-breakdown and efficient robust scale estimator, and the corresponding p-values are estimated using our robust and fast bootstrap method. Parts of this methodology are already implemented in MATLAB. The summer project consists in translating this code to R, and possibly creating a corresponding package. B. - Robust principal components for functional data. This project deals with a novel principal components analysis method for functional data. The method is robust to the presence of potential outliers in the data. I am interested in exploring other properties of this method, e.g. its ability to deal with sparse functional observations and different types of atypical observations. We will need to run several numerical experiments comparing this new proposal with other existing ones in the literature. C. - Re-sampling methods for Support Vector Machines. This project concerns studying the properties and performance of a fast bootstrap method for SVMs. The motivation for this proposal is to be able to build ensembles or to perform statistical inference (e.g. point-wise confidence bands) in a computationally efficient and feasible way. The summer project involves performing a thorough literature review and running numerical experiments to study the properties of this new proposal. D. - Sparse Kernel K-Means This project is concerned with the extension of the sparse k-means algorithm, to the corresponding kernel k-means setup. The summer project involves performing a thorough literature review, implementing this new proposal in R and running numerical experiments to study the properties of this new proposal.