THE 30th ANNUAL ALBERTA STATISTICIANS' MEETING Saturday, October 25, 2008 University of Alberta Central Academic Building Room 265 The 30th Annual Alberta Statisticians’ Meeting is sponsored by the Statistics Centre of the Department of Mathematical and Statistical Sciences and also by PIMS. PROGRAM: 12:30 to 13:30: Reception, Registration in CAB 269 13:30 to 14:00: Professor Jingjing Wu, "The application of MHD estimation in a semiparametric model." . 14:00 to 14:30: Professor Deniz Sezer, "Quantitative bounds for Markov Chain convergence: Wasserstein and Total variation distances." 14:30 to 15:00: Professor Gordon Fick, Department of Community Health Sciences, Faculty of Medicine, "Modifying a Modifier, Confounding a Modifier, Confounding a Confounder." 15:00 to 16:00: Coffee break in CAB 269 16:00 to 16:30: Dr. W. Su, TBA 16:30 to 17:00: Professor Pengfei Li, “Large Hypothesis test for Normal Mixture Models: the EM Approach” (Joint work with Professor Jiahua Chen) 17:00 to 17:30: Professor Peter Hooper, “Bayesian inference for belief net responses”. 18:30: Dinner at the home of Professor Doug Wiens (details below). EVENING: Conference dinner (starting 6:30pm) at Professor Doug Wiens’ house at 9702 89 AVE NW, EDMONTON, AB Directions will be provided. REGISTRATION: The registration fee is likely to be around $10 per person who is attending the dinner. Those who are only attending the talks need not pay anything. The registration fee will be waived for graduate students and upon request for those without research grants. We can accept registration cash only. Abstracts for the talks: 1. Professor Jingjing Wu, "The application of MHD estimation in a semiparametric model." 2. Professor Deniz Sezer, "Quantitative bounds for Markov Chain convergence: Wasserstein and Total variation distances." In this talk I will present recent results on Markov Chain convergence based on joint work with Neal Madras. Let P_n^x, and \pi be respectively the n-step transition probability kernel and the stationary distribution of a Markov chain. In many applications it is desirable to have a quantitative bound for convergence of P_n^x to \pi, i.e. a bound of the form d(P_n^x,\pi)<g(x,n) where d is a metric on the space of probability measures and g is a function which can be computed explicitly. In continuous state spaces one way to obtain a quantitative bound is formulating the Markov chain as an iterated system of random maps and applying David Steinsaltz's local contractivity convergence theorem. If the conditions are satisfied, this theorem yields a quantitative bound in terms of Wasserstein distance. We first develop a systematic framework to check for the conditions of Steinsaltz's theorem, and then show how one can obtain a quantitative bound in terms of total variation distance from a quantitative bound in terms of Wasserstein distance. 3. Professor Gordon Fick, Department of Community Health Sciences, Faculty of Medicine, "Modifying a Modifier, Confounding a Modifier, Confounding a Confounder." Objectives: 1) A paradigm in epidemiology will be applied to modeling 2) This modeling application will suggest some interesting interpretations 3) The interpretations will be advanced by appropriate graphs Abstract: When an epidemiologist considers the assessment of a disease/exposure relationship, a key part to this scientific paradigm entails the assessment of potential modifiers first and then, in the absence of modification, the assessment of potential confounders. Statistical modeling can enable this assessment in many settings. Take a scenario with interest in a disease (D) /exposure (E) relationship with age (A) and gender (G) as possible confounders / modifiers. For illustration, lets suppose we have a study that enables the consideration of the outcome p=Pr(D). The concepts detailed in this seminar could apply to any outcome: dichotomous, ordinal or interval. In a sense, all the concepts relate to the 'right-hand' side of the appropriate regression equation. We can take, for illustration of the relevant concepts, a study for which the (log) odds of disease is the primary outcome. Suppose that we have planned a backward elimination modeling approach but that we wish to incorporate the scientific paradigm from epidemiology as well. An investigation using age groupings is often considered by epidemiologists. This leads to a stratified analysis: the classic tests for modification via the consideration of possible heterogeneity of the stratum specific odds ratios and then, if appropriate, the assessment of possible confounding. It is suggested in this seminar that the consideration of a model based approach aught to proceed, in part at least, in the same manner: consideration of modification and then consideration of confounding. We will see through some illustrations that there are interesting combinations of the notions of confounding and modification that can be considered via the modeling approach. 4. Professor Pengfei Li: Large Hypothesis test for Normal Mixture Models: the EM Approach (Joint work with Professor Jiahua Chen) Normal mixture distributions are arguably the most important mixture models, and also the most challenging technically. The likelihood function of the normal mixture model is unbounded based on a set of random samples unless an artificial bound is placed on its component variance parameter. Moreover, the model is not strongly identifiable so it is hard to differentiate between over-dispersion caused by the presence of a mixture and that caused by a large variance; and it has infinite Fisher information with respect to mixing proportions. There has been extensive research on finite normal mixture models, but much of it addresses merely consistency point estimation or useful practical procedures, and many results require undesirable restrictions on the parameter space. We show that an EM-test for homogeneity is effective at overcoming many challenges in the context of finite normal mixtures. We find that the limiting distribution of the EM-test is the chi-square(2) when variances are unequal and unknown. Simulations show that the limiting distributions approximate the finite sample distribution satisfactorily. A real example is used to illustrate the application of the EM-test. 5. Professor Peter Hooper: Quantifying the Uncertainty of a Belief Net Response A Bayesian belief network models a joint distribution over variables using a directed acyclic graph to represent variable dependencies and network parameters to represent conditional distributions of variables given their immediate parents. From a Bayesian perspective, parameters are random variables with distributions reflecting uncertainty. Belief networks are commonly used to compute responses to queries Ñ i.e., return a number for P(H=h | E=e). Parameter uncertainty induces uncertainty about query responses. I will describe theory and methods, both exact and approximate, for quantifying this type of uncertainty. The discussion will include a new "network doubling" technique used to obtain a highly accurate approximation of the variance of a query response. 6. Dr. Wanhua Su: Efficient Kernel Methods for Statistical Detection This research is motivated by a drug discovery problem -- the AIDS anti-virus database from the National Cancer Institute. The objective of the study is to develop effective statistical methods to model the relationship between the chemical structure of a compound and its activity against the HIV-1 virus. As a result, the structure-activity model can be used to predict the activity of new compounds and thus helps identify those active chemical compounds that can be used as drug candidates. Since active compounds are generally rare in a compound library, we recognize the drug discovery problem as an application of the so-called statistical detection problem. In a typical statistical detection problem, we have data {yi, xi}, where xi is the predictor vector of the ith observation and yi (0, 1) is its class label. The objective of a statistical detection problem is to identify class-1 observations, which are extremely rare. Besides drug discovery problem, other applications of detection problem include direct marketing and fraud detection. We have proposed a computationally efficient detection method called LAGO, which stands for "locally adjusted GO estimator". The original idea was inspired by an ancient game known today as "GO". The construction of LAGO consists of two steps. In the first step, we estimate the density of class 1 with an adaptive bandwidth kernel density estimator. The kernel functions are located at and only at the class-1 observations. The bandwidth of the kernel function centered at a certain class-1 observation is calculated as the average distance between this class-1 observation and its K-nearest class-0 neighbors. In the second step, we adjust the density estimated in the first step locally according to the density of class 0. It can be shown that the amount of adjustment in the second step is approximately inversely proportional to the bandwidth calculated in the first step. Application to the NCI data demonstrates that LAGO is superior to methods such as K nearest neighbors and support vector machines. One drawback of the existing LAGO is that it only provides a point estimate of a test point's possibility of being class 1, ignoring the uncertainty of the model. In the second part of this thesis, we present a Bayesian framework for LAGO, referred to BLAGO. This Bayesian approach enables quantification of uncertainty. Non-informative priors are adopted. The posterior distribution is calculated over a grid of (K, α) pairs by integrating out β0 and β1 using the Laplace approximation, where K and α are two parameters to construct the LAGO score. The parameters β0 and β1 are the coefficients of the logistic transformation that converts the LAGO score to the probability scale. BLAGO provides proper probabilistic predictions that have support on (0,1) and captures uncertainty of the predictions as well.