Dawit Tadesse - WordPress.com

Dawit Tadesse Research Statement 257 S.Gay St, Apt A124 T 334 559 4954 B dgt0001@auburn.edu Auburn, AL, 36830 Overview In August 2010, I was admitted in the statistics PhD program in the Department of Mathematics and statistics at Auburn University. Prior to PhD, I have obtained double masters in mathematics at Auburn University, Auburn, AL and African University of Science and Technology (AUST), Abuja, Nigeria. My rich mathematics background combined with my statistics training at Auburn University have made spurred my interest in both methodological and applied statistical and mathematical research which I have demonstrated through my PhD research and my collaborations with scientists from other disciplines as well as my multiple participation in applied research and continuing learning workshops in different universities across the nation. My research philosophy is to develop novel statistical and mathematical techniques and improve the existing ones in the attempt to solve the various problems that the human race faces. Among others, I have developed interest in Multivariate Statistical Methods of High-dimensional classification and their application in text and data mining. In my research I provide a survey of two-class linear discriminant methods, and their associated feature selection algorithms, that have recently been developed to address extremely high-dimensional data when there is sparse and low signal. I also conduct a rigorous comparative performance analysis for existing methods, such as Features Annealed Independence Rules (FAIR) and Regularized Optimal Affine Discriminant (ROAD), both in theory and simulation, and adapt and develop these methods for text (latent semantic analysis) and data mining (predictive analytics) applications. In addition, I develop general theory and new methods that provide a unified approach towards this important application. I have shown that for my methods, under certain regularity conditions, the probability of selecting all signal features (without selecting unessential variables with no signal) converges to one and the model building algorithm leads to an optimal linear discriminant. I currently also serve as their data analyst in the heart plant research group in Industrial Engineering department at Auburn University. In my Mathematics master’s thesis at Auburn University, I gave a new proof that the set of generalized numerical ranges are convex. I believe my quantitative analytical skills and programming/computational competencies acquired through course work and real world research experience together with my oral and written communications skills developed through my seven-year teaching experience at Auburn University and Haramaya University, Ethiopia have prepared me to be successful in an interdisciplinary research and teaching environment. Current Research • Generalized Feature Selection for High-dimensional Classification in Sparse and Low Signal vectors: High dimensional data analysis has become increasingly frequent and important in diverse fields of sciences, engineering, and humanities, ranging from genomics and health sciences to economics, finance and machine learning. Noise accumulation in high dimensional prediction has long been recognized in statistics and computer sciences. Explicit characterization of this is well-known for high dimensional regression problems. The quantification of the impact of dimensionality on classification was not well understood until Fan and Fan (2008), who give a simple expression on how dimensionality impacts misclassification rates. Hall, Pittelkow and Ghosh (2008) study a similar problem for distanced basedclassifiers and observe implicitly the adverse impact of dimensionality. As shown in Fan and Fan (2008), even for the independence classification rule, classification using all features can be as bad as a random guess due to noise accumulation in estimating the population centroids in high dimensional feature space. Therefore, variable selection is fundamentally important to high dimensional statistical modeling, including regression and classification. A popular method for independence feature selection is the two-sample t-test (Tibshirani et al., 2002, Fan and Fan, 2008), which is a specific case of marginal screening in Fan and Lv (2008). Other componentwise tests such as the rank sum test are also popular. Fan and Fan (2008) give a condition in which the two-sample t-test pick up all the important s features with probability 1. Ash Abebe and Shuxin Yin (PhD dissertation, 2010) give a condition under which the Wilcoxon-Mann Whitney test can pick up all the important features with probability 1. I give the generalize condition under which any two-sample componentwise test Tj defined below can pick up all the important features with probability 1. My Tj for feature j is defined as follows: Pn1 Pn0 w − 1kj k=1 k=1 w0kj P P Tj = n1 0 SE( k=1 w1kj − nk=1 w0kj ) where wikj , i = 0, 1, is the statistic for feature j in class i and assume that the P standard error for any statistic X satisfies SE(X) − → SD(X) and we assume that for some interval on x > 0 we have P (|Tj − ηj | ≥ x) = 2(1 − Φ(x))(1 + f (x, n)), where f (x, n) = f1 (x, n) + f2 (−x, n) = o(x). Where we define P 0 P 1 w0kj ) E( nk=1 w1kj ) − E( nk=1 Pn1 Pn0 ηj := SE( k=1 w1kj − k=1 w0kj ) • Application to Text and Data Mining: When the feature space dimension p is very high compared to the sample size n, the Fisher discriminant rule performs poorly due to diverging spectra as demonstrated by Bickel and Levina (2004). These authors showed that the independence rule in which the covariance structure is ignored performs better than the naive Fisher rule (NFR) in the high dimensional setting. Fan and Fan (2008) demonstrated further that even for the independence rules, a procedure using all the features can be as poor as random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. As a result, Fan and Fan (2008) proposed the Features Annealed Independence Rule (FAIR) that selects a subset of important features for classification. Dudoit et al. (2002) reported that for microarray data, ignoring correlations between genes leads to better classification results. Tibshirani et al. (2002) proposed the Nearest Shrunken Centroid (NSC) which likewise employs the working independence structure. Similar problems are also studied in the machine learning community such as Domingos and Pazzani (1997) and Lewis (1998). In microarray studies, correlation among different genes is an essential characteristic of the data and usually not negligible. Other examples include proteomics, and metabolomics data where correlation among biomarkers is commonplace. More details can be found in Ackermann and Strimmer (2009). Intuitively, the independence assumption among genes leads to loss of critical information and hence is suboptimal. We believe that in many cases, the crucial point is not whether to consider correlations, but how we can incorporate the covariance structure into the analysis with a bullet proof vest against diverging spectra and significant noise accumulation effect. To overcome the problem with the independence rule, Fan and etal. (2012) proposed the regularized optimal affine discriminant (ROAD) and its two variants S-ROAD1 and S-ROAD2. They first do feature selection based on two-sample t-test and they apply a generalized linear discriminant function. They choose the optimal discriminant function by minimize the misclassification error rate under some assumptions. Using simulation and real data analysis they showed that S-ROAD1 and S-ROAD2 are performing better than FAIR. But for text mining i applied Singular Value Decomposition (SVD) on the term-document matrix and then ranked the SVDs based on the absolute value of their two-sample t-test statistic. I showed that my new method is better than ROAD. Future Research • General Theory and Methods: I am working towards developing a new method for high-dimensional classification problem in sparse and low signal vectors. My new methods will cover the exponential family distributions. • Extending the two-class classification problem to Multi-class classification problem: Since multi-class classification problem is applicable in many contemporary statistical problems like document classification on author identification, i will be working in extending my new methods to multi-class classification problems. References • Fan, J. and Fan, Y. (2008) High dimensional classification using features annealed independence rules. Ann. Statist., 36 2605-2637. • Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, "naive Bayes", and some alternatives when there are many more variables than observations. Bernoulli 10, 989-1010. • Mai, Q., Zou, H., and Yau, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99, 29-42. • Fan, J., Feng, Y., and Tong, X. (2012). A road to classification in high dimensional space: the regularized optimal affine discriminant. J. R. Statist. Soc. B. 74, 745-771. • Cao, Hongyuan (2007). Moderate Deviations For Two Sample T-Statistics. ESAIM: Probability and Statistics, Vol. 11, 264Ű271.

Dawit Tadesse - WordPress.com

Related documents

Products

Support

Dawit Tadesse - WordPress.com

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib