Dawit Tadesse - WordPress.com

advertisement
Dawit Tadesse
Research Statement
257 S.Gay St, Apt A124
T 334 559 4954
B dgt0001@auburn.edu
Auburn, AL, 36830
Overview
In August 2010, I was admitted in the statistics PhD program in the Department
of Mathematics and statistics at Auburn University. Prior to PhD, I have obtained
double masters in mathematics at Auburn University, Auburn, AL and African University of Science and Technology (AUST), Abuja, Nigeria. My rich mathematics
background combined with my statistics training at Auburn University have made
spurred my interest in both methodological and applied statistical and mathematical
research which I have demonstrated through my PhD research and my collaborations
with scientists from other disciplines as well as my multiple participation in applied
research and continuing learning workshops in different universities across the nation.
My research philosophy is to develop novel statistical and mathematical techniques
and improve the existing ones in the attempt to solve the various problems that the
human race faces. Among others, I have developed interest in Multivariate Statistical Methods of High-dimensional classification and their application in text and data
mining. In my research I provide a survey of two-class linear discriminant methods,
and their associated feature selection algorithms, that have recently been developed
to address extremely high-dimensional data when there is sparse and low signal. I
also conduct a rigorous comparative performance analysis for existing methods, such
as Features Annealed Independence Rules (FAIR) and Regularized Optimal Affine
Discriminant (ROAD), both in theory and simulation, and adapt and develop these
methods for text (latent semantic analysis) and data mining (predictive analytics)
applications. In addition, I develop general theory and new methods that provide
a unified approach towards this important application. I have shown that for my
methods, under certain regularity conditions, the probability of selecting all signal
features (without selecting unessential variables with no signal) converges to one and
the model building algorithm leads to an optimal linear discriminant. I currently also
serve as their data analyst in the heart plant research group in Industrial Engineering
department at Auburn University. In my Mathematics master’s thesis at Auburn University, I gave a new proof that the set of generalized numerical ranges are convex. I
believe my quantitative analytical skills and programming/computational competencies acquired through course work and real world research experience together with
my oral and written communications skills developed through my seven-year teaching
experience at Auburn University and Haramaya University, Ethiopia have prepared me
to be successful in an interdisciplinary research and teaching environment.
Current Research
• Generalized Feature Selection for High-dimensional Classification in
Sparse and Low Signal vectors: High dimensional data analysis has become
increasingly frequent and important in diverse fields of sciences, engineering, and
humanities, ranging from genomics and health sciences to economics, finance and
machine learning. Noise accumulation in high dimensional prediction has long been
recognized in statistics and computer sciences. Explicit characterization of this is
well-known for high dimensional regression problems. The quantification of the impact of dimensionality on classification was not well understood until Fan and Fan
(2008), who give a simple expression on how dimensionality impacts misclassification
rates. Hall, Pittelkow and Ghosh (2008) study a similar problem for distanced basedclassifiers and observe implicitly the adverse impact of dimensionality. As shown in
Fan and Fan (2008), even for the independence classification rule, classification using
all features can be as bad as a random guess due to noise accumulation in estimating
the population centroids in high dimensional feature space. Therefore, variable selection is fundamentally important to high dimensional statistical modeling, including
regression and classification. A popular method for independence feature selection is
the two-sample t-test (Tibshirani et al., 2002, Fan and Fan, 2008), which is a specific
case of marginal screening in Fan and Lv (2008). Other componentwise tests such
as the rank sum test are also popular. Fan and Fan (2008) give a condition in which
the two-sample t-test pick up all the important s features with probability 1. Ash
Abebe and Shuxin Yin (PhD dissertation, 2010) give a condition under which the
Wilcoxon-Mann Whitney test can pick up all the important features with probability
1. I give the generalize condition under which any two-sample componentwise test
Tj defined below can pick up all the important features with probability 1. My Tj for
feature j is defined as follows:
Pn1
Pn0
w
−
1kj
k=1
k=1 w0kj
P
P
Tj =
n1
0
SE( k=1 w1kj − nk=1
w0kj )
where wikj , i = 0, 1, is the statistic for feature j in class i and assume that the
P
standard error for any statistic X satisfies SE(X) −
→ SD(X) and we assume that
for some interval on x > 0 we have P (|Tj − ηj | ≥ x) = 2(1 − Φ(x))(1 + f (x, n)),
where f (x, n) = f1 (x, n) + f2 (−x, n) = o(x). Where we define
P 0
P 1
w0kj )
E( nk=1
w1kj ) − E( nk=1
Pn1
Pn0
ηj :=
SE( k=1 w1kj − k=1 w0kj )
• Application to Text and Data Mining: When the feature space dimension p is
very high compared to the sample size n, the Fisher discriminant rule performs poorly
due to diverging spectra as demonstrated by Bickel and Levina (2004). These authors showed that the independence rule in which the covariance structure is ignored
performs better than the naive Fisher rule (NFR) in the high dimensional setting.
Fan and Fan (2008) demonstrated further that even for the independence rules, a
procedure using all the features can be as poor as random guessing due to noise
accumulation in estimating population centroids in high-dimensional feature space.
As a result, Fan and Fan (2008) proposed the Features Annealed Independence Rule
(FAIR) that selects a subset of important features for classification. Dudoit et al.
(2002) reported that for microarray data, ignoring correlations between genes leads to
better classification results. Tibshirani et al. (2002) proposed the Nearest Shrunken
Centroid (NSC) which likewise employs the working independence structure. Similar
problems are also studied in the machine learning community such as Domingos and
Pazzani (1997) and Lewis (1998). In microarray studies, correlation among different
genes is an essential characteristic of the data and usually not negligible. Other examples include proteomics, and metabolomics data where correlation among biomarkers
is commonplace. More details can be found in Ackermann and Strimmer (2009).
Intuitively, the independence assumption among genes leads to loss of critical information and hence is suboptimal. We believe that in many cases, the crucial point
is not whether to consider correlations, but how we can incorporate the covariance
structure into the analysis with a bullet proof vest against diverging spectra and
significant noise accumulation effect. To overcome the problem with the independence rule, Fan and etal. (2012) proposed the regularized optimal affine discriminant
(ROAD) and its two variants S-ROAD1 and S-ROAD2. They first do feature selection
based on two-sample t-test and they apply a generalized linear discriminant function.
They choose the optimal discriminant function by minimize the misclassification error
rate under some assumptions. Using simulation and real data analysis they showed
that S-ROAD1 and S-ROAD2 are performing better than FAIR. But for text mining i
applied Singular Value Decomposition (SVD) on the term-document matrix and then
ranked the SVDs based on the absolute value of their two-sample t-test statistic. I
showed that my new method is better than ROAD.
Future Research
• General Theory and Methods: I am working towards developing a new method
for high-dimensional classification problem in sparse and low signal vectors. My new
methods will cover the exponential family distributions.
• Extending the two-class classification problem to Multi-class classification
problem: Since multi-class classification problem is applicable in many contemporary
statistical problems like document classification on author identification, i will be
working in extending my new methods to multi-class classification problems.
References
• Fan, J. and Fan, Y. (2008) High dimensional classification using features annealed
independence rules. Ann. Statist., 36 2605-2637.
• Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant
function, "naive Bayes", and some alternatives when there are many more variables
than observations. Bernoulli 10, 989-1010.
• Mai, Q., Zou, H., and Yau, M. (2012). A direct approach to sparse discriminant
analysis in ultra-high dimensions. Biometrika 99, 29-42.
• Fan, J., Feng, Y., and Tong, X. (2012). A road to classification in high dimensional
space: the regularized optimal affine discriminant. J. R. Statist. Soc. B. 74, 745-771.
• Cao, Hongyuan (2007). Moderate Deviations For Two Sample T-Statistics. ESAIM:
Probability and Statistics, Vol. 11, 264Ű271.
Download