CS 545 - H.W. 1 Hal Finkel 01/30/2008 Data Mining in High-Energy Particle Physics High-energy particle physics experiments, especially those carried out at large particle accelerators, generate huge amounts of data. The ATLAS detector, part of the forthcoming LHC experiment at CERN, is predicted to generate between 200 and 400 MB/s while running, resulting in a total output measured in petabytes per year1. While automated analysis techniques have become commonplace within the experimental physics community, computers processing the data from ATLAS will not only perform a majority of the data analysis, but will use data-mining techniques to determine what to analyze. It has been realized that only a computer will be able to process enough data to appropriately select regions of interest; a job which in past experiments fell within the purview of human specialists. The primary challenge facing the physicists designing and implementing the ATLAS data analysis software is that the software needs to be able to quickly classify physically-interesting events, while rejecting most background signals, based only on input variables which are available very soon after the event occurs. While a full event-reconstruction analysis would be able to easily accomplish this task, such an analysis is very complicated, and as a result is orders of magnitude too slow to be used for event tagging. Additionally, the on-site storage capacity does not exist to store all of the raw data, so some mechanism is necessary to tag interesting events for later analysis before the events can be fully analyzed. For example, the data from ATLAS, even after being filtered by the level-1 trigger, will arrive with a frequency between 75 and 100 kHz on 1600 readout links, and each readout link delivers data at 160 MB/s. This means that the raw data rate of 250 GB/s must be reduced by a factor of 1000 using a highly-accurate method of discriminating interesting events from uninteresting and background events. A number of different classification techniques have been developed and evaluated for the purpose of classifying the detector events. Different methods are needed in order to efficiently handle different sub-tasks: some are linear and some are non-linear; some classifiers need to be fast and simple enough to be encoded in hardware while some can run on complex computer clusters. Almost all of the input variables to the classification algorithms exhibit non-trivial correlations, and all are noisy. Examples of such input variables are: total mass, total momenta, decay angles, total charge, and d[Energy]/dx in the detectors. In some cases, some information regarding the spatial distribution of momenta and charge is also available. Shawn McKee, The ATLAS computing model: status, plans and future possibilities, Computer Physics CommunicationsVolume 177, Issues 1-2, , Proceedings of the Conference on Computational Physics 2006 - CCP 2006, Conference on Computational Physics 2006, July 2007, Pages 231-234. 1 Data Preprocessing: Prior to input into one of the classification algorithms, the data is often fed through a preprocessing step. One such step is called linear-decorrelation which, essentially, rotates the data such that it is aligned with its principal components. Linear cuts can also be applied to the data where the cut boundaries are determined by Monte Carlo sampling, genetic algorithms or simulated annealing. Classification Techniques2: Projective Likelihood Estimator: This technique provides a likelihood estimator for each class by, assuming the independence of the input variables, computing the conditional probability that an input belongs to a given class. This procedure uses the probability density function (PDF) estimated from the training data. The estimated PDF is determined using non-parameteric fitting: specifically, fitting spline functions of orders 1, 2, 3, and 5 to binned training data or using a technique known as unbinned adaptive kernel density estimation with Gaussian smearing. Fisher’s Linear Discriminant Analysis: This technique classifies events by taking a linear combination of the input variables. The coefficients of the linear combination are the "Fisher coefficients" and are computed from the covariance matrices of the classes in the training data. This classifier is optimal for linearly-correlated, Gaussian-distributed input variables. A more general variant of this algorithm, known as Function discriminant analysis, using an arbitrary classification function has also been implemented. Artificial Neural Networks: This technique trains an artificial neural network of a variable number of layers using the back-propagation algorithm. This algorithm is not limited to linear classification problems, and so has been used in cases where a fast non-linear classifier is needed. The disadvantage of this approach is generally considered to be that the resulting neural network is difficult to "understand." Decision Trees: This technique recursively cuts the input variable domain into smaller and smaller classification regions. It can handle highly-non-linear training sets, and while classification can proceed with reasonable efficiency, the training time is often very long. Also implemented are Boosted Decision Trees, which use a forest of weighted decision trees to classify inputs. The classification is accomplished by majority vote. Several techniques are used to improve the structural stability of the trees under additions to the training set. Predictive Learning via Rule Ensembles: This technique is an enhancement to Fisher's linear method which, in addition to using Fisher's linear coefficients, also uses a linear combination of geometric cut functions which return 1 or 0 Andreas Hoecker, CERN, Machine Learning Techniques for HEP Data Analysis. ATLAS Top Physics Workshop, October 19, 2007, LPC-Grenoble, France. 2 depending on which portion of the input domain contains their argument. This technique can handle non-linear classification problems. Support Vector Machines: This is a linear classification technique which finds the set of cuts separating input classes which have the largest margins. In the case of non-linear inputs, the variables are transformed into a higher-dimensional space in which the data is linearly separable. The variables are not, however, transformed explicitly. Instead, the transform is approximated using a Kernel function which approximates the inner product between higher-dimensional vector pairs. Gaussian, Polynomial and Sigmoid kernels are all available. Containing these classification-technique implementations is the Toolkit for Multivariate Data Analysis (TMVA), which is open source and freely available from http://tmva.sourceforge.net/. It will be used to analyze the data from the ATLAS experiment. While most parts of this toolkit can be used in a standalone configuration, the toolkit has been integrated into the data storage and analysis package ROOT. ROOT is also open source and is freely available from http://root.cern.ch/. Both TMVA and ROOT are C++-based, and ROOT comes integrated with CINT, which is a C++ interpreter. A python binding, called PyROOT, is also available.