CS 545 - H.W. 1
Hal Finkel
Data Mining in High-Energy Particle Physics
High-energy particle physics experiments, especially those carried out at large particle
accelerators, generate huge amounts of data. The ATLAS detector, part of the forthcoming LHC
experiment at CERN, is predicted to generate between 200 and 400 MB/s while running,
resulting in a total output measured in petabytes per year1. While automated analysis techniques
have become commonplace within the experimental physics community, computers processing
the data from ATLAS will not only perform a majority of the data analysis, but will use
data-mining techniques to determine what to analyze. It has been realized that only a computer
will be able to process enough data to appropriately select regions of interest; a job which in past
experiments fell within the purview of human specialists.
The primary challenge facing the physicists designing and implementing the ATLAS data
analysis software is that the software needs to be able to quickly classify physically-interesting
events, while rejecting most background signals, based only on input variables which are
available very soon after the event occurs. While a full event-reconstruction analysis would be
able to easily accomplish this task, such an analysis is very complicated, and as a result is orders
of magnitude too slow to be used for event tagging. Additionally, the on-site storage capacity
does not exist to store all of the raw data, so some mechanism is necessary to tag interesting
events for later analysis before the events can be fully analyzed.
For example, the data from ATLAS, even after being filtered by the level-1 trigger, will arrive
with a frequency between 75 and 100 kHz on 1600 readout links, and each readout link delivers
data at 160 MB/s. This means that the raw data rate of 250 GB/s must be reduced by a factor of
1000 using a highly-accurate method of discriminating interesting events from uninteresting and
background events.
A number of different classification techniques have been developed and evaluated for the
purpose of classifying the detector events. Different methods are needed in order to efficiently
handle different sub-tasks: some are linear and some are non-linear; some classifiers need to be
fast and simple enough to be encoded in hardware while some can run on complex computer
clusters. Almost all of the input variables to the classification algorithms exhibit non-trivial
correlations, and all are noisy. Examples of such input variables are: total mass, total momenta,
decay angles, total charge, and d[Energy]/dx in the detectors. In some cases, some information
regarding the spatial distribution of momenta and charge is also available.
Shawn McKee, The ATLAS computing model: status, plans and future possibilities, Computer
Physics CommunicationsVolume 177, Issues 1-2, , Proceedings of the Conference on
Computational Physics 2006 - CCP 2006, Conference on Computational Physics 2006, July
2007, Pages 231-234.
Data Preprocessing:
Prior to input into one of the classification algorithms, the data is often fed through a
preprocessing step. One such step is called linear-decorrelation which, essentially, rotates the
data such that it is aligned with its principal components. Linear cuts can also be applied to the
data where the cut boundaries are determined by Monte Carlo sampling, genetic algorithms or
simulated annealing.
Classification Techniques2:
Projective Likelihood Estimator:
This technique provides a likelihood estimator for each class by, assuming the independence of
the input variables, computing the conditional probability that an input belongs to a given class.
This procedure uses the probability density function (PDF) estimated from the training data. The
estimated PDF is determined using non-parameteric fitting: specifically, fitting spline functions
of orders 1, 2, 3, and 5 to binned training data or using a technique known as unbinned adaptive
kernel density estimation with Gaussian smearing.
Fisher’s Linear Discriminant Analysis:
This technique classifies events by taking a linear combination of the input variables. The
coefficients of the linear combination are the "Fisher coefficients" and are computed from the
covariance matrices of the classes in the training data. This classifier is optimal for
linearly-correlated, Gaussian-distributed input variables. A more general variant of this
algorithm, known as Function discriminant analysis, using an arbitrary classification function has
also been implemented.
Artificial Neural Networks:
This technique trains an artificial neural network of a variable number of layers using the
back-propagation algorithm. This algorithm is not limited to linear classification problems, and
so has been used in cases where a fast non-linear classifier is needed. The disadvantage of this
approach is generally considered to be that the resulting neural network is difficult to
Decision Trees:
This technique recursively cuts the input variable domain into smaller and smaller classification
regions. It can handle highly-non-linear training sets, and while classification can proceed with
reasonable efficiency, the training time is often very long. Also implemented are Boosted
Decision Trees, which use a forest of weighted decision trees to classify inputs. The
classification is accomplished by majority vote. Several techniques are used to improve the
structural stability of the trees under additions to the training set.
Predictive Learning via Rule Ensembles:
This technique is an enhancement to Fisher's linear method which, in addition to using Fisher's
linear coefficients, also uses a linear combination of geometric cut functions which return 1 or 0
Andreas Hoecker, CERN, Machine Learning Techniques for HEP Data Analysis. ATLAS Top
Physics Workshop, October 19, 2007, LPC-Grenoble, France.
depending on which portion of the input domain contains their argument. This technique can
handle non-linear classification problems.
Support Vector Machines:
This is a linear classification technique which finds the set of cuts separating input classes which
have the largest margins. In the case of non-linear inputs, the variables are transformed into a
higher-dimensional space in which the data is linearly separable. The variables are not, however,
transformed explicitly. Instead, the transform is approximated using a Kernel function which
approximates the inner product between higher-dimensional vector pairs. Gaussian, Polynomial
and Sigmoid kernels are all available.
Containing these classification-technique implementations is the Toolkit for Multivariate Data
Analysis (TMVA), which is open source and freely available from http://tmva.sourceforge.net/.
It will be used to analyze the data from the ATLAS experiment. While most parts of this toolkit
can be used in a standalone configuration, the toolkit has been integrated into the data storage
and analysis package ROOT. ROOT is also open source and is freely available from
http://root.cern.ch/. Both TMVA and ROOT are C++-based, and ROOT comes integrated with
CINT, which is a C++ interpreter. A python binding, called PyROOT, is also available.