The Higgs Boson Machine Learning Challenge in the CERN Large Hadron Collider

11 min readOct 11, 2021

1. Introduction

The European Organization for Nuclear Research, known as CERN, is a European research organization that operates the largest particle physics laboratory in the world. CERN is also the birthplace of the world wide web (WWW). The LHC experiments produce about 90 petabytes of data per year, and an additional 25 petabytes of data are produced per year from other (non-LHC) experiments at CERN. There are more than 20 experiments — international collaborations.

As you might know already, CERN is where the famous Higgs Boson particle was discovered. It is a particle that explains why things have mass and without it, things would have no mass. That means no stars, no galaxies and possibly no life, as we know it.

1.1 The Problem

A simple definition of the problem is that we are trying to classify events into two categories:

(1) the signal event and
(2) background event.

The “signal” event here is the “tau tau decay of a Higgs boson” and “background” event is every other Higgs boson decay that occures that is not tau-tau. It is essentially a binary classification challenge, just like the machine learning challenge of classifying if a tumor is benign or malignant.

As we can see the decay of the Higgs boson into two tau leptons (tau-tau) is just 6.3% of what the Higgs boson actually decays into. Actually, the Higgs Boson decays into two bottom quarks (bb) almost 60% of the time. The dataset, however, does not consist of just 6% signal events and 94% background events. There are much fewer background events. A third of the events are labeled as signal and the remaining two thirds of events are labeled as background.

1.2 The Dataset

The dataset was obtained from the CERN Open Data Portal. In the notebook I provide a list of what each feature means.

However, here are some details to get started:

all variables are floating point, except PRI_jet_num which is integer
variables prefixed with PRI (for PRImitives) are “raw” quantities about the bunch collision as measured by the detector.
variables prefixed with DER (for DERived) are quantities computed from the primitive features, which were selected by the physicists of ATLAS
it can happen that for some entries some variables are meaningless or cannot be computed; in this case, their value is −999.0, which is outside the normal range of all variables

It appears that values such as missing values might have been replaced by the CERN scientists with an outlier. In particular what I noticed was that missing values or NAN entries (if any) might have been replaced with an outlier that is not in the range of the values of the features; the outlier used is −999.

To turn the vast amount of data collected in particle collider experiments into precise measurements of nature, particle physicists have to find subtle patterns in high-dimensional data. To this end, particle physicists use complex computer simulations which encode our understanding of elementary particle collisions, additional radiation patterns, particle decays, hadronization, interactions with the detector, and sensor readout. Unfortunately, the analysis of particle-collision data with computer simulations often faces a fundamental challenge.

Simulations implement a forward process: given a value of the physical parameters, they can generate synthetic observations. In the scientific process, however, physicists are interested in the inverse problem: given observed data, what is the most likely value of the physical parameters? The key quantity required for this inference, both in a frequentist and a Bayesian statistical framework, is the likelihood function, or the probability density of observed data as a function of the parameters. However, in general it is not possible to compute the likelihood function — this intractability is an essential problem for data analysis.

Over the years, different methods for parameter inference have been developed that do not require a tractable likelihood function. These methods are collectively known as simulation-based, or likelihood-free, inference techniques.

The main stages of a particle physics experiment at the LHC are:

Create protons out of Hydrogen, accelerate to 0,99999999 c (here “c” is the speed of light)
Smash a bunch of protons together (the collisions are called ‘events’)
Convert raw detector information into ‘hits’ (pixels)
Reconstruct particle trajectories — ‘tracks’, ‘jets’, ‘showers’
Identify types of particles
Reconstruct decay structure
Filter meaningful events
Analyse statistical properties

These main stages are also highlighted by the right half of parabola from the turning point below:

Picture Based on F.Carminati, K.Cranmer and V.Innocente

Note that if you follow parabola to the right from the ‘Raw Data’ turning point, you follow the path that I described above this pictue from raw information that the detector records and CERN scientists collect → reconstructed points → tracks segments → track candidates and vertices → decays of certain kind. At the same time, there are very strong software and modules that can be used to simulate every step that happens on the detector level and subatomic level. Actually, there are event generators that produce signal like patterns and they allow you to estimate what kind of response you can get from the detector. Below is a pictue that I have put together of the common simulation packages that I have heard of: Geant, Genie and Pythia.

Pictures obtained from the simulation package links: Genie — http://www.genie-mc.org/, Pythia — https://pythia.org/, Geant — https://geant4.web.cern.ch/. More info about simulation packages at https://arxiv.org/pdf/1101.2599.pdf

The methodology used on simulated data can also be applied on the real data as well.

Methodology

In this section I will discuss and describe the exploratory data analysis (EDA) that I did, inferential statistical testing I performed and what machine learnings were used and why.

The data used to be available on Kaggle and the metric that was suppposed to be maximized was called the approximate median significance (AMS):

I chose to use the log-loss metric instead of the AMS metric which was intended for the competition. The log loss metric is one of the most important classification metrics based on probabilities. Although it’s hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. For any given problem, a lower log loss value means better predictions.

where y is the true label and p is the probability that y = 1.

Note that, unlike the AMS metric, the log-loss heavily punishes a model for getting predictions very wrong. For example, if y = 1 and a model predicted that p is approximately zero, then only the first term inside the brackets is left(because 1–y = 0) and the first term would be very big if p is almost zero. This is because log(0) is negative infinity and the minus sign outside brackets would mean that –log(0) is positive infinity.

Exploratory Data Analysis (EDA)

From the EDA that I performed I found that there were no missing values and this could be because the CERN scientists might have replaced the missing values with an outlier that’s not in the range of the features’ values. In particular, they seem to have replaced meaningless and missing values (if there were any) with a –999.

To summarize, each column has 818238 rows of data and the eleven columns that contain data replaced with the outlier -999 are:

“DER_mass_MMC”: has 124602 row values = -999
“DER_deltaeta_jet_jet”: 580253 row values = -999
“DER_mass_jet_jet”: 580253 row values = -999
“DER_prodeta_jet_jet” : 580253 row values = -999
“DER_lep_eta_centrality”: 580253 row values = -999
“PRI_jet_leading_pt”: 327371 row values = -999
“PRI_jet_leading_eta”: 327371 row values = -999
“PRI_jet_leading_phi”: 327371 row values = -999
“PRI_jet_subleading_pt”: 580253 row values = -999
“PRI_jet_subleading_eta”: 580253 row values = -999
“PRI_jet_subleading_phi”: 580253 row values = -999

As one might expect, this dataset has a class imbalance. In real life there is almost always a class imbalance. The Higgs Boson decaying into two tau leptons is not immune to this problem as we can see in the picture below:

**Class frequency plot:** this class imbalance suggests that we should use a stratified-shuffle splitting validation strategy.

I also calculated and plotted the mutual information scores.

**Mutual information scores:** What’s surprising here is that most of the variables prefixed with DER (for DERived) are variables that contain the most outliers (the outlier -999). Some of these variables have about two-thirds of such outliers, yet they contain so much more information about the target variable than the variables that neither have outliers nor missing values. This is a somewhat surprising result and, luckily, I didn’t drop the columns because they contain mostly outliers!

As we can see, the ‘Weight’ feature contains much more information about the target variable and this could also be taken as a suggestion that there’s a data leak (more on this later, under the section ‘What I found’).

Thinking About Models

Due to the evident class imbalance, the data splitting strategy that I used was the stratified shuffle split to maintain the same ratio of classes in the train and validation data before I trained my models.

There are essentially 4 common types of machine learning models. Linear, tree-based, kNN-based and neural network models. As my first attempt I tried out linear models, hence I had to check correlations between features, i.e I had to check what is called multicollinearity.

Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results. Luckily, decision trees and boosted trees algorithms are immune to multicollinearity by nature. When they decide to split, the tree will choose only one of the perfectly correlated features. However, other algorithms like Logistic Regression or Linear Regression are not immune to that problem.

The absolute correlation here is just the absolute value of the correlation.

As we can see, most features have a zero correlation between each other, which is good news for my linear models, but I found that there are 496 pairs of features that show strong positive/negative correlation (i.e. absolute correlation values greater than 0.8).

Another model, that seemed to be a good choice to try was the KNN model. However, it soon became apparent to me that it would take ages to determine the best k-value to use if, say, I trained KNN models from the range 1 to 40. The downside of KNN models is that they take a lot of time to train and make predictions, because of the distance calculations. However, despite this disadvantage, I still used a KNN model and it actually did better than all the linear regression models.

What I Found

Data Leakage and Quality Metrics

The 'Weight' feature contains info about the target variable 'Label'. Including this feature in our models would lead to a data leakage. This would lead to perfect classifiers if the Weight feature is not removed before training the models, hence what you will get is that the models are overly optimistic. In particular, selecting the logistic regression model and using either the L1-regularization (LASSO) or L2-regularization (RIDGE) technique would lead to perfect classifiers. The data leakage can be explained by the following two formulas defined at the beginning of my notebook:

As we can see from the two formulas above, the weight w_i tells you something about the target variable (whether the event Label is s or b). For this reason, the Weight column was simply removed from the list of features to feed to my models. Here’s the proper confusion matrix without data leakages:

As we can see, the logistic regression models using the L1 — regularization (Lasso) perfomed slightly better than when the L2 — regularization (Ridge) technique; Using regularization had a slight improvement from just the regular logistic regression model — Lr. The linear models could be improved by using a dimensionality reduction technique such as the principal component analysis (PCA), but the issue is that most of our values are weakly correlated and PCA is known to be useful if there is a strong correlation between features, so I suspected that PCA would not help or have a huge impact. So the main focus of this project quicky turned to be on tree-based models and neural networks.

As it stands, I used a tree-based model — gradient boosting (GB) and a simple neural network (NN). The gradient boosting model is currently the best model at the time of writting.

The error, in this case the log-loss, started to plateu between 200 and 400 trees.

The gradient boosting classifier confusion matrix.

**The ROC Curve of the Gradient Boosted Classifier. NB:** In the machine learning community, the ROC curve is plotted as a dependence of the true positive rate (TPR) from the false positive rate (FPR). However, in the high energy physics community, the ROC curve is plotted as the dependence of the one minus false positive rate (1–FPR) from the true positive rate. The 1–FPR is called the **background rejection** and the TPR is called the **signal efficiency**.

As we can see from the ROC Curve, the gradient boosted classifier performs well and far better than the linear models, even without any fine-tuning of the hyperparameters or use of cross-validation of 4-folds. The neural network, however, does not do as well as the gradient boosted model, but it is way better than any of the linear models that I have tried out. Perhaps some preprocessing, feature engineering, adding more hidden layers and hyperparameter tuning might make my keras neural network’s performance better. Perhaps even better that the gradient boosted model.

Conclusion: the gradient boosted classifier was the best classifier to use to determine if an event is background or signal; the accuracy was 0.84. The second-best classifier to use was the keras neural network of two hidden layers; one hidden layer with 100 neurons and the other hidden layer with 60 neurons. Both hidden layers used the ReLu activation function and the accuracy was 0.83.

Next Steps

Improving the keras neural network.
Feature engineering (requires some particle physics knowledge).

What I Did Not Do

Model ensembling. Although gradient boosting is an ensemble, I did not do a variety of model ensembling.
Try to find the best k-value for my kNN-based model. I wanted to find the best k in the range 1 t0 40. This means that I had to train 41 kNN models and training them takes alot of time, because of the distance calculations. Perhaps correctly implementing the steps from this article would help speed-up the process of determining the best k-value: Make kNN 300 times faster than Scikit-learn’s in 20 lines!. Alternatively, one might try to correctly implement the code and ideas from this KNN (K-Nearest Neighbors) is Dead! blog post — the author claims that:

“Long live ANNs (Approximate Nearest Neighbors) for their whopping 380X speedup over sklearn’s KNN while delivering 99.3% similar results.”

Thank you for reading my blog post. Click here to view my notebook on github.