To import the github repository, clone with the following command.
git clone https://github.com/gputtley/INNFER.git
Then to install conda run this command. You will need to click enter
and yes
through all prompts. If you already have a conda installed, this is not needed.
source setup.sh conda
To set up the environment, you will need to run this command.
source setup.sh env
At the beginning of every session you will need to run the following command to start the environment.
source env.sh
Running INNFER happens though the scripts/innfer.py
script with an accompanying yaml config file parsed with the --cfg
option (or a benchmark name with the --benchmark
option). You also need to specify the step you want to run with the --step
option. More information about these two options are detailed further below. An example command for this is shown below.
python3 scripts/innfer.py --cfg="example_cfg.yaml" --step="PreProcess"
As some commands may take some time, jobs can be parallelised and submitted to a batch system such as HTCondor, by adding --submit="condor.yaml"
, which points to a configuration file for submission to the batch. Running INNFER on a batch is highly recommended in all cases.
Example workflows are available in the workflows folder.
The input configuration file is how you give key information to INNFER about the input datasets and their preprocessing, how you want to validate the models and how you want to build the likelihood. Examples of this is shown in the configs/run
directory.
The structure of the config should be as noted below.
name: example_config # Name for all data, plot and model folders
variables: # Variables, parameters of interest (shape only - rates parameters added later) and nuisance parameters
- var_1
- var_2
pois:
- poi_1
nuisances:
- nui_1
- nui_2
- nui_3
data_file: data.root # Data file - this can be a root or a parquet file
inference: # Settings for inference
nuisance_constraints: # Nuisance parameters to add Gaussian constraints
- nui_1
- nui_2
- nui_3
rate_parameters: # Processes to add rate parameters for
- signal
lnN: # Log normal uncertainties by name, rate (this can be asymmetric, parse as a [0.98,1.02] for example) and processes they effect
nui_1:
rate: 1.02
files: ["signal", "background"]
default_values: # Set default values - will set defaults of nuisances to 0 and rates to 1 if not set
poi_1: 125.0
models: # Define information about the models needed - this is split by process
signal:
density_models: # This is the nominal density model that is trained. This is parsed as a list but typically only the first element is used, as each element of the list is varied separately
- parameters: ["poi_1", "nui_2"] # Parameters in model
file: base_signal # Base file to load from
shifts: # Parameters to shift from base file
nui_2:
type: continuous # This can also be discrete or fixed
range: [-3.0,3.0] # Range to vary parameter in
n_copies: 10 # This is the number of copies to make of the base dataset
regression_models: # These are regression models to vary the nominal density for weight variations. This is parsed as a list and each element of the list will be a new regression model
- parameter: "nui_3"
file: base_signal
shifts:
nui_3:
type: continuous
range: [-3.0,3.0]
n_copies: 5
yields: {"file": "base_signal"} # This is the base file to calculate the yield for, the default yield is done for the defined default parameters
background:
density_models:
- parameters: ["nui_2"]
file: base_background
shifts:
nui_2:
type: continuous
range: [-3.0,3.0]
n_copies: 10
regression_models: []
yields: {"file": "base_background"}
validation: # This is the options for the validation files
loop: # Loop of sets of parameters to validate. This is done for all processes individually and combined. For an individual file with a subset of the parameters, a unique subset of the loop is formed.
- {"poi_1" : 125.0, "nui_1" : 0.0, "nui_2" : 0.0, "nui_3" : 0.0, "mu_signal" : 1.0}
- {"poi_1" : 124.0, "nui_1" : 0.0, "nui_2" : 0.0, "nui_3" : 0.0, "mu_signal" : 1.0}
- {"poi_1" : 126.0, "nui_1" : 0.0, "nui_2" : 0.0, "nui_3" : 0.0, "mu_signal" : 1.0}
- {"poi_1" : 125.0, "nui_1" : -1.0, "nui_2" : 0.0, "nui_3" : 0.0, "mu_signal" : 1.0}
- {"poi_1" : 125.0, "nui_1" : 1.0, "nui_2" : 0.0, "nui_3" : 0.0, "mu_signal" : 1.0}
- {"poi_1" : 126.0, "nui_1" : 0.0, "nui_2" : -1.0, "nui_3" : 0.0, "mu_signal" : 1.0}
- {"poi_1" : 126.0, "nui_1" : 0.0, "nui_2" : 1.0, "nui_3" : 0.0, "mu_signal" : 1.0}
- {"poi_1" : 126.0, "nui_1" : 0.0, "nui_2" : 0.0, "nui_3" : -1.0, "mu_signal" : 1.0}
- {"poi_1" : 126.0, "nui_1" : 0.0, "nui_2" : 0.0, "nui_3" : 1.0, "mu_signal" : 1.0}
- {"poi_1" : 126.0, "nui_1" : 0.0, "nui_2" : 0.0, "nui_3" : 0.0, "mu_signal" : 0.8}
- {"poi_1" : 126.0, "nui_1" : 0.0, "nui_2" : 0.0, "nui_3" : 0.0, "mu_signal" : 1.2}
files: # Base files to generate the validation datasets from
ttbar: base_signal
other: base_background
preprocess: # Preprocess parameters
train_test_val_split: 0.8:0.1:0.1 # How to train/test/val split the dataset
save_extra_columns: {} # Extra columns to save during preprocessing - useful for custom modules
files: # Base dataset inputs
base_signal: # Base dataset name
inputs: # Input files
- signal_123.root
- signal_124.root
- signal_125.root
- signal_126.root
- signal_127.root
add_columns: # Add extra columns to dataset
poi_1:
- 123.0
- 124.0
- 125.0
- 126.0
- 127.0
tree_name: "tree" # Root ttree name
selection: "(var_1>50)" # Initial selecton
weight: "wt" # Name of event weights, scaled correctly
parameters: ["poi_1"] # Parameters in the dataset before any shifts
pre_calculate: # Extra variables to be calculated - including the ability to add feature morphing shifted variables
var_2: var_2_uncorr * (1 + (nui_3 * sigma_3))
post_calculate_selection: "(var_2 > 30)" # Selection after pre_calculate is calculated
weight_shifts: # Weight shifts
nui_3: "(1 + (nui_2 * sigma_2))"
base_background:
inputs:
- background.root
tree_name: "tree"
selection: "(var_1>50)"
weight: "wt"
parameters: []
pre_calculate:
var_2: var_2_uncorr * (1 + (nui_3 * sigma_3))
post_calculate_selection: "(var_2 > 30)"
weight_shifts:
nui_3: "(1 + (nui_2 * sigma_2))"
The framework is set up to work with the SnakeMake workflow manager. Firstly, this needs to be setup for the local batch service. To do this for HTCondor run through the following steps:
source setup.sh snakemake_condor
You should call the profile htcondor
and if you wish to looks at the condor submission logs then set this directory to somewhere accessible.
To use condor workflows you can set the required steps and submission options in the snakemake configuration file. Example files are in the configs/snakemake
directory. This contains the core steps of the innfer package. You can then run with snakemake by parsing --step="SnakeMake" --snakemake-cfg="example_snakemake_cfg.yaml"
. Other snakemake submission options are not set up. Please contact us if you wish for this to be setup.
Snakemake workflows are defined by a yaml file detailing the steps to run, the options to parse for each step and the submission options. Examples of this are in the configs/snakemake
directory. It is recommended that your run infer command to use snakemake in a tmux
terminal, so your terminal cannot be disconnected.
The snakemake chain can then be run with the following command:
python3 scripts/innfer.py --cfg="example_cfg.yaml" --step="SnakeMake" --snakemake-cfg="example_snakemake_cfg.yaml"
MakeBenchmark
: This will create a dataset and a yaml config file to run fromm for the benchmark scenario specified (with --benchmark="Dim5"
for example) in the python/worker/benchmarks.py
class.
LoadData
: Loads data to create base datasets.
PreProcess
: Prepares datasets for training, testing, and validation. Includes standardisation, train/test/validation splitting, and optional binned fit input.
Custom
: Runs a custom module specified by the --custom-module
option. Additional options can be passed using --custom-options
.
InputPlotTraining
: Plots training and testing datasets after preprocessing.
InputPlotValidation
: Plots validation datasets after preprocessing.
TrainDensity
: Trains density networks using the specified architecture. Supports logging with wandb and saving models per epoch.
SetupDensityFromBenchmark
: Sets up density models based on a benchmark scenario.
TrainRegression
: Trains regression networks for specific parameters using the specified architecture.
EvaluateRegression
: Evaluates regression models on validation datasets, also makes the normalisation spline.
PlotRegression
: Plots regression results of average binned value vs average regressed value.
MakeAsimov
: Generates Asimov datasets for validation. The number of events can be specified using --number-of-asimov-events
or you can use the --use-asimov-scaling
option to take the number of asimov events scaled from the predicted number of events.
DensityPerformanceMetrics
: Computes performance metrics for trained density networks. Supports metrics like loss, histograms, and multidimensional metrics.
EpochPerformanceMetricsPlot
: Plots performance metrics as a function of training epochs.
PValueSimVsSynth
: Performs a p-value dataset comparison test between simulated and synthetic datasets.
PValueSynthVsSynth
: Performs a p-value dataset comparison test between bootstrapped synthetic datasets.
PValueSynthVsSynthCollect
: Collects results from the synthetic vs. synthetic p-value tests.
PValueDatasetComparisonPlot
: Plots p-value dataset comparison results.
HyperparameterScan
: This will perform a hyperparameter scan based on a scan architecture file given by --architecture="scan.yaml"
for example. If you want to track your trainings with wandb, you first must need to set up an account at https://wandb.ai/. Then you must connect your terminal to your account with the command wandb login
. You will need to enter the API given on the website. Now you can add --use-wandb
to the command and track online.
HyperparameterScanCollect
: Collects the best-performing model from a hyperparameter scan.
BayesianHyperparameterTuning
: Performs Bayesian hyperparameter tuning using a specified architecture. Supports logging with wandb in the same way HyperparameterScan
does.
Flow
: Visualizes the flow of the network through coupling layers.
Generator
: Uses the network as a generator to create plots comparing network outputs to input simulations. Supports 1D and 2D unrolled plots.
GeneratorSummary
: Creates summary plots comparing simulations and network-generated outputs across all unique Y values.
LikelihoodDebug
: Debugs likelihood calculations for specific parameter values. It allows you to evaluate the likelihood with the parameter values given by the --other-input
option.
InitialFit
: Performs minimization of the negative log-likelihood to find the best fit. Whenever calling the infer module you can specify the --likelihood-type
, --data-type
and numerous other options. Supports specifying the minimization method with --minimisation-method
.
ApproximateUncertainty
: Calculates approximate uncertainties by interpolating likelihood crossings.
Hessian
: Computes the Hessian matrix for the likelihood.
HessianParallel
: Computes the Hessian matrix in parallel for efficiency.
HessianCollect
: Collects results from parallel Hessian computations.
HessianNumerical
: Computes the Hessian matrix numerically.
Covariance
: Calculates the covariance matrix from the inverse of the Hessian.
DMatrix
: Computes the D-matrix for asymptotic corrections, as derived here.
CovarianceWithDMatrix
: Computes the corrected covariance matrix using the Hessian and D-matrix.
ScanPointsFromApproximate and ScanPointsFromHessian
: Identifies points for likelihood scans based on approximate uncertainties or the Hessian matrix. The number of scan points can be given by the --number-of-scan-points
option.
Scan
: Performs a profiled likelihood scan over identified points.
ScanCollect
: Collects results from likelihood scans.
ScanPlot
: Plots results from likelihood scans.
MakePostFitAsimov
: Generates post-fit Asimov datasets using best-fit parameter values.
PostFitPlot
: Plots distributions for best-fit results compared to the data used for fitting.
SummaryChiSquared
: Computes the chi-squared value between best-fit and truth values for validation parameters.
SummaryAllButOneCollect
: Collects results for “all-but-one” parameter summaries.
Summary
: Plots a summary of best-fit and truth values for validation parameters.
SummaryPerVal
: Plots summaries of results for each validation index.