Module reference¶

Each module encapsulates a unit of functionality in the parameter estimation project. The following is a brief description of each module and its most important classes, functions, and variables.

Note

Some downloaded data and results of long-running functions are saved on disk in the cache directory, which will be created as needed. However, there is no logic for updating the cache if the code is modified, which means cache files may need to be occassionally manually deleted.

Package init file¶

Source code: src/__init__.py

Project initialization and common objects.

src.systems = ['PbPb2760', 'PbPb5020']¶: Sets the collision systems for the entire project, where each system is a string of the form '<projectile 1><projectile 2><beam energy in GeV>', such as 'PbPb2760', 'AuAu200', 'pPb5020'. Even if the project uses only a single system, this should still be a list of one system string.

src.parse_system(system)[source]¶: Parse a system string into a pair of projectiles and a beam energy.

Experimental data¶

Source code: src/expt.py

Downloads, processes, and stores experimental data. Prints all data when run as a script.

class src.expt.HEPData(inspire_rec, table, reverse=False)[source]¶

Interface to a HEPData YAML data table.

Downloads and caches the dataset specified by the INSPIRE record and table number. The web UI for inspire_rec may be found at https://hepdata.net/record/insinspire_rec.

If reverse is true, reverse the order of the data table (useful for tables that are given as a function of Npart).

Note

Datasets are assumed to be a function of centrality. Other kinds of datasets will require code modifications.

x(name, case=True)[source]¶

Get an independent variable (“x” data) with the given name.

If case is false, perform case-insensitive matching for the name.

cent¶: The centrality bins as a list of (low, high) tuples.

y(name=None, **quals)[source]¶: Get a dependent variable (“y” data) with the given name and qualifiers.

dataset(name=None, maxcent=70, ignore_bins=[], **quals)[source]¶

Return a dict containing:

cent: list of centrality bins
x: numpy array of centrality bin midpoints
y: numpy array of y values
yerr: subdict of numpy arrays of y errors

name and quals are passed to HEPData.y().

Missing y values are skipped.

Centrality bins whose upper edge is greater than maxcent are skipped.

Centrality bins in ignore_bins [a list of (low, high) tuples] are skipped.

src.expt._data()[source]¶

Curate the experimental data using the HEPData class and return a nested dict with levels

system
observable
subobservable
dataset (created by HEPData.dataset())

For example, data['PbPb2760']['dN_dy']['pion'] retrieves the dataset for pion dN/dy in Pb+Pb collisions at 2.76 TeV.

Some observables, such as charged-particle multiplicity, don’t have a natural subobservable, in which case the subobservable is set to None.

The best way to understand the nested dict structure is to explore the object in an interactive Python session.

src.expt.data = <nested dict object>¶: A nested dict containing all the experimental data, created by the _data() function.

src.expt.cov(system, obs1, subobs1, obs2, subobs2, stat_frac=0.0001, sys_corr_length=100, cross_factor=0.8, corr_obs={frozenset({'dET_deta', 'dNch_deta', 'dN_dy'})})[source]¶

Estimate a covariance matrix for the given system and pair of observables, e.g.:

>>> cov('PbPb2760', 'dN_dy', 'pion', 'dN_dy', 'pion')
>>> cov('PbPb5020', 'dN_dy', 'pion', 'dNch_deta', None)

For each dataset, stat and sys errors are used if available. If only “summed” error is available, it is treated as sys error, and stat_frac sets the fractional stat error.

Systematic errors are assumed to have a Gaussian correlation as a function of centrality percentage, with correlation length set by sys_corr_length.

If obs{1,2} are the same but subobs{1,2} are different, the sys error correlation is reduced by cross_factor.

If obs{1,2} are different and uncorrelated, the covariance is zero. If they are correlated, the sys error correlation is reduced by cross_factor. Two different obs are considered correlated if they are both a member of one of the groups in corr_obs (the groups must be set-like objects). By default {Nch, ET, dN/dy} are considered correlated since they are all related to particle / energy production.

Model data¶

Source code: src/model.py

Computes model observables to match experimental data. Prints all model data when run as a script.

Model data files are expected with the file structure model_output/design/system/design_point.dat, where design is a design type, system is a system string, and design_point is a design point name.

For example, the structure of my model_output directory is

model_output
├── main
│   ├── PbPb2760
│   │   ├── 000.dat
│   │   └── 001.dat
│   └── PbPb5020
│       ├── 000.dat
│       └── 001.dat
└── validation
    ├── PbPb2760
    │   ├── 000.dat
    │   └── 001.dat
    └── PbPb5020
        ├── 000.dat
        └── 001.dat

I have two design types (main and validation), two systems, and my design points are numbered 000-499 (most numbers omitted for brevity).

Data files are expected to have the binary format created by my heavy-ion collision event generator.

Of course, if you have a different data organization scheme and/or format, that’s fine. Modify the code for your needs.

class src.model.ModelData(*files)[source]¶

Helper class for event-by-event model data. Reads binary data files and computes centrality-binned observables.

dtype = dtype([('initial_entropy', '<f8'), ('nsamples', '<i8'), ('dNch_deta', '<f8'), ('dET_deta', '<f8'), ('dN_dy', [('pion', '<f8'), ('kaon', '<f8'), ('proton', '<f8'), ('Lambda', '<f8'), ('Sigma0', '<f8'), ('Xi', '<f8'), ('Omega', '<f8')]), ('mean_pT', [('pion', '<f8'), ('kaon', '<f8'), ('proton', '<f8'), ('Lambda', '<f8'), ('Sigma0', '<f8'), ('Xi', '<f8'), ('Omega', '<f8')]), ('pT_fluct', [('N', '<i8'), ('sum_pT', '<f8'), ('sum_pTsq', '<f8')]), ('flow', [('N', '<i8'), ('Qn', '<c16', (8,))])])¶: The expected binary data type.

observables_like(data, *keys)[source]¶

Compute the same centrality-binned observables as contained in data with the same nested dict structure.

This function calls itself recursively, each time prepending to keys.

src.model.data¶: A nested dict of model data with the same structure as src.expt.data.

Design¶

Source code: src/design.py

Generates Latin-hypercube parameter designs.

When run as a script, writes input files for use with my heavy-ion collision event generator. Run python -m src.design --help for usage information.

Warning

This module uses the R lhs package to generate maximin Latin-hypercube samples. As far as I know, there is no equivalent library for Python (I am aware of pyDOE, but that uses a much more rudimentary algorithm for maximin sampling).

This means that R must be installed with the lhs package (run install.packages('lhs') in an R session).

src.design.generate_lhs(npoints, ndim, seed)[source]¶: Generate a maximin Latin-hypercube sample (LHS) with the given number of points, dimensions, and random seed.

class src.design.Design(system, npoints=500, validation=False, seed=None)[source]¶

Latin-hypercube model design.

Creates a design for the given system with the given number of points. Creates the main (training) design if validation is false (default); creates the validation design if validation is true. If seed is not given, a default random seed is used (different defaults for the main and validation designs).

Public attributes:

system: the system string
projectiles, beam_energy: system projectile pair and beam energy
type: ‘main’ or ‘validation’
keys: list of parameter keys
labels: list of parameter display labels (for TeX / matplotlib)
range: list of parameter (min, max) tuples
min, max: numpy arrays of parameter min and max
ndim: number of parameters (i.e. dimensions)
points: list of design point names (formatted numbers)
array: the actual design array

The class also implicitly converts to a numpy array.

This is probably the worst class in this project, and certainly the least generic. It will probably need to be heavily edited for use in any other project, if not completely rewritten.

write_files(basedir)[source]¶: Write input files for each design point to basedir.

Emulator¶

Source code: src/emulator.py

Trains Gaussian process emulators.

When run as a script, allows retraining emulators, specifying the number of principal components, and other options (however it is not necessary to do this explicitly — the emulators will be trained automatically when needed). Run python -m src.emulator --help for usage information.

Uses the scikit-learn implementations of principal component analysis (PCA) and Gaussian process regression.

class src.emulator.Emulator(system, npc=10, nrestarts=0)[source]¶

Multidimensional Gaussian process emulator using principal component analysis.

The model training data are standardized (subtract mean and scale to unit variance), then transformed through PCA. The first npc principal components (PCs) are emulated by independent Gaussian processes (GPs). The remaining components are neglected, which is equivalent to assuming they are standard zero-mean unit-variance GPs.

This class has become a bit messy but it still does the job. It would probably be better to refactor some of the data transformations / preprocessing into modular classes, to be used with an sklearn pipeline. The classes would also need to handle transforming uncertainties, which could be tricky.

observables = [('dNch_deta', [None]), ('dET_deta', [None]), ('dN_dy', ['pion', 'kaon', 'proton']), ('mean_pT', ['pion', 'kaon', 'proton']), ('pT_fluct', [None]), ('vnk', [(2, 2), (3, 2), (4, 2)])]¶: Observables to emulate as a list of 2-tuples (obs, [list of subobs]).

classmethod from_cache(system, retrain=False, **kwargs)[source]¶: Load the emulator for system from the cache if available, otherwise train and cache a new instance.

predict(X, return_cov=False, extra_std=0)[source]¶

Predict model output at X.

X must be a 2D array-like with shape (nsamples, ndim). It is passed directly to sklearn GaussianProcessRegressor.predict().

If return_cov is true, return a tuple (mean, cov), otherwise only return the mean.

The mean is returned as a nested dict of observable arrays, each with shape (nsamples, n_cent_bins).

The covariance is returned as a proxy object which extracts observable sub-blocks using a dict-like interface:

>>> mean, cov = emulator.predict(X, return_cov=True)

>>> mean['dN_dy']['pion']
<mean prediction of pion dN/dy>

>>> cov[('dN_dy', 'pion'), ('dN_dy', 'pion')]
<covariance matrix of pion dN/dy>

>>> cov[('dN_dy', 'pion'), ('mean_pT', 'kaon')]
<covariance matrix between pion dN/dy and kaon mean pT>

The shape of the extracted covariance blocks are (nsamples, n_cent_bins_1, n_cent_bins_2).

NB: the covariance is only computed between observables and centrality bins, not between sample points.

extra_std is additional uncertainty which is added to each GP’s predictive uncertainty, e.g. to account for model systematic error. It may either be a scalar or an array-like of length nsamples.

sample_y(X, n_samples=1, random_state=None)[source]¶

Sample model output at X.

Returns a nested dict of observable arrays, each with shape (n_samples_X, n_samples, n_cent_bins).

MCMC¶

Source code: src/mcmc.py

Markov chain Monte Carlo model calibration using the affine-invariant ensemble sampler (emcee).

This module must be run explicitly to create the posterior distribution. Run python -m src.mcmc --help for complete usage information.

On first run, the number of walkers and burn-in steps must be specified, e.g.

python -m src.mcmc --nwalkers 500 --nburnsteps 100 200

would run 500 walkers for 100 burn-in steps followed by 200 production steps. This will create the HDF5 file mcmc/chain.hdf (default path).

On subsequent runs, the chain resumes from the last point and the number of walkers is inferred from the chain, so only the number of production steps is required, e.g.

python -m src.mcmc 300

would run an additional 300 production steps (total of 500).

To restart the chain, delete (or rename) the chain HDF5 file.

class src.mcmc.Chain(path=Path('mcmc/chain.hdf'))[source]¶

High-level interface for running MCMC calibration and accessing results.

Currently all design parameters except for the normalizations are required to be the same at all beam energies. It is assumed (NOT checked) that all system designs have the same parameters and ranges (except for the norms).

observables = [('dNch_deta', [None]), ('dET_deta', [None]), ('dN_dy', ['pion', 'kaon', 'proton']), ('mean_pT', ['pion', 'kaon', 'proton']), ('pT_fluct', [None]), ('vnk', [(2, 2), (3, 2), (4, 2)])]¶: Observables to calibrate as a list of 2-tuples (obs, [list of subobs]). Each observable is checked for each system and silently ignored if not found

log_posterior(X, extra_std_prior_scale=0.05)[source]¶

Evaluate the posterior at X.

extra_std_prior_scale is the scale parameter for the prior distribution on the model sys error parameter:

prior ~ sigma^2 * exp(-sigma/scale)

random_pos(n=1)[source]¶: Generate n random positions in parameter space.

run_mcmc(nsteps, nburnsteps=None, nwalkers=None, status=None)[source]¶: Run MCMC model calibration. If the chain already exists, continue from the last point, otherwise burn-in and start the chain.

open(mode='r')[source]¶: Return a handle to the chain HDF5 file.

dataset(mode='r', name='chain')[source]¶

Context manager for quickly accessing a dataset in the chain HDF5 file.

>>> with Chain().dataset() as dset:
        # do something with dset object

load(*keys, thin=1)[source]¶: Read the chain from file. If keys are given, read only those parameters. Read only every thin’th sample from the chain.

samples(n=1)[source]¶: Predict model output at n parameter points randomly drawn from the chain.

src.mcmc.credible_interval(samples, ci=0.9)[source]¶: Compute the highest-posterior density (HPD) credible interval (default 90%) for an array of samples.

Plots and figures¶

Source code: src/plots.py

Generates plots / figures when run as a script. Plot files are placed in the plots directory.

By default, simply running python -m src.plots generates ALL plots, which may not be desired. Instead, one can pass a list of plots to generate: python -m src.plots plot1 plot2 .... The full list of plots is shown in the usage information python -m src.plots --help.

Typing can be reduced by using shell brace expansion, e.g. python -m src.plots observables_{design,posterior} for both observables_design and observables_posterior. In addition, plots may be given as paths to plot filenames, which enables shell globbing, e.g. python -m src.plots plots/observables_*.

In the code, each plot is generated by a function tagged with the @plot decorator.