Module reference¶
Each module encapsulates a unit of functionality in the parameter estimation project. The following is a brief description of each module and its most important classes, functions, and variables.
Note
Some downloaded data and results of long-running functions are saved on disk in the cache
directory, which will be created as needed.
However, there is no logic for updating the cache if the code is modified, which means cache files may need to be occassionally manually deleted.
Package init file¶
Source code: src/__init__.py
Project initialization and common objects.
-
src.
systems
= ['PbPb2760', 'PbPb5020']¶ Sets the collision systems for the entire project, where each system is a string of the form
'<projectile 1><projectile 2><beam energy in GeV>'
, such as'PbPb2760'
,'AuAu200'
,'pPb5020'
. Even if the project uses only a single system, this should still be a list of one system string.
Experimental data¶
Source code: src/expt.py
Downloads, processes, and stores experimental data. Prints all data when run as a script.
-
class
src.expt.
HEPData
(inspire_rec, table, reverse=False)[source]¶ Interface to a HEPData YAML data table.
Downloads and caches the dataset specified by the INSPIRE record and table number. The web UI for
inspire_rec
may be found athttps://hepdata.net/record/insinspire_rec
.If
reverse
is true, reverse the order of the data table (useful for tables that are given as a function of Npart).Note
Datasets are assumed to be a function of centrality. Other kinds of datasets will require code modifications.
-
x
(name, case=True)[source]¶ Get an independent variable (“x” data) with the given name.
If
case
is false, perform case-insensitive matching for the name.
-
cent
¶ The centrality bins as a list of (low, high) tuples.
-
y
(name=None, **quals)[source]¶ Get a dependent variable (“y” data) with the given name and qualifiers.
-
dataset
(name=None, maxcent=70, ignore_bins=[], **quals)[source]¶ Return a dict containing:
- cent: list of centrality bins
- x: numpy array of centrality bin midpoints
- y: numpy array of y values
- yerr: subdict of numpy arrays of y errors
name
andquals
are passed toHEPData.y()
.Missing y values are skipped.
Centrality bins whose upper edge is greater than
maxcent
are skipped.Centrality bins in
ignore_bins
[a list of (low, high) tuples] are skipped.
-
-
src.expt.
_data
()[source]¶ Curate the experimental data using the
HEPData
class and return a nested dict with levels- system
- observable
- subobservable
- dataset (created by
HEPData.dataset()
)
For example,
data['PbPb2760']['dN_dy']['pion']
retrieves the dataset for pion dN/dy in Pb+Pb collisions at 2.76 TeV.Some observables, such as charged-particle multiplicity, don’t have a natural subobservable, in which case the subobservable is set to
None
.The best way to understand the nested dict structure is to explore the object in an interactive Python session.
-
src.expt.
data
= <nested dict object>¶ A nested dict containing all the experimental data, created by the
_data()
function.
-
src.expt.
cov
(system, obs1, subobs1, obs2, subobs2, stat_frac=0.0001, sys_corr_length=100, cross_factor=0.8, corr_obs={frozenset({'dET_deta', 'dNch_deta', 'dN_dy'})})[source]¶ Estimate a covariance matrix for the given system and pair of observables, e.g.:
>>> cov('PbPb2760', 'dN_dy', 'pion', 'dN_dy', 'pion') >>> cov('PbPb5020', 'dN_dy', 'pion', 'dNch_deta', None)
For each dataset, stat and sys errors are used if available. If only “summed” error is available, it is treated as sys error, and
stat_frac
sets the fractional stat error.Systematic errors are assumed to have a Gaussian correlation as a function of centrality percentage, with correlation length set by
sys_corr_length
.If obs{1,2} are the same but subobs{1,2} are different, the sys error correlation is reduced by
cross_factor
.If obs{1,2} are different and uncorrelated, the covariance is zero. If they are correlated, the sys error correlation is reduced by
cross_factor
. Two different obs are considered correlated if they are both a member of one of the groups incorr_obs
(the groups must be set-like objects). By default {Nch, ET, dN/dy} are considered correlated since they are all related to particle / energy production.
Model data¶
Source code: src/model.py
Computes model observables to match experimental data. Prints all model data when run as a script.
Model data files are expected with the file structure
model_output/design/system/design_point.dat
, where
design
is a design type, system
is a system string, and
design_point
is a design point name.
For example, the structure of my model_output
directory is
model_output
├── main
│ ├── PbPb2760
│ │ ├── 000.dat
│ │ └── 001.dat
│ └── PbPb5020
│ ├── 000.dat
│ └── 001.dat
└── validation
├── PbPb2760
│ ├── 000.dat
│ └── 001.dat
└── PbPb5020
├── 000.dat
└── 001.dat
I have two design types (main and validation), two systems, and my design points are numbered 000-499 (most numbers omitted for brevity).
Data files are expected to have the binary format created by my heavy-ion collision event generator.
Of course, if you have a different data organization scheme and/or format, that’s fine. Modify the code for your needs.
-
class
src.model.
ModelData
(*files)[source]¶ Helper class for event-by-event model data. Reads binary data files and computes centrality-binned observables.
-
dtype
= dtype([('initial_entropy', '<f8'), ('nsamples', '<i8'), ('dNch_deta', '<f8'), ('dET_deta', '<f8'), ('dN_dy', [('pion', '<f8'), ('kaon', '<f8'), ('proton', '<f8'), ('Lambda', '<f8'), ('Sigma0', '<f8'), ('Xi', '<f8'), ('Omega', '<f8')]), ('mean_pT', [('pion', '<f8'), ('kaon', '<f8'), ('proton', '<f8'), ('Lambda', '<f8'), ('Sigma0', '<f8'), ('Xi', '<f8'), ('Omega', '<f8')]), ('pT_fluct', [('N', '<i8'), ('sum_pT', '<f8'), ('sum_pTsq', '<f8')]), ('flow', [('N', '<i8'), ('Qn', '<c16', (8,))])])¶ The expected binary data type.
-
-
src.model.
data
¶ A nested dict of model data with the same structure as
src.expt.data
.
Design¶
Source code: src/design.py
Generates Latin-hypercube parameter designs.
When run as a script, writes input files for use with my
heavy-ion collision event generator.
Run python -m src.design --help
for usage information.
Warning
This module uses the R lhs package to generate maximin Latin-hypercube samples. As far as I know, there is no equivalent library for Python (I am aware of pyDOE, but that uses a much more rudimentary algorithm for maximin sampling).
This means that R must be installed with the lhs package (run
install.packages('lhs')
in an R session).
-
src.design.
generate_lhs
(npoints, ndim, seed)[source]¶ Generate a maximin Latin-hypercube sample (LHS) with the given number of points, dimensions, and random seed.
-
class
src.design.
Design
(system, npoints=500, validation=False, seed=None)[source]¶ Latin-hypercube model design.
Creates a design for the given system with the given number of points. Creates the main (training) design if
validation
is false (default); creates the validation design ifvalidation
is true. Ifseed
is not given, a default random seed is used (different defaults for the main and validation designs).Public attributes:
system
: the system stringprojectiles
,beam_energy
: system projectile pair and beam energytype
: ‘main’ or ‘validation’keys
: list of parameter keyslabels
: list of parameter display labels (for TeX / matplotlib)range
: list of parameter (min, max) tuplesmin
,max
: numpy arrays of parameter min and maxndim
: number of parameters (i.e. dimensions)points
: list of design point names (formatted numbers)array
: the actual design array
The class also implicitly converts to a numpy array.
This is probably the worst class in this project, and certainly the least generic. It will probably need to be heavily edited for use in any other project, if not completely rewritten.
Emulator¶
Source code: src/emulator.py
Trains Gaussian process emulators.
When run as a script, allows retraining emulators, specifying the number of
principal components, and other options (however it is not necessary to do this
explicitly — the emulators will be trained automatically when needed). Run
python -m src.emulator --help
for usage information.
Uses the scikit-learn implementations of principal component analysis (PCA) and Gaussian process regression.
-
class
src.emulator.
Emulator
(system, npc=10, nrestarts=0)[source]¶ Multidimensional Gaussian process emulator using principal component analysis.
The model training data are standardized (subtract mean and scale to unit variance), then transformed through PCA. The first
npc
principal components (PCs) are emulated by independent Gaussian processes (GPs). The remaining components are neglected, which is equivalent to assuming they are standard zero-mean unit-variance GPs.This class has become a bit messy but it still does the job. It would probably be better to refactor some of the data transformations / preprocessing into modular classes, to be used with an sklearn pipeline. The classes would also need to handle transforming uncertainties, which could be tricky.
-
observables
= [('dNch_deta', [None]), ('dET_deta', [None]), ('dN_dy', ['pion', 'kaon', 'proton']), ('mean_pT', ['pion', 'kaon', 'proton']), ('pT_fluct', [None]), ('vnk', [(2, 2), (3, 2), (4, 2)])]¶ Observables to emulate as a list of 2-tuples
(obs, [list of subobs])
.
-
classmethod
from_cache
(system, retrain=False, **kwargs)[source]¶ Load the emulator for
system
from the cache if available, otherwise train and cache a new instance.
-
predict
(X, return_cov=False, extra_std=0)[source]¶ Predict model output at
X
.X must be a 2D array-like with shape
(nsamples, ndim)
. It is passed directly to sklearnGaussianProcessRegressor.predict()
.If
return_cov
is true, return a tuple(mean, cov)
, otherwise only return the mean.The mean is returned as a nested dict of observable arrays, each with shape
(nsamples, n_cent_bins)
.The covariance is returned as a proxy object which extracts observable sub-blocks using a dict-like interface:
>>> mean, cov = emulator.predict(X, return_cov=True)
>>> mean['dN_dy']['pion'] <mean prediction of pion dN/dy>
>>> cov[('dN_dy', 'pion'), ('dN_dy', 'pion')] <covariance matrix of pion dN/dy>
>>> cov[('dN_dy', 'pion'), ('mean_pT', 'kaon')] <covariance matrix between pion dN/dy and kaon mean pT>
The shape of the extracted covariance blocks are
(nsamples, n_cent_bins_1, n_cent_bins_2)
.NB: the covariance is only computed between observables and centrality bins, not between sample points.
extra_std
is additional uncertainty which is added to each GP’s predictive uncertainty, e.g. to account for model systematic error. It may either be a scalar or an array-like of length nsamples.
-
MCMC¶
Source code: src/mcmc.py
Markov chain Monte Carlo model calibration using the affine-invariant ensemble sampler (emcee).
This module must be run explicitly to create the posterior distribution.
Run python -m src.mcmc --help
for complete usage information.
On first run, the number of walkers and burn-in steps must be specified, e.g.
python -m src.mcmc --nwalkers 500 --nburnsteps 100 200
would run 500 walkers for 100 burn-in steps followed by 200 production steps.
This will create the HDF5 file mcmc/chain.hdf
(default path).
On subsequent runs, the chain resumes from the last point and the number of walkers is inferred from the chain, so only the number of production steps is required, e.g.
python -m src.mcmc 300
would run an additional 300 production steps (total of 500).
To restart the chain, delete (or rename) the chain HDF5 file.
-
class
src.mcmc.
Chain
(path=Path('mcmc/chain.hdf'))[source]¶ High-level interface for running MCMC calibration and accessing results.
Currently all design parameters except for the normalizations are required to be the same at all beam energies. It is assumed (NOT checked) that all system designs have the same parameters and ranges (except for the norms).
-
observables
= [('dNch_deta', [None]), ('dET_deta', [None]), ('dN_dy', ['pion', 'kaon', 'proton']), ('mean_pT', ['pion', 'kaon', 'proton']), ('pT_fluct', [None]), ('vnk', [(2, 2), (3, 2), (4, 2)])]¶ Observables to calibrate as a list of 2-tuples
(obs, [list of subobs])
. Each observable is checked for each system and silently ignored if not found
-
log_posterior
(X, extra_std_prior_scale=0.05)[source]¶ Evaluate the posterior at
X
.extra_std_prior_scale
is the scale parameter for the prior distribution on the model sys error parameter:prior ~ sigma^2 * exp(-sigma/scale)
-
run_mcmc
(nsteps, nburnsteps=None, nwalkers=None, status=None)[source]¶ Run MCMC model calibration. If the chain already exists, continue from the last point, otherwise burn-in and start the chain.
-
dataset
(mode='r', name='chain')[source]¶ Context manager for quickly accessing a dataset in the chain HDF5 file.
>>> with Chain().dataset() as dset: # do something with dset object
-
Plots and figures¶
Source code: src/plots.py
Generates plots / figures when run as a script.
Plot files are placed in the plots
directory.
By default, simply running python -m src.plots
generates ALL plots,
which may not be desired. Instead, one can pass a list of plots to generate:
python -m src.plots plot1 plot2 ...
. The full list of plots is shown in
the usage information python -m src.plots --help
.
Typing can be reduced by using shell brace expansion, e.g. python -m
src.plots observables_{design,posterior}
for both observables_design
and
observables_posterior
. In addition, plots may be given as paths to plot
filenames, which enables shell globbing, e.g. python -m src.plots
plots/observables_*
.
In the code, each plot is generated by a function tagged with the @plot
decorator.