Data reprocessing (old injection method via script)
This page aims for explaining how to perform data reprocessing from the available tools in cmssw. Adapted from twiki page by Francesco Fabozzi.
Last updated
Was this helpful?
This page aims for explaining how to perform data reprocessing from the available tools in cmssw. Adapted from twiki page by Francesco Fabozzi.
Last updated
Was this helpful?
In general, we identify several types of data reprocessings:
The light reprocessings where only a subset of the primary datasets and runs are taken into account, without skims. In general, these reprocessings are motivated by particular needs and are performed upon request of a particular area of the collaboration.
The complete reprocessings, which consider all the primary datasets (or at least all the ones of interest for physics analysis, as agreed at coordination level), all the runs (collision and commissioning) and all the active skims.
We keep track of all reprocessings in the PdmVDataReprocessing twiki. There are dedicated twiki pages for each reprocessing campaign, where here one can access all details about datasets, configuration, status.
The tool to inject a re-processing wf is the wmcontrol.py
script. Detailed instructions on the tool are provided in the PdmVProductionManagerInstructions twiki.
In this twiki we collect some practical recipes for preparation and injection of reprocessing workflows. The examples decribed here are based on the legacy reprocessing of 2016 data or first reprocessings of 2017 data.
When we open a new data reprocessing campaign, we notify comp-ops via a JIRA ticket and ask to enable one workflow as a pilot (inject a workflow with a limited set of runs to not take too long). We will use the JIRA ticket to follow-up about the pilot workflow. Example here.
REMEMBER: make sure to add the UnifiedOfficer label and assign the ticket to Marc Gabriel Weinberg.
If the pilot is OK, then the campaign will be enabled and all the workflows submitted in the campaign will be automatically assigned.
The command to inject a re-reco wf in the computing system is:
wmcontrol.py --req_file=master.conf
In the above command, master.conf
is a configuration file where the parameters of the workflows are specified. It is adviced, before the actual injection, to printout and check the dictionary of the parameters sent to computing with:
wmcontrol.py --test --req_file=master.conf
You can find below an example of a master.conf file. Several sections can be identified. The first section ([DEFAULT]
) specifies the parameters that will be attributed by default to each workflow, if not overwritten in the other sections. The other sections (one section for each of the workflows, such as [Run2016D-v2-MuonEG-18Apr2017]
) define the parameters that are unique for each workflow (such as the input dataset and the run list) and eventually overwrite the parameters of the default section.
Some of the parameters are self explicative. The setting of the processing_string (= the string which identifies that particular re-reco and will enter in the re-recoed dataset name) and the campaign name (=era name) reflect the present convention. The cfg_path
specifies the cmsRun re-reco configuration file, which is different for each dataset. The requestID identifies the workflow sent to computing and must be unique.
Injection of a data reprocessing workflow is not much different as injection of any other workflow (for instance a relval workflow).
It requires basically:
A CMSSW production release
The cmsRun configuration file(s) of the steps of the workflow, that are produced by cmsDriver(s)
A set of parameters associated to the workflow needed by computing, that have to be compliant with WMAgent specifications
The configuration files and the computing parameters will be sent to computing system when injecting.
Like runTheMatrix for relvals, that can be considered as a user-interface script to prepare and inject the the workflows, also for data re-reco we have a user interface script: wmcontrol.py. The script uses a .conf configuration file to specify:
location of the cmsRun configuration files
parameters for computing that will be used to create the dictionary for WMAgent
The usage is wmcontrol.py --req_file=master.conf
.
The CMSSW release working areas and configurations for official reprocessings are stored in a common PdmV area: /afs/cern.ch/cms/PPD/PdmV/work/
Here we have
reprocessingW13 (for reprocessing until 2016 data)
reprocessing2017 (for reprocessing 2017 data)
reprocessing2018 (for reprocessing 2018 data)
The files used for this tutorial are in:
In this example we consider a full data re-reco on the SingleMuon PD of 2017 data for a few runs of 2017B era. A full data re-reco starts from RAW input dataset and re-does the RECO, SKIM (if needed), ALCARECO (if needed), miniAOD and DQM. The HLT is never re-done in data reprocessing.
Create a CMSSW working area (the production release for a reprocessing has been established in advance: this is usually a production release, only for very very exceptional cases it is a pre-release)
Source the usual script to setup PdmV tools
Let’s work in a sub-directory
The starting point to set up a re-reco cmsDriver is of course the cmsDriver used for data relvals + any needed special GT or customisation. For this tutorial you can start from the cfg here:
and read the cmsDriver used to create it. Now, you can re-create the file by yourself. In the same area there are a few more examples of cfg files.
Produce the cfg file for the harvesting step. This is needed if you have included the DQM step in the cmsDriver for re-reco. Again you can take the cfg file from here:
Test locally the re-reco cmsDriver to guarantee that it does not crash before injecting into computing. For the local test, you need to specify in the cmsDriver the number of events you want to run (e.g. 10 events) and the input file (that must be accessible, of course). There is an example of cfg for a local test here:
When we start a new reprocessing campaign we also need to make local test for another reason: indeed we want to estimate parameters (time_event, size_event, size_memory) to be written in the master.conf file. The time_event parameter it is used by computing to make the job splitting. The size_memory parameter refers to RSS memory of the job, and it used by computing to assign to computing resources with enough memory. In this case, it is suggested to run on at least 100 events to get reliable estimates for these parameters. Also, a monitoring utility has to be added into the cmsDriver
and run the local test in this way:
The output xml file will be used to find for instance PeakValueRss
value for memory.
Note : test locally the recoskim_Run2017B_SingleMuon.py (cmsRun recoskim_Run2017B_SingleMuon.py) to check output files and parameter values (produced root files are needed to estimate the time_event = rootfilesum/run_time)
Once we have the cfg files for cmsRun, then we have to edit the master.conf file for wmcontrol.py. You can take the master.conf file from here and change where it is needed:
The file content appears like this:
Note that release must match the release you are working on and globaltag must match the GT used in cmsDrivers of cfg files.
Note that for 2017 data rereco we now distinguish campaign
parameter (i.e. a conventional name to identify reprocessings of same tipology), acquisition_era
parameter (to specify the era of the RAW dataset), and processing_string
parameter (a conventional date to specify a certain reprocessing setup under a campaign).
NOTE: acquisition_era
and processing_string
will appear into the re-recoed dataset name, while campaign
will not.
Until 2016 data reprocessings, only acquisition_era
was specified and by default campaign = acquisition_era
. In 2017 data reprocessings we are distinguishing the two parameters.
Other parameters are multicore
that must match to -- nThreads
in the cmsDriver of RECO cfg and harvest_cfg
if you have a harvesting cfg file.
The parameters in the [DEFAULT]
section are common to all the workflows.
Then, under [Run2017B-v1-SingleMuon-23Jun2017]
we specify parameters that are specific to each workflow. These are basically dset_run_dict
to specify input dataset and run numbers, cfg_path
, that is the cfg file to be used for that workflow, and a request_id
string to identify that workflow into computing system.
To calculate time/event, size/event and memory peak, you can use this script here, which runs on a xml file, called report.xml:
https://github.com/pgunnell/ReRecoProcessing/blob/master/scriptforcheck.sh
RSS memory is in MB in file.xml => size_memory 14000 = 14GB (all cms resources are now > 14 GB so don't need to change unless the PeakRss return bigger value)
size_event in kB rootfilesize/#testrunevents so if 100 evts (better run at least 500evts) with root 2.6M (ls -lh in the directory) then size_event= 26kB
time_event = #events/runtime in seconds
Before injecting, it is important to printout the set of parameters associated to that workflow:
You can compare with printout of a relval workflow to see that they are similar.
For the final injection:
For major reprocessings of CMS data we need to inject tens of workflows for each era, each of them with specific CMSSW cfg file and computing parameters.
Therefore it is useful to prepare CMSSW cfg files and the master.conf file via a script. The example described below is based on the script employed for 2016 legacy re-reco.
You can take the script prepare_conf.py
as well as the main script prepare_conf_2016H.py
from the following directory:
In order to run this example, you also need a local copy of autoSkim.py
because we do not want to run all the possible skims in the release (as explained in the above sections):
Instead, the autoAlca matrix is taken from the release, so you do not need a local copy.
The main script prepare_conf_2016H.py
simply sources prepare_conf.py
and launch a prepare() function with the following parameters as arguments: acquisition era, proc_string, and GT:
You can launch the main script as follows:
You can also use option -i if you do not want to get off the ipython environment at the end.
The output will be a directory ("Run2016H" in this case) containing:
cmsRun cfg files for each dataset: recoskim_Run2016H_xxx_cfg.py
a harvesting cfg harvesting.py
the master.conf file for the Run2016H workflows injection: master_Run2016H.conf
a summary table of all datasets and runs to be copied in the re-reco twiki twiki_Run2016H.twiki
In the same repository you can find also updated versions employed for 2017 re-reco:
In the following, we will briefly explain the main features of the prepare_conf.py
script, so that you can adapt to your needs.
We first import some libraries:
In particular, the DbsApi will be used to make queries for the datasets, runs, etc.
We also import the (local) autoSkim and the (release) autoAlca:
We specify a dictionary of customise to be used in reco cmsDriver, since the customise can be different for each era:
We specify the json file for runs selection and the default priority of the workflows:
We specify the list of PD that must NOT be reprocessed:
We can play with DbsApi query to attach more datasets to the above list. For instance, we attach to the list all the ZeroBias[N] datasets :
We can specify priorities different from the default in a dictionary and also other parameters for the master.conf:
Here we extract the runs from the json:
And here we extract all the RAW and AOD datasets for the acquisition era we are considering (indeed we will consider below only datasets that have an AOD datasets produced in Prompt):
Here we open a new master.conf and write the DEFAULT section:
And here we run the cmsDriver for the harvesting step (only if the file does not exists):
We loop on the RAW input datasets and consider only if they have an AOD conterpart and are not blacklisted:
For each dataset, we consider only those runs that are in the json file (with the exception of NoBPTX dataset):
Note: for COSMICS there are required changes wrt runs and in cmsdriver.
If the dataset has at least one good run, then we add it to the master.conf section:
And we prepare the cmsDriver to write the reco cfg:
If the reco cfg does not exists in the local area, then write it:
And finally write also the dataset info in the twiki table file:
You can run this example, look the printout and outputs. You can also try to run the 2017 script to look at the differences. Also, for re-MiniAOD reprocessing the script must be changed (e.g. we start from AOD inputs, the reco cmsDriver is different, etc.). In the official re-reco repository you can have a look at the script employed in the re-MiniAOD.
The SKIMs and ALCARECOs to be run for each primary dataset are deployed into the release in the Skim Matrix and AlcaReco Matrix. NOTE: in a reprocessing, we do not necessarily run all the SKIMs associated to a primary dataset, but only the ones agreed in advance with the groups. Similarly, for ALCARECOs an up-to-date AlcaRecoMatrix w.r.t. the release can be provided by AlCa for the reprocessing.
Produce the cfg file for the reprocessing with cmsDriver. The cmsDriver to be used must have been inspected in advance (not last minute, but possibly weeks in advance...) by all the relevant experts: offline, AlCa, miniAOD, etc.. This is done by an explicit request on prep-ops hypernews + CC the experts. An example here (many more examples are in prep-ops hypernews).
Priority is a number to specify priority of the workflow in computing. Higher number gives higher priority. There are some conventional numbers ( blocks) that are used for MC production, but in principle you can use any number.