The DZERO Data Model
Preliminary notes...
Updates:
- 5/11 CAF size is 20kB, not 40KB.
Data Tiers
I'm concentrating mostly on those needed for analysis; so I've skipped a few
for MC production, for example.
| Name |
Description |
| RAW |
Level 3 takes raw data (from each read out crate) and assembles
them into a d0event. The result of the Level 3 trigger
decision is added to this, and this is written out to tape. |
| TMB++ |
Short for Thumbnail "++"; output of the reconstruction program.
This contains hit information from tracking, etc. -- enough to refit
the tracks. It also contains cell level calorimeter information. All
of this data is very compressed. About 60KB/event. |
| CAF |
ROOT based format. Contains everything required for analysis.
About 20kB/event |
Evolution
The sequence above used to include a DST between RAW and TMB++. This was the
output of the reconstruction program. The data load was huge, and it was hardly
used, and thus discarded.
Data Access and SAM
All data access is via SAM. SAM is a huge meta-data system, and containing
trace information on every file, its parentage, and even contains event level
information (triggers fired, for example). It is possible to define data sets
(collections of files) based on SQL queries. These are used to define a list of
TMB++ files, for example, that span a run range and reconstruction version. SAM
has been hugely successful from a production point of view. The single analyzer
overhead, however, is still large and most single analyzers avoid it if they
can.
Streams
Add analysis (99%) start with streamed data.
- Stream definitions based on both trigger and reconstructed object cuts.
So events will move in and out of stream depending upon the reconstruction
version.
- Done to keep streams small for fast access.
- Streams do overlap. There are major efforts underway to reduce overlaps
and thus decrease the total size of the data.
- Streams are handled by a central group, the Common Samples Group (CSG).
- Responsible for streaming the events.
- Responsible for making sure the data has latest corrections and
fixes applied.
- Responsible for generating the CAF and TMB stream formats and
putting it into SAM with appropriate metadata.
Typical Data Access Scenario
- Physics group submits skim criteria to Common Samples Group (CSG).
- CSG decides the skim has only small overlap with an already existing
skim (otherwise the physics group must re-skim).
- The CSG modifies a huge skimming executable with the new skims criteria.
Mostly this is modifying text files which specify what criteria an event
must satisfy (triggers fired, EM object pT>5, medium quality muon, etc.).
- CSG manages running on the full dataset (or requested sub-set) and TMB++
skim files are produced. They are stored in SAM.
- CSG then fixes data (if required) and writes out the root format,
CAF. These are usually stored in SAM, but may also be pinned to disk for
easy analyzer access.
- Depending, the physics group may re-skim the resulting data files, and
make a local copy. Sometimes this involves adding branches to the CAF format
(though no one has done this yet).
- The analyzer will often copy the resulting CAF files locally, or a
subset, and set up their analysis job. Once it is coded up and ready to
go...
- If the # of CAF files is large, and thus they reside in SAM and not all
on disk, one of the large analysis farms will be used. There are scripts
which automate the submission of multiple parallel batch jobs on a single
dataset (as defined in SAM).
- If the # of CAF files is small, or resides on disk, then perhaps the
user will submit regular batch jobs to the user computer cluster (clued0) or
even run interactively.
Differences
AOD Back Access
DZERO data formats to not know about each other. If there is data missing in
the CAF format the analyzer is out of luck.
- It is more crucial for CAF format to be carefully tested in DZERO to
make sure it is possible to do physics. A staged roll-out was the way around
this problem.
- Because AOD -> ESD access is bound to be slow, it will be crucial that
analyzer can write new AOD bit that contains the ESD information. In short,
they can run once in the very expensive mode where they access the ESD and
then after that just access data locally.
TAG
- TAG -- This functionality is part of the SAM meta-data. But I'm not
aware of people using this as a method of skimming or selecting data.
- Further, DZERO's stream definitions do include reconstructed objects, so
data will move from stream to stream.
Data Distribution and Geography...
- There is no back linking so this isn't as much of an issue.
- CAF samples are of order terra bytes at most, which makes them easy to
replicate.
- Many institutions keep local copies.
- User groups contribute to user computing cluster (clued0) and run there
on everyone's system using a fair use system.
Minor Notes
- The production executables are such a big task that the regularly appear
in reports to Fermilab and also yearly budget proposals (d0reco, d0sim,
etc.).
References
|