The Data Model

 
Home Research Classes

The DZERO Data Model

Preliminary notes...

Updates:

  • 5/11 CAF size is 20kB, not 40KB.

Data Tiers

I'm concentrating mostly on those needed for analysis; so I've skipped a few for MC production, for example.

Name Description
RAW Level 3 takes raw data (from each read out crate) and assembles them into a d0event. The result of the Level 3 trigger decision is added to this, and this is written out to tape.
TMB++ Short for Thumbnail "++"; output of the reconstruction program. This contains hit information from tracking, etc. -- enough to refit the tracks. It also contains cell level calorimeter information. All of this data is very compressed. About 60KB/event.
CAF ROOT based format. Contains everything required for analysis. About 20kB/event

Evolution

The sequence above used to include a DST between RAW and TMB++. This was the output of the reconstruction program. The data load was huge, and it was hardly used, and thus discarded.

Data Access and SAM

All data access is via SAM. SAM is a huge meta-data system, and containing trace information on every file, its parentage, and even contains event level information (triggers fired, for example). It is possible to define data sets (collections of files) based on SQL queries. These are used to define a list of TMB++ files, for example, that span a run range and reconstruction version. SAM has been hugely successful from a production point of view. The single analyzer overhead, however, is still large and most single analyzers avoid it if they can.

Streams

Add analysis (99%) start with streamed data.

  1. Stream definitions based on both trigger and reconstructed object cuts. So events will move in and out of stream depending upon the reconstruction version.
    • Done to keep streams small for fast access.
  2. Streams do overlap. There are major efforts underway to reduce overlaps and thus decrease the total size of the data.
  3. Streams are handled by a central group, the Common Samples Group (CSG).
    1. Responsible for streaming the events.
    2. Responsible for making sure the data has latest corrections and fixes applied.
    3. Responsible for generating the CAF and TMB stream formats and putting it into SAM with appropriate metadata.

Typical Data Access Scenario

  1. Physics group submits skim criteria to Common Samples Group (CSG).
  2. CSG decides the skim has only small overlap with an already existing skim (otherwise the physics group must re-skim).
  3. The CSG modifies a huge skimming executable with the new skims criteria. Mostly this is modifying text files which specify what criteria an event must satisfy (triggers fired, EM object pT>5, medium quality muon, etc.).
  4. CSG manages running on the full dataset (or requested sub-set) and TMB++ skim files are produced. They are stored in SAM.
  5. CSG then fixes data (if required) and writes out the root format, CAF. These are usually stored in SAM, but may also be pinned to disk for easy analyzer access.
  6. Depending, the physics group may re-skim the resulting data files, and make a local copy. Sometimes this involves adding branches to the CAF format (though no one has done this yet).
  7. The analyzer will often copy the resulting CAF files locally, or a subset, and set up their analysis job. Once it is coded up and ready to go...
  8. If the # of CAF files is large, and thus they reside in SAM and not all on disk, one of the large analysis farms will be used. There are scripts which automate the submission of multiple parallel batch jobs on a single dataset (as defined in SAM).
  9. If the # of CAF files is small, or resides on disk, then perhaps the user will submit regular batch jobs to the user computer cluster (clued0) or even run interactively.

Differences

AOD Back Access

DZERO data formats to not know about each other. If there is data missing in the CAF format the analyzer is out of luck.

  • It is more crucial for CAF format to be carefully tested in DZERO to make sure it is possible to do physics. A staged roll-out was the way around this problem.
  • Because AOD -> ESD access is bound to be slow, it will be crucial that analyzer can write new AOD bit that contains the ESD information. In short, they can run once in the very expensive mode where they access the ESD and then after that just access data locally.

TAG

  • TAG -- This functionality is part of the SAM meta-data. But I'm not aware of people using this as a method of skimming or selecting data.
  • Further, DZERO's stream definitions do include reconstructed objects, so data will move from stream to stream.

Data Distribution and Geography...

  • There is no back linking so this isn't as much of an issue.
  • CAF samples are of order terra bytes at most, which makes them easy to replicate.
  • Many institutions keep local copies.
  • User groups contribute to user computing cluster (clued0) and run there on everyone's system using a fair use system.

Minor Notes

  • The production executables are such a big task that the regularly appear in reports to Fermilab and also yearly budget proposals (d0reco, d0sim, etc.).

References

  • DZERO Note 4616