Single-Top Classification Tree Analysis

Toby Burnett
Last update: 28 Apr 2005 06:44 -0700

 

 


19-Apr-05

Boosting and resolution, cont.

Apply the resolution analysis to all four channels. Note that muon-tqb is truncated since the training failed. Conclude that we need 50 cycles for the tqb channels, and 90 for tb. In all cases, the boosted resolution is about 2.6, corresponding to a background/signal ratio of 5.8.

 

 


16-Apr-05

Boosting and resolution

Examine the quantitative effect of boosting on the resolution for a measurement of a signal in the presence of background. The resolution function is,

     

 

 


 

where s and b are the signal and background contributions (measured by the weights) to each node: the sums are over the nodes. This represents the expected standard deviation for Nsignal=1. It should be multiplied by sqrt(Nsignal) for the signal standard deviation, and is directly related to the expected limit. Note that perfect separation, in which bins are either signal or background, corresponds to R=1. The other extreme is no separation, where all bins have the same b/s ratio, is R=sqrt(1+b/s).

As usual, the trees were trained with even events only,  combining the 1- and 2-tag samples. Then testing is performed using odd, even, and all events. The value for no separation is sqrt(1+b/s) = 7.25, for s=1.33, b=68.62. It looks like the training limit is reached at around 85 cycles.

Here is a comparison with the NN equivalent from the D0 Note.

 


15-Apr-05

Check the effect of training vs. testing samples on some of the 2-d trees generated for the D0 note:

 

Clearly there is an effect, but it is not apparently large, especially since in this case


5-Apr-05

Add link to doxygen documentation of the current version classifier, which implements boosting

 


25-Feb-05

Study filter sensitivity

During the course of this analysis, we have noticed that minor changes in the training data sets seem to produce changes in measured limits from DT or NN methods of several pb. To understand the sensitivity of all the limits to small changes in the filter characteristics, we have taken advantage of how easy it is to generate DT filters. Since we were already training on only half of the training data, we made a study of the variation with respect to filters which were trained on randomly selected halves of the training data.

The first step was to update the trainer to be able to select entries randomly, with 0.5 probability. This is to be compared with the normal procedure, which is to select the even entries. Then we chose a particular channel for the variability study, tqb muon, combining 1 and >1  tags, and created 99 independent pairs of decision trees to separate the tqb data from the background sources wbb and lepjets.

Each tree is characterized by the efficiency plot: for a given cut on signal efficiency, what is the background contamination? The following plots show all 198 trees.

The full distribution is on the left,  with a histogram of the background variation for a cut at 0.7 on the right. The discrete nature of our DT filters is apparent. As is known, the wbb separation is harder, and apparently the fluctuations slightly larger, than for lepjets.

Then the 99 pairs of trees were used to make 99 2-d likelihood histograms in the usual way, and each histogram analyzed to obtain expected (from the MC data only) and actual (from the actual data) limits. The distributions for expected, expected with systematics, and actual with  systematics are shown next.

The numbers from the DT used in this note are respectively 8.4, 10.9, and 7.9 pb, each quite consistent with the range. The variation of the actual corresponds to our initial observations.


08-Feb-05

Here are results of the "1D" tree analysis. Use all 25 variables for each of the 4 cases.

Table of Variables used with ratings
name muon-tb electron-tb muon-tqb electron-tqb
BTaggedTopMass 0.58 0.39 6.18 4.65
BestTopMass 1.03 1.68 0.62 1.10
Cos_BTaggedJetAllJets_AllJets 0.47 0.60 0.93 1.34
Cos_LeptonQZ_BestTop 1.26 0.46 1.08 0.60
Cos_NotBestJetAllJets_AllJets 0.27 0.30 0.55 0.28
Cos_UntaggedJetLepton_BTaggedTop 0.31 1.00 2.72 1.79
DeltaRJet1Jet2 0.61 0.76 1.04 0.87
HT_AllJets 7.27 0.88 0.83 3.63
HT_AllJets_MinusBTaggedJet 0.54 0.42 0.91 0.39
HT_AllJets_MinusBestJet 0.42 0.37 1.54 3.16
H_AllJets_MinusBTaggedJet 0.56 0.46 1.03 1.10
H_AllJets_MinusBestJet 0.05 0.36 0.60 0.51
InvariantMass_AllJets 0.66 0.38 5.26 7.08
InvariantMass_AllJets_MinusBTaggedJet 1.35 7.18 0.58 1.24
InvariantMass_AllJets_MinusBestJet 1.12 1.57 0.60 1.39
Jet1Pt_NotBest 0.27 0.03 0.81 0.62
Jet2Pt_NotBest 0.56 0.61 1.13 0.41
LeadingBTaggedJetPt 2.19 8.16 2.31 1.34
LeadingUntaggedJetPt 1.65 2.15 0.36 0.50
Pt_AllJets_MinusBTaggedJet 1.29 2.22 0.71 1.10
Pt_Jet1Jet2 1.49 1.35 1.30 0.84
QTimesEta 0.38 0.39 5.88 5.13
SecondUntaggedJetPt 8.24 1.65 0.62 0.68
Shat 1.36 0.43 0.84 0.54
TransverseMass_Jet1Jet2 1.88 3.12 0.81 0.46
Total 35.80 36.91 39.25 40.75

 


27-Jan-05

Made this diagram to explain how decision trees work:

Caption:

A graphical representation of a portion of one of the eight trees generated for this analysis, the tb electron wbb tree. Six branch and three end nodes are shown.

Descriptive text.

 


 

26-Jan-05

Update the classification management code to make a table of the variables used for each training.

Table of Variables used with ratings
name mu-tb-wbb eCC-tb-wbb mu-tb-lepjets eCC-tb-lepjets  mu-tqb-wbb eCC-tqb-wbb mu-tqb-lepjets eCC-tqb-lepjets
LeadingBTaggedJetPt 0.0 0.2 0.8 0.5 0.4 0.9 - -
LeadingUntaggedJetPt  --- - - - 0.2 0.3 1.5 1.5
SecondUntaggedJetPt - - - - - - 0.2 0.4
Jet1Pt_NotBest 0.4 0.5 0.4 0.4 - - - -
Jet2Pt_NotBest 0.5 0.2 3.1 1.8 - - - -
Pt_Jet1Jet2 2.0 1.9 - - 2.5 1.1 - -
Pt_AllJets_MinusBTaggedJet - - 0.1 0.3 - - 0.1 0.4
HT_AllJets_MinusBestJet - - 2.8 3.9 - - - -
H_AllJets_MinusBestJet - - 0.5 0.1 - - - -
H_AllJets_MinusBTaggedJet - - 0.1 0.4 - - 1.2 0.6
HT_AllJets_MinusBTaggedJet - - - - - - 33.3 31.4
HT_AllJets 11.8 16.0 - - 0.7 2.6 - -
TransverseMass_Jet1Jet2 2.4 3.5 - - - - - -
InvariantMass_AllJets 1.1 1.3 0.8 0.9 20.2 22.9 0.7 0.4
InvariantMass_AllJets_MinusBestJet - - 55.6 52.5 - - - -
InvariantMass_AllJets_MinusBTaggedJet - - - - - - 8.0 9.5
BestTopMass 3.3 1.6 - - - - - -
BTaggedTopMass 0.2 0.6 0.1 0.2 5.7 4.4 4.8 4.7
Shat 1.5 0.2 - - 0.2 0.5 0.3 0.5
DeltaRJet1Jet2 0.9 1.4 - - 0.5 0.6 - -
QTimesEta - - - - 1.9 0.8 5.6 5.2
Cos_LeptonQZ_BestTop 0.6 0.5 - - - - - -
Cos_UntaggedJetLepton_BTaggedTop - - - - 2.3 1.8 - -
Cos_BTaggedJetAllJets_AllJets - - - - 0.1 1.5 0.6 1.5
Cos_NotBestJetAllJets_AllJets - - 0.1 0.5 - - - -
totals 24.7 27.7 63.6 60.9 34.3 36.5 56.3 56.1

 


13-Jan-05

Data: load all files without systematics, from Gordon's list:

The tree training output, with tree definitions, for all channels, and combined tags, is all here.

I used the same set of variables: a summary of the gini improvement from each, for each of the 4 channels is:

Name muons electron
tb tqb tb tqb
InvariantMass_AllJets 0.013 0.060 0.020 0.071
BTaggedTopMass 0.014 0.073 0.013 0.075
Cos_UntaggedJetLepton_BTaggedTop 0.016 0.035 0.012 0.028
Pt_Jet1Jet2 0.022 0.022 0.017 0.015
QTimesEta 0.008 0.065 0.039 0.062
Shat 0.054 0.012 0.035 0.010
LeadingBTaggedJetPt 0.084 0.027 0.094 0.020
LeadingUntaggedJetPt 0.074 0.015 0.039 0.019
HT_AllJets 0.033 0.022 0.037 0.056
DeltaRJet1Jet2 0.018 0.024 0.034 0.026
Cos_BTaggedJetAllJets_AllJets 0.010 0.011 0.006 0.013

 The efficiency table graph:

A separate application analyzes this, and uses top_statistics to estimate the limits:


23-Dec-04

The missing EqOneTag files were generated, start again with electron[4]. Not very different.

 


21-Dec-04

New splits for electron: call it electron[3]

Training pair

files

events weights
tb-lepjets
tb/electron/EqOneTag/MC_SystShiftPlusMinusOneSigma/tb_EqOneTag_eCC_tb_Sigma_zero_skim 9639 1.13832
tb/electron/GeqTwoTag/MC_SystShiftPlusMinusOneSigma/tb_GeqTwoTag_eCC_tb_Sigma_zero_skim 8149 0.266071
---------background----------    
tb/electron/EqOneTag/MC_SystShiftPlusMinusOneSigma/tb_EqOneTag_eCC_ttbar_lepjets_Sigma_zero_skim 8916 12.468
tb/electron/GeqTwoTag/MC_SystShiftPlusMinusOneSigma/tb_GeqTwoTag_eCC_ttbar_lepjets_Sigma_zero_skim 18264 3.10615
tb-wbb
tb/electron/EqOneTag/MC_SystShiftPlusMinusOneSigma/tb_EqOneTag_eCC_tb_Sigma_zero_skim 9639 1.13832
tb/electron/GeqTwoTag/MC_SystShiftPlusMinusOneSigma/tb_GeqTwoTag_eCC_tb_Sigma_zero_skim 8149 0.266071
---------background----------    
tb/electron/EqOneTag/MC_SystShiftPlusMinusOneSigma/tb_EqOneTag_eCC_wbb_higgs_Sigma_zero_skim 1634 9.20989
tb/electron/GeqTwoTag/MC_SystShiftPlusMinusOneSigma/tb_GeqTwoTag_eCC_wbb_higgs_Sigma_zero_skim 488 2.87815
tqb-lepjets
tqb/electron/GeqTwoTag/MC_SystShiftPlusMinusOneSigma/tqb_GeqTwoTag_eCC_tqb_Sigma_zero_skim 7095 0.128802
---------background----------    
tqb/electron/GeqTwoTag/MC_SystShiftPlusMinusOneSigma/tqb_GeqTwoTag_eCC_ttbar_lepjets_Sigma_zero_skim 15658 2.87971
tqb-wbb
tqb/electron/GeqTwoTag/MC_SystShiftPlusMinusOneSigma/tqb_GeqTwoTag_eCC_tqb_Sigma_zero_skim 7095 0.13
---------background----------    
tqb/electron/GeqTwoTag/MC_SystShiftPlusMinusOneSigma/tqb_GeqTwoTag_eCC_wbb_higgs_Sigma_zero_skim 111 0.67

  


11-Dec-04

Aran submits the "final" version of the muon files:

"Daekwang has produced the final skim files for muons:p

/work/husky-clued0/aran/Daekwang_NN/"

 Results are here.

 

 

Also in the above folder are postscript files and a root file containing all the histograms necessary for the 2-d likelihood analysis designed for NN. Since it may not be possible to load the root file over the Web, I've copied it to clued0: ~burnett/links/work/muon_classification.root. Note that rather than creating a different root file for each background component, they are in directories.

root [0] TFile f("muon_classification.root")
root [1] f.ls()
TFile** muon_classification.root
TFile* muon_classification.root
KEY: TDirectory s_channel;1 s_channel
KEY: TDirectory t_channel;1 t_channel
root [2] f.cd("s_channel")
(Bool_t)1
root [3] f.ls()
TFile** muon_classification.root
TFile* muon_classification.root
TDirectory* s_channel s_channel
KEY: TDirectory data;1 data
KEY: TDirectory dilep;1 dilep
KEY: TDirectory lepjets;1 lepjets
KEY: TDirectory tb;1 tb
KEY: TDirectory tqb;1 tqb
KEY: TDirectory wbb;1 wbb
KEY: TDirectory wjj;1 wjj
KEY: TDirectory wwlnujj;1 wwlnujj
KEY: TDirectory wzlnujj;1 wzlnujj
KEY: TDirectory QCD;1 QCD
KEY: TDirectory s_channel;1 s_channel
KEY: TDirectory t_channel;1 t_channel

05-Dec-04 - 20:00

Thanks to Aran, (see 12::00 entry below) get the latest electron data set (which I call electron[2]). On his advice, I combine 1TAG and 2TAg files. Results are here.


05-Dec-04 - 16:00

Grab the latest files from Aran, in /work/husky-clued0/aran/Daekwang_NN. I'm calling this set muon[3]. Results are here. Looks essentially identical.

 

 


05-Dec-04 -1200

Automate training/testing

Rewrite the training/testing code to automate it, and to create a single efficiency table, which is easy to plot all at once.

The parameters are now determined by files in a dedicated folder:

title.txt
Descriptive title
files.txt
first line is a comma-delimited list of signal files, next for background
variables.txt
List, one per line, of the variables to use. First one is the weight (# in col 1 means line is ignored)

The class TrainingInfo manages this, providing access to classification parameters for the class Trainer.

After training, files summarizing the results are written to the same folder:

log.txt
output from classes doing the training
dtree.txt
Definition of the Decision Tree, see DecisionTree in the classifier package
test.txt
Result of testing with the odd events from the training sample
 

An example of the results of this procedure, for the old data set, is in this folder. Documentation of the classes is here.

A performance file is generated, see here. It can be imported and plotted in Excel with a few clicks:

 

Try electron data

Get the files collected from Philip, with help from Aran, and run the same code, with only changes to the path to data and classification specification/output. Use Aran's variables for now (are the the same?) There is basically no wbb data, (?) so show only the lepjets

Note: Aran says to use the files at /rooms/cafe/SingleTop_SKIMS/Data/Electron_Jets/Philips_Stradivarius/MC_SKIMS/NN_SKIMS/


29-Nov-04

Take a look at tqb-lepjets, the only other separation that seems to be needed. As with the wbb study, start with the same variables as were used for the NN training,  then train with even, and test with odd events. Combine the plot, as Aran does.

Name improvement
HT_AllJets_MinusBTaggedJet 0.354
InvariantMass_AllJets_MinusBTaggedJet 0.106
BTaggedTopMass 0.047
QTimesEta 0.043
LeadingUntaggedJetPt 0.011
H_AllJets_MinusBTaggedJet 0.010
Cos_BTaggedJetAllJets_AllJets 0.010
InvariantMass_AllJets 0.006
Shat 0.005
SecondUntaggedJetPt 0.004
Pt_AllJets_MinusBTaggedJet 0.004

 

Now examine the tb (s-channel). Again use the same variables as Aran's NN analysis, and display the separation for the odd events, after training with the even ones.

Name

improvement

TransverseMass_Jet1Jet2 0.1043
Pt_Jet1Jet2 0.0386
InvariantMass_AllJets 0.0353
BestTopMass 0.0217
DeltaRJet1Jet2 0.0124
BTaggedTopMass 0.0083
LeadingBTaggedJetPt 0.0077
Shat 0.0072
Cos_LeptonQZ_BestTop 0.0040
LeadingUntaggedJetPt 0.0021
SecondBTaggedJetPt 0.0000
tb-lepjets
Name

improvement

InvariantMass_AllJets_MinusBestJet 0.5946
SecondUntaggedJetPt 0.0320
HT_AllJets_MinusBestJet 0.0208
InvariantMass_AllJets 0.0105
Jet2Pt_NotBest 0.0091
H_AllJets_MinusBestJet 0.0062
LeadingUntaggedJetPt 0.0033
Jet1Pt_NotBest 0.0029
H_AllJets_MinusBTaggedJet 0.0023
Cos_NotBestJetAllJets_AllJets 0.0016
Pt_AllJets_MinusBTaggedJet 0.0015
BTaggedTopMass 0.0003
 

21-Nov-04

Implement the possibility to choose all, even, or odd events from the training sample, apply to the tqb-wbb muon study, with the best 6 variables.

            Training sample Testing sample
file records weights records weights
tchannel 17636 1.976 17640 1.991
         
wbb_s 12096 7.098 12100 6.959
Variable
Name improvement
InvariantMass_AllJets 0.203
BTaggedTopMass 0.068
Cos_UntaggedJetLepton_BTaggedTop 0.036
Pt_Jet1Jet2 0.027
QTimesEta 0.021
Shat 0.010

 

 The plot is as expected: the separation is better with  the training sample (even events) than the other half of the data.


20-Nov-04

Implement persistence of decision trees (the models created by classification) See documentation of the package here.

Using new set of files from Thomas:

  s-channel t-channel
file records weights records weights
top 29448 2.62 35273 3.97
         
dilep 27814 8.62 22768 7.81
lepjets 45938 27.13 41937 26.94
qcd 222 19.10 222 19.10
wbb 24192 14.06 18130 12.61
wjj 52303 62.12 40165 61.56
ww 6053 0.70 4401 0.70
wz 5431 0.18 3993 0.18

The "t-channel" files are apparently experimental, stick with the s-channel guys.

Applying the tree to tqb-wbb separation, with the variables used by Aran's NN study,

Variable summary
Name improvement
InvariantMass_AllJets 0.19588
BTaggedTopMass 0.05332
Cos_UntaggedJetLepton_BTaggedTop 0.03604
Pt_Jet1Jet2 0.02268
QTimesEta 0.02128
Shat 0.01929
LeadingBTaggedJetPt 0.00605
LeadingUntaggedJetPt 0.00599
HT_AllJets 0.00567
DeltaRJet1Jet2 0.00521
Cos_BTaggedJetAllJets_AllJets 0.00339

I get the following plots:

 

The variables are sorted according to the Gini reduction. Using only the fist six,

there is little difference.


14-Nov-04

Continue with Thomas' variable set.

I now have a an analysis of the tree that ranks the nodes in order of purity, using Classifier::purityMap. The class BackgroundVsEfficiency will print a table in order of the node purity, with columns for the cumulative efficiency and backgrouind content. It is then easy to make plots of the number of background vs efficiency.

Here is the variables used, and their Gini improvement ranking:

Variable summary
Name improvement
_HT_AllJetsLeptonMET 10.0109
_WTransverseMassPrime 6.49468
_Jet3Pt 4.40732

The following  shows the result, for total signal vs. background (with total weights = 130), various subsets of the above variables:

The "HT1cut" represents a single branch: the full "HT only" shows how subsequent branches in the same variable create new alternatives for signal vs. background. One sees also how the addition of variables for branching improves the performance.

A simple criterion to chose an efficiency is to maximize S/sqrt(B), appropriate for small signal. The plot of this for the single HT cut and the full 3-variable tree follows:


30-Oct-04

Sent this mail off to uw-top group:

Hi folks,

 

I’m building C++ classes to deal with the classification trees, playing with the latest data set generated by Thomas, and concentrating on the variable HT_AllJetsLeptonMET, since Insightful Miner seemed to prefer it for separating the unweighted data, with an initial cut at 270 GeV. To my surprise when I started applying the weights, the cut shifted significantly, and the function that is optimized to determine the cut, the Gini improvement, developed two peaks, with essentially zero in between. I show the three plots here:

 

where I’ve rescaled the background for comparison. The shapes of the signal and background are very different, and I can understand why it now wants to cut at 165 GeV, since there is very little signal there.

 

Since I’m still developing the tools (and I have a lot to do), I’m not yet concerned with the physics, but the dramatic difference in the threshold behavior of the signal and background is determined solely by the weighting procedure: see the following plot for the corresponding unweighted Gini and event counts:

--

So I’m a little suspicious that the weighting should so distinguish between signal and background, if my plots are correct. Thoughts?

 

A summary of the data is:

  file   records   weight sum

  ---------signal------------

  schannel  7065   2.40322

  tchannel  7164   3.57781

---------background----------

     dilep  5182   7.89733

   lepjets  4124   24.8404

       qcd   223   19.2116

     wjets  3265   73.7715

 


26-Oct-04

New data set from Thomas:  encapsulate with this script. Note now there is a qcd file, left off before?

# set up sym links

tpath=/rooms/cafe/SingleTop_SKIMS

tag1="Muon_Jets/p14Stradivarius/Tagged/p14Stradivarius_"
tag2="_TightIsolation_HighMissingEt_Tag/p14Stradivarius_"
tag3="_TightIsolation_HighMissingEt_Tag_RGS_SKIM.root"

rm -f *.root

ln -s $tpath/Data/${tag1}DATA${tag2}DATA$tag3 data.root
ln -s $tpath/Data/${tag1}WJETS${tag2}WJETS$tag3 wjets.root
ln -s $tpath/Data/${tag1}QCD${tag2}QCD$tag3 qcd.root

ln -s $tpath/MonteCarlo/${tag1}LEPJETS${tag2}LEPJETS$tag3 lepjets.root
ln -s $tpath/MonteCarlo/${tag1}DILEP${tag2}DILEP$tag3 dilep.root

ln -s $tpath/MonteCarlo/${tag1}SCHANNEL${tag2}SCHANNEL$tag3 schannel.root
ln -s $tpath/MonteCarlo/${tag1}TCHANNEL${tag2}TCHANNEL$tag3 tchannel.root

The root files are different: all the variables are in the TopTree. 

Modify the extraction program as follows, since RGS_Variables is not a branch now. Generate the .txt tab-delimited files and read in to IM. Check the files

File #events <weight> Wt sum
bkg. wjets 3264 0.0226 73.8
lepjets 4123 0.006 24.7
dilep 5181 0.0015 7.8
qcd 222 0.086 19.1
signal schannel 7064 0.00034 2.4
tchannel 7163 0.0005 3.6
data 78 1 135

 

 

 


21-Oct-04

Files used for tables:

# set up sym links

tpath=/rooms/cafe/SingleTop_SKIMS
tag=Muon_Jets/Preselection_SLV_TAG
ln -s $tpath/Data/$tag/DATA_Preselection_SLV_TAG/MUQCD_DQ_PRESELECTION_TIGHTMUON_SLV_TAG_SKIM.root data.root
ln -s $tpath/Data/$tag/WJETS_Preselection_SLV_TAG/WJETS_DQ_PRESELECTION_TIGHTHIGH_SLV_0TAG_TRF_SKIM.root wjets.root
ln -s $tpath/MonteCarlo/$tag/SCHANNEL_Preselection_SLV_TAG/MUNBB_MC_PRESELECTION_TIGHTHIGH_SLV_TRF_SKIM.root schannel.root
ln -s $tpath/MonteCarlo/$tag/TCHANNEL_Preselection_SLV_TAG/MUNBB_MC_PRESELECTION_TIGHTHIGH_SLV_TRF_SKIM.root tchannel.root
ln -s $tpath/MonteCarlo/$tag/DILEP_Preselection_SLV_TAG/TTBAR_DILEP_MC_PRESELECTION_TIGHTHIGH_SLV_TRF_SKIM.root dilep.root
ln -s $tpath/MonteCarlo/$tag/LEPJETS_Preselection_SLV_TAG/TTBAR_LEPJETS_MC_PRESELECTION_TIGHTHIGH_SLV_TRF_SKIM.root lepjets.root

First step: read root files, create tab-delimited text files from the RGS_Variables branch, simplify variable names, using this code

Next: read the 6 files into Insightful Miner 2, combine and tag signal and background, then run a classification tree; export the tree to predict the composition of the data.

This does not make much sense, since the background events have rather different weights:

File #events <weight> Wt sum
bkg. wjets 2282 0.02242 51.2
lepjets 3606 0.00408 14.7
dilep 4275 0.00103 4.4
signal schan 6162 0.00023 1.4
tchan 6700 0.00033 2.2
data 78 1 78.0

 

Given the attempt anyway, the variables most useful for classification are:

And the cross-tab to show the separation is:

   
PREDICT.class
Totals
   
background
signal
type
background
5119 5044 10163
signal
1801 11061 12862
Totals
6920 16105 23025

Note that it does not classify the background well!

 


see http://www-d0.hef.kun.nl///askArchive.php?base=agenda&categ=a041527&id=a041527s1t3/transparencies/091504_NeuralNetwork.pdf

for similar NN analysis.