ROOT grid analysis

From Etp
Jump to: navigation, search

D3PD analysis for MC and Data using Ganga on Panda backend based on TTree::MakeClass()

requirements: grid certificate, member of ATLAS virtual organization, athena setup


A) MC analysis on the grid


a) Prepare your analysis code


a.1) make a .C and .h file applying TTree::MakeClass() on the "physics" tree in one of your D3PD files, here e.g. "group10.phys-sm.mc09_7TeV.106280.AlpgenJimmyWbbNp0_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD.D3PD._00001.root".

root [1] TFile f("group10.phys-sm.mc09_7TeV.106280.AlpgenJimmyWbbNp0_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD.D3PD._00001.root") 
root [2] physics->MakeClass("MyAnalysis")
Info in <TTreePlayer::MakeClass>: Files: MyAnalysis.h and MyAnalysis.C generated from TTree: physics
(Int_t)0
root [3]

a.2) Modify MyAnalysis.h

include the following lines:

#include <iostream>
#include "TFileCollection.h"

using namespace std;

replace:

TTree          *fChain;   //!pointer to the analyzed TTree or TChain

by:

TChain         *fChain;  //!pointer to the analyzed TTree or TChain

replace:

MyAnalysis(TTree *tree=0);

by:

TChain* chain;
MyAnalysis(const char* fileName);

include the declaration of the functions you need in your .C file, e.g.

int get_electron(int ielec);

replace the constructor:

MyAnalysis::MyAnalysis(TTree *tree)
  {
  // if parameter tree is not specified (or zero), connect the file
  // used to generate this class and read the Tree.
    if (tree == 0) {
	TFile *f = (TFile*)gROOT->GetListOfFiles()->FindObject("group10.phys-sm.mc09_7TeV.106280.AlpgenJimmyWbbNp0_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD.D3PD._00001.root");
	if (!f) {
	  f = new TFile("group10.phys-sm.mc09_7TeV.106280.AlpgenJimmyWbbNp0_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD.D3PD._00001.root");
	}
	tree = (TTree*)gDirectory->Get("physics");

    }
    Init(tree);
  }

by:

MyAnalysis::MyAnalysis(const char* inputFile)
{

    TChain * chain = new TChain("physics","");
    TFileCollection* fc = new TFileCollection("mylist", "mylist",inputFile);
    chain->AddFileInfoList((TCollection*)fc->GetList());
    std::cout << "Total  number of entries in chain (after all the files) " <<  chain->GetEntries() << std::endl;
	
   Init(chain);
   
}

replace:

void MyAnalysis::Init(TTree *tree)

by:

void MyAnalysis::Init(TChain *tree)

replace:

virtual void     Init(TTree *tree);

by:

virtual void     Init(TChain *tree);

a.3) Modify MyAnalysis.C

include the headers you need for your analysis:

e.g.

#include <TROOT.h>
#include <TChain.h>
#include <TFile.h>
                                                              
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <math.h>
                                                               
#include <vector>
#include <list>
#include <string>

#include "TApplication.h"            //mandatory
#include "Cintex/Cintex.h"           //mandatory within athena environment. Comment out when running standalone root on your local machine.

#include "egamma/robustIsEMDefs.C"
using namespace egammaPID;

#include "OTX/checkOQ.C"
using namespace egammaOQ;

...

implement your cuts, histograms, functions, ...

declare the output file and write the histograms in void MyAnalysis::Loop() after the event loop:

TFile outputfile("MyAnalysis_output.root","RECREATE");
    h1.write();
    ...
  outputfile.Close();

include the following function:

int main(int argc, char **argv)
{

  TApplication App(argv[0],&argc,argv);
  ROOT::Cintex::Cintex::Enable();           //mandatory within athena environment. Comment out when running standalone root on your local machine.
  gROOT->ProcessLine("#include <vector>");
  
  MyAnalysis a("input.txt");
  a.Loop();
  
}

Remark: The file "input.txt" is generated on the grid and contains the files from the dataset specified in your python script (see b.2) and c.1))

a.4) Make a test directory, e.g. D3PDgrid_localtest

a.5) Go to this directory and make a new textfile called "Makefile" and copy the following lines to it:

ROOTCFLAGS   := $(shell root-config --cflags)
ROOTLIBS     := $(shell root-config --libs) -lMinuit -lEG -lCintex
# -lCintex is mandatory within athena environment. Comment out when running standalone root on your local machine.
ROOTGLIBS	= $(shell root-config --glibs)

CXX		= g++
CXXFLAGS	=-I$(ROOTSYS)/include -O -Wall -fPIC
LD		= g++
LDFLAGS		= -g
SOFLAGS		= -shared

CXXFLAGS	+= $(ROOTCFLAGS)
LIBS		= $(ROOTLIBS)
GLIBS		= $(ROOTGLIBS)
#  $(warning LIBS is $(LIBS))		#for debugging
#  $(warning GLIBS is $(GLIBS))		#for debugging
OBJS		= MyAnalysis.o

MyAnalysis: $(OBJS)
	$(CXX) -o $@ $(OBJS) $(CXXFLAGS) $(LIBS)

# suffix rule
.cc.o:
	$(CXX) -c $(CXXFLAGS) $(GDBFLAGS) $<

# clean
clean:
	rm -f *~ *.o *.o~ core

a.6) Make a new textfile called "cppmake.sh" and copy the following lines to it:

#!/bin/bash
    
make -f Makefile

a.7) Make a new textfile called "input.txt" and copy the name of the test dataset files to it. Make a line break after each dataset file:

e.g.,

group10.phys-sm.mc09_7TeV.106280.AlpgenJimmyWbbNp0_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD.D3PD._00001.root
group10.phys-sm.mc09_7TeV.106280.AlpgenJimmyWbbNp0_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD.D3PD._00002.root

a.8) Copy the corresponding dataset files to the test directory

a.9) Copy the files MyAnalysis.C and MyAnalysis.h to the test directory

a.10) Copy the files/directories necessary for electron definitions, OTX cleaning, AtlasStyle ... to the test directory

a.11) Open a shell and go to the test directory. Compile MyAnalysis.C:

 > source cppmake.sh

a.12) Having successfully compiled, let your analysis run:

 > ./MyAnalysis


b) Starting on the grid


b.1) After a successfull test copy MyAnalysis.C, MyAnalysis.h, cppmake.sh, Makefile and the files/directories from a.10.) to your athena testarea.

b.2) go to the athena testarea, make a new textfile MyAnalysis_script.py and copy the following lines to it:

j = Job()

j.application=Athena()
j.application.atlas_exetype='EXE'
j.application.option_file='MyAnalysis'
j.application.prepare()

j.inputdata=DQ2Dataset()
j.inputdata.dataset=['group10.phys-sm.mc09_7TeV.106280.AlpgenJimmyWbbNp0_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD']
#j.inputdata.exclude_pattern=['*.D3PD.GLOBAL.*.root'] #to exclude processing of empty files

j.outputdata=DQ2OutputDataset()
j.outputdata.outputdata=['MyAnalysis_output.root']

j.splitter=DQ2JobSplitter()
j.splitter.numsubjobs=1

j.backend=Panda()				#job has to be submitted to the Panda backend
#j.backend.site = 'ANALY_NIKHEF-ELPROD'		#specifiy the backend
j.backend.bexec='cppmake.sh'
j.submit()

Remarks: Be aware, that the output dataset name in MyAnalysis_script.py and MyAnalysis.C has to be identical! Don't make any blanks at the beginning of your python commands! It can happen that your MyAnalysis.h file is too large and that's why not automatically submitted with your job. This problem can be solved by adding

config.Athena.EXE_MAXFILESIZE=2097152

just before

j.application.prepare()

b.3) set up your athena environment

b.4) go to your athena testarea and set up the latest ganga version:

 > source /afs/cern.ch/sw/ganga/install/etc/setup-atlas.sh

b.5) start ganga:

 > ganga

b.6) send your job to the grid:

In [1]: execfile("MyAnalysis_script.py")

Remark: For further information on ganga see: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DistributedAnalysisUsingGanga, and never forget our local ganga experts.

b.7) get your output data using dq2

set up dq2 in a new shell or leave ganga.

search the output data:

 > dq2-ls "user.MyGridNickName*jobID*/"

get the output data:

 > dq2-get user.MyGridNickName. ... /
 > dq2-get -f  *.root*  user.MyGridNickName. ... /		#if you want to get rid of the log files and are interested only in *.root* files


c) Sending many datasets using AnaTasks.


c.1) Instead of of the script built under b.2.) make the following python script "MyAnalysis_AnaTask_script.py":

t = AnaTask()
t.name = "MyAnalysis_task"

t.analysis.application=Athena()
t.analysis.application.atlas_exetype='EXE'
t.analysis.application.option_file='MyAnalysis'
t.analysis.application.prepare()

t.analysis.outputdata=DQ2OutputDataset()
t.analysis.outputdata.outputdata=['MyAnalysis_output.root']

t.analysis.files_per_job = 10		#control the number of files processed by one job

be = Panda()				#task has to be submitted to the Panda backend
be.bexec = 'cppmake.sh'
t.setBackend(be)

t.analysis.outputdata.datasetname = t.name

#for tf in t.transforms: tf.backend.site = 'ANALY_NIKHEF-ELPROD'		#specify the site

for tf in t.transforms: tf.inputdata.exclude_pattern=['*.D3PD.GLOBAL.*.root']   #prevents the processing of empty D3PD's
for tf in t.transforms: tf.inputdata.exclude_pattern+=['*log.tgz*']             #prevents the processing of the log files

t.float = 10									#specify how many jobs run in parallel
#t.overview() 	`								# Watch the processing

t.initializeFromDatasets([							#specify here all your datasets you wish to run over
'group10.phys-sm.mc09_7TeV.106280.AlpgenJimmyWbbNp0_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD' ,
'group10.phys-sm.mc09_7TeV.106281.AlpgenJimmyWbbNp1_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD' , 
'group10.phys-sm.mc09_7TeV.106282.AlpgenJimmyWbbNp2_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD' ,
'group10.phys-sm.mc09_7TeV.106283.AlpgenJimmyWbbNp3_pt20.merge.AOD.e524_s765_s767_r1302_r1306.WZphys.100612.01.D3PD'
])

#t.info() 									#Check here if settings are correct
t.run()

Remark: Please check the AnaTasks twiki for further infos: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/GangaTasks, or ask Johannes Ebke :-)

c.2) send your task to the grid:

execfile("MyAnalysis_AnaTask_script.py")

c.3) task babysitting. type during a ganga session:

In [1]: tasks                        #gives you the output of tasks.table() when a new ganga session was started; afterwards; gives you a summary over the state of your task
In [2]: tasks.overview()             #gives you detailed informations about your submitted tasks
In [3]: tasks(task_id).overview()    #gives you detailed informations about the specified task

c.4) get your output data using dq2

As outlined in b.7.). using:

dq2-ls "user.MyGridNickName*task_taskID*/"

c.5) merge the output files of one dataset using the hadd method implemented in ROOT

c.6) AnaTasks trouble shooting

What can I do, when:

I) the backend status of a certain job is "running"/"finished" (check on Panda monitoring page) but the status in ganga is stuck in "submitted"/"running":

restart ganga and wait a little bit (~ 5 min)

if the problem is not solved:

In [2]: j.subjobs[x]._impl.updateStatus("running")     #or "completed", ...

II) a job completely fails because of backend site problems:

for tf in tasks(myTaskID).transforms:
   ....:     if tf.status != 'completed':
   ....:         tf._impl.rebrokerPanda()    #for all transforms (subtasks) belonging to the task myTaskID that are not in the "completed" status a new backend site is chosen

and then:

for tf in tasks(222).transforms:
   ....:     if tf.status != 'completed':
   ....:         tf.retryFailed()     #the jobs of the corresponding transforms are resubmitted


d) sending ganga jobs without any build job

Since Ganga-5-5-17 the build job is not mandatory any more.

d.1) compile your code locally in your test directory and copy the compiled code to your athena test area.

d.2) add the following lines to your "MyAnalysis_script.py" or "MyAnalysis_AnaTask_script.py", respectively:

j.application.athena_compile=False                                                  #for a job
j.backend.nobuild=True                                                              #for a job; put this line directly after the "j.backend=Panda()"
t.analysis.application.athena_compile=False                                         #for a task
be.nobuild=True                                                                     #put this line directly after the "be = Panda() / j.backend=Panda()" command

d.2) Uncommend the following lines in your "MyAnalysis_script.py" or "MyAnalysis_AnaTask_script.py", respectively:

#j.backend.bexec='cppmake.sh'            #for a job
#be.bexec = 'cppmake.sh'                 #for a task

e) Slimming/Skimming D3PDs on the grid

e.1) Make a MyAnalysis.C and MyAnalysis.h file as outlined in a) and modify it:

activate only the branches you need for your analysis.

e.g. ,

void MyAnalysis::Loop()
{

   if (fChain == 0) return;

   Long64_t nentries = fChain->GetEntriesFast();
   
   fChain->SetBranchStatus("*",0);
   
   fChain->SetBranchStatus("EventNumber",1);   
   fChain->SetBranchStatus("RunNumber",1);
   
   fChain->SetBranchStatus("mcevt_weight",1);
   
   fChain->SetBranchStatus("vxp_n",1);  
   fChain->SetBranchStatus("vxp_nTracks",1);   
   fChain->SetBranchStatus("vxp_z",1);

    ...

make a new TFile and clone the structure of the original tree:

   TFile *newfile = new TFile("MyAnalysis_output.root","recreate");
   TTree *newtree = fChain->CloneTree(0);

fill the tree within the loop:

   newtree->Fill();

save the tree after the loop:

  newtree->AutoSave();
  delete newfile;



B) Data analysis on the grid


Analogously as described aboveb starting with a data D3PD file.

Using the GRL standalone package (tested in 15.6.12 and GoodRunsLists-00-00-76):

after setting up your athena environment, go to your testarea and check the tag of your GoodRunsLists package

/15.6.12/testarea/15.6.12> cmt show versions /DataQuality/GoodRunsLists/
/DataQuality/GoodRunsLists/ GoodRunsLists-00-00-76 /afs/cern.ch/atlas/software/releases/15.6.12/AtlasEvent/15.6.12

check out the corresponding package or download the tarball from svn directly to your atena testarea and extract it there

/15.6.12/testarea/15.6.12> cmt co -r GoodRunsLists-00-00-76 DataQuality/GoodRunsLists

or

 https://svnweb.cern.ch/cern/wsvn/atlasoff/DataQuality/GoodRunsLists/tags/#path_DataQuality_GoodRunsLists_tags_

simplify the path from "DataQuality/GoodRunsLists" to "GoodRunsLists"

in GoodRunsLists/cmt: rename the "Makefile.Standalone" to "Makefile" (this generates the GoodRunsLists library for ROOT "StandAlone/GoodRunsListsLib.so")

in your testarea make a directory "GRL_header"

copy all header files from "/GoodRunsLists/GoodRunsLists/" to "GRL_header"

go to "GRL_header" and make the following changes in the header files

Change "DQHelperFunctions.h"

#include "TGoodRunsList.h"

instead of

#include "GoodRunsLists/TGoodRunsList.h"

Change "GoodRunsListSelectorTool.h"

#include "IGoodRunsListSelectorTool.h"
#include "AthenaBaseComps/AthAlgTool.h"
#include "AthenaKernel/IAthenaEvtLoopPreSelectTool.h"
#include "RegularFormula.h"

instead of

#include "GoodRunsLists/IGoodRunsListSelectorTool.h"
#include "AthenaBaseComps/AthAlgTool.h"
#include "AthenaKernel/IAthenaEvtLoopPreSelectTool.h"
#include "GoodRunsLists/RegularFormula.h"

Change "TGoodRun.h"

#include "TLumiBlockRange.h"

instead of

#include "GoodRunsLists/TLumiBlockRange.h"

change "TGoodRunsList.h"

#include "TGoodRun.h"

instead of

#include "GoodRunsLists/TGoodRun.h"

change "TGoodRunsListReader.h"

#include "TMsgLogger.h"
#include "TGRLCollection.h"

instead of

#include "GoodRunsLists/TMsgLogger.h"
#include "GoodRunsLists/TGRLCollection.h"

change "TGoodRunsListWriter.h"

#include "TMsgLogger.h"
#include "TGoodRunsList.h"
#include "TGRLCollection.h"

instead of

#include "GoodRunsLists/TMsgLogger.h"
#include "GoodRunsLists/TGoodRunsList.h"
#include "GoodRunsLists/TGRLCollection.h"

change "TGRLCollection.h"

#include "TGoodRunsList.h"

instead of

#include "GoodRunsLists/TGoodRunsList.h"

change "TriggerRegistryTool.h"

#include "ITriggerRegistryTool.h"

instead of

#include "GoodRunsLists/ITriggerRegistryTool.h"

go to your testarea and use the following "cppmake.sh"

#!/bin/bash

cd GoodRunsLists/cmt
 
make -f Makefile

cd ../StandAlone

ln -s GoodRunsListsLib.so libGoodRunsListsLib.so

 cd ../../

 libGoodRunsLists="GoodRunsLists/StandAlone"

export  LD_LIBRARY_PATH=${libGoodRunsLists}:${LD_LIBRARY_PATH}

export  GoodRunsListLib="-L${libGoodRunsLists} -lGoodRunsListsLib"

make -f Makefile

use the following "Makefile"

# --- External configuration ----------------------------------
include $(ROOTSYS)/test/Makefile.arch


ROOTCFLAGS   := $(shell root-config --cflags)
#  $(warning ROOTCFLAGS is $(ROOTCFLAGS))
ROOTLIBS     := $(shell root-config --libs) -lMinuit -lEG -lCintex

#  -lg2c
#  $(warning ROOTLIBS is $(ROOTLIBS))
ROOTLIBS  += $(shell echo ${GoodRunsListLib})
ROOTGLIBS	= $(shell root-config --glibs)
ROOTGLIBS  += $(shell echo ${GoodRunsListLib})

CXX		= g++43
CXXFLAGS	=-I$(ROOTSYS)/include -O -Wall -fPIC
LD		= g++43
LDFLAGS		= -g
SOFLAGS		= -shared

CXXFLAGS	+= $(ROOTCFLAGS)
LIBS		= $(ROOTLIBS)
GLIBS		= $(ROOTGLIBS)
#  $(warning LIBS is $(LIBS))
#  $(warning GLIBS is $(GLIBS))
OBJS		= MyAnalysis.o

MyAnalysis: $(OBJS)
	$(CXX) -o $@ $(OBJS) $(CXXFLAGS) $(LIBS)

# suffix rule
.cc.o:
	$(CXX) -c $(CXXFLAGS) $(GDBFLAGS) $<

# clean
clean:
	rm -f *~ *.o *.o~ core

Finally go to your "MyAnalysis.C" and add the following lines

#include "GRL_header/TGoodRunsListReader.h"
#include "GRL_header/TGoodRunsList.h"
using namespace std;
using namespace Root;

add the following within the method MyAnalysis::Loop() but outside the loop itself

TGoodRunsListReader* grlR;
Root::TGoodRunsList grl;
grlR = new TGoodRunsListReader();
string sname = "MyGoodRunList.xml";
cout << "XML to load: " << sname.c_str() << endl;
grlR->SetXMLFile(sname.c_str());
grlR->Interpret();
grl = grlR->GetMergedGoodRunsList();
grl.Summary(false);
cout << endl;

and the following within the loop

if (grl.HasRunLumiBlock(RunNumber,lbn)){

...

}

Copy the corresponding GRL to your testarea (e.g. from https://espace.cern.ch/atlas-project-sm-wzjets/GRL/Forms/AllItems.aspx for W/Z+jets analysis) and modify the name for "MyGoodRunList.xml".

That's it.