Dynamic Proof Cluster

From Etp
Jump to: navigation, search

PoD

Our colleagues at GSI have developed Proof on Demand (PoD), a nice tool which allows each user to setup a dynamic Proof cluster on her own.

PoD can use ssh, batch submission and even Grid to submit and start slave jobs.

For LMU/etp we have set it up both on the etp desktop cluster and at the LRZ. The basic usage is straightforward:

ETP cluster

To set up a PoD environment, start a clean shell on any ubuntu 10.04 64-bit node (if you still have a 32 bit machine, log in to e.g. etppc25) and:

source /project/etpsw/Common/PoD/setup.sh

Start PoD, give requested number of slaves as argument (default 15)

PoD_Start [n-slaves]  

PoD_Start first runs a script (lproofnodes.sh, Otto S.) which checks our recent powerful 4-core 64 bit nodes.

  • It will only take nodes with
    • Ubuntu 10.04/x86-64
    • current CPU load below 1.5
  • Only 3 out of 4 cores will be used by Proof, such that normal interactive usage should not be affected
  • you can get up to 50 slots for proof (status Jan 11), but it can be less, depending on the current load of the desktop nodes

In your Root session or script you can start the Proof cluster with:

proof = TProof::Open(gSystem->GetFromPipe("pod-info -c"));

When you are done with your PoD/Proof session you should clean-up

PoD_End

Please note that PoD automatically stops and kills your proof slaves when in-active for some 30 mins. To check whether it is still active call e.g.

pod-info -nl 

This returns list and number of active proof slaves.

Monitoring

The etp nodes used for PoD are included in LMU Ganglia monitoring. Please check these (in particular load and network usage) if you observe problems starting PoD/proof or if you think that your own node is overloaded.

LRZ Tier-3

The basic PoD usage at LRZ is similar, just that the proof slaves are started via batch and one can get many more proof slaves. The LRZ nodes run under Suse Linux Enterprise 11 (SLES11) and it is recommended to re-build binaries/libs at LRZ and not just copy them from Scientific Linux 5 environment.


ssh -X your-LRZ-login@lxlogin1.lrz.de # or lxlogin2...
  • from the login noes one cannot submit directly to the LCG batch part, therefore move on to one of LCG nodes:
ssh -X lxe16 # or lxe31
  • Then setup PoD (+Root) environment
# setup root first
source /home/grid/lcg/sw/root_setup.sh
# then PoD
source /home/grid/lcg/sw/PoD/setup.sh 

Start PoD, give requested number of slaves as argument (default 30)

PoD_Start [n-slaves]  

PoD_Start (@LRZ) first submits a so-called array-job to the Slurm batch system which will start a job for each Proof slave.

Depending how heavily the LRZ cluster is loaded the startup can take just a few seconds but it might also take longer. Before starting Proof you have to check that the jobs are actually running, just do:

squeue -u $USER

If the job is still waiting the output looks like:

2904629 lcg_serial  PoD    lu57gup7  PENDING       0:00 2:00:00      1 (Priority)

If the jobs have started you get something like

2905126 lcg_serial  PoD      lu57gup7  RUNNING       0:02   2:00:00      1 lx64e192
2905126 lcg_serial  PoD      lu57gup7  RUNNING       0:02   2:00:00      1 lxe19
...

Then you are ready to go! Sometimes it takes a while until all jobs are running. You can start your Proof session earlier, but it will only see and use the currently running jobs.

Start Proof as above:

proof = TProof::Open(gSystem->GetFromPipe("pod-info -c"));

When you are done with your PoD/Proof session you should clean-up

PoD_End

As above PoD automatically stops and kills your proof slaves when in-active for some 30 mins. Call e.g.

pod-info -nl 

to check whether it is still active (returns list and number of active proof slaves).

Important: Always rebuild/recompile your root-C-programs when you move your code between system running different Linux flavours, e.g. Ubuntu -- Scientific Linux -- SuSe Linux