Tivoli LoadLeveler
All jobs are scheduled on the BG/P, the P690 machines and the Cluster via the Tivoli Load Leveler software.
A local copy of the users' document LoadLeveler - v3.4 - Using & Administering Media:am2ug305.pdf Local PDF or from IBM or HTML/JSP version may also be found on the CHPC unix machine chpcln:/CHPC/usr/local/doc/QUICKSTART .
By default LoadLeveler will use the 1 Gigabit Ethernet adapters over a TCP/IP protocal rather than the faster InfiniBand adapters over the Voltaire drivers irrespective of which version or MPI is being used (MVAPICH or MPICH).
A worked example using LoadLeveler command scripts is given in Hello World
Essential Load Leveler commands
llclass:Listing the available classes (queues)
Use this to determine which of the par*_* classes to submit to. Below is a typical result:
| Name |
MaxJobCPU
d+hh:mm:ss |
MaxProcCPU
d+hh:mm:ss |
Free Slots |
Max
Slots |
Description |
| smp_1 |
undefined |
undefined |
32 |
32 |
32 CPUs (1 nodes), 3 day P690 |
| par16_2d |
undefined |
undefined |
123 |
576 |
64 CPUs (16 nodes), 2 day class |
| par64_1 |
undefined |
undefined |
123 |
576 |
256 CPUs (64 nodes), 1 hour class |
| par128_12 |
undefined |
undefined |
123 |
576 |
512 CPUs (128 nodes), 12 hour class |
| par128_6 |
undefined |
undefined |
123 |
576 |
512 CPUs (128 nodes), 6 hour class |
| par128_3 |
undefined |
undefined |
123 |
576 |
512 CPUs (128 nodes), 3 hour class |
| par32_1w |
undefined |
undefined |
0 |
576 |
128 CPUs (32 nodes), 1 week class |
| cspeed |
undefined |
undefined |
32 |
32 |
Test ClearSpeed class |
| compile |
undefined |
undefined |
0 |
36 |
4 CPUs (1 node), 1 hour class for compile |
| UAT |
undefined |
undefined |
123 |
612 |
640 CPUs (160 nodes), 12 hour class for UAT |
| par1_12 |
undefined |
undefined |
155 |
644 |
4 CPUs (1 nodes), 12 hour class |
| par2_12 |
undefined |
undefined |
123 |
612 |
8 CPUs (2 nodes), 12 hour class |
| par4_12 |
undefined |
undefined |
123 |
612 |
16 CPUs (4 nodes), 12 hour class |
| par2_2w |
undefined |
undefined |
0 |
612 |
8 CPUs (2 nodes), 2 week class |
| par160_1 |
undefined |
undefined |
123 |
612 |
640 CPUs (160 nodes), 1 hour class |
| par160_2 |
undefined |
undefined |
123 |
612 |
640 CPUs (160 nodes), 2 hour class |
| par160_1d |
undefined |
undefined |
123 |
612 |
640 CPUs (160 nodes), 1 day class |
| par160_2d |
undefined |
undefined |
123 |
612 |
640 CPUs (160 nodes), 2 day class |
| "Free Slots" values of the classes "par16_2d", "par64_1", "par128_12", "par128_6", "par128_3", "par32_1w", "compile", "UAT", "par1_12", "par2_12", "par4_12", "par2_2w", "par160_1", "par160_2", "par160_1d", "par160_2d" are constrained by the MAX_STARTERS limit(s). |
List all available clusters:
List available classes on Blue Gene/P:
| Cluster bgp |
| Name |
MaxJobCPU
d+hh:mm:ss |
MaxProcCPU
d+hh:mm:ss |
Free Slots |
Max
Slots |
Description |
| BGPCOMP |
undefined |
undefined |
15 |
16 |
|
| BGP |
undefined |
undefined |
15 |
16 |
|
| "Free Slots" value of the class "BGPCOMP" is constrained by the MAX_STARTERS limit(s). |
llsubmit:Submitting a job
llq and llstatus:Querying the status of the queues
Usage:
| llq |
[ -? ] [ -H ] [ -v ] [ -W ] [ -x [ -d ] ] [ -s ] [ -l ] [ -w ]
[ -X {cluster_list | all} ] [ -j joblist | joblist ]
[ -u userlist ] [ -h hostlist ] [ -c classlist ] [ -R reservationlist ]
[ -f category_list ] [ -r category_list ] [ -b ] |
Usage:
| llstatus |
[-?] [-H] [-v] [-W] [-R] [-F] [-M] [-l] [-a] [-C] [-b]
[-B {base_partition_list | all}] [-P {partition_list | all}]
[-X {cluster_list | all}] [-f category_list] [-r category_list]
[-h hostlist | hostlist] |
llcancel:Cancelling a job
Email notification of job status
Add the following two lines to the head section of the load leveler script to receive Load Leveler notifications.
# @ notification = always
# @ notify_user = user@email.domain |
Debugging
| mpirun_dbg.dbx, mpirun_dbg.ddd, mpirun_dbg.gdb |
Monitoring
For monitoring on nodes use one of
- nmon
- vmstat
- top
- xloadl(X11)
and of course ps and free
|