Unix Operator

Wednesday, September 16, 2015

Control Groups: CPU Shares explained

Almost two years back I implemented Control Groups on our HPC cluster login nodes. Even though we gave equal shares for each user on the systems, it always intrigued me how cpu shares work. Eventually, I got a chance to test it on a host that I was building for our new cluster. Anyway, let's dive in.

Let's say we want to create groups with a share structure like this:

Soon, I will come to cgconfig.conf and cgrules.conf that shows the configuration and how we connect the configuration with users. In the mean time, let me explain what this picture means. There is a control group "sm4082" within another control group "users". Then, there are two groups "important" and "normal" (by mistake I typed it as Normal in the picture) within "sm4082". Then, there are two more groups "high" and "low" with in "normal". Now, let's see how shares work and how numbers 1024 and 512, etc get translated into percentages of cpu time. By the way, I forgot to mention the most important thing. By default, cpu shares are 1024 and it is 100% if there is only one 1024. This means, whatever number you give for cpu share, it gets translated into percentage in relation to 1024.

Let's do first triangle at the top. That'd be

Total=1024+512+512=2048

Let's say there are 3 processes running and one of them is in "sm4082" group, another one is in "important" and the last one is in "normal" group. Importantly, all these 3 processes are on the same core. Their percentages of cpu time would be

For the process in sm4082: 1024/2048=0.5 or 50%

For the process in important: 512/2048=0.25 or 25%

For the process in normal: 512/2048=0.25 or 25%

All good. What happens when there are two more processes at the same time but one of them is in "high" and another one is in "low".

Now what happens with the percentages? Let me show this in a picture.

Let's see whether we can come up with same percentages in our calculations. We know that sm4082 has 50% and both important and normal have 25% for each. Now, all three: normal, high, and low share this 25% of cpu time. So, what would be their individual shares? Let's assume that each of these groups have one process in it.

Total=512+1024+512=2048

For a process in normal: 512/2048=1/4 or 25%

This is 25% of the cpu time normal gets, that is 25% of total cpu time as we showed above. So, it'd be

25% of 25% or (1/4)*(1/4)=1/16 or 6.25% of total cpu time when there is a single process in all the groups.

For a process in high: 1024/2048=1/2 or 50%

50% of 25% or (1/2)*(1/4)=1/8 or 12.5% of total cpu time when there is a single process in all the groups.

Similarly, for a process in low: 512/2048=1/4 or 25%

25% of 25% or (1/4)*(1/4)=1/16 or 6.25% of total cpu time when there is a single process in all the groups.

But, the picture is showing different percentages. How come? Oho, by the way, the percentages from the picture were from the system when I ran the jobs. So, it is correct. Then, where did our calculations go wrong?

It seems, for each triangle whatever the value at the top of it, we need to consider it as 1024. What? Yes, that's right. In fact, it does make sense because other groups below are going to share it's share which is 100% in it's own. As you know, 100% relates to 1024 shares. So, in calculations we need to put 1024 in stead of 512. Then, let's see what values we'll get:

Total=1024+1024+512=2560

For a process in normal: 1024/2560=2/5 or 40%

This is 40% of the cpu time normal gets, that is 25% of total cpu time as we showed above. So, it'd be

40% of 25% or (2/5)*(1/4)=1/10 or 10% of total cpu time when there is a single process in all the groups.

For a process in high: 1024/2560=2/5 or 40%

40% of 25% or (2/5)*(1/4)=1/10 or 10% of total cpu time when there is a single process in all the groups.

Similarly, for a process in low: 512/2560=1/5 or 20%

20% of 25% or (1/5)*(1/4)=1/20 or 5% of total cpu time when there is a single process in all the groups.

Now, all of our numbers matched with the actual numbers I got from the system by running jobs. That is

Now, let's look at cgconfig.conf and cgrules.conf

[root@login-0-0 ~]# cat /etc/cgconfig.conf 
#
#  Copyright IBM Corporation. 2007
#
#  Authors: Balbir Singh <balbir@linux.vnet.ibm.com>
#  This program is free software; you can redistribute it and/or modify it
#  under the terms of version 2.1 of the GNU Lesser General Public License
#  as published by the Free Software Foundation.
#
#  This program is distributed in the hope that it would be useful, but
#  WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See man cgconfig.conf for further details.
#
# By default, mount all controllers to /cgroup/<controller>

#mount {
# cpuset = /cgroup/cpuset;
# cpu = /cgroup/cpu;
# cpuacct = /cgroup/cpuacct;
# memory = /cgroup/memory;
# devices = /cgroup/devices;
# freezer = /cgroup/freezer;
# net_cls = /cgroup/net_cls;
# blkio = /cgroup/blkio;
#}

mount {
 cpuset = /cgroup/cpu_and_mem;
 cpu = /cgroup/cpu_and_mem;
 cpuacct = /cgroup/cpu_and_mem;
 memory = /cgroup/cpu_and_mem;
}

group users {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "140G";
  memory.memsw.limit_in_bytes = "140G";
  memory.use_hierarchy = "1";
 }
}

group commands {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="0-17";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "40G";
  memory.memsw.limit_in_bytes = "40G";
  memory.use_hierarchy = "1";
 }
}

template commands/%u {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="0-17";
        }
        cpu {
                cpu.shares = "100";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "10G";
  memory.memsw.limit_in_bytes = "10G";
 }
}

group users/sm4082 {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "1024";
 }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "10G";
  memory.memsw.limit_in_bytes = "10G";
 }
}

group users/sm4082/important {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "512";
 }
}

group users/sm4082/normal {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "512";
 }
}

group users/sm4082/normal/high {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "1024";
 }
}

group users/sm4082/normal/low {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "512";
 }
}

[root@login-0-0 ~]# cat /etc/cgrules.conf 
# /etc/cgrules.conf
#The format of this file is described in cgrules.conf(5)
#manual page.
#
# Example:
#<user>  <controllers> <destination>
#@student cpu,memory usergroup/student/
#peter  cpu  test1/
#%  memory  test2/
# End of file
*:scp  cpuset,cpu,cpuacct,memory  commands/%u
#@users:matho-primes cpuset,cpu,cpuacct,memory               commands/%u 
sm4082  cpuset,cpu,cpuacct,memory  users/sm4082

As user sm4082:

-bash-4.1$ echo $USER
sm4082
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[1] 68686
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[2] 68687
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[3] 68688
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[4] 68689
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[5] 68691
-bash-4.1$

As user root:

[root@login-0-0 ~]# echo 68687 > /cgroup/cpu_and_mem/users/sm4082/important/tasks 
[root@login-0-0 ~]# echo 68688 > /cgroup/cpu_and_mem/users/sm4082/normal/tasks 
[root@login-0-0 ~]# echo 68689 > /cgroup/cpu_and_mem/users/sm4082/normal/high/tasks 
[root@login-0-0 ~]# echo 68691 > /cgroup/cpu_and_mem/users/sm4082/normal/low/tasks 
[root@login-0-0 ~]# top -u sm4082

Finally, I would like to mention a scenario where there are 4 processes in total. Out of these 4, first two are in sm4082 and the rest of them are in important and normal, one per each group. There are no processes in high and low.

What happens to percentages now?

[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/tasks 
68702
68705
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/important/tasks 
68687
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/normal/tasks    
68688
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/normal/high/tasks 
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/normal/low/tasks  
[root@login-0-0 ~]#

First, I thought two processes in sm4082 would get 25% each (half of sm4082' 50%) and process in important and process in normal would get their usual 25%. But as you can see in the picture it wasn't like that. So, what's happening here?

So, cpu share percentages are never by a considering the entire group. It's always by considering the processes in that group. Meaning, it's always about the processes. So, here there are two processes in sm4082. These two processes are same in priority and both of them get an equal share of 1024 per cpu iteration. This means, our calculation changes completely. Let's see how it changes.

Total=2*1024+512+512=3072

There are 2 processes running in "sm4082" group, another one is in "important" and the last one is in "normal" group. Their percentages of cpu time would be

For each process in sm4082: 1024/3072=1/3 or 33.3%

For the process in important: 512/3072=1/6 or 16.6%

For the process in normal: 512/3072=1/6 or 16.6%

These numbers are pretty much same as the ones in the picture above (small discrepancies are normal).

This is it. Now, I hope all of it is making sense. It is for me for sure.

Tuesday, September 15, 2015

Bash wrapper around ibswitches that shows switch names rather than their GUIDs

Please look at the blog entry right before this one to understand what this script does and where it comes handy.

Without Wrapper

[2015-09-15 22:59:02:7798 root@master post]# cat /usr/local/sbin/ibswitches_wrapper
#!/bin/bash

# by Sreedhar Manchu

sw_guid=(0xf45214030095b2c0 0xf4521403009564e0 0x0002c9020048d260 0x0002c9020048d9b8 0x0002c9020048d940 0x0002c9020048d8e8 0x0002c9020048d240 0x0002c90200489d18 0x0002c902004a7998 0x0002c902004b5d00 0xf452140300f61d20 0xf4521403009571c0 0x0002c902004100f0 0x0002c9020040ff28 0x0002c90200422938 0x0002c902004239c8 0x0002c90200423950 0x0002c90200423a60 0x0002c9020040fe80 0x0002c9020040d668 0xf452140300868de0 0xf452140300680100 0xf452140300680180 0xf45214030067f800 0xf452140300680080 0x0002c9020041e098 0x0002c9020040c868 0x0002c90200422498 0x0002c90200422258 0x0002c9020041e0c0 0x0002c9020041e108 0x0002c903006be3f0 0x0002c903006bfa70 0x0002c903006bfb70 0x0002c903006bfe70 0x0002c903006be6f0 0x0002c903006be7f0 0x0002c903007b6a30 0x0002c903006bfaf0 0x0002c903006be670 0x0002c903006bf970)

sw_name=(ibswcore0 ibswcore1 ibswspine0 ibswspine1 ibswspine2 ibswspine3 ibswspine4 ibswspine5/spmercerib0 ibswspine6 ibswspine7/spmercerib1 ibswspine8/spmercerib3 ibswspine9/spmercerib4 ibswspine10/spboweryib ibswspine11 ibswspine12 ibswspine13 ibswspine14 ibswspine15 ibswspine16/splibb ibswspine17 ibswspine18/spmercerib2 ibswedge0 ibswedge1 ibswedge2 ibswedge3 ibswedge4 ibswedge5 ibswedge6 ibswedge7 ibswedge8 ibswedge9 ibswedge14 ibswedge15 ibswedge16 ibswedge17 ibswedge18 ibswedge19 ibswedge20 ibswedge21 ibswedge22 ibswedge23)

/usr/sbin/ibswitches > /tmp/ibswitches_wrapper_$$

for ((i=0;i<${#sw_guid[@]};i++));do
 awk -F'"' -v OFS=\" -v ss="${sw_guid[$i]}" -v rs="${sw_name[$i]}" '$0 ~ ss {$2 = rs; print }' /tmp/ibswitches_wrapper_$$
done

rm -f /tmp/ibswitches_wrapper_$$

[2015-09-15 22:55:25:7796 root@master post]# ibswitches
Switch : 0x0002c9020040fe80 ports 36 "splibb SW-1" enhanced port 0 lid 2 lmc 0
Switch : 0xf452140300680180 ports 32 "SwitchX -  Mellanox Technologies" base port 0 lid 376 lmc 0
Switch : 0xf452140300680100 ports 32 "SwitchX -  Mellanox Technologies" base port 0 lid 375 lmc 0
Switch : 0xf452140300680080 ports 32 "SwitchX -  Mellanox Technologies" base port 0 lid 379 lmc 0
Switch : 0xf45214030067f800 ports 32 "SwitchX -  Mellanox Technologies" base port 0 lid 377 lmc 0
Switch : 0x0002c90200422938 ports 36 "Infiniscale-IV Mellanox Technologies" base port 0 lid 301 lmc 0
Switch : 0x0002c902004239c8 ports 36 "Infiniscale-IV Mellanox Technologies" base port 0 lid 424 lmc 0
Switch : 0xf4521403009571c0 ports 36 "MF0;switch-a8fcc2:SX6036/U1" enhanced port 0 lid 373 lmc 0
Switch : 0xf452140300f61d20 ports 36 "MF0;switch-534ae0:SX6036/U1" enhanced port 0 lid 374 lmc 0
Switch : 0x0002c9020040d668 ports 36 "Infiniscale-IV Mellanox Technologies" base port 0 lid 355 lmc 0
Switch : 0x0002c9020040c868 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 354 lmc 0
Switch : 0x0002c9020041e098 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 395 lmc 0
Switch : 0x0002c90200422258 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 334 lmc 0
Switch : 0x0002c90200422498 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 394 lmc 0
Switch : 0x0002c9020041e108 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 260 lmc 0
Switch : 0x0002c9020041e0c0 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 418 lmc 0
Switch : 0x0002c90200423a60 ports 36 "Infiniscale-IV Mellanox Technologies" base port 0 lid 342 lmc 0
Switch : 0x0002c902004100f0 ports 36 "spboweryib SW-1" enhanced port 0 lid 442 lmc 0
Switch : 0x0002c9020040ff28 ports 36 "Infiniscale-IV Mellanox Technologies" base port 0 lid 289 lmc 0
Switch : 0x0002c90200423950 ports 36 "Infiniscale-IV Mellanox Technologies" base port 0 lid 325 lmc 0
Switch : 0x0002c902004b5d00 ports 36 "MF0;spmercerib1:IS5030/U1" enhanced port 0 lid 18 lmc 0
Switch : 0x0002c902004a7998 ports 36 "ibswspine6__" base port 0 lid 150 lmc 0
Switch : 0x0002c90200489d18 ports 36 "MF0;spmercerib0:IS5030/U1" enhanced port 0 lid 99 lmc 0
Switch : 0x0002c9020048d240 ports 36 "ibswspine4__" base port 0 lid 22 lmc 0
Switch : 0x0002c9020048d8e8 ports 36 "ibswspine3__" base port 0 lid 33 lmc 0
Switch : 0x0002c9020048d940 ports 36 "ibswspine2__" base port 0 lid 23 lmc 0
Switch : 0x0002c9020048d9b8 ports 36 "ibswspine1__" base port 0 lid 32 lmc 0
Switch : 0xf452140300868de0 ports 36 "MF0;spmercerib2:SX6036/U1" enhanced port 0 lid 130 lmc 0
Switch : 0x0002c903006be3f0 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 34 lmc 0
Switch : 0x0002c903006bfa70 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 26 lmc 0
Switch : 0x0002c903006bfb70 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 27 lmc 0
Switch : 0x0002c903006bfaf0 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 154 lmc 0
Switch : 0x0002c903006bfe70 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 28 lmc 0
Switch : 0xf45214030095b2c0 ports 36 "MF0;ibswcore0:SX6036/U1" enhanced port 0 lid 463 lmc 0
Switch : 0xf4521403009564e0 ports 36 "MF0;ibswcore1:SX6036/U1" enhanced port 0 lid 464 lmc 0
Switch : 0x0002c903006be670 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 24 lmc 0
Switch : 0x0002c903006bf970 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 25 lmc 0
Switch : 0x0002c903007b6a30 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 20 lmc 0
Switch : 0x0002c903006be7f0 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 36 lmc 0
Switch : 0x0002c903006be6f0 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 35 lmc 0
Switch : 0x0002c9020048d260 ports 36 "ibswspine0__" base port 0 lid 120 lmc 0

With Wrapper

[2015-09-15 22:58:40:7797 root@master post]# bash /usr/local/sbin/ibswitches_wrapper 
Switch : 0xf45214030095b2c0 ports 36 "ibswcore0" enhanced port 0 lid 463 lmc 0
Switch : 0xf4521403009564e0 ports 36 "ibswcore1" enhanced port 0 lid 464 lmc 0
Switch : 0x0002c9020048d260 ports 36 "ibswspine0" base port 0 lid 120 lmc 0
Switch : 0x0002c9020048d9b8 ports 36 "ibswspine1" base port 0 lid 32 lmc 0
Switch : 0x0002c9020048d940 ports 36 "ibswspine2" base port 0 lid 23 lmc 0
Switch : 0x0002c9020048d8e8 ports 36 "ibswspine3" base port 0 lid 33 lmc 0
Switch : 0x0002c9020048d240 ports 36 "ibswspine4" base port 0 lid 22 lmc 0
Switch : 0x0002c90200489d18 ports 36 "ibswspine5/spmercerib0" enhanced port 0 lid 99 lmc 0
Switch : 0x0002c902004a7998 ports 36 "ibswspine6" base port 0 lid 150 lmc 0
Switch : 0x0002c902004b5d00 ports 36 "ibswspine7/spmercerib1" enhanced port 0 lid 18 lmc 0
Switch : 0xf452140300f61d20 ports 36 "ibswspine8/spmercerib3" enhanced port 0 lid 374 lmc 0
Switch : 0xf4521403009571c0 ports 36 "ibswspine9/spmercerib4" enhanced port 0 lid 373 lmc 0
Switch : 0x0002c902004100f0 ports 36 "ibswspine10/spboweryib" enhanced port 0 lid 442 lmc 0
Switch : 0x0002c9020040ff28 ports 36 "ibswspine11" base port 0 lid 289 lmc 0
Switch : 0x0002c90200422938 ports 36 "ibswspine12" base port 0 lid 301 lmc 0
Switch : 0x0002c902004239c8 ports 36 "ibswspine13" base port 0 lid 424 lmc 0
Switch : 0x0002c90200423950 ports 36 "ibswspine14" base port 0 lid 325 lmc 0
Switch : 0x0002c90200423a60 ports 36 "ibswspine15" base port 0 lid 342 lmc 0
Switch : 0x0002c9020040fe80 ports 36 "ibswspine16/splibb" enhanced port 0 lid 2 lmc 0
Switch : 0x0002c9020040d668 ports 36 "ibswspine17" base port 0 lid 355 lmc 0
Switch : 0xf452140300868de0 ports 36 "ibswspine18/spmercerib2" enhanced port 0 lid 130 lmc 0
Switch : 0xf452140300680100 ports 32 "ibswedge0" base port 0 lid 375 lmc 0
Switch : 0xf452140300680180 ports 32 "ibswedge1" base port 0 lid 376 lmc 0
Switch : 0xf45214030067f800 ports 32 "ibswedge2" base port 0 lid 377 lmc 0
Switch : 0xf452140300680080 ports 32 "ibswedge3" base port 0 lid 379 lmc 0
Switch : 0x0002c9020041e098 ports 32 "ibswedge4" base port 0 lid 395 lmc 0
Switch : 0x0002c9020040c868 ports 32 "ibswedge5" base port 0 lid 354 lmc 0
Switch : 0x0002c90200422498 ports 32 "ibswedge6" base port 0 lid 394 lmc 0
Switch : 0x0002c90200422258 ports 32 "ibswedge7" base port 0 lid 334 lmc 0
Switch : 0x0002c9020041e0c0 ports 32 "ibswedge8" base port 0 lid 418 lmc 0
Switch : 0x0002c9020041e108 ports 32 "ibswedge9" base port 0 lid 260 lmc 0
Switch : 0x0002c903006be3f0 ports 36 "ibswedge14" base port 0 lid 34 lmc 0
Switch : 0x0002c903006bfa70 ports 36 "ibswedge15" base port 0 lid 26 lmc 0
Switch : 0x0002c903006bfb70 ports 36 "ibswedge16" base port 0 lid 27 lmc 0
Switch : 0x0002c903006bfe70 ports 36 "ibswedge17" base port 0 lid 28 lmc 0
Switch : 0x0002c903006be6f0 ports 36 "ibswedge18" base port 0 lid 35 lmc 0
Switch : 0x0002c903006be7f0 ports 36 "ibswedge19" base port 0 lid 36 lmc 0
Switch : 0x0002c903007b6a30 ports 36 "ibswedge20" base port 0 lid 20 lmc 0
Switch : 0x0002c903006bfaf0 ports 36 "ibswedge21" base port 0 lid 154 lmc 0
Switch : 0x0002c903006be670 ports 36 "ibswedge22" base port 0 lid 24 lmc 0
Switch : 0x0002c903006bf970 ports 36 "ibswedge23" base port 0 lid 25 lmc 0

Bash wrapper around iblinkinfo that shows IB switch names rather than their GUIDs

We have many IB switches in our Infiniband network on our HPC cluster. I use iblinkinfo a lot to find out which node is connected which port on which switch, etc. Problem with iblinkinfo is that it doesn't show any names for unmanaged switches. Most of our switches are unmanaged. One of the Mellanox representative gave us a script that made it possible to give names for few of these unmanaged switches. But it didn't work for the switches in Dell chassis. So, I decided to put in a bit of work and came up with a list of GUIDs for all the switches and then matched them with the names I wanted to see (I put the labels with the same names on the switches as well). So here is the script. Below you can find out the output with just iblinko and with wrapper around it. It makes life much much easier.

[2015-09-15 22:01:35:7793 root@soho post]# cat /usr/local/sbin/iblinkinfo_wrapper 
#!/bin/bash

# by Sreedhar Manchu

sw_guid=(0xf45214030095b2c0 0xf4521403009564e0 0x0002c9020048d260 0x0002c9020048d9b8 0x0002c9020048d940 0x0002c9020048d8e8 0x0002c9020048d240 0x0002c90200489d18 0x0002c902004a7998 0x0002c902004b5d00 0xf452140300f61d20 0xf4521403009571c0 0x0002c902004100f0 0x0002c9020040ff28 0x0002c90200422938 0x0002c902004239c8 0x0002c90200423950 0x0002c90200423a60 0x0002c9020040fe80 0x0002c9020040d668 0xf452140300868de0 0xf452140300680100 0xf452140300680180 0xf45214030067f800 0xf452140300680080 0x0002c9020041e098 0x0002c9020040c868 0x0002c90200422498 0x0002c90200422258 0x0002c9020041e0c0 0x0002c9020041e108 0x0002c903006be3f0 0x0002c903006bfa70 0x0002c903006bfb70 0x0002c903006bfe70 0x0002c903006be6f0 0x0002c903006be7f0 0x0002c903007b6a30 0x0002c903006bfaf0 0x0002c903006be670 0x0002c903006bf970)

sw_name=(ibswcore0 ibswcore1 ibswspine0 ibswspine1 ibswspine2 ibswspine3 ibswspine4 ibswspine5:spmercerib0 ibswspine6 ibswspine7:spmercerib1 ibswspine8:spmercerib3 ibswspine9:spmercerib4 ibswspine10:spboweryib ibswspine11 ibswspine12 ibswspine13 ibswspine14 ibswspine15 ibswspine16:splibb ibswspine17 ibswspine18:spmercerib2 ibswedge0 ibswedge1 ibswedge2 ibswedge3 ibswedge4 ibswedge5 ibswedge6 ibswedge7 ibswedge8 ibswedge9 ibswedge14 ibswedge15 ibswedge16 ibswedge17 ibswedge18 ibswedge19 ibswedge20 ibswedge21 ibswedge22 ibswedge23)

/usr/sbin/iblinkinfo > /tmp/iblinkinfo_wrapper_$$

echo
echo
for ((i=0;i<${#sw_guid[@]};i++));do
 awk -F: -v OFS=: -v ss="${sw_guid[$i]}" -v rs="${sw_name[$i]}" '$0 ~ ss {$2 = ss" "rs; print }' /tmp/iblinkinfo_wrapper_$$
done
echo
echo

for ((i=0;i<${#sw_guid[@]};i++));do
 lids[$i]=$(smpquery NI -G ${sw_guid[$i]} | awk 'NR == 1 {print $5}')
 sed -i "/${sw_guid[$i]}/s//& ${sw_name[$i]}/" /tmp/iblinkinfo_wrapper_$$
done

IFS=$'\n'
while read -r i
do
 slid1=$(echo "$i" | awk '{print $10}')
 slid2=$(echo "$i" | awk '{print $11}')
 for ((j=0;j<${#sw_guid[@]};j++))
 do
  if [ "$slid1" = '(' ]
  then
   echo $i
   break
  elif [ "$slid1" = "${lids[$j]}" -o "$slid2" = "${lids[$j]}" ]
  then
   echo $i|awk -F'"' -v OFS=\" -v ss="${sw_name[$j]}" '{$2 = ss; print}'
   break
  elif [ $((j+1)) -eq ${#lids[@]} ]
  then
   echo $i
  fi
 done
done< /tmp/iblinkinfo_wrapper_$$

rm -f /tmp/iblinkinfo_wrapper_$$

As you can see, at the beginning of the script I put a list of names of switches and then matched them with GUIDs. All this does is to replace the Mellanox generic names for these switches with my names by matching GUIDs. Without wrapper I am going to paste only part of the output here.

[2015-09-15 22:51:28:7795 root@master post]# iblinkinfo
Switch: 0x0002c9020041e098 Infiniscale-IV Mellanox Technologies:
         395    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     342   15[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395    2[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     342   17[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395    3[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     342   13[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395    4[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     325   13[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395    5[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     325   15[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395    6[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     325   17[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395    7[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     424   15[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395    8[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     424   13[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395    9[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     424   17[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395   10[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     301   13[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395   11[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     301   15[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395   12[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     301   17[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395   13[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     289   26[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395   14[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     289   25[  ] "Infiniscale-IV Mellanox Technologies" ( )
         395   15[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     442   26[  ] "spboweryib SW-1" ( )
         395   16[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     442   25[  ] "spboweryib SW-1" ( )
         395   17[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     276    1[  ] "compute-4-0 HCA-1" ( )
         395   18[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     279    1[  ] "compute-4-1 HCA-1" ( )
         395   19[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     277    1[  ] "compute-4-2 HCA-1" ( )
         395   20[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     283    1[  ] "compute-4-3 HCA-1" ( )
         395   21[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     281    1[  ] "compute-4-4 HCA-1" ( )
         395   22[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     278    1[  ] "compute-4-5 HCA-1" ( )
         395   23[  ] ==(                Down/Disabled)==>             [  ] "" ( )
         395   24[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     284    1[  ] "compute-4-7 HCA-1" ( )
         395   25[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     286    1[  ] "compute-4-8 HCA-1" ( )
         395   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
         395   27[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     290    1[  ] "compute-4-10 HCA-1" ( )
         395   28[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     292    1[  ] "compute-4-11 mlx4_0" ( )
         395   29[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     287    1[  ] "compute-4-12 HCA-1" ( )
         395   30[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     288    1[  ] "compute-4-13 HCA-1" ( )
         395   31[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     293    1[  ] "compute-4-14 HCA-1" ( )
         395   32[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     291    1[  ] "compute-4-15 HCA-1" ( )

With the wrapper I am pasting the exact part of the output I pasted above

[2015-09-15 22:51:28:7795 root@master post]# bash /usr/local/sbin/iblinkinfo_wrapper
Switch: 0x0002c9020041e098 ibswedge4 Infiniscale-IV Mellanox Technologies:
         395    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     342   15[  ] "ibswspine15" ( )
         395    2[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     342   17[  ] "ibswspine15" ( )
         395    3[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     342   13[  ] "ibswspine15" ( )
         395    4[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     325   13[  ] "ibswspine14" ( )
         395    5[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     325   15[  ] "ibswspine14" ( )
         395    6[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     325   17[  ] "ibswspine14" ( )
         395    7[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     424   15[  ] "ibswspine13" ( )
         395    8[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     424   13[  ] "ibswspine13" ( )
         395    9[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     424   17[  ] "ibswspine13" ( )
         395   10[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     301   13[  ] "ibswspine12" ( )
         395   11[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     301   15[  ] "ibswspine12" ( )
         395   12[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     301   17[  ] "ibswspine12" ( )
         395   13[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     289   26[  ] "ibswspine11" ( )
         395   14[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     289   25[  ] "ibswspine11" ( )
         395   15[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     442   26[  ] "ibswspine10:spboweryib" ( )
         395   16[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     442   25[  ] "ibswspine10:spboweryib" ( )
         395   17[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     276    1[  ] "compute-4-0 HCA-1" ( )
         395   18[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     279    1[  ] "compute-4-1 HCA-1" ( )
         395   19[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     277    1[  ] "compute-4-2 HCA-1" ( )
         395   20[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     283    1[  ] "compute-4-3 HCA-1" ( )
         395   21[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     281    1[  ] "compute-4-4 HCA-1" ( )
         395   22[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     278    1[  ] "compute-4-5 HCA-1" ( )
         395   23[  ] ==(                Down/Disabled)==>             [  ] "" ( )
         395   24[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     284    1[  ] "compute-4-7 HCA-1" ( )
         395   25[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     286    1[  ] "compute-4-8 HCA-1" ( )
         395   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
         395   27[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     290    1[  ] "compute-4-10 HCA-1" ( )
         395   28[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     292    1[  ] "compute-4-11 mlx4_0" ( )
         395   29[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     287    1[  ] "compute-4-12 HCA-1" ( )
         395   30[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     288    1[  ] "compute-4-13 HCA-1" ( )
         395   31[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     293    1[  ] "compute-4-14 HCA-1" ( )
         395   32[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>     291    1[  ] "compute-4-15 HCA-1" ( )

As you can see, it took out generic names and replaced them with the names we have given for the switches and labelled the switches with. It comes pretty handy.

Implementation of Control Groups (cgroups) on Rocks cluster (HPC) Login nodes

Few years ago, our HPC clusters had CentOS 5.X. Users would login onto login nodes and then submit jobs to batch system. But sometimes users start running jobs on login nodes it self. If it is a heavy job this might just crash the login nodes there by blocking the entire cluster access to everyone (since the only way to access cluster is by going to login nodes). Since CentOS 5.X didn't support Cgroups, I had to put cron jobs that would run every minute to kill jobs that'd take memory more than certain amount.

Then, back in 2013, finally, I got a chance to build a new cluster and we ended up with CentOS 6.3. This version supports Cgroups. I wanted to implement Cgroups on login nodes. My goals were:

1) Every user gets only two cores and 2 to 4GB memory depending on which login node they're on 2) Needed a way to couple Cgroups with PAM 3) Wanted to send an email and print the same message to terminal whenever their job running on login node was killed

This is what I did. Just to let you know, I use Rocks as a provisioning tool on our clusters.

[root@login-0-0 ~]# uname -a
Linux login-0-0.local 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

We need to install two packages. You can simply do yum install libcgroup libcgroup-pam

[root@login-0-0 ~]# rpm -qa | grep libcgroup
libcgroup-pam-0.40.rc1-6.el6_5.1.x86_64
libcgroup-0.40.rc1-6.el6_5.1.x86_64

Next, we need to update cgconfig.conf in /etc. This is what I have

[root@login-0-0 ~]# more /etc/cgconfig.conf 
#
#  Copyright IBM Corporation. 2007
#
#  Authors: Balbir Singh <balbir@linux.vnet.ibm.com>
#  This program is free software; you can redistribute it and/or modify it
#  under the terms of version 2.1 of the GNU Lesser General Public License
#  as published by the Free Software Foundation.
#
#  This program is distributed in the hope that it would be useful, but
#  WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See man cgconfig.conf for further details.
#
# By default, mount all controllers to /cgroup/<controller>

mount {
 cpuset = /cgroup/cpu_and_mem;
 cpu = /cgroup/cpu_and_mem;
 cpuacct = /cgroup/cpu_and_mem;
 memory = /cgroup/cpu_and_mem;
}

group users {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="0-17";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "130G";
  memory.memsw.limit_in_bytes = "130G";
  memory.use_hierarchy = "1";
 }
}

group commands {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="0-17";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "40G";
  memory.memsw.limit_in_bytes = "40G";
  memory.use_hierarchy = "1";
 }
}

group admins {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="18-19";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "10G";
  memory.memsw.limit_in_bytes = "10G";
  memory.use_hierarchy = "1";
 }
}

group users/sm4082 {
 cpuset {
  cpuset.mems="0";
  cpuset.cpus="0,2";
 }
 cpu {
  cpu.shares = "100";
 }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "4G";
  memory.memsw.limit_in_bytes = "4G";
 }
}

group users/xx77 {
 cpuset {
  cpuset.mems="1";
  cpuset.cpus="1,3";
 }
 cpu {
  cpu.shares = "100";
 }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "4G";
  memory.memsw.limit_in_bytes = "4G";
 }
}

group users/xx151 {
 cpuset {
  cpuset.mems="0";
  cpuset.cpus="2,4";
 }
 cpu {
  cpu.shares = "100";
 }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "4G";
  memory.memsw.limit_in_bytes = "4G";
 }
}

[root@login-0-0 ~]# more /etc/cgrules.conf
# /etc/cgrules.conf
#
#Each line describes a rule for a user in the forms:
#
#<user>   <controllers>  <destination>
#<user>:<process name> <controllers>  <destination>
#
#Where:
# <user> can be:
#        - an user name
#        - a group name, with @group syntax
#        - the wildcard *, for any user or group.
#        - The %, which is equivalent to "ditto". This is useful for
#          multiline rules where different cgroups need to be specified
#          for various hierarchies for a single user.
#
# <process name> is optional and it can be:
#  - a process name
#  - a full command path of a process
#
# <controller> can be:
#   - comma separated controller names (no spaces)
#   - * (for all mounted controllers)
#
# <destination> can be:
#   - path with-in the controller hierarchy (ex. pgrp1/gid1/uid1)
#
# Note:
# - It currently has rules based on uids, gids and process name.
#
# - Don't put overlapping rules. First rule which matches the criteria
#   will be executed.
#
# - Multiline rules can be specified for specifying different cgroups
#   for multiple hierarchies. In the example below, user "peter" has
#   specified 2 line rule. First line says put peter's task in test1/
#   dir for "cpu" controller and second line says put peter's tasks in
#   test2/ dir for memory controller. Make a note of "%" sign in second line.
#   This is an indication that it is continuation of previous rule.
#
#
#<user>   <controllers>   <destination>
#
#john          cpu  usergroup/faculty/john/
#john:cp       cpu  usergroup/faculty/john/cp
#@student      cpu,memory usergroup/student/
#peter        cpu  test1/
#%        memory  test2/
#@root      *  admingroup/
#*  *  default/
# End of file
#@wheel  cpuset,cpu,cpuacct,memory               admins/%u
@wheel  cpuset,cpu,cpuacct,memory               admins
#@users:scp cpuset,cpu,cpuacct,memory               commands/%u
@users:scp cpuset,cpu,cpuacct,memory               commands
#@users:sftp cpuset,cpu,cpuacct,memory               commands/%u
@users:sftp cpuset,cpu,cpuacct,memory               commands
#@users:rsync cpuset,cpu,cpuacct,memory               commands/%u
@users:rsync cpuset,cpu,cpuacct,memory               commands
#@users:tar cpuset,cpu,cpuacct,memory               commands/%u
@users:tar cpuset,cpu,cpuacct,memory               commands
#@users:gzip cpuset,cpu,cpuacct,memory               commands/%u
@users:gzip cpuset,cpu,cpuacct,memory               commands
#@users:pigz cpuset,cpu,cpuacct,memory               commands/%u
@users:pigz cpuset,cpu,cpuacct,memory               commands
#@users:bzip cpuset,cpu,cpuacct,memory               commands/%u
@users:bzip cpuset,cpu,cpuacct,memory               commands
#@users:bzip2 cpuset,cpu,cpuacct,memory               commands/%u
@users:bzip2 cpuset,cpu,cpuacct,memory               commands
#@users:globus-url-copy cpuset,cpu,cpuacct,memory               commands/%u
@users:globus-url-copy cpuset,cpu,cpuacct,memory               commands
sm4082  cpuset,cpu,cpuacct,memory  users/sm4082
xx77  cpuset,cpu,cpuacct,memory  users/xx77
xx151  cpuset,cpu,cpuacct,memory  users/xx151

As you can see, I've created 4 control groups: cpu, cpuset, cpuacct, memory

There are 20 cores on this machine. But, all the users can use first 18 cores (0-17). This is defined in the group users section under cpuset. This is a numa node and there are two memory nodes and two processors. Then I defined a group for every user with their userid under group users. Just to make things easier I went with userid. Here, group name doesn't have to be userid. You can give whatever you want. But you need to relate this to userid in cgrules.conf.

For memory, I gave 130GB for memory.limit_in_bytes. Since I didn't want to give any swap, I gave same number for memory.memsw.limit_in_bytes. We have 50 to 60 active users at any given time on any login node out of 4 available. I gave 4GB for each user. Most importantly, I had to set memory.use_hierarchy to 1. Otherwise, cumulative memory usage of all users could go beyond 130GB and crash the system. But with this value set to 1, cumulative usage can not go beyond 130GB.

Then, for cpu cgroup I gave a default share of 1024 between users, admins and commands. But among users, it was set to 100 for everyone. This means everyone gets an equal share of cpu resources. For cpuset, every user gets only two cores among 0-17. Of course, I made sure that both of these cores are on the same memory node and processor socket. You can check this with this command.

[root@login-0-0 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18
node 0 size: 98258 MB
node 0 free: 53554 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19
node 1 size: 98304 MB
node 1 free: 82944 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

Let's move on to cgrules.conf. This is the file that defines which group defined in cgconfig.conf goes for which user. For example, for user sm4082 there is a group users/sm4082 in cgconfig.conf. In cgrules.conf, we have a line that says user sm4082 is under control groups cpuset, cpu, cpuacct, and memory and the group is users/sm4082. The last field tells the group we should look at in cgconfig.conf.

Now, what about admins and commands. Sometimes commands like bzip, gzip, tar, etc take lot of memory and we didn't want cgroups to kill these. So, what I did was to create another group called commands. Here, total memory usage of all processes under this group can go up to 40GB. This value has nothing to do with 130GB given for group users. Then, I defined which commands fall under this group admins in cgrules.conf.

Let's take a look at commands in cgrules.conf. @users:scp means every user that belongs linux group users on the system using the system command scp falls into group commands. So, how does it work? There is a deamon cgred that keeps track of the processes. As soon as a user starts using scp, this daemon moves the pid of scp into group commands. Since, this is out of group users/userid and group users, memory usage doesn't get counted against the user.

Finally, I defined another group called admins just for admins. All the admins that belong to linux group wheel come under this group.

You can ask why in the world I defined a section for each and every user in cgconfig.conf when I could have simply defined something like @users cpuset,cpu,cpuacct,memory users/%u. I wanted to give different cores for each user and unfortunately there is no way we could do this with just one line. If I didn't care about cpuset section being different for each user, then I could simply use just one line. Even then, it didn't work in CentOS 6.3. It worked fine in CentOS 6.6 though.

Now, let's move onto how I configured our pam for sshd so that users came under cgroups right after they logged in via ssh onto the system. I appended this to /etc/pam.d/sshd

session    required     pam_limits.so
session    optional     pam_cgroup.so
session    optional     pam_exec.so seteuid /usr/local/bin/oom_notification.sh

Interesting. What is the last line? It tells PAM module named pam_exec to execute a script at every PAM event, such as a successfull login. Here it is running a script oom_notification.sh. Let's see what I put in that script.

[root@login-0-0 ~]# cat /usr/local/bin/oom_notification.sh
#!/bin/bash

# Written by Sreedhar Manchu (Sreedhar@nyu.edu)
# June 3rd, 2014

if [ "$PAM_USER" = "root" ]
then
  exit 0
else
 if [ "$PAM_TYPE" = "open_session" ]
 then
   {
  if [ $(ps -ef | grep "/usr/local/bin/oom_notification /cgroup/cpu_and_mem/users/$PAM_USER/cgroup.event_control /cgroup/cpu_and_mem/users/$PAM_USER/memory.oom_control $PAM_USER" | grep -v grep > /dev/null) $? -ne 0 ];then
   /usr/local/bin/oom_notification /cgroup/cpu_and_mem/users/$PAM_USER/cgroup.event_control /cgroup/cpu_and_mem/users/$PAM_USER/memory.oom_control $PAM_USER &
  fi
   }
 elif [ "$PAM_TYPE" = "close_session" ]
 then
  if [ $(who | awk '{print $1}' | grep "^$PAM_USER$" | wc -l) -eq 0 ];then
   kill -9 $(ps -ef | grep "/usr/local/bin/oom_notification /cgroup/cpu_and_mem/users/$PAM_USER/cgroup.event_control /cgroup/cpu_and_mem/users/$PAM_USER/memory.oom_control $PAM_USER" | grep -v grep | awk '{print $2}')
  fi
 fi 
fi
exit 0

So, what I'm doing in this script? Simple. It just checks whether it is an open session, meaning user just logged in, and also checks for whether there is already a command oom_notification is checking on his/her cgroup memory usage. I will get back to this command in a bit. If both checks are ok, it will go and start an executable oom_notification that checks on memory usage of a user.

If it is a close session, meaning user is logging out, then it checks whether there is another session going on and if it doesn't find one, then it'll simply kill the executable oom_notification that's watching the memory usage. Simple enough.

Now let's talk about the executable oom_notification. Since, as an admin, I wanted to send notifications to users whenever their pid gets killed by a cgroup (when their usage touches 4GB), I had to use cgroup event notifications. Red Hat documentation has a code written in C and I modified it a bit so that it runs one of my bash scripts. Here is the C code.

[2015-09-15 20:39:55:7787 root@master post]# cat oom_notification.c
/* Copied from access.redhat.com and modified by Sreedhar Manchu (Sreedhar@nyu.edu) */
/* June 3rd, 2014 */
/* Compile it simpley by running gcc oom_notification.c -o oom_notification */

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/eventfd.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

static inline void die(const char *msg)
{
 fprintf(stderr, "error: %s: %s(%d)\n", msg, strerror(errno), errno);
 exit(EXIT_FAILURE);
}

static inline void usage(void)
{
 fprintf(stderr, "usage: oom_eventfd_test <cgroup.event_control> <memory.oom_control>\n");
 exit(EXIT_FAILURE);
}

#define BUFSIZE 256

int main(int argc, char *argv[])
{
 char cmd[80]={0x0};
 char *user="sm4082";
 char buf[BUFSIZE];
 int efd, cfd, ofd, rb, wb;
 uint64_t u;

 if (argc != 4)
  usage();

 if ((efd = eventfd(0, 0)) == -1)
  die("eventfd");

 if ((cfd = open(argv[1], O_WRONLY)) == -1)
  die("cgroup.event_control");

 if ((ofd = open(argv[2], O_RDONLY)) == -1)
  die("memory.oom_control");

 if ((wb = snprintf(buf, BUFSIZE, "%d %d", efd, ofd)) >= BUFSIZE)
  die("buffer too small");

 if (write(cfd, buf, wb) == -1)
  die("write cgroup.event_control");

 if (close(cfd) == -1)
  die("close cgroup.event_control");

 for (;;) {
  if (read(efd, &u, sizeof(uint64_t)) != sizeof(uint64_t))
   die("read eventfd");

  sprintf(cmd,"bash /usr/local/bin/emailUserUponOOM.sh %s",argv[3]);
  system(cmd);
 }

 return 0;
}
[2015-09-15 22:01:35:7793 root@master post]# gcc -o oom_notification oom_notification.c

As you see, as soon as user's memory usage touches 4GB or whatever you have in cgconfig.conf, it kills a certain process, the one that's using most of the memory I believe, and then runs a bash script /usr/local/bin/emailUserUponOOM.sh with a third argument. If you check above third argument is a userid (it is $PAM_USER; this is a userid of a user logging in via ssh).

/usr/local/bin/oom_notification /cgroup/cpu_and_mem/users/$PAM_USER/cgroup.event_control /cgroup/cpu_and_mem/users/$PAM_USER/memory.oom_control $PAM_USER &

Now, let's check what emailUserUponOOMsh does.

[root@login-0-0 ~]# cat /usr/local/bin/emailUserUponOOM.sh 
#!/bin/bash

# Written by Sreedhar Manchu (Sreedhar@nyu.edu)
# June 3rd, 2014

screenMessage="
Hello HPC user,

Please do not run jobs on the login nodes as it destabilizes the nodes, and can 
cause issues for other users on the system. For short jobs and for debugging 
programs, please use the interactive queue:
qsub -I -X -l nodes=1:ppn=8,walltime=04:00:00

For more information, please refer to the wiki:
https://wikis.nyu.edu/display/NYUHPC/Running+jobs

For information on assessing job requirements, please see:
https://wikis.nyu.edu/display/NYUHPC/Finding+usage+information

Thank you,
NYU HPC team
"

for tscreen in $(who | grep "^$1 " | awk '{print $2}');do
 /usr/bin/write $1 /dev/$tscreen << END_L
$screenMessage
END_L
done

    /usr/sbin/sendmail -t -i << END_L
To: $1@nyu.edu
Bcc: hpc@nyu.edu
From: NYU HPC team <hpc@nyu.edu>
Subject: $(hostname -s): Please do not run jobs on login nodes
$screenMessage
END_L

As you can see, it simply sends an email how to submit an interactive job rather than running it on a login node, etc and then prints the same message on the terminal screen. The reason I'm dumping this on the screen is that people simply don't bother to check email or they put into trash using filters. My hope is that it annoys them so much that they submit a batch job in stead of running it on the login node it self.

Ok. Let't test it. I'm going to run a job that uses lot of memory. Here is the script I'm going to use.

[sm4082@login-0-1 ~]$ cat mem-hog.c 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define KB (1024)
#define MB (1024 * KB)
#define GB (1024 * MB)

int main(int argc, char *argv[])
{
 char *p;

again:
 while ((p = (char *)malloc(GB)))
  memset(p, 0, GB);

 while ((p = (char *)malloc(MB)))
  memset(p, 0, MB);

 while ((p = (char *)malloc(KB)))
  memset(p, 0,
    KB);

 sleep(1);

 goto again;

 return 0;
}

Let's compile and run it.

[sm4082@login-0-1 ~]$ gcc -o mem-hog mem-hog.c 
[sm4082@login-0-1 ~]$ ./mem-hog

Message from sm4082@login-0-1.local (as root) on <no tty> at 20:50 ...

Hello HPC user,

Please do not run jobs on the login nodes as it destabilizes the nodes, and can 
cause issues for other users on the system. For short jobs and for debugging 
programs, please use the interactive queue:
qsub -I -X -l nodes=1:ppn=8,walltime=04:00:00

For more information, please refer to the wiki:
https://wikis.nyu.edu/display/NYUHPC/Running+jobs

For information on assessing job requirements, please see:
https://wikis.nyu.edu/display/NYUHPC/Finding+usage+information

Thank you,
NYU HPC team

EOF
Killed

That was from the terminal. I got an email as well and it was cc'ed to hpc group. Cool, ha!

Let me show you the cgroup structure it self

[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/cpuset.mems
0-1
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/cpuset.cpus
0-17
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/cpu.shares 
1024
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/memory.limit_in_bytes 
139586437120
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/memory.memsw.limit_in_bytes 
139586437120
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/memory.usage_in_bytes       
4559634432
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/tasks 
12500
42269
42289
42290
42291
42292
42293
42294
[root@login-0-0 ~]# ps -u sm4082
  PID TTY          TIME CMD
42289 ?        00:00:00 sshd
42294 pts/29   00:00:00 bash
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/cpu.shares 
100
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/cpuset.mems
0
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/cpuset.cpus
0,2
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/memory.limit_in_bytes 
4294967296
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/memory.memsw.limit_in_bytes 
4294967296
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/memory.usage_in_bytes       
2334720

Let's check what happens when user runs a command scp.

[sm4082@login-0-0 sm4082]$ scp Matlab-2013b1.tar butinah.xxx.nyu.edu:/scratch/sm4082
sm4082@xxx.abudhabi.nyu.edu's password:

[root@login-0-0 ~]# ps -fu sm4082
UID        PID  PPID  C STIME TTY          TIME CMD
sm4082    1292 42294  0 21:34 pts/29   00:00:00 scp Matlab-2013b1.tar xxx.abudhabi.nyu.edu:/scratch/sm4082
sm4082    1294  1292  0 21:34 pts/29   00:00:00 /usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -- xxx.abudhabi.nyu.edu scp -t /scratch/sm4082
sm4082   42289 42269  0 20:59 ?        00:00:00 sshd: sm4082@pts/29
sm4082   42294 42289  0 20:59 pts/29   00:00:00 -bash
[root@login-0-0 ~]# grep 1292 /cgroup/cpu_and_mem/users/sm4082/tasks
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/commands/tasks 
1292
13068
35142

As you see, process was moved out of users/sm4082 into commands.

Ok. Now, let's check which cores processes are running on. In cgconfig.conf, I have cpuset.mems="0"; cpuset.cpus="0,2"; This means all my processes should be on cores 0 and 2 which are on memory node 0. Let's check.

[root@login-0-0 ~]# top -u sm4082 -c
top - 21:41:58 up 31 days,  4:15, 49 users,  load average: 0.33, 0.42, 0.40
Tasks: 1104 total,   1 running, 1088 sleeping,  14 stopped,   1 zombie
Cpu(s):  2.4%us,  1.7%sy,  0.0%ni, 95.4%id,  0.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  198441680k total, 58340200k used, 140101480k free,   249720k buffers
Swap:  1048572k total,   592560k used,   456012k free, 47002028k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   P COMMAND                                                                                                                   
42289 sm4082    20   0  162m 3760 1312 S  0.0  0.0   0:00.07  2 sshd: sm4082@pts/29                                                                                                        
42294 sm4082    20   0  105m 1980 1464 S  0.0  0.0   0:00.08  0 -bash

As you can see, processes are on cores 0 and 2.

Finally, I would like to show how I automated cgconfig.conf and cgrules.conf on master node. These are pushed onto login nodes. Every time, a user gets created on master node, my wrapper script around useradd runs and creates the lines in these two files that are needed for cgroups.

[2015-09-15 21:45:27:7789 root@master post]# cat /usr/sbin/useradd
#!/bin/bash

# By Sreedhar Manchu (Sreedhar@nyu.edu)

args=("$@")
for((arg=0;arg<$#;arg++))
do
 if [ "${args[$arg]}" = "-g" -a "${args[$((arg+1))]}" = "100" ]
 then
  if [ $(echo ${args[9]} | egrep "^[a-z]{1,9}[0-9]{1,4}$" > /dev/null 2>&1) $? -eq 0 ]
  then 
   if [ $(egrep "^${args[9]}:" /etc/passwd >/dev/null 2>&1) $? -ne 0 ]
   then
    isUser="yes"
    sleep $(awk -v seed=$RANDOM 'BEGIN{srand(seed); print rand()}')
    source /usr/sbin/cgVariables.txt
    p10=$p1;q10=$q1;p20=$p2;q20=$q2;k10=$k1;k20=$k2
    p1=$[p1+1];q1=$[q1+1];
    if [ $q1 -eq 18 ];then
     p1=0;q1=2
    fi
    p2=$[p2+1];q2=$[q2+1];
    if [ $q2 -eq 10 ];then
     p2=0;q2=2
    fi
    if [ $((q1%2)) -eq 0 ];then
     k1=0
    else
     k1=1
    fi
    if [ $((q2%2)) -eq 0 ];then
     k2=0
    else
    k2=1
    fi
    sed -i "s/p1=$p10;q1=$q10;p2=$p20;q2=$q20;k1=$k10;k2=$k20/p1=$p1;q1=$q1;p2=$p2;q2=$q2;k1=$k1;k2=$k2/" /usr/sbin/cgVariables.txt
    break
   fi
  fi
 fi
done

if [ "$isUser" = "yes" ]
then
 set -- "${@:1:7}" "${@:8:1}" "${@:10}"
 /usr/sbin/useradd.original "$@"
 /usr/sbin/usermod -p '!!' ${args[9]}
 /usr/bin/timeout -s KILL 2 /bin/df -h 2>/dev/null | grep ' /scratch$' >/dev/null 2>&1
 if [ $? -eq 0 ];then
  if [ ! -d /scratch/${args[9]} ];then
   /bin/mkdir /scratch/${args[9]}
   /bin/chown ${args[9]}:users /scratch/${args[9]}
   /bin/chmod 700 /scratch/${args[9]}
   /usr/bin/lfs setquota -u ${args[9]} -b 5368709120 -B 6442450944 -i 1000000 -I 1001000 /scratch
  fi
 fi

 /bin/df -h /home/${args[9]} >/dev/null 2>&1
 if [ $? -ne 0 ];then
  sleep 600
 fi
 /bin/df -h /home/${args[9]} >/dev/null 2>&1
 if [ $? -eq 0 ];then
  /bin/cp -r /etc/skel/. /home/${args[9]}/
  /bin/chown -R ${args[9]}:users /home/${args[9]}
 fi

 echo -e "group users/${args[9]} {\n\
\tcpuset {\n\
\t\tcpuset.mems=\"${k1}\";\n\
\t\tcpuset.cpus=\"${p1},${q1}\";\n\
\t}\n\
\tcpu {\n\
\t\tcpu.shares = \"100\";\n\
\t}\n\
\tcpuacct {\n\
\t\tcpuacct.usage = \"0\";\n\
\t}\n\
\tmemory {\n\
\t\tmemory.limit_in_bytes = \"4G\";\n\
\t\tmemory.memsw.limit_in_bytes = \"4G\";\n\
\t}\n\
}\n" >> /var/411/groups/login01/etc/cgconfig.conf

 echo -e "group users/${args[9]} {\n\
\tcpuset {\n\
\t\tcpuset.mems=\"${k2}\";\n\
\t\tcpuset.cpus=\"${p2},${q2}\";\n\
\t}\n\
\tcpu {\n\
\t\tcpu.shares = \"100\";\n\
\t}\n\
\tcpuacct {\n\
\t\tcpuacct.usage = \"0\";\n\
\t}\n\
\tmemory {\n\
\t\tmemory.limit_in_bytes = \"2G\";\n\
\t\tmemory.memsw.limit_in_bytes = \"2G\";\n\
\t}\n\
}\n" >> /var/411/groups/login23/etc/cgconfig.conf

 echo -e "${args[9]}\t\tcpuset,cpu,cpuacct,memory\t\tusers/${args[9]}" >> /var/411/groups/Login/etc/cgrules.conf

 /opt/rocks/bin/rocks run host login "/bin/cgcreate -g cpuset,cpu,cpuacct,memory:/users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-0 login-0-1 "/bin/cgset -r cpuset.mems=${k1} users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-2 login-0-3 "/bin/cgset -r cpuset.mems=${k2} users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-0 login-0-1 "/bin/cgset -r cpuset.cpus=${p1},${q1} users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-2 login-0-3 "/bin/cgset -r cpuset.cpus=${p2},${q2} users/${args[9]}"
 /opt/rocks/bin/rocks run host login "/bin/cgset -r cpu.shares=100 users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-0 login-0-1 "/bin/cgset -r memory.limit_in_bytes=4294967296 users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-0 login-0-1 "/bin/cgset -r memory.memsw.limit_in_bytes=4294967296 users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-2 login-0-3 "/bin/cgset -r memory.limit_in_bytes=2147483648 users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-2 login-0-3 "/bin/cgset -r memory.memsw.limit_in_bytes=2147483648 users/${args[9]}"

 #/opt/rocks/bin/rocks sync users

 exit 0;
else
 /usr/sbin/useradd.original "$@"
fi

[2015-09-15 21:45:34:7790 root@master post]# cat /usr/sbin/cgVariables.txt
p1=1;q1=3;p2=1;q2=3;k1=1;k2=1

We have 4 login nodes and among them two sets are similar to each other in terms of configuration. This is the reason I'm creating two different cgconfig.conf files. On one login node each user gets 2GB and on other it's 4GB. Some times, bulk account creation happens on the master node. Since I wanted every account have different cores in a cyclic order, I needed to introduce latency between each useradd command. Also, I had to change core numbers in a file after each call to this wrapper. This is exactly first part of this script does.

For every user, values cgVariables.txt keep changing. Using random, I'm introducing milliseconds worth of latency between each call. These variables in this file are then being used to print core numbers into cgconfig.conf.

At the beginning, whenever I had a new user addition I pushed the cgconfig.conf and cgrules.conf onto login nodes and then tried to restart cgconfig daemon. That didn't go well at all. It simply killed all the oom_notification executables that were checking on users' memory usage and sent emails. I think my bash script was not strong enough to avoid this I guess. Due to time factor, I didn't bother to change it as I found another elegant solution. I think it's much better than restarting daemon all the time.

That elegant solution was to add the required lines into cgconfig.conf and cgrules.conf for new users and then create required cgroup folder structure on the fly. This is taken care of by the last part of the above script. But the configuration is always there for it to come into effect upon system reboot.

For user deletion, I put a similar wrapper around userdel and that deletes the configuration from both files.

[2015-09-15 21:50:04:7791 root@master post]# cat /usr/sbin/userdel
#!/bin/bash

# By Sreedhar Manchu (Sreedhar@nyu.edu)

args=("$@")
GIDS=$(id -G ${args[$#-1]})
for GID in $GIDS;do 
 if [ $GID -eq 100 ];then
  # remove mail
  rm -f /var/spool/mail/${args[$#-1]}
  # cleanup cgroup config files
  sed -i "/users\/${args[$#-1]} {/I,+16 d" /var/411/groups/login01/etc/cgconfig.conf
  sed -i "/users\/${args[$#-1]} {/I,+16 d" /var/411/groups/login23/etc/cgconfig.conf
  #sed -i "/users\/${args[$#-1]} {/{N;N;N;N;N;N;N;N;N;N;N;N;d}" /var/411/groups/Login/etc/cgconfig.conf
  sed -i "/users\/${args[$#-1]}$/d" /var/411/groups/Login/etc/cgrules.conf
  # mv scratch directory
  mv /scratch/${args[$#-1]} /scratch/expired/${args[$#-1]}
  break
 fi
done
/usr/sbin/userdel.original "$@"

But here, I'm not bothering to delete the cgroup folders that were created on demand. Because at some point we're going to reboot the login nodes and that'd simply take care of it because configuration doesn't have these groups.

I am using rocks 411 service to push the configuration files onto login nodes.

Friday, May 8, 2015

Consolidating wtmp log across all login nodes on Rocks clusters: run a command only for the first login in a given day across all the login nodes

Our stakeholders wanted to see their quotas and usage on different filesystems as soon as they logged in onto cluster. We have 4 login nodes and users get distributed onto them in round robin fashion (round robin DNS).

So, I wrote a script in bash and added few lines of code to /etc/profile and /etc/csh.login to run this script whenever users login. Everything was great except that this script was running for everyone and every time they logged in. Not ideal.

Then, I wanted to make sure this script would not run for user root and also when root becomes another user via su. Then, I wanted to show the quota only once a day. So, I changed the code again on /etc/profile and /etc/csh.login. Finally, users would see it once a day on any given login node. Still, I wasn't happy because if user logged in multiple times, s/he would get onto different login nodes. This means, s/he would see the quota & usage information at least 4 times if s/he logged 4 times and got onto 4 different login nodes.

I wanted to consolidate wtmp logs from all login nodes and use that information. After making subtle changes, finally everything seems to work well. Since I am consolidating wtmp every minute via a cron job, the only time this script doesn't work is when a user logs in multiple times with in a minute and s/he ends up on different login nodes. In this case, s/he would still see the quota usage information.

On all login nodes we need to add this code at the end of /etc/profile to make myquota run only for the first login in a day for any user.

if [ $USER != root -a `id -G $USER | cut -f1 -d" "` -ne 10 -a $(who am i | awk '{print $1}') = $USER ];then
 if [ -z "$(last -2 -f /share/apps/admins/wtmp.merged $USER | awk 'NR==2')" ];then
  myquota
  echo -e "See this report at any time with `tput bold`'myquota'`tput sgr0`\n"
 else
  if [ $(last -F -f /share/apps/admins/wtmp.merged $USER | sort -k1.48,1.49n | awk 'END{print $6}') -ne $(date '+%_d') ];then
   if [ -z "$(last -2 $USER | awk 'NR==2')" ];then
    myquota
    echo -e "See this report at any time with `tput bold`'myquota'`tput sgr0`\n"
   else
    if [ $(last -2 $USER | awk 'NR==2{print $6}') -ne $(date '+%_d') ];then
     myquota
     echo -e "See this report at any time with `tput bold`'myquota'`tput sgr0`\n"
    fi
   fi
  fi
 fi
fi

On all login nodes we need to add this code to end of /etc/csh.login to make myquota run only for the first login in a day for any user.

if ( $USER != root && `id -G $USER | cut -f1 -d" "` != 10 && `who am i | awk '{print $1}'` == $USER ) then
 if ( "`last -2 -f /share/apps/admins/wtmp.merged $USER | awk 'NR==2'`" == "" ) then
  myquota
  echo "See this report at any time with `tput bold`'myquota'`tput sgr0`"
  echo
 else
  if ( `last -F -f /share/apps/admins/wtmp.merged $USER | sort -k1.48,1.49n | awk 'END{print $6}'` != `date '+%_d'` ) then
   if ( "`last -2 $USER | awk 'NR==2'`" == "" ) then
    myquota
    echo "See this report at any time with `tput bold`'myquota'`tput sgr0`"
    echo
   else
    if ( `last -2 $USER | awk 'NR==2{print $6}'` != `date '+%_d'` ) then
     myquota
     echo "See this report at any time with `tput bold`'myquota'`tput sgr0`"
     echo
    endif
   endif
  endif
 endif
endif

On the master node, I have this in my crontab.

* * * * * (/opt/rocks/bin/tentakel -glogin 'cp /var/log/wtmp /share/apps/admins/wtmp.$(hostname -s)';/bin/cat /share/apps/admins/wtmp.login-0-[0-3] > /share/apps/admins/wtmp.merged)>/dev/null 2>&1