Wednesday, September 16, 2015

Control Groups: CPU Shares explained

Almost two years back I implemented Control Groups on our HPC cluster login nodes. Even though we gave equal shares for each user on the systems, it always intrigued me how cpu shares work. Eventually, I got a chance to test it on a host that I was building for our new cluster. Anyway, let's dive in.

Let's say we want to create groups with a share structure like this:

Soon, I will come to cgconfig.conf and cgrules.conf that shows the configuration and how we connect the configuration with users. In the mean time, let me explain what this picture means. There is a control group "sm4082" within another control group "users". Then, there are two groups "important" and "normal" (by mistake I typed it as Normal in the picture) within "sm4082". Then, there are two more groups "high" and "low" with in "normal". Now, let's see how shares work and how numbers 1024 and 512, etc get translated into percentages of cpu time. By the way, I forgot to mention the most important thing. By default, cpu shares are 1024 and it is 100% if there is only one 1024. This means, whatever number you give for cpu share, it gets translated into percentage in relation to 1024.

Let's do first triangle at the top. That'd be

Total=1024+512+512=2048

Let's say there are 3 processes running and one of them is in "sm4082" group, another one is in "important" and the last one is in "normal" group. Importantly, all these 3 processes are on the same core. Their percentages of cpu time would be

For the process in sm4082: 1024/2048=0.5 or 50%

For the process in important: 512/2048=0.25 or 25%

For the process in normal: 512/2048=0.25 or 25%

All good. What happens when there are two more processes at the same time but one of them is in "high" and another one is in "low".

Now what happens with the percentages? Let me show this in a picture.

Let's see whether we can come up with same percentages in our calculations. We know that sm4082 has 50% and both important and normal have 25% for each. Now, all three: normal, high, and low share this 25% of cpu time. So, what would be their individual shares? Let's assume that each of these groups have one process in it.

Total=512+1024+512=2048

For a process in normal: 512/2048=1/4 or 25%

This is 25% of the cpu time normal gets, that is 25% of total cpu time as we showed above. So, it'd be

25% of 25% or (1/4)*(1/4)=1/16 or 6.25% of total cpu time when there is a single process in all the groups.

For a process in high: 1024/2048=1/2 or 50%

50% of 25% or (1/2)*(1/4)=1/8 or 12.5% of total cpu time when there is a single process in all the groups.

Similarly, for a process in low: 512/2048=1/4 or 25%

25% of 25% or (1/4)*(1/4)=1/16 or 6.25% of total cpu time when there is a single process in all the groups.

But, the picture is showing different percentages. How come? Oho, by the way, the percentages from the picture were from the system when I ran the jobs. So, it is correct. Then, where did our calculations go wrong?

It seems, for each triangle whatever the value at the top of it, we need to consider it as 1024. What? Yes, that's right. In fact, it does make sense because other groups below are going to share it's share which is 100% in it's own. As you know, 100% relates to 1024 shares. So, in calculations we need to put 1024 in stead of 512. Then, let's see what values we'll get:


Total=1024+1024+512=2560

For a process in normal: 1024/2560=2/5 or 40%

This is 40% of the cpu time normal gets, that is 25% of total cpu time as we showed above. So, it'd be

40% of 25% or (2/5)*(1/4)=1/10 or 10% of total cpu time when there is a single process in all the groups.

For a process in high: 1024/2560=2/5 or 40%

40% of 25% or (2/5)*(1/4)=1/10 or 10% of total cpu time when there is a single process in all the groups.

Similarly, for a process in low: 512/2560=1/5 or 20%

20% of 25% or (1/5)*(1/4)=1/20 or 5% of total cpu time when there is a single process in all the groups.

Now, all of our numbers matched with the actual numbers I got from the system by running jobs. That is

Now, let's look at cgconfig.conf and cgrules.conf

[root@login-0-0 ~]# cat /etc/cgconfig.conf 
#
#  Copyright IBM Corporation. 2007
#
#  Authors: Balbir Singh <balbir@linux.vnet.ibm.com>
#  This program is free software; you can redistribute it and/or modify it
#  under the terms of version 2.1 of the GNU Lesser General Public License
#  as published by the Free Software Foundation.
#
#  This program is distributed in the hope that it would be useful, but
#  WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See man cgconfig.conf for further details.
#
# By default, mount all controllers to /cgroup/<controller>

#mount {
# cpuset = /cgroup/cpuset;
# cpu = /cgroup/cpu;
# cpuacct = /cgroup/cpuacct;
# memory = /cgroup/memory;
# devices = /cgroup/devices;
# freezer = /cgroup/freezer;
# net_cls = /cgroup/net_cls;
# blkio = /cgroup/blkio;
#}

mount {
 cpuset = /cgroup/cpu_and_mem;
 cpu = /cgroup/cpu_and_mem;
 cpuacct = /cgroup/cpu_and_mem;
 memory = /cgroup/cpu_and_mem;
}

group users {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "140G";
  memory.memsw.limit_in_bytes = "140G";
  memory.use_hierarchy = "1";
 }
}

group commands {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="0-17";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "40G";
  memory.memsw.limit_in_bytes = "40G";
  memory.use_hierarchy = "1";
 }
}

template commands/%u {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="0-17";
        }
        cpu {
                cpu.shares = "100";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "10G";
  memory.memsw.limit_in_bytes = "10G";
 }
}

group users/sm4082 {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "1024";
 }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "10G";
  memory.memsw.limit_in_bytes = "10G";
 }
}

group users/sm4082/important {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "512";
 }
}

group users/sm4082/normal {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "512";
 }
}

group users/sm4082/normal/high {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "1024";
 }
}

group users/sm4082/normal/low {
 cpuset {
                cpuset.mems="0";
                cpuset.cpus="0";
        }
 cpu {
  cpu.shares = "512";
 }
}

[root@login-0-0 ~]# cat /etc/cgrules.conf 
# /etc/cgrules.conf
#The format of this file is described in cgrules.conf(5)
#manual page.
#
# Example:
#<user>  <controllers> <destination>
#@student cpu,memory usergroup/student/
#peter  cpu  test1/
#%  memory  test2/
# End of file
*:scp  cpuset,cpu,cpuacct,memory  commands/%u
#@users:matho-primes cpuset,cpu,cpuacct,memory               commands/%u 
sm4082  cpuset,cpu,cpuacct,memory  users/sm4082


As user sm4082:

-bash-4.1$ echo $USER
sm4082
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[1] 68686
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[2] 68687
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[3] 68688
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[4] 68689
-bash-4.1$ matho-primes 0 9999999999 >/dev/null &
[5] 68691
-bash-4.1$ 
As user root:

[root@login-0-0 ~]# echo 68687 > /cgroup/cpu_and_mem/users/sm4082/important/tasks 
[root@login-0-0 ~]# echo 68688 > /cgroup/cpu_and_mem/users/sm4082/normal/tasks 
[root@login-0-0 ~]# echo 68689 > /cgroup/cpu_and_mem/users/sm4082/normal/high/tasks 
[root@login-0-0 ~]# echo 68691 > /cgroup/cpu_and_mem/users/sm4082/normal/low/tasks 
[root@login-0-0 ~]# top -u sm4082


Finally, I would like to mention a scenario where there are 4 processes in total. Out of these 4, first two are in sm4082 and the rest of them are in important and normal, one per each group. There are no processes in high and low.

What happens to percentages now?

[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/tasks 
68702
68705
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/important/tasks 
68687
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/normal/tasks    
68688
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/normal/high/tasks 
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/normal/low/tasks  
[root@login-0-0 ~]#


First, I thought two processes in sm4082 would get 25% each (half of sm4082' 50%) and process in important and process in normal would get their usual 25%. But as you can see in the picture it wasn't like that. So, what's happening here?

So, cpu share percentages are never by a considering the entire group. It's always by considering the processes in that group. Meaning, it's always about the processes. So, here there are two processes in sm4082. These two processes are same in priority and both of them get an equal share of 1024 per cpu iteration. This means, our calculation changes completely. Let's see how it changes.

Total=2*1024+512+512=3072

There are 2 processes running in "sm4082" group, another one is in "important" and the last one is in "normal" group. Their percentages of cpu time would be

For each process in sm4082: 1024/3072=1/3 or 33.3%

For the process in important: 512/3072=1/6 or 16.6%

For the process in normal: 512/3072=1/6 or 16.6%

These numbers are pretty much same as the ones in the picture above (small discrepancies are normal).

This is it. Now, I hope all of it is making sense. It is for me for sure.

No comments:

Post a Comment