Unix Operator: Implementation of Control Groups (cgroups) on Rocks cluster (HPC) Login nodes

Few years ago, our HPC clusters had CentOS 5.X. Users would login onto login nodes and then submit jobs to batch system. But sometimes users start running jobs on login nodes it self. If it is a heavy job this might just crash the login nodes there by blocking the entire cluster access to everyone (since the only way to access cluster is by going to login nodes). Since CentOS 5.X didn't support Cgroups, I had to put cron jobs that would run every minute to kill jobs that'd take memory more than certain amount.

Then, back in 2013, finally, I got a chance to build a new cluster and we ended up with CentOS 6.3. This version supports Cgroups. I wanted to implement Cgroups on login nodes. My goals were:

1) Every user gets only two cores and 2 to 4GB memory depending on which login node they're on 2) Needed a way to couple Cgroups with PAM 3) Wanted to send an email and print the same message to terminal whenever their job running on login node was killed

This is what I did. Just to let you know, I use Rocks as a provisioning tool on our clusters.

[root@login-0-0 ~]# uname -a
Linux login-0-0.local 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

We need to install two packages. You can simply do yum install libcgroup libcgroup-pam

[root@login-0-0 ~]# rpm -qa | grep libcgroup
libcgroup-pam-0.40.rc1-6.el6_5.1.x86_64
libcgroup-0.40.rc1-6.el6_5.1.x86_64

Next, we need to update cgconfig.conf in /etc. This is what I have

[root@login-0-0 ~]# more /etc/cgconfig.conf 
#
#  Copyright IBM Corporation. 2007
#
#  Authors: Balbir Singh <balbir@linux.vnet.ibm.com>
#  This program is free software; you can redistribute it and/or modify it
#  under the terms of version 2.1 of the GNU Lesser General Public License
#  as published by the Free Software Foundation.
#
#  This program is distributed in the hope that it would be useful, but
#  WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See man cgconfig.conf for further details.
#
# By default, mount all controllers to /cgroup/<controller>

mount {
 cpuset = /cgroup/cpu_and_mem;
 cpu = /cgroup/cpu_and_mem;
 cpuacct = /cgroup/cpu_and_mem;
 memory = /cgroup/cpu_and_mem;
}

group users {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="0-17";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "130G";
  memory.memsw.limit_in_bytes = "130G";
  memory.use_hierarchy = "1";
 }
}

group commands {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="0-17";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "40G";
  memory.memsw.limit_in_bytes = "40G";
  memory.use_hierarchy = "1";
 }
}

group admins {
 cpuset {
                cpuset.mems="0-1";
                cpuset.cpus="18-19";
        }
        cpu {
                cpu.shares = "1024";
        }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "10G";
  memory.memsw.limit_in_bytes = "10G";
  memory.use_hierarchy = "1";
 }
}

group users/sm4082 {
 cpuset {
  cpuset.mems="0";
  cpuset.cpus="0,2";
 }
 cpu {
  cpu.shares = "100";
 }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "4G";
  memory.memsw.limit_in_bytes = "4G";
 }
}

group users/xx77 {
 cpuset {
  cpuset.mems="1";
  cpuset.cpus="1,3";
 }
 cpu {
  cpu.shares = "100";
 }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "4G";
  memory.memsw.limit_in_bytes = "4G";
 }
}

group users/xx151 {
 cpuset {
  cpuset.mems="0";
  cpuset.cpus="2,4";
 }
 cpu {
  cpu.shares = "100";
 }
 cpuacct {
  cpuacct.usage = "0";
 }
 memory {
  memory.limit_in_bytes = "4G";
  memory.memsw.limit_in_bytes = "4G";
 }
}

[root@login-0-0 ~]# more /etc/cgrules.conf
# /etc/cgrules.conf
#
#Each line describes a rule for a user in the forms:
#
#<user>   <controllers>  <destination>
#<user>:<process name> <controllers>  <destination>
#
#Where:
# <user> can be:
#        - an user name
#        - a group name, with @group syntax
#        - the wildcard *, for any user or group.
#        - The %, which is equivalent to "ditto". This is useful for
#          multiline rules where different cgroups need to be specified
#          for various hierarchies for a single user.
#
# <process name> is optional and it can be:
#  - a process name
#  - a full command path of a process
#
# <controller> can be:
#   - comma separated controller names (no spaces)
#   - * (for all mounted controllers)
#
# <destination> can be:
#   - path with-in the controller hierarchy (ex. pgrp1/gid1/uid1)
#
# Note:
# - It currently has rules based on uids, gids and process name.
#
# - Don't put overlapping rules. First rule which matches the criteria
#   will be executed.
#
# - Multiline rules can be specified for specifying different cgroups
#   for multiple hierarchies. In the example below, user "peter" has
#   specified 2 line rule. First line says put peter's task in test1/
#   dir for "cpu" controller and second line says put peter's tasks in
#   test2/ dir for memory controller. Make a note of "%" sign in second line.
#   This is an indication that it is continuation of previous rule.
#
#
#<user>   <controllers>   <destination>
#
#john          cpu  usergroup/faculty/john/
#john:cp       cpu  usergroup/faculty/john/cp
#@student      cpu,memory usergroup/student/
#peter        cpu  test1/
#%        memory  test2/
#@root      *  admingroup/
#*  *  default/
# End of file
#@wheel  cpuset,cpu,cpuacct,memory               admins/%u
@wheel  cpuset,cpu,cpuacct,memory               admins
#@users:scp cpuset,cpu,cpuacct,memory               commands/%u
@users:scp cpuset,cpu,cpuacct,memory               commands
#@users:sftp cpuset,cpu,cpuacct,memory               commands/%u
@users:sftp cpuset,cpu,cpuacct,memory               commands
#@users:rsync cpuset,cpu,cpuacct,memory               commands/%u
@users:rsync cpuset,cpu,cpuacct,memory               commands
#@users:tar cpuset,cpu,cpuacct,memory               commands/%u
@users:tar cpuset,cpu,cpuacct,memory               commands
#@users:gzip cpuset,cpu,cpuacct,memory               commands/%u
@users:gzip cpuset,cpu,cpuacct,memory               commands
#@users:pigz cpuset,cpu,cpuacct,memory               commands/%u
@users:pigz cpuset,cpu,cpuacct,memory               commands
#@users:bzip cpuset,cpu,cpuacct,memory               commands/%u
@users:bzip cpuset,cpu,cpuacct,memory               commands
#@users:bzip2 cpuset,cpu,cpuacct,memory               commands/%u
@users:bzip2 cpuset,cpu,cpuacct,memory               commands
#@users:globus-url-copy cpuset,cpu,cpuacct,memory               commands/%u
@users:globus-url-copy cpuset,cpu,cpuacct,memory               commands
sm4082  cpuset,cpu,cpuacct,memory  users/sm4082
xx77  cpuset,cpu,cpuacct,memory  users/xx77
xx151  cpuset,cpu,cpuacct,memory  users/xx151

As you can see, I've created 4 control groups: cpu, cpuset, cpuacct, memory

There are 20 cores on this machine. But, all the users can use first 18 cores (0-17). This is defined in the group users section under cpuset. This is a numa node and there are two memory nodes and two processors. Then I defined a group for every user with their userid under group users. Just to make things easier I went with userid. Here, group name doesn't have to be userid. You can give whatever you want. But you need to relate this to userid in cgrules.conf.

For memory, I gave 130GB for memory.limit_in_bytes. Since I didn't want to give any swap, I gave same number for memory.memsw.limit_in_bytes. We have 50 to 60 active users at any given time on any login node out of 4 available. I gave 4GB for each user. Most importantly, I had to set memory.use_hierarchy to 1. Otherwise, cumulative memory usage of all users could go beyond 130GB and crash the system. But with this value set to 1, cumulative usage can not go beyond 130GB.

Then, for cpu cgroup I gave a default share of 1024 between users, admins and commands. But among users, it was set to 100 for everyone. This means everyone gets an equal share of cpu resources. For cpuset, every user gets only two cores among 0-17. Of course, I made sure that both of these cores are on the same memory node and processor socket. You can check this with this command.

[root@login-0-0 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18
node 0 size: 98258 MB
node 0 free: 53554 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19
node 1 size: 98304 MB
node 1 free: 82944 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

Let's move on to cgrules.conf. This is the file that defines which group defined in cgconfig.conf goes for which user. For example, for user sm4082 there is a group users/sm4082 in cgconfig.conf. In cgrules.conf, we have a line that says user sm4082 is under control groups cpuset, cpu, cpuacct, and memory and the group is users/sm4082. The last field tells the group we should look at in cgconfig.conf.

Now, what about admins and commands. Sometimes commands like bzip, gzip, tar, etc take lot of memory and we didn't want cgroups to kill these. So, what I did was to create another group called commands. Here, total memory usage of all processes under this group can go up to 40GB. This value has nothing to do with 130GB given for group users. Then, I defined which commands fall under this group admins in cgrules.conf.

Let's take a look at commands in cgrules.conf. @users:scp means every user that belongs linux group users on the system using the system command scp falls into group commands. So, how does it work? There is a deamon cgred that keeps track of the processes. As soon as a user starts using scp, this daemon moves the pid of scp into group commands. Since, this is out of group users/userid and group users, memory usage doesn't get counted against the user.

Finally, I defined another group called admins just for admins. All the admins that belong to linux group wheel come under this group.

You can ask why in the world I defined a section for each and every user in cgconfig.conf when I could have simply defined something like @users cpuset,cpu,cpuacct,memory users/%u. I wanted to give different cores for each user and unfortunately there is no way we could do this with just one line. If I didn't care about cpuset section being different for each user, then I could simply use just one line. Even then, it didn't work in CentOS 6.3. It worked fine in CentOS 6.6 though.

Now, let's move onto how I configured our pam for sshd so that users came under cgroups right after they logged in via ssh onto the system. I appended this to /etc/pam.d/sshd

session    required     pam_limits.so
session    optional     pam_cgroup.so
session    optional     pam_exec.so seteuid /usr/local/bin/oom_notification.sh

Interesting. What is the last line? It tells PAM module named pam_exec to execute a script at every PAM event, such as a successfull login. Here it is running a script oom_notification.sh. Let's see what I put in that script.

[root@login-0-0 ~]# cat /usr/local/bin/oom_notification.sh
#!/bin/bash

# Written by Sreedhar Manchu (Sreedhar@nyu.edu)
# June 3rd, 2014

if [ "$PAM_USER" = "root" ]
then
  exit 0
else
 if [ "$PAM_TYPE" = "open_session" ]
 then
   {
  if [ $(ps -ef | grep "/usr/local/bin/oom_notification /cgroup/cpu_and_mem/users/$PAM_USER/cgroup.event_control /cgroup/cpu_and_mem/users/$PAM_USER/memory.oom_control $PAM_USER" | grep -v grep > /dev/null) $? -ne 0 ];then
   /usr/local/bin/oom_notification /cgroup/cpu_and_mem/users/$PAM_USER/cgroup.event_control /cgroup/cpu_and_mem/users/$PAM_USER/memory.oom_control $PAM_USER &
  fi
   }
 elif [ "$PAM_TYPE" = "close_session" ]
 then
  if [ $(who | awk '{print $1}' | grep "^$PAM_USER$" | wc -l) -eq 0 ];then
   kill -9 $(ps -ef | grep "/usr/local/bin/oom_notification /cgroup/cpu_and_mem/users/$PAM_USER/cgroup.event_control /cgroup/cpu_and_mem/users/$PAM_USER/memory.oom_control $PAM_USER" | grep -v grep | awk '{print $2}')
  fi
 fi 
fi
exit 0

So, what I'm doing in this script? Simple. It just checks whether it is an open session, meaning user just logged in, and also checks for whether there is already a command oom_notification is checking on his/her cgroup memory usage. I will get back to this command in a bit. If both checks are ok, it will go and start an executable oom_notification that checks on memory usage of a user.

If it is a close session, meaning user is logging out, then it checks whether there is another session going on and if it doesn't find one, then it'll simply kill the executable oom_notification that's watching the memory usage. Simple enough.

Now let's talk about the executable oom_notification. Since, as an admin, I wanted to send notifications to users whenever their pid gets killed by a cgroup (when their usage touches 4GB), I had to use cgroup event notifications. Red Hat documentation has a code written in C and I modified it a bit so that it runs one of my bash scripts. Here is the C code.

[2015-09-15 20:39:55:7787 root@master post]# cat oom_notification.c
/* Copied from access.redhat.com and modified by Sreedhar Manchu (Sreedhar@nyu.edu) */
/* June 3rd, 2014 */
/* Compile it simpley by running gcc oom_notification.c -o oom_notification */

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/eventfd.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

static inline void die(const char *msg)
{
 fprintf(stderr, "error: %s: %s(%d)\n", msg, strerror(errno), errno);
 exit(EXIT_FAILURE);
}

static inline void usage(void)
{
 fprintf(stderr, "usage: oom_eventfd_test <cgroup.event_control> <memory.oom_control>\n");
 exit(EXIT_FAILURE);
}

#define BUFSIZE 256

int main(int argc, char *argv[])
{
 char cmd[80]={0x0};
 char *user="sm4082";
 char buf[BUFSIZE];
 int efd, cfd, ofd, rb, wb;
 uint64_t u;

 if (argc != 4)
  usage();

 if ((efd = eventfd(0, 0)) == -1)
  die("eventfd");

 if ((cfd = open(argv[1], O_WRONLY)) == -1)
  die("cgroup.event_control");

 if ((ofd = open(argv[2], O_RDONLY)) == -1)
  die("memory.oom_control");

 if ((wb = snprintf(buf, BUFSIZE, "%d %d", efd, ofd)) >= BUFSIZE)
  die("buffer too small");

 if (write(cfd, buf, wb) == -1)
  die("write cgroup.event_control");

 if (close(cfd) == -1)
  die("close cgroup.event_control");

 for (;;) {
  if (read(efd, &u, sizeof(uint64_t)) != sizeof(uint64_t))
   die("read eventfd");

  sprintf(cmd,"bash /usr/local/bin/emailUserUponOOM.sh %s",argv[3]);
  system(cmd);
 }

 return 0;
}
[2015-09-15 22:01:35:7793 root@master post]# gcc -o oom_notification oom_notification.c

As you see, as soon as user's memory usage touches 4GB or whatever you have in cgconfig.conf, it kills a certain process, the one that's using most of the memory I believe, and then runs a bash script /usr/local/bin/emailUserUponOOM.sh with a third argument. If you check above third argument is a userid (it is $PAM_USER; this is a userid of a user logging in via ssh).

/usr/local/bin/oom_notification /cgroup/cpu_and_mem/users/$PAM_USER/cgroup.event_control /cgroup/cpu_and_mem/users/$PAM_USER/memory.oom_control $PAM_USER &

Now, let's check what emailUserUponOOMsh does.

[root@login-0-0 ~]# cat /usr/local/bin/emailUserUponOOM.sh 
#!/bin/bash

# Written by Sreedhar Manchu (Sreedhar@nyu.edu)
# June 3rd, 2014

screenMessage="
Hello HPC user,

Please do not run jobs on the login nodes as it destabilizes the nodes, and can 
cause issues for other users on the system. For short jobs and for debugging 
programs, please use the interactive queue:
qsub -I -X -l nodes=1:ppn=8,walltime=04:00:00

For more information, please refer to the wiki:
https://wikis.nyu.edu/display/NYUHPC/Running+jobs

For information on assessing job requirements, please see:
https://wikis.nyu.edu/display/NYUHPC/Finding+usage+information

Thank you,
NYU HPC team
"

for tscreen in $(who | grep "^$1 " | awk '{print $2}');do
 /usr/bin/write $1 /dev/$tscreen << END_L
$screenMessage
END_L
done

    /usr/sbin/sendmail -t -i << END_L
To: $1@nyu.edu
Bcc: hpc@nyu.edu
From: NYU HPC team <hpc@nyu.edu>
Subject: $(hostname -s): Please do not run jobs on login nodes
$screenMessage
END_L

As you can see, it simply sends an email how to submit an interactive job rather than running it on a login node, etc and then prints the same message on the terminal screen. The reason I'm dumping this on the screen is that people simply don't bother to check email or they put into trash using filters. My hope is that it annoys them so much that they submit a batch job in stead of running it on the login node it self.

Ok. Let't test it. I'm going to run a job that uses lot of memory. Here is the script I'm going to use.

[sm4082@login-0-1 ~]$ cat mem-hog.c 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define KB (1024)
#define MB (1024 * KB)
#define GB (1024 * MB)

int main(int argc, char *argv[])
{
 char *p;

again:
 while ((p = (char *)malloc(GB)))
  memset(p, 0, GB);

 while ((p = (char *)malloc(MB)))
  memset(p, 0, MB);

 while ((p = (char *)malloc(KB)))
  memset(p, 0,
    KB);

 sleep(1);

 goto again;

 return 0;
}

Let's compile and run it.

[sm4082@login-0-1 ~]$ gcc -o mem-hog mem-hog.c 
[sm4082@login-0-1 ~]$ ./mem-hog

Message from sm4082@login-0-1.local (as root) on <no tty> at 20:50 ...

Hello HPC user,

Please do not run jobs on the login nodes as it destabilizes the nodes, and can 
cause issues for other users on the system. For short jobs and for debugging 
programs, please use the interactive queue:
qsub -I -X -l nodes=1:ppn=8,walltime=04:00:00

For more information, please refer to the wiki:
https://wikis.nyu.edu/display/NYUHPC/Running+jobs

For information on assessing job requirements, please see:
https://wikis.nyu.edu/display/NYUHPC/Finding+usage+information

Thank you,
NYU HPC team

EOF
Killed

That was from the terminal. I got an email as well and it was cc'ed to hpc group. Cool, ha!

Let me show you the cgroup structure it self

[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/cpuset.mems
0-1
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/cpuset.cpus
0-17
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/cpu.shares 
1024
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/memory.limit_in_bytes 
139586437120
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/memory.memsw.limit_in_bytes 
139586437120
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/memory.usage_in_bytes       
4559634432
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/tasks 
12500
42269
42289
42290
42291
42292
42293
42294
[root@login-0-0 ~]# ps -u sm4082
  PID TTY          TIME CMD
42289 ?        00:00:00 sshd
42294 pts/29   00:00:00 bash
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/cpu.shares 
100
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/cpuset.mems
0
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/cpuset.cpus
0,2
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/memory.limit_in_bytes 
4294967296
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/memory.memsw.limit_in_bytes 
4294967296
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/users/sm4082/memory.usage_in_bytes       
2334720

Let's check what happens when user runs a command scp.

[sm4082@login-0-0 sm4082]$ scp Matlab-2013b1.tar butinah.xxx.nyu.edu:/scratch/sm4082
sm4082@xxx.abudhabi.nyu.edu's password:

[root@login-0-0 ~]# ps -fu sm4082
UID        PID  PPID  C STIME TTY          TIME CMD
sm4082    1292 42294  0 21:34 pts/29   00:00:00 scp Matlab-2013b1.tar xxx.abudhabi.nyu.edu:/scratch/sm4082
sm4082    1294  1292  0 21:34 pts/29   00:00:00 /usr/bin/ssh -x -oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -- xxx.abudhabi.nyu.edu scp -t /scratch/sm4082
sm4082   42289 42269  0 20:59 ?        00:00:00 sshd: sm4082@pts/29
sm4082   42294 42289  0 20:59 pts/29   00:00:00 -bash
[root@login-0-0 ~]# grep 1292 /cgroup/cpu_and_mem/users/sm4082/tasks
[root@login-0-0 ~]# cat /cgroup/cpu_and_mem/commands/tasks 
1292
13068
35142

As you see, process was moved out of users/sm4082 into commands.

Ok. Now, let's check which cores processes are running on. In cgconfig.conf, I have cpuset.mems="0"; cpuset.cpus="0,2"; This means all my processes should be on cores 0 and 2 which are on memory node 0. Let's check.

[root@login-0-0 ~]# top -u sm4082 -c
top - 21:41:58 up 31 days,  4:15, 49 users,  load average: 0.33, 0.42, 0.40
Tasks: 1104 total,   1 running, 1088 sleeping,  14 stopped,   1 zombie
Cpu(s):  2.4%us,  1.7%sy,  0.0%ni, 95.4%id,  0.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  198441680k total, 58340200k used, 140101480k free,   249720k buffers
Swap:  1048572k total,   592560k used,   456012k free, 47002028k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   P COMMAND                                                                                                                   
42289 sm4082    20   0  162m 3760 1312 S  0.0  0.0   0:00.07  2 sshd: sm4082@pts/29                                                                                                        
42294 sm4082    20   0  105m 1980 1464 S  0.0  0.0   0:00.08  0 -bash

As you can see, processes are on cores 0 and 2.

Finally, I would like to show how I automated cgconfig.conf and cgrules.conf on master node. These are pushed onto login nodes. Every time, a user gets created on master node, my wrapper script around useradd runs and creates the lines in these two files that are needed for cgroups.

[2015-09-15 21:45:27:7789 root@master post]# cat /usr/sbin/useradd
#!/bin/bash

# By Sreedhar Manchu (Sreedhar@nyu.edu)

args=("$@")
for((arg=0;arg<$#;arg++))
do
 if [ "${args[$arg]}" = "-g" -a "${args[$((arg+1))]}" = "100" ]
 then
  if [ $(echo ${args[9]} | egrep "^[a-z]{1,9}[0-9]{1,4}$" > /dev/null 2>&1) $? -eq 0 ]
  then 
   if [ $(egrep "^${args[9]}:" /etc/passwd >/dev/null 2>&1) $? -ne 0 ]
   then
    isUser="yes"
    sleep $(awk -v seed=$RANDOM 'BEGIN{srand(seed); print rand()}')
    source /usr/sbin/cgVariables.txt
    p10=$p1;q10=$q1;p20=$p2;q20=$q2;k10=$k1;k20=$k2
    p1=$[p1+1];q1=$[q1+1];
    if [ $q1 -eq 18 ];then
     p1=0;q1=2
    fi
    p2=$[p2+1];q2=$[q2+1];
    if [ $q2 -eq 10 ];then
     p2=0;q2=2
    fi
    if [ $((q1%2)) -eq 0 ];then
     k1=0
    else
     k1=1
    fi
    if [ $((q2%2)) -eq 0 ];then
     k2=0
    else
    k2=1
    fi
    sed -i "s/p1=$p10;q1=$q10;p2=$p20;q2=$q20;k1=$k10;k2=$k20/p1=$p1;q1=$q1;p2=$p2;q2=$q2;k1=$k1;k2=$k2/" /usr/sbin/cgVariables.txt
    break
   fi
  fi
 fi
done

if [ "$isUser" = "yes" ]
then
 set -- "${@:1:7}" "${@:8:1}" "${@:10}"
 /usr/sbin/useradd.original "$@"
 /usr/sbin/usermod -p '!!' ${args[9]}
 /usr/bin/timeout -s KILL 2 /bin/df -h 2>/dev/null | grep ' /scratch$' >/dev/null 2>&1
 if [ $? -eq 0 ];then
  if [ ! -d /scratch/${args[9]} ];then
   /bin/mkdir /scratch/${args[9]}
   /bin/chown ${args[9]}:users /scratch/${args[9]}
   /bin/chmod 700 /scratch/${args[9]}
   /usr/bin/lfs setquota -u ${args[9]} -b 5368709120 -B 6442450944 -i 1000000 -I 1001000 /scratch
  fi
 fi

 /bin/df -h /home/${args[9]} >/dev/null 2>&1
 if [ $? -ne 0 ];then
  sleep 600
 fi
 /bin/df -h /home/${args[9]} >/dev/null 2>&1
 if [ $? -eq 0 ];then
  /bin/cp -r /etc/skel/. /home/${args[9]}/
  /bin/chown -R ${args[9]}:users /home/${args[9]}
 fi

 echo -e "group users/${args[9]} {\n\
\tcpuset {\n\
\t\tcpuset.mems=\"${k1}\";\n\
\t\tcpuset.cpus=\"${p1},${q1}\";\n\
\t}\n\
\tcpu {\n\
\t\tcpu.shares = \"100\";\n\
\t}\n\
\tcpuacct {\n\
\t\tcpuacct.usage = \"0\";\n\
\t}\n\
\tmemory {\n\
\t\tmemory.limit_in_bytes = \"4G\";\n\
\t\tmemory.memsw.limit_in_bytes = \"4G\";\n\
\t}\n\
}\n" >> /var/411/groups/login01/etc/cgconfig.conf

 echo -e "group users/${args[9]} {\n\
\tcpuset {\n\
\t\tcpuset.mems=\"${k2}\";\n\
\t\tcpuset.cpus=\"${p2},${q2}\";\n\
\t}\n\
\tcpu {\n\
\t\tcpu.shares = \"100\";\n\
\t}\n\
\tcpuacct {\n\
\t\tcpuacct.usage = \"0\";\n\
\t}\n\
\tmemory {\n\
\t\tmemory.limit_in_bytes = \"2G\";\n\
\t\tmemory.memsw.limit_in_bytes = \"2G\";\n\
\t}\n\
}\n" >> /var/411/groups/login23/etc/cgconfig.conf

 echo -e "${args[9]}\t\tcpuset,cpu,cpuacct,memory\t\tusers/${args[9]}" >> /var/411/groups/Login/etc/cgrules.conf

 /opt/rocks/bin/rocks run host login "/bin/cgcreate -g cpuset,cpu,cpuacct,memory:/users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-0 login-0-1 "/bin/cgset -r cpuset.mems=${k1} users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-2 login-0-3 "/bin/cgset -r cpuset.mems=${k2} users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-0 login-0-1 "/bin/cgset -r cpuset.cpus=${p1},${q1} users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-2 login-0-3 "/bin/cgset -r cpuset.cpus=${p2},${q2} users/${args[9]}"
 /opt/rocks/bin/rocks run host login "/bin/cgset -r cpu.shares=100 users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-0 login-0-1 "/bin/cgset -r memory.limit_in_bytes=4294967296 users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-0 login-0-1 "/bin/cgset -r memory.memsw.limit_in_bytes=4294967296 users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-2 login-0-3 "/bin/cgset -r memory.limit_in_bytes=2147483648 users/${args[9]}"
 /opt/rocks/bin/rocks run host login-0-2 login-0-3 "/bin/cgset -r memory.memsw.limit_in_bytes=2147483648 users/${args[9]}"

 #/opt/rocks/bin/rocks sync users

 exit 0;
else
 /usr/sbin/useradd.original "$@"
fi

[2015-09-15 21:45:34:7790 root@master post]# cat /usr/sbin/cgVariables.txt
p1=1;q1=3;p2=1;q2=3;k1=1;k2=1

We have 4 login nodes and among them two sets are similar to each other in terms of configuration. This is the reason I'm creating two different cgconfig.conf files. On one login node each user gets 2GB and on other it's 4GB. Some times, bulk account creation happens on the master node. Since I wanted every account have different cores in a cyclic order, I needed to introduce latency between each useradd command. Also, I had to change core numbers in a file after each call to this wrapper. This is exactly first part of this script does.

For every user, values cgVariables.txt keep changing. Using random, I'm introducing milliseconds worth of latency between each call. These variables in this file are then being used to print core numbers into cgconfig.conf.

At the beginning, whenever I had a new user addition I pushed the cgconfig.conf and cgrules.conf onto login nodes and then tried to restart cgconfig daemon. That didn't go well at all. It simply killed all the oom_notification executables that were checking on users' memory usage and sent emails. I think my bash script was not strong enough to avoid this I guess. Due to time factor, I didn't bother to change it as I found another elegant solution. I think it's much better than restarting daemon all the time.

That elegant solution was to add the required lines into cgconfig.conf and cgrules.conf for new users and then create required cgroup folder structure on the fly. This is taken care of by the last part of the above script. But the configuration is always there for it to come into effect upon system reboot.

For user deletion, I put a similar wrapper around userdel and that deletes the configuration from both files.

[2015-09-15 21:50:04:7791 root@master post]# cat /usr/sbin/userdel
#!/bin/bash

# By Sreedhar Manchu (Sreedhar@nyu.edu)

args=("$@")
GIDS=$(id -G ${args[$#-1]})
for GID in $GIDS;do 
 if [ $GID -eq 100 ];then
  # remove mail
  rm -f /var/spool/mail/${args[$#-1]}
  # cleanup cgroup config files
  sed -i "/users\/${args[$#-1]} {/I,+16 d" /var/411/groups/login01/etc/cgconfig.conf
  sed -i "/users\/${args[$#-1]} {/I,+16 d" /var/411/groups/login23/etc/cgconfig.conf
  #sed -i "/users\/${args[$#-1]} {/{N;N;N;N;N;N;N;N;N;N;N;N;d}" /var/411/groups/Login/etc/cgconfig.conf
  sed -i "/users\/${args[$#-1]}$/d" /var/411/groups/Login/etc/cgrules.conf
  # mv scratch directory
  mv /scratch/${args[$#-1]} /scratch/expired/${args[$#-1]}
  break
 fi
done
/usr/sbin/userdel.original "$@"

But here, I'm not bothering to delete the cgroup folders that were created on demand. Because at some point we're going to reboot the login nodes and that'd simply take care of it because configuration doesn't have these groups.

I am using rocks 411 service to push the configuration files onto login nodes.

Unix Operator

Tuesday, September 15, 2015

Implementation of Control Groups (cgroups) on Rocks cluster (HPC) Login nodes

No comments:

Post a Comment