Vallari Agrawal

Build your own docker


Containers

For this exercise, I named my container runtime “turtle” because containers live in a self-contained environment like a turtle does. And both of them can live in different kinds of environments. While turtle’s shells are made of bones, let’s see what containers are made of…

[Drawing of a turtle]

A container image is simply a tarball of a filesystem. Running containers means downloading these tarball, unpacking it in a directory, and then running a program assuming that directory is it’s whole filesystem. Containers are also isolated and restricted in resources and access of the rest of the system, so it can only see its own environment and not the rest of the machine.

The first step we’ll do to build our own container runtime is to get a small filesystem which could become our containers’s root filesystem.

Below, I’ll fetch alpine-minirootfs and unzip it in /tmp/turtle-os:

[vallari@fedora turtle]$ wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.2-x86_64.tar.gz
[vallari@fedora turtle]$ mkdir /tmp/turtle-os
[vallari@fedora turtle]$ tar xzf alpine-minirootfs-3.20.2-x86_64.tar.gz --directory=/tmp/turtle-os/

[vallari@fedora turtle]$ ls /tmp/turtle-os/
bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

To make our turtle container work, we need to isolate and restrict resources of host system that the container is allowed to use. These are a list of things we want to achieve:

  1. use /tmp/turtle-os directory as container’s rootfs (not host’s rootfs at /)
  2. use /tmp/turtle-os/bin/ls when we run ls inside the container
  3. show only container’s processes in procfs ( ls /proc does not include host’s processes) .
  4. container to have it’s own hostname (without affecting host’s hostname)
  5. Make it rootless - so running sudo inside the container is not actually host’s root user
  6. Restrict how much memory/cpu the container can use

To make all this possible, let’s learn a little about three kernel features that containers use:

  1. chroot
  2. Namespaces
  3. C-groups

chroot

chroot changes root directory / of the calling process to the specific path.

This allows the container process and it’s children processes to have their own filesystem.

# an actual docker container's rootfs:
root@8244007e2d1b:/# ls   
bin  boot  dev  etc  home  lib  lib32  lib64  libx32  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

A user needs to have CAP_SYS_CHROOT capability to call chroot. Here, I can restrict turtle container’s root filesystem to alpine-minirootfs extracted directory:

[vallari@fedora turtle]$ touch /tmp/turtle-os/TEST            # 1. create a TEST file to verify chroot is working 
[vallari@fedora turtle]$ sudo chroot /tmp/turtle-os /bin/sh   # 2. change rootfs for /bin/sh process 
/ # ls /                                     # 3. verify - rootfs of this shell has TEST file
TEST   bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var
/ # exit

[vallari@fedora turtle]$ ls /                # 4. host's rootfs does not have TEST file
afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  snap  srv  sys  tmp  usr  var
[vallari@fedora turtle]$ 

Real containers use pivot_root instead of chroot. pivot_root achieves the same results but it’s more effective. Why? Because there are a few ways for superuser to come out of “chroot jail” but pivot_root changes the root mount in the mount namespace so it properly jails processes inside a directory. Man page of chroot clearly says “it is not intended to be used for any kind of security purpose”.

But for this article, we’ll use chroot.


Namespaces

Linux namespaces are a feature provided by the kernel to isolate resources so processes within the namespace see a set of resources, instead of seeing all global resources.

There are around 8 types of namespaces on Linux, you can read them about it on namespace man page. I will only create these namespaces in this article:

  1. UTS namespace - isolate hostnames/domainnames!
  2. Mount namespace - isolate mounts!
  3. PID namespace - isolate process ID numbering!
  4. User namespace - isolate users/groups IDs, helps to make rootless containers!

Creating and listing namepsaces

To learn about creating namespaces, I’ll use the example of UTS namespace because its the easiest to understand - it lets the container have it’s own hostname. So processes in same UTS namespace share the same hostname and domain name.

Now, let’s create a namespace! We’ll use unshare command to create new namespaces. It follows the syntax:

unshare <options> <program>

This command creates new namespaces (based on <options>) and then executes <process> within those namespaces. Example: unshare --uts /bin/sh which creates a UTS namespace and executes /bin/sh process in that UTS namespace (and not in default host UTS namespace).

Note: namespaces can also be created with clone syscall. Only difference is that clone spawns a new child process inside the namespaces, meanwhile unshare create new namespaces and execute the process in that namespace.

To verify that the namespace is successfully made. We can check the list of all namespaces on the system:

# listing namespaces in a system 
$ lsns             # list all namespaces
$ lsns --type utc  # list 'utc' namespaces - print a type of namespaces

# enter namespace (here we enter UTS namespace to which process $PID belongs to) 
$ sudo nsenter --uts=/proc/$PID/ns/uts 
$ sudo nsenter -t $PID -u

# see namespaces of a process in procfs 
$ ls -l /proc/$PID/ns/

To see it in practice, let’s test it with a docker container:

Let’s start a docker container and see it’s PID outside the container (after this, we’ll see this PID being associated with namespaces created by docker).

# Start a container + check hostname
[vallari@fedora turtle]$ docker run -it ubuntu:latest
root@8244007e2d1b:/#  
root@8244007e2d1b:/# hostname
8244007e2d1b

# Find PID of our docker's bash process: it's 36600 here! 
[vallari@fedora turtle]$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
vallari    36543  0.1  0.1 1920808 26752 pts/8   Sl+  23:51   0:00 docker run -it ubuntu:latest
root       36600  0.2  0.0   4624  3712 pts/0    Ss+  23:51   0:00 /bin/bash

Listing all namespaces created by the above docker process (associated with docker process’ PID):

# Now let's look at all the namespaces this bash process belongs to:
[vallari@fedora turtle]$ sudo lsns
        NS TYPE   NPROCS   PID USER            COMMAND
...
4026533017 mnt         1 36600 root            /bin/bash
4026533018 uts         1 36600 root            /bin/bash
4026533019 ipc         1 36600 root            /bin/bash
4026533020 pid         1 36600 root            /bin/bash
4026533021 net         1 36600 root            /bin/bash
4026533091 cgroup      1 36600 root            /bin/bash


# Look at namespaces of this process in procfs! 
[vallari@fedora turtle]$ sudo ls -l /proc/36600/ns/
total 0
lrwxrwxrwx 1 root root 0 Mar  7 23:56 cgroup -> 'cgroup:[4026533091]'
lrwxrwxrwx 1 root root 0 Mar  7 23:56 ipc -> 'ipc:[4026533019]'
lrwxrwxrwx 1 root root 0 Mar  7 23:56 mnt -> 'mnt:[4026533017]'
lrwxrwxrwx 1 root root 0 Mar  7 23:51 net -> 'net:[4026533021]'
lrwxrwxrwx 1 root root 0 Mar  7 23:56 pid -> 'pid:[4026533020]'
lrwxrwxrwx 1 root root 0 Mar  7 23:56 pid_for_children -> 'pid:[4026533020]'
lrwxrwxrwx 1 root root 0 Mar  7 23:56 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 Mar  7 23:56 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 Mar  7 23:56 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Mar  7 23:54 uts -> 'uts:[4026533018]'

Entering inside the UTS namespace created and used by docker:

# Enter UTS namespace of the docker bash process:
[vallari@fedora turtle]$ sudo nsenter --uts=/proc/36600/ns/uts 
[root@8244007e2d1b turtle]# hostname
8244007e2d1b    
# ^ that's the same hostname as our docker's bash we saw above!

# BONUS: anther way to enter the UTS namespace for that process:
[vallari@fedora turtle]$ sudo nsenter -t 36600 -u 
[root@8244007e2d1b turtle]# hostname
8244007e2d1b

Removing namespaces

Namespaces are automatically destroyed when the last process in that namespace terminates.

They can be made persistent to exist even after the last process has exited (except PID namespace where init process is required to run for the namespace to exist).

Example, UTS namespace is made persistent by bind mounting a file with unshare --uts=<file>.

We can remove persistent namespaces with umount (this was helpful to know how to clean out stray namespaces when experimenting with namespaces in this article!):

$ sudo touch /root/uts-ns
$ sudo unshare --uts=/root/uts-ns hostname FOO    # persistent UTS namespace 
$ sudo nsenter --uts=/root/uts-ns hostname        # we can enter the persistent UTS namepsace (even though it has no processes running!)
FOO
$ sudo mount | grep "uts-ns"                      # find our namespace's bind mount! 
nsfs on /root/uts-ns type nsfs (rw)
$ sudo umount /root/uts-ns                        # destroy the namespace by removing the bind mount

Let’s understand each type of namespace.

1. UTS namespace

UTS namespace isolates two system identifiers: hostname and NIS domain name. These identifiers can be get/set with system calls sethostname, gethostname, setdomainname, getdomainname.

We can create a new namespace using unshare --uts. Creating new UTS namespaces requires CAP_SYS_ADMIN capability (see here). So, we’ll create these namespaces from the root user or with sudo.

Any changes made to hostname/domainname are local to each UTS namespace. If I create a new UTS namespace and set a new hostname, then the hostname of that UTS namespace would change, but hostname would remain same outside that namespace.

[vallari@fedora turtle]$ sudo unshare --uts /bin/sh
sh-5.2# hostname
fedora
sh-5.2# hostname uts-name   # change hostname inside the namespace 
sh-5.2# hostname
uts-name                    # it's changed!
sh-5.2# 
sh-5.2# 

# outside that UTS namespace, the hostname is unchanged! 
[vallari@fedora turtle]$ hostname
fedora

We can see the new namespace in lsns:

[vallari@fedora turtle]$ sudo unshare --uts /bin/sh
sh-5.2# lsns --type uts                        # this uts namespace listed here
        NS TYPE NPROCS   PID USER        COMMAND
....
4026532703 uts       2 32547 root        /bin/sh
sh-5.2# exit 

[vallari@fedora turtle]$ sudo lsns --type uts    # above uts namespace gone!  

Reference: man page

2. Mount namespace

Brief explaination of mounts on mount man page:

All files accessible in a Unix system are arranged in one big tree, the file hierarchy, rooted at /. These files can be spread out over several devices. The mount command serves to attach the filesystem found on some device to the big file tree.

Conversely, the umount(8) command will detach it again.

The filesystem is used to control how data is stored on the device or provided in a virtual way by network or other services.

Mount namespaces isolate list of mounts visible to the processes in that namespace.

In docker, we often bind mount a directory:

[vallari@fedora turtle]$ docker run -v /tmp/turtle-os/:/turtle -it ubuntu:22.04 /bin/bash
root@8201381039c9:/# ls /
bin  boot  dev  etc  home  lib  lib32  lib64  libx32  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  turtle  usr  var

root@8201381039c9:/# mount | grep "turtle"
tmpfs on /turtle type tmpfs (rw,nosuid,nodev,nr_inodes=1048576,inode64)

What is a bind mount? It makes a file/directory subtree visible on another path within same tree. Similar to results of a symlink. Example: mount --bind /bin/ /tmp/mybins, then /tmp/mybins/cat should work as /bin/cat.

When a new mount namespace is created with unshare, we can mount and umount filesystems without effecting host’s filesystem.

This means that our mount at “/turtle” above should not be visible to the host:

[vallari@fedora ~]$ mount | grep "turtle"
[vallari@fedora ~]$ 

Let’s create a new mount namespace and mount bind filesystems. Both actions - creating new mount namespace and mounting filesystems - require CAP_SYS_ADMIN capability so we’ll sudo for now!

[vallari@fedora turtle]$ sudo unshare --mount /bin/sh  # create new mount namespace 
sh-5.2# mount --bind /bin/ /tmp/mybins/                # mount bind 
sh-5.2# ls /tmp/mybins/cat                             # mount bind works! 
/tmp/mybins/cat
sh-5.2# 
exit

# mount bind did not effect the host filesystem:
[vallari@fedora turtle]$ ls /tmp/mybins/cat
ls: cannot access '/tmp/mybins/cat': No such file or directory
[vallari@fedora turtle]$

This isolation is possible because the kernel (by default) sets the propagation to PRIVATE in a new mount namespace so mount/umount events are private to that mount namespace. There can be other types of propagation: SHARED (effect of mount/umount events are propagated into peer mount namespaces - they all effect each other), SLAVE (mount/umount events propagate from master mount namespaces to slave mount namespaces, but events from slave mounts do not propagate to master mounts), etc.

Using mount namespace for containers

When we chroot into a new root filesystem inside a mount namespace, /proc from the host is no longer accessible. This is because mount points do not propagate, so we must need to remount the proc pseudo-filesystem at the new /proc to restore process visibility.

Let’s understand this by adding mount namespace to turtle implimentation…


[vallari@fedora turtle]$ sudo unshare --mount /bin/sh      # without chroot 
/ # mount
..........
.......... (all host mounts)
/ # ps aux
PID   USER     TIME  COMMAND
..........
.......... (all host processes)


[vallari@fedora turtle]$ sudo unshare --mount chroot /tmp/turtle-os /bin/sh    # with chroot 
/ # mount
mount: no /proc/mounts
/ # ps aux
PID   USER     TIME  COMMAND

Now, let’s remount the procfs:

[vallari@fedora turtle]$ sudo unshare --mount chroot /tmp/turtle-os /bin/sh
/ # 
/ # /bin/mount proc /proc -t proc                 <- remount proc


/ # mount
proc on /proc type proc (rw,relatime)
/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:10 /usr/lib/systemd/systemd --switched-root --system --deserialize=35 rhgb
.......
.......
.......
# list of all host processes

The command /bin/mount proc /proc -t proc remount procfs at /proc of the new root filesystem.

References: mount namespce man page, mount command, unix.stackexchange explaination

3. PID namespace

PID namespace isolates process IDs. Man page explains it as:

PID namespaces isolate the process ID number space, meaning that processes in different PID namespaces can have the same PID.

A process in a new PID namespace starts with the PID of 1, like it is part of it’s own system. The PID for the same process inside and outside the namespace is different.

Let’s understand PID namespaces by observing PIDs in a docker container.

# inside container - in a new PID namespace 
[vallari@fedora turtle]$ docker run -it ubuntu:latest /bin/bash
root@4dbeabacee71:/# sleep 3000 &
[1] 10
root@4dbeabacee71:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4624  3712 pts/0    Ss   19:00   0:00 /bin/bash      # init process of the PID namespace
root          10  0.0  0.0   2788  1536 pts/0    S    19:05   0:00 sleep 3000
root          11  0.0  0.0   7060  2944 pts/0    R+   19:05   0:00 ps aux

# outside container - in default PID namespace 
[vallari@fedora turtle]$ ps aux 
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
........
root           1  0.0  0.0 169200 14900 ?        Ss   Mar04   0:11 /usr/lib/systemd/systemd --switched-root --system --deserialize=35 rhgb    # init process of the host!
root       51936  0.9  0.0   4624  3712 pts/0    Ss+  00:30   0:00 /bin/bash
root       51997  0.0  0.0   2788  1536 pts/0    S    00:35   0:00 sleep 3000

The first process (PID 1) is called “init” process. If init process terminates then the kernel sends a SIGKILL signal to all processes in that namespace. From above example, we can see that /bin/bash with PID 1 is init process of the container. And /usr/lib/systemd/systemd is the init process of the host system.

When creating a new PID namespace, we also need to “fork” the process, i.e. use --fork. The forked child process then starts in the new PID namespace as PID 1. Why? Because PID namespace doesn’t move the current process to the namespace, but it’s children are created in the new namespace. So the current process is not visible to the children, unless we fork it inside the namespace.

Let’s create a new PID namespace:

[vallari@fedora ~]$ sudo unshare -p --fork /bin/bash
[root@fedora vallari]# ps
    PID TTY          TIME CMD
  29265 pts/1    00:00:00 sudo
  29266 pts/1    00:00:00 unshare
  29267 pts/1    00:00:00 bash
  29352 pts/1    00:00:00 ps

But notice how in this new PID namespace, ps still returns host PID of these process. Why don’t these PIDs start from 1? This is because tools like ps read from /proc pseudo-filesystem. In the above example, host’s procfs is visible to ps tool. To see changes in PID namespaces with ps, we should use the new rootfs with chroot and remount procfs (like we did above in mount namespaces section).

# (with PID namespace)
[vallari@fedora turtle]$ sudo unshare --pid --fork --mount chroot /tmp/turtle-os /bin/sh
/ # mount proc /proc -t proc
/ # mount
proc on /proc type proc (rw,relatime)
/ # ps aux                   # new PID namespace! 
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    4 root      0:00 ps aux
/ # exit 


# (without PID namespace)
[vallari@fedora turtle]$ sudo unshare --mount chroot /tmp/turtle-os /bin/sh
/ # mount proc /proc -t proc
/ # ps aux                # host's PID namespace 
PID   USER     TIME  COMMAND
    1 root      0:11 /usr/lib/systemd/systemd --switched-root --system --deserialize=35 rhgb
........

We can also nest PID namespaces, allowing a parent namespace to see all processes within its child and grandchild namespaces. But a PID namespace cannot see any processes from its ancestor namespaces.

References: https://www.redhat.com/en/blog/pid-namespace

4. User namespace

We will use this namespace to run the container without root privileges.

User namespace is used to isolate security related identifiers - user ID, group ID, capabilities. A process in a new user namespace can have a different user and group ID from the host.

[vallari@fedora ~]$ id
uid=1000(vallari) gid=1000(vallari) groups=1000(vallari)

[vallari@fedora ~]$ sudo unshare --user /bin/sh
sh-5.2$ id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)

We can map container’s root (user ID 0) to host’s normal unprivileged user (example, ID 1000) on host. A simple way to do this is to use --map-root-user when creating a new user namespace with unshare.

[vallari@fedora turtle]$ cat /proc/self/uid_map
         0          0 4294967295
[vallari@fedora turtle]$ cat /proc/self/gid_map
         0          0 4294967295

[vallari@fedora turtle]$ unshare --user --map-root-user /bin/bash
[root@fedora turtle]# cat /proc/self/uid_map
         0       1000          1
[root@fedora turtle]# cat /proc/self/gid_map
         0       1000          1

Now that the user is “root” inside the new user namespace, it has root privileges inside that user namespace! user_namespace docs explain it:

a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace;

in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.

This means that we can use container’s root capabilities to create new mount/pid namespace inside that user namespace (no need to use host’s root capabilities with sudo)!

 
[vallari@fedora ~]$ unshare --user --map-root-user --pid --fork --mount --uts chroot /tmp/turtle-os/ /bin/sh
/ # /bin/mount proc /proc -t proc
/ # /bin/hostname turtle
/ # whoami
root
/ # /bin/hostname
turtle
/ # /bin/ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    6 root      0:00 /bin/ps aux
/ # /bin/mount
proc on /proc type proc (rw,relatime)
/ #  

The user namespace governs every namespace. This means that a namespace’s capabilities are directly related to the capabilities of its parent user namespace.

Another way to map these user_ID/group_ID is to edit pseudo files /proc/self/uid_map and /proc/self/gid_map. Example:

void create_ns() {
    auto uid = getuid();
    auto gid = getgid();
    int rt = unshare(CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS);
    if (rt != 0) {
        cout << "creating namespaces failed!" << endl;
    }
    std::ofstream uid_file("/proc/self/uid_map");
    uid_file << "0   " << uid << "   1";
    uid_file.close();

    std::ofstream setgroups("/proc/self/setgroups");
    setgroups << "deny" << endl;
    setgroups.close();

    std::ofstream gid_file("/proc/self/gid_map");
    gid_file << "0  " << gid << "  1";
    gid_file.close();

    cout << "=> new userid: " << getuid() << ", new gid: " << getgid() << "\n";
}

References: user namespace man page

Note: there are more types of namespaces (Network namespace, cgroup namespace, IPC namespace, and Time namespace) which I haven’t covered in this article.


Putting together these namespaces

Alpine uses busybox, so we’ll use that to run our commands.

BusyBox combines tiny versions of many common UNIX utilities into a single small executable.

For our turtle container, we can setup namespaces like this:


[vallari@fedora turtle]$ unshare --user --map-root-user --uts --pid --fork --mount \
    chroot /tmp/turtle-os /bin/busybox sh -c \
    "/bin/mount proc /proc -t proc && /bin/hostname turtle && /bin/busybox sh"
/ # 
/ # PS1="\\u@\\h ~ "
root@turtle ~

This accomplishes similar results as <runtime> exec -it </image> /bin/sh command.


C-groups

Control groups (cgroups) are linux kernel feature which helps to limit resources usage for processes. Example, set memory limits or cpu limits to a process. If processes inside a cgroup exceeds its limits, the process is killed.

Processes are organized in hierarchical cgroups. They form a tree structure, each cgroup might have multiple children cgroup making a file tree structure. This can be observed in pseudo filesystem - cgroupfs (at /sys/fs/cgroup). The root cgroup is where all processes belong by default. We can create new cgroups by creating/deleting new directory in cgroupfs subtree.

Example of cgroup hierarchy:

/sys/fs/cgroup/          # root cgroup
/sys/fs/cgroup/child     # child cgroup named "child"
/sys/fs/cgroup/child/grandchild           # nested cgroup called "grandchild"
/sys/fs/cgroup/child/secondgrandchild     # another nested cgroup called "secondgrandchild"

Create a new cgroup:

mkdir /sys/fs/cgroup/child             # new cgroup in root cgroup 
mkdir /sys/fs/cgroup/child/grandchild  # new child cgroup inside a cgroup

Removing a cgroup (if it has no active processes and have no children) :

rmdir /sys/fs/cgroup/child/grandchild

Each process belongs to a cgroup. You can check which cgroups a process belongs to at /proc/$PID/cgroup:

[vallari@fedora turtle]$ cat /proc/self/cgroup
1:net_cls:/
0::/user.slice/user-1000.slice/session-2.scope

[vallari@fedora turtle]$ ls /sys/fs/cgroup/user.slice/user-1000.slice/session-2.scope/
cgroup.controllers      cgroup.procs            cpu.max.burst          cpu.stat         io.prio.class        memory.high       memory.pressure      memory.swap.peak
cgroup.events           cgroup.stat             cpu.pressure           cpu.weight       io.stat              memory.low        memory.reclaim       memory.zswap.current
cgroup.freeze           cgroup.subtree_control  cpuset.cpus            cpu.weight.nice  io.weight            memory.max        memory.stat          memory.zswap.max
cgroup.kill             cgroup.threads          cpuset.cpus.effective  io.bfq.weight    irq.pressure         memory.min        memory.swap.current  pids.current
cgroup.max.depth        cgroup.type             cpuset.cpus.partition  io.latency       memory.current       memory.numa_stat  memory.swap.events   pids.events
cgroup.max.descendants  cpu.idle                cpuset.mems            io.max           memory.events        memory.oom.group  memory.swap.high     pids.max
cgroup.pressure         cpu.max                 cpuset.mems.effective  io.pressure      memory.events.local  memory.peak       memory.swap.max      pids.peak

Terminology:

A cgroup is collection of processes that are resource bound and defined by cgroup filesystem.

cgroup subsystem or controllers are kernel component that control a specific resource. Examples of cgroup controllers: cpu, memory, etc.

How to limit resources using a new cgroup?

It can be done in 3 steps using cgroupfs:

  1. Defined active controllers in our cgroup
  2. Set limits
  3. Move process to our cgroup

For this explanation, let’s say we want to limit a process using cgroup named “child”. And it’s hierarchy is root>“parent”>“child”.

1. Defined active controllers

Each cgroup contains these two files:

  1. cgroup.subtree_control - List of active controllers in cgroup. This is what we edit to modify active controllers (cpu, memory, etc.) in a cgroup.
  2. cgroup.controllers - List of available controllers. File content matches with parent’s cgroup.subtree_control file. This is a read-only file.

So all possible controllers which can be enabled in ‘child’ cgroup (defined in /sys/fs/cgroup/parent/child/cgroup.controllers) are determined by active controllers in parent cgroup (defined in /sys/fs/cgroup/parent/cgroup.subtree_control).

# new cgroup 
[vallari@fedora turtle]$ sudo mkdir /sys/fs/cgroup/parent/

# list of active controllers 
[vallari@fedora turtle]$ ls /sys/fs/cgroup/parent/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

# setting child controllers 
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/cgroup.subtree_control 
[vallari@fedora turtle]$ sudo bash -c "sudo echo +cpu > /sys/fs/cgroup/parent/cgroup.subtree_control"
[vallari@fedora turtle]$ sudo bash -c "sudo echo +memory > /sys/fs/cgroup/parent/cgroup.subtree_control"
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/cgroup.subtree_control 
cpu memory 

## creating child cgroup 
[vallari@fedora turtle]$ sudo mkdir /sys/fs/cgroup/parent/child

# listing active controllers (see, content is same as /sys/fs/cgroup/parent/cgroup.subtree_control)
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/cgroup.controllers 
cpu memory

2. Set limits

Now, let’s set memory and cpu limit in our cgroup.

# echo 50 > /sys/fs/cgroup/parent/child/cpu.weight
# echo "500M" > /sys/fs/cgroup/parent/child/memory.max

Here, I will define limits for ‘child’ cgroup:

[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/cgroup.subtree_control 
cpu memory 
[vallari@fedora turtle]$ ls /sys/fs/cgroup/parent/child 
cgroup.controllers  cgroup.max.descendants  cgroup.threads  cpu.pressure     irq.pressure         memory.low        memory.peak          memory.swap.events    memory.zswap.max
cgroup.events       cgroup.pressure         cgroup.type     cpu.stat         memory.current       memory.max        memory.pressure      memory.swap.high
cgroup.freeze       cgroup.procs            cpu.idle        cpu.weight       memory.events        memory.min        memory.reclaim       memory.swap.max
cgroup.kill         cgroup.stat             cpu.max         cpu.weight.nice  memory.events.local  memory.numa_stat  memory.stat          memory.swap.peak
cgroup.max.depth    cgroup.subtree_control  cpu.max.burst   io.pressure      memory.high          memory.oom.group  memory.swap.current  memory.zswap.current

[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/cpu.weight
100
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/memory.max
max

[vallari@fedora turtle]$ sudo bash -c "sudo echo 50 > /sys/fs/cgroup/parent/tasks/cpu.weight"
[vallari@fedora turtle]$ sudo bash -c "sudo echo 500M > /sys/fs/cgroup/parent/tasks/memory.max"
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/memory.max
524288000
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/cpu.weight
50

3. Move process to cgroup

We can restrict resources of a process by moving it into the defined cgroup:

echo $PID > /sys/fs/cgroup/parent/child/cgroup.procs

Manage cgroup for turtle

Here’s what I’ll do to setup cgroup for turtle:

  1. Create a “test” cgroup and defined limits for it’s children in cgroup.subtree_control
  2. Create a “tasks” child cgroup and limit cpu/memory for that cgroup
  3. Add turtle container’s PID to “tasks” cgroup
CGROUP_NAME="test"
# create new cgroup for our container
sudo mkdir /sys/fs/cgroup/$CGROUP_NAME
sudo bash -c "echo +cpu > /sys/fs/cgroup/$CGROUP_NAME/cgroup.subtree_control"
sudo bash -c "echo +memory > /sys/fs/cgroup/$CGROUP_NAME/cgroup.subtree_control"
# setting limits to cgroup
sudo mkdir /sys/fs/cgroup/$CGROUP_NAME/tasks
sudo bash -c "echo 50 > /sys/fs/cgroup/$CGROUP_NAME/tasks/cpu.weight"
sudo bash -c "echo "500M" > /sys/fs/cgroup/$CGROUP_NAME/tasks/memory.max"

# limit a PID to the cgroup limits
TURTLE_PID=$(echo $$)
sudo bash -c "echo $TURTLE_PID > /sys/fs/cgroup/$CGROUP_NAME/tasks/cgroup.procs"

Bonus!

There’s another way to manage cgroups (besides cgroupfs!):

Using commands cgcreate (create a cgroup), cgset (set limits), cgexec (use the cgroup).

cgcreate -g "cpu,memory:$CGROUP_NAME"
cgset -r cpu.weight=50 $CGROUP_NAME
cgset -r memory.max=500M $CGROUP_NAME
cgexec -g "cpu,memory:$CGROUP_NAME" ...<process cmd>...

References: kernel docs, man page, redhat docs

Build your own docker

Putting together everything, here’s the full script to implement our turtle container runtime.

# Step 1: setup minifs - root filesystem of our container  
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.2-x86_64.tar.gz
mkdir /tmp/turtle-os
tar xzf alpine-minirootfs-3.20.2-x86_64.tar.gz --directory=/tmp/turtle-os/

export PATH=$PATH:/bin
export PS1="\\u@\\h ~ "

cgroup_setup(){
    CGROUP_NAME="test"
    if [ ! -d "/sys/fs/cgroup/$CGROUP_NAME/tasks" ]; then
        echo ">> setting cgroup..."
        if [ ! -d "/sys/fs/cgroup/$CGROUP_NAME" ]; then
            echo ">> creating '$CGROUP_NAME' cgroup"
            sudo mkdir /sys/fs/cgroup/$CGROUP_NAME
            sudo bash -c "echo +cpu > /sys/fs/cgroup/$CGROUP_NAME/cgroup.subtree_control"
            sudo bash -c "echo +memory > /sys/fs/cgroup/$CGROUP_NAME/cgroup.subtree_control"
        fi
        echo ">> creating $CGROUP_NAME/tasks"
        sudo mkdir /sys/fs/cgroup/$CGROUP_NAME/tasks
    fi
    sudo bash -c "echo 50 > /sys/fs/cgroup/$CGROUP_NAME/tasks/cpu.weight"
    sudo bash -c "echo "500M" > /sys/fs/cgroup/$CGROUP_NAME/tasks/memory.max"

    TURTLE_PID=$(echo $$)
    sudo bash -c "echo $TURTLE_PID > /sys/fs/cgroup/$CGROUP_NAME/tasks/cgroup.procs"
}

# Step 2: Setup cgroup 
cgroup_setup
# Step 3: Setup namespaces
unshare --user --map-root-user --uts --pid --fork --mount \
    chroot /tmp/turtle-os /bin/busybox sh -c \
    "/bin/mount proc /proc -t proc && /bin/hostname turtle && /bin/busybox $@"

Run this script as:

./turtle.sh sh    // exec into container
./turtle.sh ls    // run random commands

References

#tech #tools

Reply to this post by email ↪