Build your own docker
Containers
For this exercise, I named my container runtime “turtle” because containers live in a self-contained environment like a turtle does. And both of them can live in different kinds of environments. While turtle’s shells are made of bones, let’s see what containers are made of…
A container image is simply a tarball of a filesystem. Running containers means downloading these tarball, unpacking it in a directory, and then running a program assuming that directory is it’s whole filesystem. Containers are also isolated and restricted in resources and access of the rest of the system, so it can only see its own environment and not the rest of the machine.
The first step we’ll do to build our own container runtime is to get a small filesystem which could become our containers’s root filesystem.
Below, I’ll fetch alpine-minirootfs and unzip it in /tmp/turtle-os
:
[vallari@fedora turtle]$ wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.2-x86_64.tar.gz
[vallari@fedora turtle]$ mkdir /tmp/turtle-os
[vallari@fedora turtle]$ tar xzf alpine-minirootfs-3.20.2-x86_64.tar.gz --directory=/tmp/turtle-os/
[vallari@fedora turtle]$ ls /tmp/turtle-os/
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
To make our turtle container work, we need to isolate and restrict resources of host system that the container is allowed to use. These are a list of things we want to achieve:
- use
/tmp/turtle-os
directory as container’s rootfs (not host’s rootfs at/
) - use
/tmp/turtle-os/bin/ls
when we runls
inside the container - show only container’s processes in procfs (
ls /proc
does not include host’s processes) . - container to have it’s own hostname (without affecting host’s hostname)
- Make it rootless - so running
sudo
inside the container is not actually host’s root user - Restrict how much memory/cpu the container can use
To make all this possible, let’s learn a little about three kernel features that containers use:
- chroot
- Namespaces
- C-groups
chroot
chroot changes root directory /
of the calling process to the specific path
.
This allows the container process and it’s children processes to have their own filesystem.
# an actual docker container's rootfs:
root@8244007e2d1b:/# ls
bin boot dev etc home lib lib32 lib64 libx32 media mnt opt proc root run sbin srv sys tmp usr var
A user needs to have CAP_SYS_CHROOT capability to call chroot
. Here, I can restrict turtle container’s root filesystem to alpine-minirootfs extracted directory:
[vallari@fedora turtle]$ touch /tmp/turtle-os/TEST # 1. create a TEST file to verify chroot is working
[vallari@fedora turtle]$ sudo chroot /tmp/turtle-os /bin/sh # 2. change rootfs for /bin/sh process
/ # ls / # 3. verify - rootfs of this shell has TEST file
TEST bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/ # exit
[vallari@fedora turtle]$ ls / # 4. host's rootfs does not have TEST file
afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin snap srv sys tmp usr var
[vallari@fedora turtle]$
Real containers use pivot_root instead of chroot. pivot_root achieves the same results but it’s more effective. Why? Because there are a few ways for superuser to come out of “chroot jail” but pivot_root changes the root mount in the mount namespace so it properly jails processes inside a directory. Man page of chroot
clearly says “it is not intended to be used for any kind of security purpose”.
But for this article, we’ll use chroot
.
Namespaces
Linux namespaces are a feature provided by the kernel to isolate resources so processes within the namespace see a set of resources, instead of seeing all global resources.
There are around 8 types of namespaces on Linux, you can read them about it on namespace man page. I will only create these namespaces in this article:
- UTS namespace - isolate hostnames/domainnames!
- Mount namespace - isolate mounts!
- PID namespace - isolate process ID numbering!
- User namespace - isolate users/groups IDs, helps to make rootless containers!
Creating and listing namepsaces
To learn about creating namespaces, I’ll use the example of UTS namespace because its the easiest to understand - it lets the container have it’s own hostname. So processes in same UTS namespace share the same hostname and domain name.
Now, let’s create a namespace! We’ll use unshare
command to create new namespaces. It follows the syntax:
unshare <options> <program>
This command creates new namespaces (based on <options>
) and then executes <process>
within those namespaces. Example: unshare --uts /bin/sh
which creates a UTS namespace and executes /bin/sh
process in that UTS namespace (and not in default host UTS namespace).
Note: namespaces can also be created with clone
syscall. Only difference is that clone
spawns a new child process inside the namespaces, meanwhile unshare
create new namespaces and execute the process in that namespace.
To verify that the namespace is successfully made. We can check the list of all namespaces on the system:
# listing namespaces in a system
$ lsns # list all namespaces
$ lsns --type utc # list 'utc' namespaces - print a type of namespaces
# enter namespace (here we enter UTS namespace to which process $PID belongs to)
$ sudo nsenter --uts=/proc/$PID/ns/uts
$ sudo nsenter -t $PID -u
# see namespaces of a process in procfs
$ ls -l /proc/$PID/ns/
To see it in practice, let’s test it with a docker container:
Let’s start a docker container and see it’s PID outside the container (after this, we’ll see this PID being associated with namespaces created by docker).
# Start a container + check hostname
[vallari@fedora turtle]$ docker run -it ubuntu:latest
root@8244007e2d1b:/#
root@8244007e2d1b:/# hostname
8244007e2d1b
# Find PID of our docker's bash process: it's 36600 here!
[vallari@fedora turtle]$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
vallari 36543 0.1 0.1 1920808 26752 pts/8 Sl+ 23:51 0:00 docker run -it ubuntu:latest
root 36600 0.2 0.0 4624 3712 pts/0 Ss+ 23:51 0:00 /bin/bash
Listing all namespaces created by the above docker process (associated with docker process’ PID):
# Now let's look at all the namespaces this bash process belongs to:
[vallari@fedora turtle]$ sudo lsns
NS TYPE NPROCS PID USER COMMAND
...
4026533017 mnt 1 36600 root /bin/bash
4026533018 uts 1 36600 root /bin/bash
4026533019 ipc 1 36600 root /bin/bash
4026533020 pid 1 36600 root /bin/bash
4026533021 net 1 36600 root /bin/bash
4026533091 cgroup 1 36600 root /bin/bash
# Look at namespaces of this process in procfs!
[vallari@fedora turtle]$ sudo ls -l /proc/36600/ns/
total 0
lrwxrwxrwx 1 root root 0 Mar 7 23:56 cgroup -> 'cgroup:[4026533091]'
lrwxrwxrwx 1 root root 0 Mar 7 23:56 ipc -> 'ipc:[4026533019]'
lrwxrwxrwx 1 root root 0 Mar 7 23:56 mnt -> 'mnt:[4026533017]'
lrwxrwxrwx 1 root root 0 Mar 7 23:51 net -> 'net:[4026533021]'
lrwxrwxrwx 1 root root 0 Mar 7 23:56 pid -> 'pid:[4026533020]'
lrwxrwxrwx 1 root root 0 Mar 7 23:56 pid_for_children -> 'pid:[4026533020]'
lrwxrwxrwx 1 root root 0 Mar 7 23:56 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 Mar 7 23:56 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 Mar 7 23:56 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Mar 7 23:54 uts -> 'uts:[4026533018]'
Entering inside the UTS namespace created and used by docker:
# Enter UTS namespace of the docker bash process:
[vallari@fedora turtle]$ sudo nsenter --uts=/proc/36600/ns/uts
[root@8244007e2d1b turtle]# hostname
8244007e2d1b
# ^ that's the same hostname as our docker's bash we saw above!
# BONUS: anther way to enter the UTS namespace for that process:
[vallari@fedora turtle]$ sudo nsenter -t 36600 -u
[root@8244007e2d1b turtle]# hostname
8244007e2d1b
Removing namespaces
Namespaces are automatically destroyed when the last process in that namespace terminates.
They can be made persistent to exist even after the last process has exited (except PID namespace where init process is required to run for the namespace to exist).
Example, UTS namespace is made persistent by bind mounting a file with unshare --uts=<file>
.
We can remove persistent namespaces with umount
(this was helpful to know how to clean out stray namespaces when experimenting with namespaces in this article!):
$ sudo touch /root/uts-ns
$ sudo unshare --uts=/root/uts-ns hostname FOO # persistent UTS namespace
$ sudo nsenter --uts=/root/uts-ns hostname # we can enter the persistent UTS namepsace (even though it has no processes running!)
FOO
$ sudo mount | grep "uts-ns" # find our namespace's bind mount!
nsfs on /root/uts-ns type nsfs (rw)
$ sudo umount /root/uts-ns # destroy the namespace by removing the bind mount
Let’s understand each type of namespace.
1. UTS namespace
UTS namespace isolates two system identifiers: hostname and NIS domain name. These identifiers can be get/set with system calls sethostname
, gethostname
, setdomainname
, getdomainname
.
We can create a new namespace using unshare --uts
. Creating new UTS namespaces requires CAP_SYS_ADMIN capability (see here). So, we’ll create these namespaces from the root user or with sudo
.
Any changes made to hostname/domainname are local to each UTS namespace. If I create a new UTS namespace and set a new hostname, then the hostname of that UTS namespace would change, but hostname would remain same outside that namespace.
[vallari@fedora turtle]$ sudo unshare --uts /bin/sh
sh-5.2# hostname
fedora
sh-5.2# hostname uts-name # change hostname inside the namespace
sh-5.2# hostname
uts-name # it's changed!
sh-5.2#
sh-5.2#
# outside that UTS namespace, the hostname is unchanged!
[vallari@fedora turtle]$ hostname
fedora
We can see the new namespace in lsns
:
[vallari@fedora turtle]$ sudo unshare --uts /bin/sh
sh-5.2# lsns --type uts # this uts namespace listed here
NS TYPE NPROCS PID USER COMMAND
....
4026532703 uts 2 32547 root /bin/sh
sh-5.2# exit
[vallari@fedora turtle]$ sudo lsns --type uts # above uts namespace gone!
Reference: man page
2. Mount namespace
Brief explaination of mounts on mount
man page:
All files accessible in a Unix system are arranged in one big tree, the file hierarchy, rooted at /. These files can be spread out over several devices. The mount command serves to attach the filesystem found on some device to the big file tree.
Conversely, the umount(8) command will detach it again.
The filesystem is used to control how data is stored on the device or provided in a virtual way by network or other services.
Mount namespaces isolate list of mounts visible to the processes in that namespace.
In docker, we often bind mount a directory:
[vallari@fedora turtle]$ docker run -v /tmp/turtle-os/:/turtle -it ubuntu:22.04 /bin/bash
root@8201381039c9:/# ls /
bin boot dev etc home lib lib32 lib64 libx32 media mnt opt proc root run sbin srv sys tmp turtle usr var
root@8201381039c9:/# mount | grep "turtle"
tmpfs on /turtle type tmpfs (rw,nosuid,nodev,nr_inodes=1048576,inode64)
What is a bind mount? It makes a file/directory subtree visible on another path within same tree. Similar to results of a symlink. Example: mount --bind /bin/ /tmp/mybins
, then /tmp/mybins/cat
should work as /bin/cat
.
When a new mount namespace is created with unshare
, we can mount
and umount
filesystems without effecting host’s filesystem.
This means that our mount at “/turtle” above should not be visible to the host:
[vallari@fedora ~]$ mount | grep "turtle"
[vallari@fedora ~]$
Let’s create a new mount namespace and mount bind filesystems. Both actions - creating new mount namespace and mounting filesystems - require CAP_SYS_ADMIN capability so we’ll sudo
for now!
[vallari@fedora turtle]$ sudo unshare --mount /bin/sh # create new mount namespace
sh-5.2# mount --bind /bin/ /tmp/mybins/ # mount bind
sh-5.2# ls /tmp/mybins/cat # mount bind works!
/tmp/mybins/cat
sh-5.2#
exit
# mount bind did not effect the host filesystem:
[vallari@fedora turtle]$ ls /tmp/mybins/cat
ls: cannot access '/tmp/mybins/cat': No such file or directory
[vallari@fedora turtle]$
This isolation is possible because the kernel (by default) sets the propagation to PRIVATE
in a new mount namespace so mount
/umount
events are private to that mount namespace. There can be other types of propagation: SHARED
(effect of mount
/umount
events are propagated into peer mount namespaces - they all effect each other), SLAVE
(mount
/umount
events propagate from master mount namespaces to slave mount namespaces, but events from slave mounts do not propagate to master mounts), etc.
Using mount namespace for containers
When we chroot into a new root filesystem inside a mount namespace, /proc
from the host is no longer accessible. This is because mount points do not propagate, so we must need to remount the proc pseudo-filesystem at the new /proc to restore process visibility.
Let’s understand this by adding mount namespace to turtle implimentation…
[vallari@fedora turtle]$ sudo unshare --mount /bin/sh # without chroot
/ # mount
..........
.......... (all host mounts)
/ # ps aux
PID USER TIME COMMAND
..........
.......... (all host processes)
[vallari@fedora turtle]$ sudo unshare --mount chroot /tmp/turtle-os /bin/sh # with chroot
/ # mount
mount: no /proc/mounts
/ # ps aux
PID USER TIME COMMAND
Now, let’s remount the procfs:
[vallari@fedora turtle]$ sudo unshare --mount chroot /tmp/turtle-os /bin/sh
/ #
/ # /bin/mount proc /proc -t proc <- remount proc
/ # mount
proc on /proc type proc (rw,relatime)
/ # ps aux
PID USER TIME COMMAND
1 root 0:10 /usr/lib/systemd/systemd --switched-root --system --deserialize=35 rhgb
.......
.......
.......
# list of all host processes
The command /bin/mount proc /proc -t proc
remount procfs at /proc
of the new root filesystem.
References: mount namespce man page, mount command, unix.stackexchange explaination
3. PID namespace
PID namespace isolates process IDs. Man page explains it as:
PID namespaces isolate the process ID number space, meaning that processes in different PID namespaces can have the same PID.
A process in a new PID namespace starts with the PID of 1, like it is part of it’s own system. The PID for the same process inside and outside the namespace is different.
Let’s understand PID namespaces by observing PIDs in a docker container.
# inside container - in a new PID namespace
[vallari@fedora turtle]$ docker run -it ubuntu:latest /bin/bash
root@4dbeabacee71:/# sleep 3000 &
[1] 10
root@4dbeabacee71:/# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 4624 3712 pts/0 Ss 19:00 0:00 /bin/bash # init process of the PID namespace
root 10 0.0 0.0 2788 1536 pts/0 S 19:05 0:00 sleep 3000
root 11 0.0 0.0 7060 2944 pts/0 R+ 19:05 0:00 ps aux
# outside container - in default PID namespace
[vallari@fedora turtle]$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
........
root 1 0.0 0.0 169200 14900 ? Ss Mar04 0:11 /usr/lib/systemd/systemd --switched-root --system --deserialize=35 rhgb # init process of the host!
root 51936 0.9 0.0 4624 3712 pts/0 Ss+ 00:30 0:00 /bin/bash
root 51997 0.0 0.0 2788 1536 pts/0 S 00:35 0:00 sleep 3000
The first process (PID 1) is called “init” process. If init process terminates then the kernel sends a SIGKILL signal to all processes in that namespace. From above example, we can see that /bin/bash
with PID 1 is init process of the container. And /usr/lib/systemd/systemd
is the init process of the host system.
When creating a new PID namespace, we also need to “fork” the process, i.e. use --fork
. The forked child process then starts in the new PID namespace as PID 1. Why? Because PID namespace doesn’t move the current process to the namespace, but it’s children are created in the new namespace. So the current process is not visible to the children, unless we fork it inside the namespace.
Let’s create a new PID namespace:
[vallari@fedora ~]$ sudo unshare -p --fork /bin/bash
[root@fedora vallari]# ps
PID TTY TIME CMD
29265 pts/1 00:00:00 sudo
29266 pts/1 00:00:00 unshare
29267 pts/1 00:00:00 bash
29352 pts/1 00:00:00 ps
But notice how in this new PID namespace, ps
still returns host PID of these process. Why don’t these PIDs start from 1? This is because tools like ps
read from /proc
pseudo-filesystem. In the above example, host’s procfs is visible to ps
tool. To see changes in PID namespaces with ps
, we should use the new rootfs with chroot
and remount procfs (like we did above in mount namespaces section).
# (with PID namespace)
[vallari@fedora turtle]$ sudo unshare --pid --fork --mount chroot /tmp/turtle-os /bin/sh
/ # mount proc /proc -t proc
/ # mount
proc on /proc type proc (rw,relatime)
/ # ps aux # new PID namespace!
PID USER TIME COMMAND
1 root 0:00 /bin/sh
4 root 0:00 ps aux
/ # exit
# (without PID namespace)
[vallari@fedora turtle]$ sudo unshare --mount chroot /tmp/turtle-os /bin/sh
/ # mount proc /proc -t proc
/ # ps aux # host's PID namespace
PID USER TIME COMMAND
1 root 0:11 /usr/lib/systemd/systemd --switched-root --system --deserialize=35 rhgb
........
We can also nest PID namespaces, allowing a parent namespace to see all processes within its child and grandchild namespaces. But a PID namespace cannot see any processes from its ancestor namespaces.
References: https://www.redhat.com/en/blog/pid-namespace
4. User namespace
We will use this namespace to run the container without root privileges.
User namespace is used to isolate security related identifiers - user ID, group ID, capabilities. A process in a new user namespace can have a different user and group ID from the host.
[vallari@fedora ~]$ id
uid=1000(vallari) gid=1000(vallari) groups=1000(vallari)
[vallari@fedora ~]$ sudo unshare --user /bin/sh
sh-5.2$ id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
We can map container’s root (user ID 0) to host’s normal unprivileged user (example, ID 1000) on host. A simple way to do this is to use --map-root-user
when creating a new user namespace with unshare
.
[vallari@fedora turtle]$ cat /proc/self/uid_map
0 0 4294967295
[vallari@fedora turtle]$ cat /proc/self/gid_map
0 0 4294967295
[vallari@fedora turtle]$ unshare --user --map-root-user /bin/bash
[root@fedora turtle]# cat /proc/self/uid_map
0 1000 1
[root@fedora turtle]# cat /proc/self/gid_map
0 1000 1
Now that the user is “root” inside the new user namespace, it has root privileges inside that user namespace! user_namespace docs explain it:
a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace;
in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.
This means that we can use container’s root capabilities to create new mount/pid namespace inside that user namespace (no need to use host’s root capabilities with sudo
)!
[vallari@fedora ~]$ unshare --user --map-root-user --pid --fork --mount --uts chroot /tmp/turtle-os/ /bin/sh
/ # /bin/mount proc /proc -t proc
/ # /bin/hostname turtle
/ # whoami
root
/ # /bin/hostname
turtle
/ # /bin/ps aux
PID USER TIME COMMAND
1 root 0:00 /bin/sh
6 root 0:00 /bin/ps aux
/ # /bin/mount
proc on /proc type proc (rw,relatime)
/ #
The user namespace governs every namespace. This means that a namespace’s capabilities are directly related to the capabilities of its parent user namespace.
Another way to map these user_ID/group_ID is to edit pseudo files /proc/self/uid_map
and /proc/self/gid_map
. Example:
void create_ns() {
auto uid = getuid();
auto gid = getgid();
int rt = unshare(CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS);
if (rt != 0) {
cout << "creating namespaces failed!" << endl;
}
std::ofstream uid_file("/proc/self/uid_map");
uid_file << "0 " << uid << " 1";
uid_file.close();
std::ofstream setgroups("/proc/self/setgroups");
setgroups << "deny" << endl;
setgroups.close();
std::ofstream gid_file("/proc/self/gid_map");
gid_file << "0 " << gid << " 1";
gid_file.close();
cout << "=> new userid: " << getuid() << ", new gid: " << getgid() << "\n";
}
References: user namespace man page
Note: there are more types of namespaces (Network namespace, cgroup namespace, IPC namespace, and Time namespace) which I haven’t covered in this article.
Putting together these namespaces
Alpine uses busybox, so we’ll use that to run our commands.
BusyBox combines tiny versions of many common UNIX utilities into a single small executable.
For our turtle container, we can setup namespaces like this:
[vallari@fedora turtle]$ unshare --user --map-root-user --uts --pid --fork --mount \
chroot /tmp/turtle-os /bin/busybox sh -c \
"/bin/mount proc /proc -t proc && /bin/hostname turtle && /bin/busybox sh"
/ #
/ # PS1="\\u@\\h ~ "
root@turtle ~
This accomplishes similar results as <runtime> exec -it </image> /bin/sh
command.
C-groups
Control groups (cgroups) are linux kernel feature which helps to limit resources usage for processes. Example, set memory limits or cpu limits to a process. If processes inside a cgroup exceeds its limits, the process is killed.
Processes are organized in hierarchical cgroups. They form a tree structure, each cgroup might have multiple children cgroup making a file tree structure. This can be observed in pseudo filesystem - cgroupfs (at /sys/fs/cgroup
). The root cgroup is where all processes belong by default. We can create new cgroups by creating/deleting new directory in cgroupfs subtree.
Example of cgroup hierarchy:
/sys/fs/cgroup/ # root cgroup
/sys/fs/cgroup/child # child cgroup named "child"
/sys/fs/cgroup/child/grandchild # nested cgroup called "grandchild"
/sys/fs/cgroup/child/secondgrandchild # another nested cgroup called "secondgrandchild"
Create a new cgroup:
mkdir /sys/fs/cgroup/child # new cgroup in root cgroup
mkdir /sys/fs/cgroup/child/grandchild # new child cgroup inside a cgroup
Removing a cgroup (if it has no active processes and have no children) :
rmdir /sys/fs/cgroup/child/grandchild
Each process belongs to a cgroup. You can check which cgroups a process belongs to at /proc/$PID/cgroup
:
[vallari@fedora turtle]$ cat /proc/self/cgroup
1:net_cls:/
0::/user.slice/user-1000.slice/session-2.scope
[vallari@fedora turtle]$ ls /sys/fs/cgroup/user.slice/user-1000.slice/session-2.scope/
cgroup.controllers cgroup.procs cpu.max.burst cpu.stat io.prio.class memory.high memory.pressure memory.swap.peak
cgroup.events cgroup.stat cpu.pressure cpu.weight io.stat memory.low memory.reclaim memory.zswap.current
cgroup.freeze cgroup.subtree_control cpuset.cpus cpu.weight.nice io.weight memory.max memory.stat memory.zswap.max
cgroup.kill cgroup.threads cpuset.cpus.effective io.bfq.weight irq.pressure memory.min memory.swap.current pids.current
cgroup.max.depth cgroup.type cpuset.cpus.partition io.latency memory.current memory.numa_stat memory.swap.events pids.events
cgroup.max.descendants cpu.idle cpuset.mems io.max memory.events memory.oom.group memory.swap.high pids.max
cgroup.pressure cpu.max cpuset.mems.effective io.pressure memory.events.local memory.peak memory.swap.max pids.peak
Terminology:
A cgroup is collection of processes that are resource bound and defined by cgroup filesystem.
cgroup subsystem or controllers are kernel component that control a specific resource. Examples of cgroup controllers: cpu, memory, etc.
How to limit resources using a new cgroup?
It can be done in 3 steps using cgroupfs:
- Defined active controllers in our cgroup
- Set limits
- Move process to our cgroup
For this explanation, let’s say we want to limit a process using cgroup named “child”. And it’s hierarchy is root>“parent”>“child”.
1. Defined active controllers
Each cgroup contains these two files:
cgroup.subtree_control
- List of active controllers in cgroup. This is what we edit to modify active controllers (cpu, memory, etc.) in a cgroup.cgroup.controllers
- List of available controllers. File content matches with parent’scgroup.subtree_control
file. This is a read-only file.
So all possible controllers which can be enabled in ‘child’ cgroup (defined in /sys/fs/cgroup/parent/child/cgroup.controllers
) are determined by active controllers in parent cgroup (defined in /sys/fs/cgroup/parent/cgroup.subtree_control
).
# new cgroup
[vallari@fedora turtle]$ sudo mkdir /sys/fs/cgroup/parent/
# list of active controllers
[vallari@fedora turtle]$ ls /sys/fs/cgroup/parent/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
# setting child controllers
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/cgroup.subtree_control
[vallari@fedora turtle]$ sudo bash -c "sudo echo +cpu > /sys/fs/cgroup/parent/cgroup.subtree_control"
[vallari@fedora turtle]$ sudo bash -c "sudo echo +memory > /sys/fs/cgroup/parent/cgroup.subtree_control"
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/cgroup.subtree_control
cpu memory
## creating child cgroup
[vallari@fedora turtle]$ sudo mkdir /sys/fs/cgroup/parent/child
# listing active controllers (see, content is same as /sys/fs/cgroup/parent/cgroup.subtree_control)
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/cgroup.controllers
cpu memory
2. Set limits
Now, let’s set memory and cpu limit in our cgroup.
# echo 50 > /sys/fs/cgroup/parent/child/cpu.weight
# echo "500M" > /sys/fs/cgroup/parent/child/memory.max
Here, I will define limits for ‘child’ cgroup:
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/cgroup.subtree_control
cpu memory
[vallari@fedora turtle]$ ls /sys/fs/cgroup/parent/child
cgroup.controllers cgroup.max.descendants cgroup.threads cpu.pressure irq.pressure memory.low memory.peak memory.swap.events memory.zswap.max
cgroup.events cgroup.pressure cgroup.type cpu.stat memory.current memory.max memory.pressure memory.swap.high
cgroup.freeze cgroup.procs cpu.idle cpu.weight memory.events memory.min memory.reclaim memory.swap.max
cgroup.kill cgroup.stat cpu.max cpu.weight.nice memory.events.local memory.numa_stat memory.stat memory.swap.peak
cgroup.max.depth cgroup.subtree_control cpu.max.burst io.pressure memory.high memory.oom.group memory.swap.current memory.zswap.current
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/cpu.weight
100
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/memory.max
max
[vallari@fedora turtle]$ sudo bash -c "sudo echo 50 > /sys/fs/cgroup/parent/tasks/cpu.weight"
[vallari@fedora turtle]$ sudo bash -c "sudo echo 500M > /sys/fs/cgroup/parent/tasks/memory.max"
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/memory.max
524288000
[vallari@fedora turtle]$ cat /sys/fs/cgroup/parent/child/cpu.weight
50
3. Move process to cgroup
We can restrict resources of a process by moving it into the defined cgroup:
echo $PID > /sys/fs/cgroup/parent/child/cgroup.procs
Manage cgroup for turtle
Here’s what I’ll do to setup cgroup for turtle:
- Create a “test” cgroup and defined limits for it’s children in cgroup.subtree_control
- Create a “tasks” child cgroup and limit cpu/memory for that cgroup
- Add turtle container’s PID to “tasks” cgroup
CGROUP_NAME="test"
# create new cgroup for our container
sudo mkdir /sys/fs/cgroup/$CGROUP_NAME
sudo bash -c "echo +cpu > /sys/fs/cgroup/$CGROUP_NAME/cgroup.subtree_control"
sudo bash -c "echo +memory > /sys/fs/cgroup/$CGROUP_NAME/cgroup.subtree_control"
# setting limits to cgroup
sudo mkdir /sys/fs/cgroup/$CGROUP_NAME/tasks
sudo bash -c "echo 50 > /sys/fs/cgroup/$CGROUP_NAME/tasks/cpu.weight"
sudo bash -c "echo "500M" > /sys/fs/cgroup/$CGROUP_NAME/tasks/memory.max"
# limit a PID to the cgroup limits
TURTLE_PID=$(echo $$)
sudo bash -c "echo $TURTLE_PID > /sys/fs/cgroup/$CGROUP_NAME/tasks/cgroup.procs"
Bonus!
There’s another way to manage cgroups (besides cgroupfs!):
Using commands cgcreate
(create a cgroup), cgset
(set limits), cgexec
(use the cgroup).
cgcreate -g "cpu,memory:$CGROUP_NAME"
cgset -r cpu.weight=50 $CGROUP_NAME
cgset -r memory.max=500M $CGROUP_NAME
cgexec -g "cpu,memory:$CGROUP_NAME" ...<process cmd>...
References: kernel docs, man page, redhat docs
Build your own docker
Putting together everything, here’s the full script to implement our turtle container runtime.
# Step 1: setup minifs - root filesystem of our container
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.2-x86_64.tar.gz
mkdir /tmp/turtle-os
tar xzf alpine-minirootfs-3.20.2-x86_64.tar.gz --directory=/tmp/turtle-os/
export PATH=$PATH:/bin
export PS1="\\u@\\h ~ "
cgroup_setup(){
CGROUP_NAME="test"
if [ ! -d "/sys/fs/cgroup/$CGROUP_NAME/tasks" ]; then
echo ">> setting cgroup..."
if [ ! -d "/sys/fs/cgroup/$CGROUP_NAME" ]; then
echo ">> creating '$CGROUP_NAME' cgroup"
sudo mkdir /sys/fs/cgroup/$CGROUP_NAME
sudo bash -c "echo +cpu > /sys/fs/cgroup/$CGROUP_NAME/cgroup.subtree_control"
sudo bash -c "echo +memory > /sys/fs/cgroup/$CGROUP_NAME/cgroup.subtree_control"
fi
echo ">> creating $CGROUP_NAME/tasks"
sudo mkdir /sys/fs/cgroup/$CGROUP_NAME/tasks
fi
sudo bash -c "echo 50 > /sys/fs/cgroup/$CGROUP_NAME/tasks/cpu.weight"
sudo bash -c "echo "500M" > /sys/fs/cgroup/$CGROUP_NAME/tasks/memory.max"
TURTLE_PID=$(echo $$)
sudo bash -c "echo $TURTLE_PID > /sys/fs/cgroup/$CGROUP_NAME/tasks/cgroup.procs"
}
# Step 2: Setup cgroup
cgroup_setup
# Step 3: Setup namespaces
unshare --user --map-root-user --uts --pid --fork --mount \
chroot /tmp/turtle-os /bin/busybox sh -c \
"/bin/mount proc /proc -t proc && /bin/hostname turtle && /bin/busybox $@"
Run this script as:
./turtle.sh sh // exec into container
./turtle.sh ls // run random commands
References
- Julia Evans’s containers zine: How containers works
- Coding Challenge: Build Your Own Docker
- Talk by Jérôme Petazzoni: ‘Cgroups, namespaces, and beyond: what are containers made from?’
- And man pages!