Namespaces and Cgroups in Linux

Arian Fm
7 min readJul 25, 2024

--

Introduction

Namespaces are a powerful feature in the Linux kernel that allow for the creation of isolated environments, giving each process or group of processes its own isolated set of system resources. This isolation is a cornerstone of containerization technologies like Docker and LXC. The six primary types of namespaces in Linux are: Mount (mnt), Process ID (pid), Network (net), Inter-process Communication (ipc), UTS (uts), and User (user). Additionally, Control Groups (cgroups) are used to manage and limit resources like CPU, memory, and I/O for groups of processes.

Types of Namespaces

  1. Mount Namespace (mnt): Isolates the set of filesystem mount points seen by a group of processes.
  2. PID Namespace (pid): Isolates the process ID number space, allowing for processes to have the same PID in different namespaces.
  3. IPC Namespace (ipc): Isolates inter-process communication resources, such as System V IPC and POSIX message queues.
  4. UTS Namespace (uts): Isolates the hostname and NIS domain name, allowing containers to have unique hostnames.
  5. User Namespace (user): Isolates user and group ID numbers, allowing a process to have different user IDs within different namespaces.

Creating and Managing Namespaces

To demonstrate the creation and management of namespaces, we will use the unshare, ip, and cgcreate commands along with various system files and settings.

1. Create a Network Namespace

$ ip netns add mynetns

$ ip link add veth0 type veth peer name veth1 # Create a Virtual Ethernet Pair

$ ip link set veth1 netns mynetns # Move One End of the Pair to the Namespace

$ ip addr add 192.168.1.1/24 dev veth0 \
ip netns exec mynetns ip addr add 192.168.1.2/24 dev veth1 # Assign IP Addresses

$ ip link set veth0 up \
ip netns exec mynetns ip link set veth1 up # Bring Up the Interfaces

$ ip netns exec mynetns ip link set lo up # Set Up Loopback Interface

$ ip netns exec mynetns ping -c 3 192.168.1.1 # Test Connectivity

2. Creating IPC Namespace

$ unshare --ipc --fork /bin/bash # Unshare IPC Namespace

$ ipcs # check the ipcs in namespace

$ ipcmk -Q # create a Message Queue

$ ipcmk -M # create a shared memory

3. Creating PID Namespace

$ unshare --pid --fork --mount-proc /bin/bash # Unshare PID Namespace
# --mount-proc: Mounts the /proc filesystem in the new namespace.

$ ps -ef # Verify PID Namespace

4. Creating UTS Namespace

$ unshare --uts --fork /bin/bash # Unshare UTS Namespace

$ hostname mycontainer # set a hostname

$ hostname # Verify Hostname

5. Creating USER Namespace

User Namespace (user)

User namespaces allow a process to have different user and group IDs inside the namespace than those outside. This is especially useful for running processes with root privileges inside a container while the actual process on the host runs with a non-root user.

$ unshare --user --map-root-user --fork /bin/bash # Create a User Namespace and Map User IDs

$ id # Verify User IDs

Inside the new shell, you will see that the user ID is 0 (root), but outside, the process still runs with the original user’s privileges.

6. Creating MNT Namespace

Mount namespaces allow a group of processes to have a different view of the filesystem hierarchy. This is particularly useful for containerization, as each container can have its own filesystem layout.

$ unshare --mount --fork /bin/bash # Create a Mount Namespace

$ mount -t tmpfs none /mnt # Mount a New Filesystem

$ mount | grep /mnt # Verify the Mount

$ findmnt # check the mount points in parent shell

This shows that the /mnt directory is now a separate tmpfs mount point inside the namespace, isolated from the host.

if you use <findmnt> command in your parent shell you cant see any mount point of your namespace

Let’s run the following command:

$ unshare --user --pid --map-root-user --mount-proc --fork chroot $HOME/test /bin/bash

What exactly are we doing here? What are we telling the Kernel to do? Let’s go step by step to try to understand what we are doing.

  • –user: Create a new user namespace.
  • –pid: Create a new pid namespace. (Will fail if –fork is not specified)
  • –map-root-user: Wait to start the process until the current user (running unshare command) gets mapped to the superuser in the new namespace. This allows having root privileges within the namespace, but not outside of that scope.
  • –mount-proc: Mount /proc filesystem in the new namespace and create a new mount, this is to be able to have different processes with same process IDs in both namespaces.
  • –fork: Run command as a child process of the unshare command, instead of running it directly. It’s required when creating a new PID namespace
  • chroot: change root directory when a namespace craeted , in this case if a namespace will be created , the root directory going to be test directory in parent home.

Control Groups (cgroups)

Control groups (cgroups) allow you to allocate resources — such as CPU time, system memory, network bandwidth, or combinations of these resources — among user-defined groups of tasks (processes) running on a system.

Viewing cgroups

The filesystem is typically utilized for accessing cgroups, diverging from the Unix system call interface traditionally used for kernel interactions. To investigate a shell’s cgroup configuration, one should examine the /proc/self/cgroup file, which reveals the shell’s cgroup. Then, by navigating to the /sys/fs/cgroup (or /sys/fs/cgroup/unified) directory and locating a directory that shares the cgroup's name, one can observe various settings and resource usage information pertinent to the cgroup.

root@debian:/sys/fs/cgroup# ls
cgroup.stat cpuset.mems.effective io.cost.model memory.pressure sys-kernel-debug.mount
cgroup.controllers cgroup.subtree_control cpu.stat io.cost.qos memory.stat sys-kernel-tracing.mount
cgroup.max.depth cgroup.threads dev-hugepages.mount io.pressure -.mount system.slice
cgroup.max.descendants cpu.pressure dev-mqueue.mount io.stat sys-fs-fuse-connections.mount user.slice
cgroup.procs cpuset.cpus.effective init.scope memory.numa_stat sys-kernel-config.mount

to create a cgroup you can make a directory inside this path /sys/fs/cgroup

for example im going to create <test> cgroup :

root@debian:/sys/fs/cgroup# mkdir test


root@debian:/sys/fs/cgroup# cd test

root@debian:/sys/fs/cgroup/test# ls
cgroup.controllers cgroup.stat cpuset.cpus cpu.weight memory.current memory.min memory.swap.events
cgroup.events cgroup.subtree_control cpuset.cpus.effective cpu.weight.nice memory.events memory.numa_stat memory.swap.high
cgroup.freeze cgroup.threads cpuset.cpus.partition io.max memory.events.local memory.oom.group memory.swap.max
cgroup.max.depth cgroup.type cpuset.mems io.pressure memory.high memory.pressure pids.current
cgroup.max.descendants cpu.max cpuset.mems.effective io.stat memory.low memory.stat pids.events
cgroup.procs cpu.pressure cpu.stat io.weight memory.max memory.swap.current pids.max

as you can see system generate all the cgroup parameters inside the directory automatically.

for example in this case i wanna set a limit for processes in test namespace

root@debian:/sys/fs/cgroup/test# echo 20 > pids.max
# this set limit for processes to dont exceed more than 20
# so lets add a pid to this cgroup

root@debian:/sys/fs/cgroup/test# echo $PID > cgroup.procs

Controlling distribution of CPU time for applications by adjusting CPU weight:

You need to assign values to the relevant files of the cpu controller to regulate distribution of the CPU time to applications under the specific cgroup tree.

for more info check this out : https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel#preparing-the-cgroup-for-distribution-of-cpu-time_using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications

Memory Controller

Memory utilization is a key area where resource control can make big efficiency improvements. In this section we’ll look in detail at the cgroup2 memory controller, and how to get started configuring its interface files for controlling system memory resources.

Like all cgroup controllers, the memory controller creates a set of interface files in its child cgroups whenever it’s enabled. You adjust the distribution of memory resources by modifying these interface files, often within a Chef recipe or container job description.Here are some of the memory controller’s core interface files. Amounts in these files are expressed in bytes.

memory.currentShows the total amount of memory currently being used by the cgroup and its descendants. It includes page cache, in-kernel data structures such as inodes, and network buffers.

memory.highis the memory usage throttle limit. This is the main mechanism to control a cgroup’s memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit.memory.max is the memory usage hard limit, acting as the final protection mechanism: If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM killer is invoked on the cgroup. Under certain circumstances, usage may go over the memory.high limit temporarily. When the high limit is used and monitored properly, memory.max serves mainly to provide the final safety net. The default is max.

memory.low is the best-effort memory protection, a “soft guarantee” that if the cgroup and all its descendants are below this threshold, the cgroup's memory won't be reclaimed unless memory can’t be reclaimed from any unprotected cgroups.memory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup’s memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked.

memory.swap.currentis the total amount of swap currently used by the cgroup and its descendants.memory.swap.max is the swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out; memory that already has swap slots allocated to it can continue to be swapped out. The default is max.

# Set the limit to 40MB
root@debian:/sys/fs/cgroup/test$ echo $((40 * 1024 * 1024)) > /sys/fs/cgroup/test/memory.max

--

--