
Build Containers From Scratch in Go
In the last few years, the use of containers has increased significantly. The concept of containers have been around for several years, but it was Docker’s easy-to-use command line that started to popularize containers among developer in 2013.
In this series, I am trying to demonstrate how containers work underneath and how I did develop the vessel.
What is vessel?
vessel is an educational-purpose project of mine that implements a tiny version of Docker to manage containers. It does not use either containerd or runc, it uses a set of Linux features to be able to create containers.
vessel is neither production-ready nor well-tested software. It’s just a simple project to learn more about containers.
Let’s start: reading about Docker!
I found it useful to take a look at Docker docs and gain insight into containers first, before starting to code.
Docker, regarding its documentation, takes advantage of several features of the Linux kernel and combines them into a wrapper called a container format. Those features are:
- Namespaces
- Control groups
- Union file systems
Now let’s go through the above list and understand what they are briefly.
What is Namespace!?
Linux namespaces are the underlying technology behind the most modern container implementations. Namespaces are processes’ awareness of what else is running around them. Namespaces allow for isolating global system resources within a group of processes. The network namespace, for example, isolates the networking stack, which means processes within that network namespace can have their own independent routes, firewall rules, and network devices.
So without namespaces, processes in a container could, for example, unmount a file system, or set down a network interface in another container.
What kind of resource can isolate using namespaces?
In the current Linux kernel (5.9), there are 8 types of different namespaces. Each namespace can isolate certain global system resources.
- Cgroup: This namespace isolates the Control Groups root directory. I will explain what cgroups are in part 2. But for a short explanation, cgroup allows the system to define resource restriction to a group of processes. There is, however, a note to mention here, “cgroup namespace” only controls which cgroups are visible within the namespace. The namespace can not assign resource restrictions. We will explain this in-depth soon.
- IPC: This namespace isolates inter-process communication mechanisms such as System V and POSIX message queues. Understanding IPC is not hard but this post is not going through this topic.
- Network: This namespace isolates routes, firewall rules, and network devices that a group of processes within the namespace can see.
- Mount: This namespace isolates the list of mount points in each namespace. Processes running in separate mount namespaces can mount and unmount without affecting other namespaces.
- PID: This namespace isolates process ID number space. It enables functions such as suspending/resuming processes within the namespace.
- Time: This namespace isolates
CLOCK_MONOTONICandCLOCK_BOOTTIMEsystem clocks which affect APIs that measure against these clocks such as system uptime. - User: This namespace isolates user IDs, group IDs, the root directory, keys, and capabilities. This allows a process to be root within the namespace, but not outside of it (like in host).
- UTS: This namespace isolates the hostname and the domain name
An important note about namespaces
Namespaces are not doing anything but isolation, this means, for example, joining a new network namespace won’t give you a set of isolated network devices, you have to create them on your own. The same thing about the UTS namespace, it won’t change your hostname. The only thing it does is isolate hostname-related system calls. We are going to do these things through this series together.
Namespaces lifetime
A namespace will automatically turn down when the last process in the namespace leaves the namespace. However, there are a number of exceptions that keep the namespace alive without any member processes. We will explain one of these exceptions in creating a network namespace for vessel.
Namespaces system calls
Now we briefly know what namespaces are, it is time to see how to interact with them. In Linux, there are a set of system calls that enable creating, joining, and discovering namespace.
clone: This system call actually creates a new process. But with the aid of the flags argument, the new process will create its own new namespaces.setns: This system call allows the running process to join an existing namespace.unshare: This system call is actually the same as clone but the difference is that this syscall will create and move the current process to a new namespace butclonewill create a new process with new namespaces.
Bonus point: Internally fork and vfork syscalls simply call clone() with different arguments.
Namespace Flags
The system calls mentioned above need a flag to be able to specify the namespaces you want.
CLONE_NEWCGROUP Cgroup namespaces
CLONE_NEWIPC IPC namespaces
CLONE_NEWNET Network namespaces
CLONE_NEWNS Mount namespaces$$
CLONE_NEWPID PID namespaces
CLONE_NEWTIME Time namespaces
CLONE_NEWUSER User namespaces
CLONE_NEWUTS UTS namespacesFor example, if you want to create a new Network namespace for the current process you should call unshare with CLONE_NEWNET flag, and if you wanna create a new process with a new User and UTS namespace you should call clone with CLONE_NEWUSER|CLONE_NEWUTS. The vertical bar represents OR bitwise which combines two flags.
Namespace file
In above I’ve mentioned that the setns syscall will move a running process between namespaces. But how can we specify which namespace we want to move to? Well, after creating a namespace, the member processes will have a symbolic link to the namespace file.
In Unix, Everything is a file.
In your shell, for example, by listing files under the /proc/[pid]/ns directory, you can see the process namespaces. Here you can see the current namespaces of the running shell (self stands for current shell pid):
$ ls -l /proc/self/ns | cut -d ' ' -f 10-12cgroup -> cgroup:[4026531835]
ipc -> ipc:[4026531839]
mnt -> mnt:[4026531840]
net -> net:[4026532008]
pid -> pid:[4026531836]
pid_for_children -> pid:[4026531836]
time -> time:[4026531834]
time_for_children -> time:[4026531834]
user -> user:[4026531837]
uts -> uts:[4026531838]Also using lsns command you can see a list of processes namespaces:
# lsns
NS TYPE NPROCS PID USER COMMAND
4026531834 time 244 1 root /sbin/init
4026531835 cgroup 244 1 root /sbin/init
4026531836 pid 199 1 root /sbin/init
4026531837 user 198 1 root /sbin/init
4026531838 uts 241 1 root /sbin/init
4026531839 ipc 244 1 root /sbin/init
4026531840 mnt 234 1 root /sbin/initWhat actually setns syscall does is change the link of the files under the /proc/[pid]/ns directory.
Enough talk, LET’S CODE!
Now we know everything we want. It is time to write our first code to run on a separated namespace. For our first try let’s see how unshare works. The code below, at line 1 using the syscall package and Unshare method creates a new namespace for the current running Go program, then in line 5 sets the hostname to “container”, then at line 9 it creates a new command and runs it. Run starts the command and waits for it to finish.
Creation of name spaces needs
CAP_SYS_ADMINcapability, except for user namespace. Thus you need to run the program as root.








