The docker group and privilege escalation

Docker by default will only accept commands if you run as the root user. Running as root is something that should be avoided if not strictly necessary, as the root user gives you enormous powers.

The problem

Say you are working on a shared computer, and users on that computer need to use docker. To avoid having to give all your users root access you can add them to the docker group, and suddenly they can run docker without root. Neat right? A lot of you will probably end your reading here, as you found the solution to your problem. But please read on, the problem with root access is not yet solved! If you are impatient and don't care about the full explanation you can skip to the section called "The fix".

Namespaces

On Linux, Docker is built on top of namespaces and cgroups. For this particular problem we have to talk briefly about the user namespace.

A user namespace maps a new set of user and group ids to the namespace. What I mean by this sentence is that the user and group ids present inside of a user namespace might map to completely different user and group ids in the parent namespace, or the docker host machine if you prefer. This has a direct effect on file permissions inside of the container.

Let's first have a look at how the namespace uids are managed. I'm ignoring the gids from now on, as the same applies for them, except for the file used to set and check the mappings.

Every process has a /proc/self/uid_map file, as well as a /proc/self/gid_map file for group ids. These files are used to give the mapping of ids for that process, and are inherited by child processes. The default mapping can be seen by looking at the uid_map for your normal shell:

$ cat /proc/self/uid_map
       0          0 4294967295

This mapping can be read as uid 0 for this process is mapped to uid 0 in the parent namespace, and this mapping is the same up to uid 4294967295. We can see this as an identity mapping where every uid in a namespace is mapped to the same uid in the parent namespace. So how does this look for a process inside of a user namespace?

$ id
uid=1000(martin) gid=1000(martin) groups=1000(martin),14(uucp),102(docker)
$ unshare -U /bin/bash # Create and enter a user namespace
$ id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
$ cat /proc/self/uid_map
$ echo $$
12345
$

As you can see, the default mapping is empty, and Linux will tell us we are nobody as we have no user or group id. If we, as root in the parent namespace, do echo "0 0 10000" > /proc/12345/uid_map we set the identity mapping, up to 10000 uids, for the shell in the user namespace. This changes the id of the user:

$ id
uid=1000(martin) gid=65534(nobody) groups=65534(nobody)
$ cat /proc/self/uid_map
0 0 10000
$ 

So we now have gotten the id of uid 1000. If we do the same with /proc/12345/gid_map we also get the correct group permissions, but right now they are not set.

If we instead set the map to 0 100 1000 however, the mapping is that uid 0 inside of the namespace will be uid 100 in the parent namespace, so we will instead get:

$ id
uid=1100(nobody) gid=65534(nobody) groups=65534(nobody)
$ cat /proc/self/uid_map
0 100 1000

Which means if we are uid 0 inside of the namespace we are uid 100 in the parent namespace, and we are no longer able to get true root inside of the namespace.

This directly map to file permissions in Linux. To see if a user has access to a file, it have to resolve the true uid and gids for the process trying to access it. This is done by mapping the uid the process is running as to the uid in the parent namespace. This mapping will be done until we are in the root user namespace. At this point we have the true uid and gids of this user, and those ids are used to see if we have access to this file.

<image here>

Docker is built on top of namespaces in Linux. When docker sets up the mapping it will, by default, use an identity mapping. To see the mapping you can run $ docker container run -it --rm alpine cat /proc/self/uid_map. To see the group mapping you can cat /proc/self/gid_map instead. This mapping means that root inside of the container is actually true root.

The exploit

The effect of this mapping that by default any of the users on the system can escalate their privileges to full root access on the host. The trick here is the ability docker has to make a bind mount of a file or directory from the host to the container. All they need is an image with root access to do something like:

 $ docker image run -it --volume /etc/shadow:/tmp/shadow alpine:latest sh
 # echo "root:$(mkpasswd -m sha-512 foo foobarbaz):12345:0:::::" > /tmp/shadow

This switches the root password to foo and removes all other passwords. This command is extremely crude, and one can be more sophisticated to avoid breaking every other account on the system, but the issue is the same.

The fix

So how can you give your developers access to docker but not give them root access then? The problem here is the uid mapping of the user namespace created. We need the second number in the uid/gid_map files to change, so root inside the container is mapped to something non-root. There are several way of achieving this, but here I'm only going to present the one I found easiest.

To do this you need to change some files on your system. The first two files are standard. Change /etc/subuid and /etc/subgid by adding the line dockremap:100:65536. This limits the uid and gid maps the user dockremap can set. In this case root inside of the container will get uid 100 on the host, and there will be a maximum of 65536 uids inside of the container.

You then need to add the flag --userns-remap=default to your docker daemon. The simplest for the latter is to add the line to your /etc/docker/daemon.json file.

{
  "userns-remap": "default"
}

This tells docker to use the default docker user for remapping the user namespace. The default user docker will use is the dockremap user, which will be created for you. If you want to use another username you have to create that user, and change dockremap to that user in both the subuid/subgid files, as well as in the docker daemon arguments.

When I do this, and run the same docker command as previously I see:

$ docker run -it --rm --volume /etc/shadow:/tmp/shadow alpine sh
/ # cat /proc/self/uid_map 
         0        100      65536
/ # cat /proc/self/gid_map 
         0        100      65536
/ # cat /tmp/shadow 
cat: can't open '/tmp/shadow': Permission denied
/ # ls -l /tmp/shadow 
-rw-r-----    1 nobody   nobody         833 Jun 17 12:58 /tmp/shadow
/ # 

What we see here is that the shadow file, owned by the host root, has permissions nobody. The reason for that is that the uid 0 on the host is not mapped inside of the docker. When we try to access the file the Linux kernel will map the root inside of the container to the host uid and see that the real uid of this user is 100. Since uid and gid 100 on the host lacks the privileges to read the shadow file we get "Permission denied" when trying to read it.

The caveat

One important thing to notice about this is that this will change all of your mappings. If you depend on your docker having the same uid as your host user you need to either change the uid of the docker user to fit the new mapping.

SELinux

Another way to protect /etc/shadow is to use SELinux and make an accept list over folders that you allow to mount inside of docker. I would suggest you also look into SELinux, as SELinux protects against a lot more than we have covered here, but I would still add this in case I mess up my SELinux configuration and as an extra level of protection.

Hope this was useful for you. If you have any comments or questions feel free to comment below. I'll read through the comments, but might not be able to answer quickly.