Running multiple Linux graphical environments using containers

Dividing a single machine into multiple different Linux environments (e.g. private and work related) can be challenging. Especially when dealing with the usual approaches such as dual-booting and virtualization. In this post I will describe a different approach in which the user can run multiple graphical environments without sacrificing convenience (as with dual-booting) or performance (as with virtualization or local VNC connections).

Introduction

This specific setup has been achieved on Arch Linux host with either Gentoo or Arch Linux as the containerized system. The configuration process should be fairly similar for any other distribution and for various other system approaches.

The target machine contains an 500G NVMe disk which has been partitioned into two GPT partitions:

  1. EFI System around 2G in size for the kernel and UEFI bootloader

  2. LUKS on LVM partition which spans the remaining disk space

    1. LVM swap partition, 16G in size
    2. LUKS-encrypted work rootfs partition, around 230G in size
    3. LUKS-encrypted private rootfs partition, remaining partition space

The LightDM display manager is automatically started when booting both Linux systems and is used to login into the i3 window manager environment.

Running a container using systemd nspawn

It is relatively trivial to create and start a Linux container. The underlying technology, Linux namespaces, has allowed the proliferation of various containerization methodologies such as Docker, Kubernetes, Linux Containers, and others. However, for the purpose of containerizing an entire root filesystem we may choose between Linux Containers (LXC) and systemd-nspawn.

Given that systemd is the init system of choice on Arch Linux and given that it provides a containerization mechanism by default — the systemd-nspawn(1) tool — it will be used as the containerization tool of choice in this post.

We begin by decrypting the LUKS partition and mounting the underlying filesystem below the /var/lib/machines namespace:

~ # cryptsetup open /dev/vgstrthinpad/work rootfs-work
Enter passphrase for /dev/vgstrthinpad/work:

~ # mount /dev/mapper/rootfs-work /var/lib/machines/work

A systemd instalation usually provides the systemd-nspawn@.service template unit which can be used to run containers as background services. Since it is a template unit, starting the systemd-nspawn@work service would indeed create a new container based on the contents of /var/lib/machines/work. However, the default nspawn configuration performs certain destructive operations and is therefore necessary to configure nspawn appropriately.

Therefore, we create the /etc/systemd/nspawn directory (if it does not exist) containing the following systemd.nspawn(5) configuration file:

[Exec]
Boot=yes
Parameters="systemd.legacy_systemd_cgroup_controller=0 systemd.mask=docker.service"

Ephemeral=no
ProcessTwo=no
PrivateUsers=no

Capability=all
SystemCallFilter=add_key keyctl openat

Hostname=strthinpad

[Files]
ReadOnly=no
Volatile=no
Bind=/sys/fs/cgroup/unified

[Network]
Private=yes
VirtualEthernet=yes

As mentioned, the PrivateUsers option and its systemd-nspawn(1) equivalent --private-users= are destructive operation when combined with --private-users-chown. This is the default operation of the systemd-nspawn@.service unit file. Although the operation is destructive, there is a way for it to be undone as explained in the manpages:

--private-users-chown

	If specified, all files and directories in the container's directory tree
	will be adjusted so that they are owned by the appropriate UIDs/GIDs selected
	for the container (see above). This operation is potentially expensive, as it
	involves iterating through the full directory tree of the container. Besides
	actual file ownership, file ACLs are adjusted as well.

	This option is implied if --private-users=pick is used. This option has no
	effect if user namespacing is not used.

-U
   Note: it is possible to undo the effect of --private-users-chown (or -U)
   on the file system by redoing the operation with the first UID of 0:

       systemd-nspawn ... --private-users=0 --private-users-chown

Note that we have configured both Bind=/sys/fs/cgroup/unified bind mount and systemd.legacy_systemd_cgroup_controller=0 init process parameter. This is a recent development stemming from the fact that systemd v233 release had introduced a new hybrid control group mode. These options directly influence whether systemd init process of the container will be able to successfully start. Here we concern ourselves exclusively with default system configuration i.e. “unified” cgroups-v2 mode.

To actually start the containerized system we use the systemctl(1) tool:

~ # systemctl start systemd-nspawn@work
~ # systemctl status systemd-nspawn@work
● systemd-nspawn@work.service - Container work
     Loaded: loaded (/usr/lib/systemd/system/systemd-nspawn@.service; disabled; vendor preset: disabled)
     Active: active (running) since Tue 2021-03-16 17:22:28 CET; 9s ago
       Docs: man:systemd-nspawn(1)
   Main PID: 11023 (systemd-nspawn)
     Status: "Container running: Startup finished in 254ms."
      Tasks: 21 (limit: 16384)
     Memory: 41.3M
     CGroup: /machine.slice/systemd-nspawn@work.service
             ├─payload
             │ ├─init.scope
             │ │ └─11025 /usr/lib/systemd/systemd systemd.legacy_systemd_cgroup_controller=0 systemd.mask=docker.service

Since it is a registered container, we may use the machinectl(1) tool to open a shell into the system:

nyqcd@private ~ $ sudo machinectl shell nyqcd@work /bin/bash
Connected to machine work. Press ^] three times within 1s to exit session.
[nyqcd@work ~]$

Excellent, now we can move over to configuring the target container system.

Running X11 on different TTY

The LightDM instance starting inside the container attempts to spawn an X11 process onto the vt7 virtual console:

root@work ~ # lightdm --debug
...
[+0.00s] DEBUG: Using VT 7
[+0.00s] DEBUG: Seat seat0: Starting local X display on VT 7
[+0.00s] DEBUG: XServer 0: Logging to /var/log/lightdm/x-0.log
[+0.00s] DEBUG: XServer 0: Writing X server authority to /run/lightdm/root/:0
[+0.00s] DEBUG: XServer 0: Launching X Server
[+0.00s] DEBUG: Launching process 314: /usr/bin/X :0 -seat seat0 -auth /run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
[+0.00s] DEBUG: XServer 0: Waiting for ready signal from X server :0
...

However, the tty7 console is not only inacessible due to nspawn configuration above, it is also used by the X11 server on the host system. To resolve this issue, we simply instruct LightDM to spawn X11 on another virtual terminal, in this case tty8, by changing the /etc/lightdm/lightdm.conf file:

#
# General configuration
#
...
[LightDM]
#start-default-seat=true
#greeter-user=lightdm
#minimum-display-number=0
minimum-vt=8 # Setting this to a value < 7 implies security issues, see FS#46799
...

To actually allow container access to the /dev/tty8 virtual device, we must stop the container and make several changes to the nspawn mechanism. First, we must allow the systemd-nspawn process to access the virtual device by creating an /etc/systemd/system/systemd-nspawn@work.service.d/override.conf file with the following contents:

[Service]
Environment=SYSTEMD_NSPAWN_USE_CGNS=0
DevicePolicy=auto
DeviceAllow=/dev/tty8 rwm
DeviceAllow=/dev/dri/card0 rwm
DeviceAllow=char-drm rwm

Second, the nspawn configuration file at /etc/systemd/nspawn/work.nspawn must be changed to expose the relevant tty8 device using bind mounts:

diff --git a/etc/systemd/nspawn/work.nspawn b/etc/systemd/nspawn/work.nspawn
index c4a9e06..85b7b39 100644
--- a/etc/systemd/nspawn/work.nspawn
+++ b/etc/systemd/nspawn/work.nspawn
@@ -16,6 +16,10 @@ ReadOnly=no
 Volatile=no
 Bind=/sys/fs/cgroup/unified

+Bind=/dev/tty8
+Bind=/dev/fb0
+Bind=/dev/dri
+
 [Network]
 Private=yes
 VirtualEthernet=yes

Starting the container should now spawn a new X11 environment and immediately switch to it. Hovever, keyboard and mouse have not been configured yet so it would not be possible to interact with the system in this state. Moreover, switching back to the host system graphical interface (or any other virtual terminal) would be similarly difficult.

Enabling keyboard and mouse input

Although a keyboard is usually used for switching between virtual terminals, this can also be achieved on the command line using the chvt(1) tool. Obviously, in case keyboard input on the host machine is currently not available, the chvt command should be used via remote shell access like OpenSSH. For example, to switch back to the host system graphical interface invoke the following command:

~ # chvt 7

In order to allow keyboard input inside the container X11 display session modifications to both the nspawn configuration file and the unit drop-in configuration file should be made. The drop-in file must be changed as follows:

diff --git a/etc/systemd/system/systemd-nspawn@work.service.d/override.conf b/etc/systemd/system/systemd-nspawn@work.service.d/override.conf
index 52a01e4..4f97e5a 100644
--- a/etc/systemd/system/systemd-nspawn@work.service.d/override.conf
+++ b/etc/systemd/system/systemd-nspawn@work.service.d/override.conf
@@ -4,3 +4,4 @@ DevicePolicy=auto
 DeviceAllow=/dev/tty8 rwm
 DeviceAllow=/dev/dri/card0 rwm
 DeviceAllow=char-drm rwm
+DeviceAllow=char-input rwm

Intuitively, we must also expose the /dev/input device inside the container. However, the system is usually made aware of various input devices once they are scanned by udev(7). Therefore we must make the following modifications to the nspawn configuration:

diff --git a/etc/systemd/nspawn/work.nspawn b/etc/systemd/nspawn/work.nspawn
index bec6a3c..5395d5d 100644
--- a/etc/systemd/nspawn/work.nspawn
+++ b/etc/systemd/nspawn/work.nspawn
@@ -20,6 +20,9 @@ Bind=/dev/tty8
 Bind=/dev/fb0
 Bind=/dev/dri

+BindReadOnly=/run/udev/data
+Bind=/dev/input
+
 [Network]
 Private=yes
 VirtualEthernet=yes

Now both keyboard and mouse input should be enabled in the container graphical interface.

Enabling audio in the container using PulseAudio

Access to audio peripherals inside the network namespace would be quite challenging since they are used by the host system. However, if PulseAudio is used on both the host and container system it is fairly trivial to connect the container PulseAudio stack into PulseAudio server on the host. This approach is not only easier from a configurational standpoint, it also allows transparent passthrough of e.g. bluetooth microphone input to container applications such as web browsers.

Network setup

Network should automatically be configured by systemd both in the host network namespace and the container network namespace.

Conclusion

Running multiple X11 servers as shown is indeed quite easy. I was taken by surprise just how simple it is to pass along keyboard and mouse input onto the X11 display of the container. Switching between the environments seems to be quite convenient!

This setup hasn’t shown to be problematic, although some bugs are quite interesting:

  • Upon keyboard and mouse dis- and re-connect while in container display environment, no further input will be possible until chvt is used.

Further experiments should be with regards to Wayland and more security-oriented setups.

2020-04-13 update issue

An update broke this configuration recently:

systemd-nspawn[156347]: Failed to stat /sys/fs/cgroup/unified: No such file or directory
systemd[1]: systemd-nspawn@work.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: systemd-nspawn@work.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Container work.
systemd[1]: Starting Container work...
systemd[1]: Started Container work.

As mentioned previously, this can be resolved by replacing the unified portion of the cgroups bind:

diff --git a/etc/systemd/nspawn/work.nspawn b/etc/systemd/nspawn/work.nspawn
index 6ad7865..0420878 100644
--- a/etc/systemd/nspawn/work.nspawn
+++ b/etc/systemd/nspawn/work.nspawn
@@ -14,7 +14,7 @@ Hostname=strthinpad
 [Files]
 ReadOnly=no
 Volatile=no
-Bind=/sys/fs/cgroup/unified
+Bind=/sys/fs/cgroup

 Bind=/dev/tty8
 Bind=/dev/fb0