Sandboxing Systemd Services

Systemd has numerous powerful security features that tend to go underutilised in Linux systems. Especially, the service files have a lot of sandboxing features that can be used to contain and isolate a service. These tend to be a bit unknown, at least I hadn’t heard of them until quite recently, so I thought that it’d be a good idea to share some knowledge about these in case someone else isn’t aware of them.

Options for Hardening the Services

So, if you’ve used systemd before, you know that it launches services using service files. These files define things like the command to start the service, when to start the service, and restart conditions. However, as usual with systemd, this simple feature can do a lot more than just the expected basic things.

From the security point-of-view, the provided sandboxing and isolation mechanisms are handy and numerous. They help to reduce the blast radius if a vulnerable (or just poorly developed) process attempts to wreak havoc in the system. Here are some of the security-related features to give you an idea of what is possible:

User

Systemd allows for easily defining the user the service runs as (User=). The same is possible for the group as well (Group=). There is also an option to create temporary users as well (DynamicUser=). As the user is generated dynamically, this enables quite a few other security options because many of the traditional user-based assumptions are not applicable anymore.

Capabilities

Systemd allows limiting the capabilities for the services that need root but are not smart enough to drop unused capabilities by themselves (CapabilityBoundingSet=). There are also ambient capabilities that can be used to achieve the opposite: grant certain capabilities to services that run under a non-root user (AmbientCapabilities=). NoNewPrivileges= can be used to ensure that the service and its children cannot get new privileges, for example by executing sudo.

Networking

There are numerous networking isolation mechanisms in systemd. The IP addresses the service can connect to can be limited (IPAddressAllow=/IPAddressDeny=), address families (IP4, IP6, Unix, etc.) can be limited (RestrictAddressFamilies=), and the service can be completely isolated from the networks (PrivateNetwork=). In addition, the bound ports (SocketBindAllow=/SocketBindDeny=) and network interfaces (RestrictNetworkInterfaces=) can be restricted. So yeah, a lot, which makes sense because the network access can be challenging from a security point of view.

File System Sandboxing

Systemd also has quite a few mechanisms for securing files in the system. The access to the file systems and paths can be flexibly set to read-only, read-write, no-exec, or even inaccessible (ReadWritePaths=, etc.), automatic directory creation to common locations can be enabled (RuntimeDirectory=, etc.), temporary filesystems can be created (TemporaryFileSystem=), and /tmp can be made private (PrivateTmp=). At the core of the file system sandboxing is ProtectSystem=. This allows setting the core directories of the system as read-only.

And More

If that is not enough, you can also make the inter-process communication private (PrivateIPC=), filter syscalls (SystemCallFilter=), set umask (UMask=), protect kernel tunables and modules (ProtectKernelTunables=/ProtectKernelModules=), prevent creation of memory regions that are both writable and executable (MemoryDenyWriteExecute=), and control access to the devices (DeviceAllow=). And yes, there’s still even more security options.

It’s worth noting that not all of the options are necessarily available on all of the systems, as they may require kernel features that are not enabled. Some options are also architecture-specific, so read the documentation carefully. Many options take negations (for example, you can define capabilities that a service does not have), and complement each other (like limiting both capabilities and syscalls), meaning that the configuration can get quite complex.

Analysing the System

Let’s do something useful with this information. The first step of fixing anything is figuring out what’s wrong. Systemd provides a tool for this, named systemd-analyze. You can use the following command to analyse the security of the service files in your system:

ShellScript
systemd-analyze security

This typically outputs a lot of sad emojis (or old-school smileys if you don’t have support for emojis). This does not actually mean that the services are dangerous or vulnerable; it just means that they’re not utilising the security features systemd provides. Here’s an example of what the output looks like on Sulka, the Yocto distro that I am currently building:

Analysis output
UNIT                                 EXPOSURE PREDICATE HAPPY
auditd.service                            9.4 UNSAFE    :-{
crond.service                             9.6 UNSAFE    :-{
dbus.service                              9.6 UNSAFE    :-{
emergency.service                         9.5 UNSAFE    :-{
getty@tty1.service                        9.6 UNSAFE    :-{
rc-local.service                          9.6 UNSAFE    :-{
rescue.service                            9.5 UNSAFE    :-{
serial-getty@ttyS0.service                9.6 UNSAFE    :-{
serial-getty@ttyS1.service                9.6 UNSAFE    :-{
syslog-ng@default.service                 9.6 UNSAFE    :-{
systemd-ask-password-console.service      9.4 UNSAFE    :-{
systemd-ask-password-wall.service         9.4 UNSAFE    :-{
systemd-initctl.service                   9.4 UNSAFE    :-{
systemd-journald.service                  4.3 OK        :-)
systemd-logind.service                    2.8 OK        :-)
systemd-networkd.service                  2.6 OK        :-)
systemd-resolved.service                  2.2 OK        :-)
systemd-timesyncd.service                 2.1 OK        :-)
systemd-udevd.service                     7.0 MEDIUM    :-|
systemd-userdbd.service                   2.3 OK        :-)
user@1200.service                         9.4 UNSAFE    :-{
Expand

Well, that’s a bit embarrassing for a distro that claims to be secure. In my defence, for example, on a fairly fresh installation of Linux Mint, 42 services of 62 are considered unsafe, so this unfortunately seems to be a bit of a norm. To get more information on what exactly is wrong, we can add the service name to the end of the command, like so:

Detailed analysis
serviceuser@qemux86-64:~$ systemd-analyze security auditd
  NAME                                                        DESCRIPTION                                                             EXPOSURE
  PrivateTmp=                                                 Service runs in special boot phase, option is not appropriate
  ProtectHome=                                                Service runs in special boot phase, option is not appropriate
  ProtectSystem=                                              Service runs in special boot phase, option is not appropriate
  RootDirectory=/RootImage=                                   Service runs in special boot phase, option is not appropriate
  SupplementaryGroups=                                        Service runs as root, option does not matter
  RemoveIPC=                                                  Service runs as root, option does not apply
- User=/DynamicUser=                                          Service runs as root user                                                    0.4
+ RestrictRealtime=                                           Service realtime scheduling access is restricted
- CapabilityBoundingSet=~CAP_SYS_TIME                         Service processes may change the system clock                                0.2
- NoNewPrivileges=                                            Service processes may acquire new privileges                                 0.2
+ AmbientCapabilities=                                        Service process does not receive ambient capabilities
- PrivateDevices=                                             Service potentially has access to hardware devices                           0.2
- ProtectClock=                                               Service may write to the hardware clock or system clock                      0.2
- CapabilityBoundingSet=~CAP_SYS_PACCT                        Service may use acct()                                                       0.1
- CapabilityBoundingSet=~CAP_KILL                             Service may send UNIX signals to arbitrary processes                         0.1
- ProtectKernelLogs=                                          Service may read from or write to the kernel log ring buffer                 0.2
- CapabilityBoundingSet=~CAP_WAKE_ALARM                       Service may program timers that wake up the system                           0.1
- CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER)         Service may override UNIX file/IPC permission checks                         0.2
- ProtectControlGroups=                                       Service may modify the control group file system                             0.2
- CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE                  Service may mark files immutable                                             0.1
- CapabilityBoundingSet=~CAP_IPC_LOCK                         Service may lock memory into RAM                                             0.1
- ProtectKernelModules=                                       Service may load or read kernel modules                                      0.2
- CapabilityBoundingSet=~CAP_SYS_MODULE                       Service may load kernel modules                                              0.2
- CapabilityBoundingSet=~CAP_BPF                              Service may load BPF programs                                                0.1
- CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG                   Service may issue vhangup()                                                  0.1
- CapabilityBoundingSet=~CAP_SYS_BOOT                         Service may issue reboot()                                                   0.1
- CapabilityBoundingSet=~CAP_SYS_CHROOT                       Service may issue chroot()                                                   0.1
- PrivateMounts=                                              Service may install system mounts                                            0.2
- SystemCallArchitectures=                                    Service may execute system calls with all ABIs                               0.2
- CapabilityBoundingSet=~CAP_BLOCK_SUSPEND                    Service may establish wake locks                                             0.1
- RestrictNamespaces=~user                                    Service may create user namespaces                                           0.3
- RestrictNamespaces=~pid                                     Service may create process namespaces                                        0.1
- RestrictNamespaces=~net                                     Service may create network namespaces                                        0.1
- RestrictNamespaces=~uts                                     Service may create hostname namespaces                                       0.1
- RestrictNamespaces=~mnt                                     Service may create file system namespaces                                    0.1
- CapabilityBoundingSet=~CAP_LEASE                            Service may create file leases                                               0.1
- CapabilityBoundingSet=~CAP_MKNOD                            Service may create device nodes                                              0.1
- RestrictNamespaces=~cgroup                                  Service may create cgroup namespaces                                         0.1
- RestrictSUIDSGID=                                           Service may create SUID/SGID files                                           0.2
- RestrictNamespaces=~ipc                                     Service may create IPC namespaces                                            0.1
- ProtectHostname=                                            Service may change system host/domainname                                    0.1
- CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP)           Service may change file ownership/access mode/capabilities unrestricted      0.2
- CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP)                Service may change UID/GID identities/capabilities                           0.3
- ProtectKernelTunables=                                      Service may alter kernel tunables                                            0.2
- RestrictAddressFamilies=~AF_PACKET                          Service may allocate packet sockets                                          0.2
- RestrictAddressFamilies=~AF_NETLINK                         Service may allocate netlink sockets                                         0.1
- RestrictAddressFamilies=~AF_UNIX                            Service may allocate local sockets                                           0.1
- RestrictAddressFamilies=~…                                  Service may allocate exotic sockets                                          0.3
- RestrictAddressFamilies=~AF_(INET|INET6)                    Service may allocate Internet sockets                                        0.3
- CapabilityBoundingSet=~CAP_MAC_*                            Service may adjust SMACK MAC                                                 0.1
- CapabilityBoundingSet=~CAP_SYS_RAWIO                        Service has raw I/O access                                                   0.2
- CapabilityBoundingSet=~CAP_SYS_PTRACE                       Service has ptrace() debugging abilities                                     0.3
- CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE)              Service has privileges to change resource use parameters                     0.1
- DeviceAllow=                                                Service has no device ACL                                                    0.2
- CapabilityBoundingSet=~CAP_NET_ADMIN                        Service has network configuration privileges                                 0.2
- ProtectProc=                                                Service has full access to process tree (/proc hidepid=)                     0.2
- ProcSubset=                                                 Service has full access to non-process /proc files (/proc subset=)           0.1
- CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges                                   0.1
- CapabilityBoundingSet=~CAP_AUDIT_*                          Service has audit subsystem access                                           0.1
- CapabilityBoundingSet=~CAP_SYS_ADMIN                        Service has administrator privileges                                         0.3
- PrivateNetwork=                                             Service has access to the host's network                                     0.5
- PrivateUsers=                                               Service has access to other users                                            0.2
- CapabilityBoundingSet=~CAP_SYSLOG                           Service has access to kernel logging                                         0.1
+ KeyringMode=                                                Service doesn't share key material with other services
+ Delegate=                                                   Service does not maintain its own delegated control group subtree
- SystemCallFilter=~@clock                                    Service does not filter system calls                                         0.2
- SystemCallFilter=~@cpu-emulation                            Service does not filter system calls                                         0.1
- SystemCallFilter=~@debug                                    Service does not filter system calls                                         0.2
- SystemCallFilter=~@module                                   Service does not filter system calls                                         0.2
- SystemCallFilter=~@mount                                    Service does not filter system calls                                         0.2
- SystemCallFilter=~@obsolete                                 Service does not filter system calls                                         0.1
- SystemCallFilter=~@privileged                               Service does not filter system calls                                         0.2
- SystemCallFilter=~@raw-io                                   Service does not filter system calls                                         0.2
- SystemCallFilter=~@reboot                                   Service does not filter system calls                                         0.2
- SystemCallFilter=~@resources                                Service does not filter system calls                                         0.2
- SystemCallFilter=~@swap                                     Service does not filter system calls                                         0.2
- IPAddressDeny=                                              Service does not define an IP address allow list                             0.2
+ NotifyAccess=                                               Service child processes cannot alter service state
+ MemoryDenyWriteExecute=                                     Service cannot create writable executable memory mappings
+ LockPersonality=                                            Service cannot change ABI personality
- UMask=                                                      Files created by service are world-readable by default                       0.1

-> Overall exposure level for auditd.service: 9.4 UNSAFE :-{
Expand

This provides a comprehensive list of things to consider for our service to make it more secure. Exposure score allows prioritising the items easily.

Putting It Into Practice

Let’s try to turn that frown upside down and harden the auditd service. The analysis gave some quite good ideas for the hardening, so let’s consider what could make sense here. In addition to relying on our own guesswork, we could also use some external resources, like this systemd service hardening guide from Linux Audit.

auditd needs to run with quite high privileges, so we need to have at least some root-level capabilities. In this situation, it makes more sense to run as the root user and limit the capabilities as required. The service is designed to run as root, the configuration files are owned by root, and they’re stored in privileged directories in /etc, so changing the assumptions drastically could be problematic if something changes upstream. If I were doing hardening for my own service that I fully control, I’d try the opposite approach first (run non-root & use AmbientCapabilities).

So, let’s begin by considering what capabilities we want to disable with CapabilityBoundingSet. Since I’m hardening a 3rd party service, this step mostly consist of grepping the sources and making educated guesses whether the service needs some capability or not. It is also a good idea to use common sense and ask questions like “does auditing service really need raw I/O capabilities”, and even if it does need them, would I want to allow such behaviour?

Next, setting up SystemCallFilter. This is a bit simpler. The Linux Audit systemd service hardening guide demonstrates a method using strace to list the used syscalls. However, I took a bit simpler/dumber approach. Systemd provides some pre-defined syscall groups that can be used to set up the filter without having to go through all the individual syscalls. The groups and the syscalls they contain are documented at least here.

So, instead of using strace, I just checked the groups I wanted to disable, and then grepped the source code to see if it contained syscalls from the groups. If the syscalls aren’t used, it should be safe to filter out the group of syscalls. Note that the SystemCallFilter and CapabilityBoundingSet deal with similar things, like limiting reboots or kernel module interactions, but they operate differently, so it is worthwhile to disable similar features in both places.

RestrictAddressFamilies is quite simple, grep AF_ in your code and check what address families are used. NoNewPrivileges should be typically enabled, as being able to elevate privileges during service execution is always a risk. RestrictSUIDSGID is also a good idea; rarely does a service need to set SUID or GUID bits. ProtectHostname, at least in the case of auditd, makes sense, as does ProtectClock. In addition, you want to consider protecting some of the namespaces with RestrictNamespaces.

After some hours of digging into source code, trying out different hardening options, and rebooting my device countless times to ensure that the service still worked as it should, I came up with the following list. It could still be expanded further, but it is a start. Note that this works for my auditd configuration, but there may be some auditing rules that do not work with these sandboxing options. For example, there seems to be a plug-in that performs a reboot in case it detects a violation, so that wouldn’t work anymore.

audtid additions
UMask=077

# These do not necessarily have to be defined one by one
CapabilityBoundingSet=~CAP_SYS_BOOT
CapabilityBoundingSet=~CAP_SYS_CHROOT
CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG
CapabilityBoundingSet=~CAP_SYS_TIME
CapabilityBoundingSet=~CAP_SYS_PACCT
CapabilityBoundingSet=~CAP_WAKE_ALARM
CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE
CapabilityBoundingSet=~CAP_SYS_MODULE
CapabilityBoundingSet=~CAP_SYS_PTRACE
CapabilityBoundingSet=~CAP_SYS_RAWIO
CapabilityBoundingSet=~CAP_SETUID CAP_SETGID CAP_SETPCAP
CapabilityBoundingSet=~CAP_MAC_ADMIN CAP_MAC_OVERRIDE
CapabilityBoundingSet=~CAP_MKNOD
CapabilityBoundingSet=~CAP_LEASE
CapabilityBoundingSet=~CAP_SYS_ADMIN

NoNewPrivileges=yes

SystemCallFilter=~@clock
SystemCallFilter=~@cpu-emulation
SystemCallFilter=~@debug
SystemCallFilter=~@module
SystemCallFilter=~@mount
SystemCallFilter=~@obsolete
SystemCallFilter=~@raw-io
SystemCallFilter=~@reboot
SystemCallFilter=~@swap

RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX AF_NETLINK
RestrictSUIDSGID=yes
RestrictNamespaces=~cgroup
RestrictNamespaces=~mnt
RestrictNamespaces=~uts

ProtectClock=yes
ProtectHostname=yes
ProtectControlGroups=yes

PrivateMounts=yes
Expand

With these additions to the [Service] section in the service file, the security analysis looks much happier:

Happy happy joy joy
auditd.service                            4.8 OK        :-)

The slightly frustrating part about this process is that this has to be done for each service individually, and there aren’t that many (if any) rules that apply to all the services. The process is a combination of trial-and-error, and actually doing some research and using a brain.

The Yocto-Specific Part

Now, since I usually talk about Yocto in this blog, I’ll add a paragraph on how to achieve this with Yocto. The only thing specific to Yocto is realising that the systemd-analyze package has to be installed separately, so to enable the command in your distro, you’ll need to add the following:

Yocto
IMAGE_INSTALL:append = " systemd-analyze"

After that, it’s pretty much the same. To fix the service file in your build, you can either use a patch (if the service file is stored in a version control system) or override it with your own service file (if the original service file is stored in the meta-layer).

Conclusion

Thanks for reading this introduction to the systemd service hardening & sandboxing. As mentioned, there unfortunately aren’t that many easy shortcuts that can be taken here, as service hardening requires a bit more careful planning and research (especially if you’re not hardening your own service). However, the hardening features provided by systemd are quite extensive and can be invaluable if a service ends up having exploitable vulnerabilities, especially for services that are public-facing or interact with users.

Recommended Reading

Share