Adding Key-Based SSH Authentication to Yocto

I recently needed to add SSH keys to a Yocto image, but realised that I have no idea how that is done. When I tried to figure out how to do it, I realised that I didn’t even fully know what I actually wanted to achieve. This text is supposed to be a quick crash course on the different keys used in SSH servers, and how to generate and use them in Yocto.

Step zero of this SSH process is adding either ssh-server-dropbear or ssh-server-openssh to your IMAGE_FEATURES to select the SSH server you want to use. After the choice is made, we are good to go and ready to start logging in.

I feel like 90% of my blog posts either talk about how to get into a system, or how to prevent people from getting into systems.

Host Key

The host key is the key that is used to identify the SSH server. When you’re connecting for the first time to an SSH server, you’ll get a warning that the identity of the server isn’t known. Once you accept the server identity, it’ll get added to known_hosts file. If the server generates a new host key every time it boots up, you’ll get a new warning that the identity doesn’t match the expected one. If a server’s identity suddenly changes, it can also mean that something malicious is going on, and the connection is usually aborted.

Host key consists of both public and private portions, and both are stored on the server. The public key is given to the client so that it can check it during subsequent connection attempts, and the private key is used to sign messages that the client can then verify with the public key.

Generating Keys

OpenSSH

First, we need to generate the keys. This step varies a bit, depending on the SSH server you’re using. To generate keys for OpenSSH, you can run the following:

ssh-keygen -t rsa -b 4096 -f ssh_host_rsa_key
ssh-keygen -t ed25519 -f ssh_host_ed25519_key

Having multiple key types offers security, flexibility, and both backwards compatibility and future-proofing. RSA is quite old, but using it with longer keys should still be an acceptable fallback for older clients without ed25519 support. ed25519 is a more modern algorithm that’s faster and hopefully harder to crack, but not fully supported in all SSH clients. ecdsa is also a modern algorithm that can be used with OpenSSH. If you know that you’re not going to require the fallback for the older clients you can skip the RSA keys.

Dropbear

Creating keys for Dropbear is quite similar, but we need to use dropbearkey command instead:

dropbearkey -t rsa -s 4096 -f dropbear_rsa_host_key
dropbearkey -t ed25519 -f dropbear_ed25519_host_key

It’s worth noting that the OpenSSH command creates two separate files, one for the public and another one for the private key, while Dropbear only creates one file that contains both. That’s because they have their own key format.

Yocto Recipe

In Poky there already is a recipe that installs pre-generated host keys to the system: ssh-pregen-hostkeys. We can use that recipe as a basis, copy the keys to a location where the recipe can find them and write a recipe like this:

SUMMARY = "Pre-generated host keys"

SRC_URI = "file://dropbear_rsa_host_key \
           file://dropbear_ed25519_host_key \
           file://ssh_host_rsa_key \
           file://ssh_host_rsa_key.pub \
           file://ssh_host_ed25519_key \
           file://ssh_host_ed25519_key.pub"

LICENSE = "MIT"
LIC_FILES_CHKSUM = "file://${COMMON_LICENSE_DIR}/MIT;md5=0835ade698e0bcf8506ecda2f7b4f302"

INHIBIT_DEFAULT_DEPS = "1"

do_install () {
    install -d ${D}${sysconfdir}/dropbear
    install ${WORKDIR}/dropbear_*_host_key -m 0600 ${D}${sysconfdir}/dropbear/

    install -d ${D}${sysconfdir}/ssh
    install ${WORKDIR}/ssh_host_*_key* ${D}${sysconfdir}/ssh/
    chmod 0600 ${D}${sysconfdir}/ssh/*
    chmod 0644 ${D}${sysconfdir}/ssh/*.pub
}
It’s open source, I don’t have to explain anything.

Note that you should remove the parts that do not apply to your use-case because you most likely do not want to have both Dropbear and OpenSSH keys in the system.

What if I want to generate the keys during the build?

I’d recommend creating a pre-build script that runs the key-generation commands before starting the build. The SSH packages in Yocto don’t provide native versions, meaning that the ssh-keygen and dropbearkey won’t get compiled for the build architecture. This in turn means that you cannot call these commands in recipes without host contamination. Instead of trying to call the commands in a recipe, a shell script like this could work:

ssh-keygen -t rsa -b 4096 -f ssh_host_rsa_key -N ""
ssh-keygen -t ed25519 -f ssh_host_ed25519_key -N ""
dropbearkey -t rsa -s 4096 -f dropbear_rsa_host_key
dropbearkey -t ed25519 -f dropbear_ed25519_host_key
cp ssh_host_*_key* <RECIPE_SRC_URI_DIR>
cp dropbear_*_host_key <RECIPE_SRC_URI_DIR>

Authentication Key

The second type of key we’re usually interested in is the authentication key. This can be used for SSH login without having to enter a password. It’s also usually recommended over the password login because passwords are a quite poor method of authentication. In key-based authentication, the public key gets stored on the server (the Yocto image in our case), and the client uses the private key it has to prove it should be allowed in.

Generating Keys

OpenSSH

I recommend using ssh-keygen, because the public keys generated with it are compatible with both Dropbear and OpenSSH. The command to create the keys is the same as before. I’ll stick to ed25519 for security:

ssh-keygen -t ed25519 -f ssh_auth_ed25519_key

Dropbear

Dropbear is a bit trickier, in the sense that it requires TWO commands instead of one. First, we create the key pair:

dropbearkey -t ed25519 -f dropbear_ed25519_auth_key

Then we’ll extract the public key from the key file:

dropbearkey -y -f dropbear_ed25519_auth_key > ./dropbear_ed25519_auth_key.pub

This may create a file with multiple lines that contain some unnecessary information. You should remove all the other lines except the one that begins with ssh-ed25519.

Yocto Recipe

On the Yocto side, we want to add a generated public key to the ~/.ssh/authorized_keys file for each of the users that should be allowed to log in. Adding the following snippet to an image recipe can be used to do that:

# Add example user
inherit extrausers

# Hashed password, unhashed value is "password"
PASSWD = "\$6\$vRcGS0O8nEeug1zJ\$YnRLFm/w1y/JtgGOQRTfm57c1.QVSZfbJEHzzLUAFmwcf6N72tDQ7xlsmhEF.3JdVL9iz75DVnmmtxVnNIFvp0"

EXTRA_USERS_PARAMS:append = "\
    useradd -u 1200 -d /home/serviceuser -p '${PASSWD}' -s /bin/sh serviceuser; \
"

# Location of the authentication keys
AUTH_KEYS_DIR ??= "${TOPDIR}/../auth-keys"

# Function to install the auth keys
configure_ssh_auth_key() {
    AUTH_KEYS=$(ls ${AUTH_KEYS_DIR}/*-auth-key.pub)
    for auth_key in ${AUTH_KEYS}; do
        user=$(basename "$auth_key" | cut -d'-' -f1)
        mkdir ${IMAGE_ROOTFS}/home/${user}/.ssh
        cat ${auth_key} >> ${IMAGE_ROOTFS}/home/${user}/.ssh/authorized_keys
    done
}

# Leading ; shouldn't be required, but seems to fail without it
ROOTFS_POSTPROCESS_COMMAND:append = ";configure_ssh_auth_key;"

This approach assumes that the keys have the following naming scheme: <USER_NAME>-<ALGORITHM>-auth-key.pub. It also assumes that the keys are located in a directory named auth-keys one level above the build directory. Also, it is assumed that the user has their home directory already created when the key is copied. That’s a lot of assumptions, but assuming you follow the key file naming scheme things should work just fine. You can configure the key folder to a custom location using AUTH_KEYS_DIR variable.

The example above adds a new user named serviceuser to the system for demonstration purposes. So, if the AUTH_KEYS_DIR points to a directory that contains serviceuser-ed25519-auth-key.pub, the key should get copied to the user’s authorized_keys and it should be possible to log in as them using SSH. If you want to read more about adding users to your Yocto images, you can check out my other blog text.

The first thought I had was using SRC_URI and installing the files from WORKDIR, but that doesn’t work because image.bbclass defines do_fetch as noexec. According to this StackOverflow answer, you can define those as executable with Python function, make bitbake fetch the certificates from SRC_URI, and install them in a custom task, but it’s a bit risky to edit the core classes like that, so I’d advise against it.

Also, once again, if you want to generate the keys during the build, I recommend using a pre-build script to create and store the keys.

Connecting

The next logical step is of course connecting. I’m going to assume you have a service that starts the SSH server on the image, such service should exist by default if you install an SSH server using IMAGE_FEATURES. Note that OpenSSH and Dropbear clients expect the private key in their specific format, so you can’t use OpenSSH private key on Dropbear client or vice versa. Unless you convert them, more on that a bit later.

One thing to note when attempting an SSH connection. If you get asked for a password even when you shouldn’t, there’s a high chance that the user you’re trying to log in as has been locked. This happens for example if you don’t define a password when creating a user. Check the system log for relevant messages to debug the issue if you encounter it.

OpenSSH

OpenSSH client is the classic, fan-favourite ssh. To connect to a remote server with your private key, simply run:

ssh -i <PRIVATE_KEY_FILE> <USERNAME>@<IP>

The magic is the -i flag. You may want to add the key to the ssh-agent so that you won’t have to use the flag every time:

# Start ssh-agent
eval "$(ssh-agent -s)"
# Add key
ssh-add <PRIVATE_KEY_FILE>

After that the ssh command “should just work”. I personally don’t like things that “should just work”, but it saves some typing if you’re constantly connecting to a server.

Dropbear

Honestly, I didn’t even know that Dropbear has its own SSH client before writing this text. It turns out it has, which makes sense considering that it has its own key format. The client is called dbclient, and as one can guess, it focuses on being lightweight. One thing that may not be obvious from its name is that it has nothing to do with databases though. To connect to a remote server with the Dropbear format key, use the following command:

dbclient -i <PRIVATE_KEY_FILE> <USERNAME>@<IP>

Converting Keys

If you end up in a situation where you have the wrong type of private keys for your SSH client, you can use dropbearconvert to convert a key from Dropbear format into OpenSSH format:

dropbearconvert dropbear openssh <SOURCE_KEY> <DESTINATION_KEY>

The conversion from OpenSSH to Dropbear isn’t much harder either:

dropbearconvert openssh dropbear <SOURCE_KEY> <DESTINATION_KEY>

It’s worth noting that dropbearconvert converts only private keys. That’s because both OpenSSH and Dropbear use a similar type of public keys.

That’s all for this topic. These instructions should help you build an image that has a constant host certificate, and that you can connect to easily and password-free. Both are quite basic yet important features for a firmware image. If you have questions or comments let me know in the comments.

Fuzzing Yocto Kernel Modules with Syzkaller

This is a sequel to a similarly named blog post Black-Box Fuzzing Kernel Modules in Yocto. In that blog post, I briefly went through what fuzzing is and presented an easy but fairly naive approach to fuzzing. This time I will present a more refined approach to fuzzing using Syzkaller.

Syzkaller

Syzkaller is an unsupervised coverage-guided kernel fuzzer from Google. Simply put, it creates programs that perform different kernel syscalls in varying order with evolving parameters and then analyzes the coverage to see what test programs reach new code coverage. These programs are added to the corpus, which can be used to create even more programs with even larger coverage. Sounds simple? Well, the architecture looks like this:

Could be worse. There are two core components in the system, syz-manager and syz-executor. syz-manager runs on the “main fuzzing server”, and syz-executor runs on the fuzzing target. This can either be an actual hardware device or a QEMU emulator system that syz-manager controls. syz-manager is responsible for the fuzzing work, generating the test programs, and storing the corpus & crashes. The executor receives the test programs, runs them and reports back the results. This communication happens over a network interface.

Smoke Test

A good way to smoke test the setup is to run Syzkaller with the default built-in syscall definitions. This ensures that your image is suitable for fuzzing without having to worry that your additions are breaking things. I’m using Yocto to build an image for x86_64 QEMU. You can use pretty much anything to build the image as long as you have the kernel source and object files, and a disk image for rootfs and of course a kernel. In addition to x86_64 there are plenty of other supported architectures. The setup guide lists arm, arm64, and riscv64 for example.

Building the Image

First, the disk image requirements. The image should have an SSH root login that either has key authentication or a passwordless login. This naturally means that there needs to be networking support. Optional, but strongly recommended, is a DHCP client for getting an IP address. This is required for QEMU port forwarding that syz-manager utilizes. Yocto’s default QEMU core-image-base image does not require any special changes to the disk image as it has network capabilities, passwordless root, and a DHCP client.

The kernel should be built with instrumentation and a few other debugging options enabled. The exact options depend on the target architecture, the x86_64 setup guide lists “at the very least” the following as a requirement:

CONFIG_KCOV=y
CONFIG_DEBUG_INFO_DWARF4=y
CONFIG_KASAN=y
CONFIG_KASAN_INLINE=y
CONFIG_CONFIGFS_FS=y
CONFIG_SECURITYFS=y

The full documentation of the kernel configuration can be found here, and it lists plenty of additional configuration flags. I added both the minimum and the maximum configuration to my meta-fuzzing layer. I’ll be using the minimum configuration.

Writing Syzkaller configuration

Next, we need to write the configuration for Syzkaller. Let’s pick the example QEMU-configuration, and modify it a bit. This configuration defines the web interface address, location of required files, the syscalls to call, and the fuzzing target. We are going to use emulated QEMU targets in this test:

{
	"target": "linux/amd64",
	"http": "<YOUR-IP-HERE>:56741",
	"workdir": "./workdir",
	"kernel_obj": "/poky/build/tmp/work/qemux86_64-poky-linux/linux-yocto/6.6.50+git/linux-qemux86_64-standard-build/",
	"kernel_src": "/poky/build/tmp/work-shared/qemux86-64/kernel-source",
	"image": "/poky/build/tmp/deploy/images/qemux86-64/core-image-base-qemux86-64.rootfs.ext4",
	"syzkaller": ".",
	"disable_syscalls": ["keyctl", "add_key", "request_key"],
	"procs": 4,
	"type": "qemu",
	"vm": {
		"count": 2,
		"cpu": 2,
		"mem": 2048,
		"kernel": "/poky/build/tmp/deploy/images/qemux86-64/bzImage",
		"cmdline": "ip=dhcp"
	}
}

Different configuration items are quite well documented in the manager configuration and QEMU VM configuration files. The amount of virtual machines and processes here is quite low because I have a poor PC for running these tests, so you may want to increase them.

Quite literally what happens when I run Syzkaller with 4 QEMU instances.

This configuration assumes you have the Poky folder directly under the root folder, adjust this if necessary. Also, the configuration assumes that the process is launched in the root of the Syzkaller repo. The ip=dhcp cmdline option is also worth noting. This will be passed to the kernel, and it should ensure that the Ethernet interface uses DHCP to get IP address from QEMU. I think you can also hardcode the IP address if you know what it should be. tcpdump can be used to check the incoming ARP requests to see what the IP address is expected to be. There may be an easier way, but that’s what I did when poking around.

There’s one weird hack I had to do though. In my experience (and I may be wrong here), it seems that Syzkaller expects a certain format from the kernel source tree, and that expected format is not the actual structure of the kernel source. There seems to be an expectation that under the path defined in kernel_src there is a usr/src/kernel folder that points to the source, otherwise the coverage information generation will fail. However, if I move the kernel source to a usr/src/kernel folder, the report generation will fail because the scripts folder cannot be found anymore. To create a suitable directory structure, I used the following script:

cd <KERNEL_SOURCE>
mkdir -p usr/src/
cd usr/src/
ln -s ../../. kernel

Again, I’m not sure if this is necessary, it may be that I misconfigured something, but I had to do this to get the syz-manager running, reporting coverage, and formatting error reports.

Running Syzkaller

Once the configuration is ready, we can start to wonder what to do with it. The first thing is fetching the Syzkaller source from GitHub. Then we can compile Syzkaller using syz-env that is a Docker script that can be used to ensure that the build environment is the expected one:

./tools/syz-env make

This of course requires Docker. You can also use make directly, but then you need to take care of the build dependencies yourself. If you’re doing cross-arch testing, you need to define the target variables, so check out the setup guide for those. Once the build completes, copy the configuration created in the previous chapter to the root of the Syzkaller repo, and run:

./bin/syz-manager -config <CONFIG_FILE>

syz-manager should start up, and after a moment the QEMU machines should boot up as well. If you navigate your web browser to http://<YOUR-IP-HERE>:56741, you should see the web interface that shows the status of fuzzing, collected coverage and crashes:

If not, you can try adding -debug option to the syz-manager command to see what’s going wrong. Also, for some reason my web interface doesn’t load if there are no crashes, there’s just an error message about missing crashes folder. So, I had to create manually ./workdir/crashes folder, and then it seemed to work just fine.

Once you get to the web interface, you can leave the fuzzer running for a while to see how it works and explores new coverage paths while increasing corpus. It’s quite unlikely it’ll find a crash, as it’s fuzzing against a mainline kernel, but if it does, you have potentially found a kernel bug!

Adding Custom Syscalls

However, it’s not very interesting to fuzz the mainline kernel. It’s been so done (and is constantly being done). What’s more interesting is fuzzing our custom kernel module, and seeing if the Syzkaller can find a poorly hidden error from it.

I asked ChatGPT to write a small driver that has an IOCTL interface, and then I added a bug to it. The single command in the interface takes a string as an input. It tokenizes the string, and if the string contains five commas, an invalid free will be performed. There are some extra checks to guide the fuzzer towards the crash.

In-tree vs. Out-of-tree Module Build

While writing this text, I had to consider the in-tree vs. out-of-tree kernel module building. The difference is that an in-tree module is built as a part of the kernel build, and an out-of-tree module is built against the kernel headers after the kernel has been built. The out-of-tree method allows compiling modules for pre-built kernels, assuming the headers are available. For example, if you’ve ever built a Hello World -module for Ubuntu, you’ve most likely run apt-get install linux-headers-`uname -r`. This means you’ve pulled the development headers for building the out-of-tree module, and you’re not compiling an entire kernel.

Since Yocto builds the kernel it is quite easy to build the modules in-tree. This has a few advantages. For example, the module can be easily built as a built-in feature. From the fuzzing point of view, obtaining the coverage for the in-tree modules is a lot easier. Also getting the correct line numbers for crash reports is a lot simpler because there’s no need to decrypt offsets with objdump. So I’d recommend building the modules in-tree for fuzzing.

The example driver has both a module recipe (because that’s what I tried first) for out-of-tree build and linux-yocto append for in-tree build. It’s a bit silly way of supporting both methods, but at least it works (until it doesn’t). To add the IOCTL example module to the in-tree kernel build, add this line somewhere in your configuration:

IOCTL_STRING_PARSE_INTREE = "1"

Describing New Syscalls in Syzkaller

This is where the magic happens. Defining our own syscalls to Syzkaller so that it knows about the non-default syscalls and can create fuzzing sequences utilizing them. To achieve that, we need to write the syscall definitions in Syzkaller’s syntax, which is kind of simple, but still a bit frustrating to get right. Fortunately, there are plenty of examples. After staring at those, and reading this blog post, here’s what I came up with:

include <linux/fcntl.h>
include <linux/ioctl_string_parse.h>

resource fd_vuln_ioctl[fd]
openat$ioctl_string_parse(fd const[AT_FDCWD], file ptr[in, string["/dev/ioctl_example"]], flags const[0x2], mode const[0x0]) fd_vuln_ioctl
ioctl$IOCTL_STRING_PARSE_CMD(fd fd_vuln_ioctl, cmd const[IOCTL_CMD_PARSE_STRING], arg ptr[in, string])

A file with this content should be added to <PATH_TO_SYZKALLER>/sys/linux. The file name can be arbitrarily chosen, but it should have the .txt suffix.

The includes here are from the kernel source tree. fcntl.h is for the AT_FDCWD macro, and our own ioctl_string_parse.h is for the IOCTL_CMD_PARSE_STRING that contains the IOCTL command to use when making IOCTL calls to our driver. resource defines the file descriptor resource that is shared between the two other calls. openat opens the ioctl_example device in read/write mode with no special mode flags. This call should be quite static as the goal isn’t to fuzz the openat command, so most values are constants.

The final definition is the IOCTL command on the opened device, using the IOCTL_CMD_PARSE_STRING defined in the header as the command, and a random string as an argument for the IOCTL call.

Once the definitions are done, it’s time to compile them into Syzkaller. We first need to compile a tool called syz-extract, then extract syscalls from the our .txt file to a .const file, update the generated code, and re-compile the binary. The four commands below do exactly that (for 64-bit Linux systems)

./tools/syz-env make bin/syz-extract
./bin/syz-extract -os linux -arch amd64 -sourcedir /poky/linked-linux-src/usr/src/kernel -builddir /poky/build/tmp/work/qemux86_64-poky-linux/linux-yocto/6.6.50+git/linux-qemux86_64-standard-build <CUSTOM_DEFINITIONS>.txt
./tools/syz-env make generate
./tools/syz-env make

Note how you need to pass the build and source directories for extracting syscalls into .const file that will be used in the code generation. Again, update paths and file names if necessary

After that, we should be almost good to go. The configuration file still needs some tweaking. We do want to focus only on our custom syscalls, and we do not want to mutate the openat command. The following configuration should work:

{
	"target": "linux/amd64",
	"http": "<YOUR-IP-HERE>:56741",
	"workdir": "./workdir",
	"kernel_obj": "/poky/build/tmp/work/qemux86_64-poky-linux/linux-yocto/6.6.50+git/linux-qemux86_64-standard-build/",
	"kernel_src": "/poky/build/tmp/work-shared/qemux86-64/kernel-source",
	"image": "/poky/build/tmp/deploy/images/qemux86-64/core-image-base-qemux86-64.rootfs.ext4",
	"enable_syscalls": ["openat$ioctl_string_parse", "ioctl$IOCTL_STRING_PARSE_CMD"],
	"no_mutate_syscalls": ["openat$ioctl_string_parse"],
	"syzkaller": ".",
	"procs": 8,
	"type": "qemu",
	"vm": {
		"count": 2,
		"cpu": 2,
		"mem": 2048,
		"kernel": "/poky/build/tmp/deploy/images/qemux86-64/bzImage",
		"cmdline": "ip=dhcp"
	}
}

After that, the fuzzer can be started with the same command as before:

./bin/syz-manager -config <CONFIG_FILE>

Now, after waiting a few minutes, there should be some crashes visible:

Sometimes the report may get corrupted. For such crashes C repro code cannot be generated, but the logs may still yield some useful information.

Interestingly enough, if we are using the maximum debug kernel configuration these get reported as “potential deadlocks”. I guess enabling 40+ debug flags has some side effects. (As a side note, after enabling all the configuration items the basic runqemu machine boot time slows down from 10 seconds to 5 minutes). Regardless of which configuration we’re using, if we check out the report we can see the expected root cause for the crash:

It’s quite fascinating to watch the coverage information and see how the fuzzer approaches the problematic line when it attempts to increase the coverage. Fascinating in the same sense it’s exciting to watch paint dry:

The number on the left shows how many items in the corpus reach the line. It can be seen that over time new items can get further in the while-loop. In the perfect example two programs would reach line 208 and not just one, but it’s difficult to get perfection with randomness.

Extra Bonus Issue

Can you see what’s the issue with this code:

input_buffer = kmalloc(input_len, GFP_KERNEL);
switch (cmd) {
    case IOCTL_CMD_PARSE_STRING:
        ret = copy_from_user(input_buffer, (char *)arg, input_len);
        if (ret != 0) {
            printk(KERN_ALERT "Failed to copy string from user space\n");
            kfree(input_buffer);
            return -EFAULT;
        }

        // Some processing happens here

        kfree(input_buffer);
        break;

    default:
        printk(KERN_ALERT "Invalid IOCTL command\n");
        kfree(input_buffer);
        return -EINVAL;
}

I didn’t, and neither did ChatGPT when it suggested this for the first version of the driver. However, when Syzkaller calls this kind of IOCTL function multiple times in a rapid fashion it results in some unexpected invalid-frees. I guess constantly performing allocations and frees in an IOCTL command isn’t the best idea. The second implementation of the driver uses a memory pool to avoid having to allocate memory after initialization, and that seems to work. But yeah, another point for fuzzing for finding a bug from a seemingly functional code.

But Wait, There’s More!

In addition to this, I also created a test program with ChatGPT for debugging purposes. Later I realized that this could also be used for black-box fuzzing the example kernel module. So, if you want to, you can run the following script to fuzz the module with Radamsa (assuming you have installed Radamsa, check the black-box fuzzing blog text for more info):

while true; do 
    test-program-ioctl "$(echo ,,, | radamsa)";
done

Mandatory Final Chapter

This should cover all for now. With these instructions, you can hopefully fuzz your kernel module with Syzkaller. However, this only performs fuzzing on virtual QEMU targets. Sometimes it’d be better to fuzz on the actual hardware, especially if using specialized hardware. I’ll cover that in a follow-up text. If you want to get notified when that text goes out, consider joining my mailing list. Thanks for reading, and happy bug-hunting.

Raspberry Pi 4, LetsTrust TPM and Yocto

As briefly mentioned in the measured boot blog post, I had some issues with a TPM in the emulated environment. In the end, I bought a physical TPM chip for Raspberry Pi and verified that the measured boot worked as intended on the actual hardware, confirming that the issues I encountered were likely due to my somewhat esoteric virtual setup. Getting this LetsTrust TPM module working was fairly simple but there were a few things I learned along the way that may be worth sharing.

LetsTrust TPM Module

First of all, let’s clarify what is this LetsTrust TPM module. It’s a TPM module that connects to the Raspberry Pi’s GPIO pins, and communicates with the Raspberry Pi via SPI (Serial Peripheral Interface). The chip in the module is Infineon’s SLB9672. There are a few other TPM modules for Raspberry Pi, but this LetsTrust module was cheap and readily available, so I decided to get it. I still don’t quite understand the full capabilities of TPM devices, but it seems to work quite well for the little use I have so I think it’s worth the money. This was my comprehensive hands-on review of it.

There’s a surprisingly high amount of resources for integrating the LetsTrust TPM module into the Yocto build: two. First of all, there’s this guide which goes deep into details and is fairly thorough. Secondly, there’s this meta-slb9670-rpi layer that does most of the things required for integrating the module into the Yocto build. SLB9670 in the name of the meta-layer is the earlier version of the Infineon chip that was used in the older versions of the LetsTrust module.

The guide is a bit outdated and not all of the steps in it seem to be mandatory anymore, but here’s an outline of what needs to be done to get the LetsTrust working with Raspberry Pi:

  • Create a device tree overlay that enables the SPI bus and defines the TPM module
  • Configure TPM to be enabled in the kernel and U-boot
  • Enable SPI TPM drivers in the kernel
  • Add TPM2 software stack to the image
  • (Outdated) Patch U-boot to communicate to the module via GPIOs using bit-banging
  • (Optional) Add a service to enable TPM in Linux

The meta-slb9670-rpi layer does these things, and a bit more to define a FIT image and the boot script.

Updating meta-slb9670-rpi

However, the layer has not been updated since Dunfell version of Yocto. The upgrade from Dunfell to Scarthgap is fortunately fairly straightforward as it’s mostly syntax fixes. The meta-layer had a few other issues though. Booting the FIT image didn’t work because the boot script loaded the image too close to the kernel load address, overwriting the image when kernel was loaded. I made the executive decision to load the FIT image to the initrd load address because there’s plenty of space there and the boot is not using initrd anyway. A few other things in the meta-layer required clean-up as well, like removing some unused files and a service that loaded tpm_tis_spi module that was actually being built as a kernel built-in feature.

You can find the updated branch from my fork of meta-slb9670-rpi. Adding that layer with the dependencies should be enough to do the trick. The trick, in this case, is adding the LetsTrust module to Yocto builds. Simple as that. So, what’s the point of this blog post? First, to showcase the updated meta-layer, and second, to introduce the measured boot functionality I added. That was a slightly more complicated affair. But not too much, no need to be scared. However, before proceeding I recommend taking a look at the meta-layer and it’s contents, because it will be referenced later on.

Adding Measured Boot

First, I recommend reading my blog post about enabling measured boot to Yocto, because we will follow the steps listed there.

Second, I want to say that I’m not a big fan of SystemD, and after writing this text I have once again a bit more repressed anger towards it. Because, for some reason, reading the measured boot event log just does not work on SystemD. tpm2_eventlog command either says that the log cannot be read, or straight up segfaults. Reading PCR registers works just fine though. I still haven’t found the exact root cause for this, but I hope I’ll find it soon because I need some of the SystemD features for the upcoming texts. It may be related to the device management. If I figure out how to fix this I’ll add a link here, but for the time being these instructions are only for SysVinit-style systems. My SEO optimizer complains that this chapter is too long, but in my opinion, SystemD rants can never be too long.

I try really hard not to get on the SystemD hate train, but SystemD itself makes it so difficult at times.

Back to the actual business. The first step of adding the measured boot is enabling the feature in the U-boot configuration. However, we want to enable measured boot without the devicetree measurement, so the additional configuration looks like this:

CONFIG_MEASURED_BOOT=y
# CONFIG_MEASURE_DEVICETREE is not set

It seems that the PCRs are not constant if the devicetree is measured. I’m not 100% sure why, but my theory is that the proprietary Raspberry Pi bootloader that runs before U-Boot edits the devicetree on-the-fly, and therefore the devicetree measured by U-boot is never constant. I suppose RasPi bootloader does this to support the different RAM variants without requiring a separate devicetree for each. The main memory size is defined as zero in the devicetree source and there is a “Will be filled by the bootloader” comment next to the zero value, so at least something happens there.

Next, the memory region for the measurement log needs to be defined. In the measured boot blog post two methods for this were presented: either defining a reserved-memory section or defining sml-base and sml-size. I mentioned in the blog post that I couldn’t get sml-base working with QEMU, but with Raspberry Pi it was the opposite. Defining a reserved-memory block didn’t work, but the linux,sml-base and linux,sml-size did. I have a feeling that this is related to the fact that the memory node in devicetree is dynamically defined by the bootloader, but I cannot prove it.

So, to define the event log location we just add the two required parameters under the TPM devicetree node in the letstrust-tpm-overlay like this:

slb9670: slb9670@0 {
	compatible = "infineon,slb9670", "tis,tpm2-spi", "tcg,tpm_tis-spi";
	reg = <0>;
	gpio-reset = <&gpio 24 1>;
	#address-cells = <1>;
	#size-cells = <0>;
	status = "okay";

	/* for kernel driver */
	spi-max-frequency = <1000000>;
	linux,sml-base = <0x00 0xf6ffa000>;
	linux,sml-size = <0x6000>;
};

Note how I fixed the bad bad thing I did the last time. The event log should now be aligned to the end of the free RAM (non-reserved areas checked from /proc/iomem). This is only the case for my 4GB RAM RPI4 though, with the 2GB and 8GB versions the memory region is either in a wrong or non-existent place. I’m starting to understand the reasoning behind the dynamic memory definition done by the bootloader.

Et voilà! Just with these two changes tpm2_eventlog and tpm2_pcrread can now be used to read the boot measurements after booting the image. I created scarthgap-measured-boot-raspberrypi4-4gb branch (strong contender for the longest branch name I’ve ever created) for this feature because for now the measured boot works properly only on the 4GB variant using SysVinit, so it shouldn’t be merged into the main scarthgap branch. Maybe in the future there will be rainbows, sunshine and working SystemD-based systems.

Yocto Hardening: Measured Boot

You can find the other Yocto hardening posts from here!

Oh yes, it’s time for more of the security stuff. We are getting into the difficult things now. So far we have mostly been focusing on hardening the kernel and userspace separately, but this time we will zoom out a bit and take a look at securing the entire system. First, we are going to start hardening the boot process to prevent unwanted bootflows and loading undesired binaries.

I know that there is a philosophical and moral question of whether doing this is “right”, potentially locking the devices from the people using them. I’m not going to argue too much in either direction. I’d like everything to be open and easily hackable (in the good sense of the word), but because the real world is the way it is, keeping the embedded devices open doesn’t always make sense. Mostly because of the hacking (in the bad sense of the word). Anyway, I hope you use the power you will learn for good.

What Is Measured Boot

Simply put, the measured boot is a boot feature that hashes different boot components and then stores the hashes in immutable hash chains. The measured boot can perform hashing during different stages of the boot. The hashed items can for example be the kernel binary, devicetree, boot arguments, disk partitions, etc. These calculated hashes usually then get written to the platform configuration registers (PCR) in TPM.

These registers can only be extended, meaning that the existing value in the register and the new value get hashed together, and this combined hash then gets stored in the register. This creates a chain of hashes. This can then be called a “blockchain”, and it can be used to raise unlimited venture capital funding (or at least it was possible before AI became the hype train locomotive). The hash chain can also be used to detect unwanted changes in the chain because if one of the hashes in the chain changes, all the subsequent hashes after that will be changed as well.

To make actual use of these registers and their contents, attestation should be performed. In attestation, it is decided whether the system is in an acceptable state or not for performing some actions. For example, there could be a check that the PCRs contain certain expected values. Then, if the system is considered to be cool, for example filesystems may be decrypted, services could be started, or remote connections may be made.

I asked ChatGPT to create a meme about measured boot, and I’m not sure if this is genius or not.

It’s worth noting that measured boot doesn’t prevent loading or running unwanted binaries or configurations, it just makes a note if such a thing happened. Attestation on the other hand may prevent some unexpected things from happening if it is configured to do so. To prevent loading naughty stuff into your system, you may want to read about the secure boot. There’s even a summary of the differences coming sooner than you think!

Measured Boot vs. Secure Boot

Secure boot is a term that’s often confused with the measured boot, or the trusted boot, or the verified boot, so it is worth clarifying how these differ. This document sums it up nicely, but I’ll briefly summarize the differences in the next paragraphs. The original link contains some more pros and cons explained in an actually professional manner, so I recommend checking it out if you have the time.

In secure boot (also known as verified boot) each boot component checks the signatures of the next boot item (e.g U-boot checks Linux kernel, etc.), and if these don’t match with the keys stored in the device, the boot fails. If they match, the component doing the measurement transfers the control to the next component in boot chain. This fairly rigid system gives more control over the boot process, but signature verification and key storage aren’t trivial problems to solve. Also, updates to this kind of system are difficult.

Measured boot (also known as trusted boot) only measures the boot items and stores their hashes to TPM’s PCRs. It is then the responsibility of the attestation process to decide if the event log is acceptable or not for proceeding. This is more flexible and allows more options than a simple “boot or no boot”, but it is quite complex, and in theory, may allow booting some bad configurations if attestation isn’t sufficient. Performing the attestation itself isn’t that easy either. While local attestation is simpler to set up, it’s susceptible to local attacks, and with remote attestation, you have a server to set up and need a secure way of transferring the hashes to the attestation server.

So, despite having quite similar names, secure boot and measured boot are quite different things. Therefore a single system can have both systems in place. It’s actually a good idea, assuming the performance and complexity hits are acceptable. The performance hit is usually tolerable, as the actions need to be performed once per boot (as opposed to some encrypted filesystems where every filesystem operation takes a hit). Complexity on the other hand, well… In my experience, things won’t surely become easier after implementing these systems. All in all, everything requires more work and makes life miserable (but hopefully for the bad actors as well).

Considering the nature of the human meme culture, I’m not sure if ChatGPT was actually that bad.

Adding Measured Boot to Yocto

Now that we know what we’re trying to achieve, we can start working towards that goal. As you can guess, the exact steps vary a lot depending on your hardware and software. Therefore, it’s difficult to give the exact instructions on how to enable measured boot on your device. But, to give some useful advice, I’m going to utilize the virtual QEMU machine I’ve been working on a few earlier blog texts.

Yocto Emulation: Setting Up QEMU with U-Boot
Yocto Emulation: Setting Up QEMU with TPM

I’ve enabled measured boot also on Raspberry Pi 4 & LetsTrust TPM module combination using almost the same steps as outlined here, so the instructions should work on actual hardware as well. I’ll write a text about this a bit later…

Edit: The text for enabling the measured boot on Raspberry Pi 4 is now available, check it out here.

Configuring U-Boot

You want to start measuring the boot as early as possible to have a long hash chain. In an actual board, this could be something like the boot ROM (if boot ROM supports that) or SPL/FSBL. In our emulated example, the first piece doing the measurement is the U-boot bootloader. This is fairly late because we can only measure the kernel boot parameters, but we can’t change the boot ROM and don’t have SPL so it’s the best we can do.

Since we’re using U-Boot, according to the documentation enabling the boot measurement requires CONFIG_MEASURED_BOOT to be added into the U-Boot build configuration. This requires hashing and TPM2 support as well. You’ll most likely also want CONFIG_MEASURE_DEVICETREE to hash the device tree. It should be enabled automatically by default, at least in U-boot 2024.01 which I’m using it is, but you can add it just in case. The configuration fragment looks like this:

# Dependencies
CONFIG_HASH=y
CONFIG_TPM_V2=y
# The actual stuff
CONFIG_MEASURED_BOOT=y
CONFIG_MEASURE_DEVICETREE=y

Measured boot should be enabled by default in qemu_arm_defconfig used by our virtual machine, so no action is required to enable the measured boot for that device. If you’re using some other device you may need to add the configs. On the other hand, if you’re using something else than U-Boot as the bootloader, you have to consult the documentation of that bootloader. Or, in the worst case, write the boot measurement code yourself. U-Boot measures OS image, initial ramdisk image (if present), and bootargs variable. And the device tree, if the configuration option is enabled.

Editing the devicetree

Next, if you checked out the link to U-Boot documentation, it mentions that we also have to make some changes to our device tree. We need to define where the measurement event log is located in the memory. There are two ways of doing this: either by defining a memory-region of tcg_event_log type for the TPM node, or by adding linux,sml-base and linux,sml-size parameters to the TPM node. We’re going to go with the first option because the second option didn’t work with the QEMU for some reason (with the Raspberry Pi 4 it was the other way around, only linux,sml-base method worked. Go figure.)

For this, we first need to decompile our QEMU devicetree binary that has been dumped in the Yocto emulation blog texts (check those out if you haven’t already). The decompilation can be done with the following command:

dtc -I dtb -O dts -o qemu.dts qemu.dtb

Then, you can add memory-region = <&event_log>; to the TPM node in the source so that it looks like the following:

tpm_tis@0 {
    reg = <0x00 0x5000>;
    compatible = "tcg,tpm-tis-mmio";
    memory-region = <&event_log>;
};

After that, add the event log memory region to the root of the device tree. My node looks like this:

reserved-memory {
	#address-cells = <0x01>;
	#size-cells = <0x01>;
	ranges;
	event_log: tcg_event_log {
		#address-cells = <0x01>;
		#size-cells = <0x01>;
		no-map;
		reg = <0x45000000 0x6000>;
	};
};

Commit showing an example of this can be found from here. I had some trouble finding the correct location and addresses for the reserved-memory. In the end, I added reserved-memory node to the root of the device tree. The address is defined to be inside the device memory range, and that range is (usually) defined in the memory node at the root of the devicetree. The size of the event log comes from one of the U-Boot devicetree examples if I remember right.

Note that my reserved memory region is a bit poorly aligned to be in the middle of the memory, causing some segmentation. You can move it to some other address, just make sure that the address is not inside kernel code or kernel data sections. You can check these address ranges from a live system by reading /proc/iomem. For example, in my emulator device they look like this;

root@qemuarm-uboot:~# cat /proc/iomem
09000000-09000fff : pl011@9000000
09000000-09000fff : 9000000.pl011 pl011@9000000
09010000-09010fff : pl031@9010000
09010000-09010fff : rtc-pl031
09030000-09030fff : pl061@9030000
0a003c00-0a003dff : a003c00.virtio_mmio virtio_mmio@a003c00
0a003e00-0a003fff : a003e00.virtio_mmio virtio_mmio@a003e00
0c000000-0c004fff : c000000.tpm_tis tpm_tis@0
10000000-3efeffff : pcie@10000000
10000000-10003fff : 0000:00:01.0
10000000-10003fff : virtio-pci-modern
10004000-10007fff : 0000:00:02.0
10004000-10007fff : xhci-hcd
10008000-1000bfff : 0000:00:03.0
10008000-1000bfff : virtio-pci-modern
1000c000-1000cfff : 0000:00:01.0
1000d000-1000dfff : 0000:00:03.0
3f000000-3fffffff : PCI ECAM
40000000-4fffffff : System RAM
40008000-40ffffff : Kernel code
41200000-413c108f : Kernel data

After adding the reserved block of memory, you can check the reserved memory blocks in U-boot with bdinfo command:

=> bdinfo
boot_params = 0x00000000
DRAM bank   = 0x00000000
-> start    = 0x40000000
-> size     = 0x10000000
flashstart  = 0x00000000
flashsize   = 0x04000000
flashoffset = 0x000d7074
baudrate    = 115200 bps
relocaddr   = 0x4f722000
reloc off   = 0x4f722000
Build       = 32-bit
current eth = virtio-net#31
ethaddr     = 52:54:00:12:34:02
IP addr     = <NULL>
fdt_blob    = 0x4e6d9160
new_fdt     = 0x4e6d9160
fdt_size    = 0x00008d40
lmb_dump_all:
 memory.cnt = 0x1 / max = 0x10
 memory[0]      [0x40000000-0x4fffffff], 0x10000000 bytes flags: 0
 reserved.cnt = 0x2 / max = 0x10
 reserved[0]    [0x45000000-0x45005fff], 0x00006000 bytes flags: 4
 reserved[1]    [0x4d6d4000-0x4fffffff], 0x0292c000 bytes flags: 0
devicetree  = board
arch_number = 0x00000000
TLB addr    = 0x4fff0000
irq_sp      = 0x4e6d9150
sp start    = 0x4e6d9140
Early malloc usage: 2c0 / 2000

Once you’re done with the device tree, you can compile the source back into binary with the following command (this will print warnings, I guess the QEMU-generated device tree isn’t 100% perfect and my additions didn’t most likely help):

dtc -I dts -O dtb -o qemu.dtb qemu.dts

Booting the Device

That should be the hard part done. Since we have edited the devicetree and the modifications need to be present already in the U-Boot, QEMU can’t use the on-the-fly generated devicetree. Instead, we need to pass the self-compiled devicetree with the dtb option. The whole runqemu command looks like this:

BIOS=/<path>/<to>/u-boot.bin \
runqemu \
core-image-base nographic wic.qcow2 \
qemuparams="-chardev \
socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
-tpmdev emulator,id=tpm0,chardev=chrtpm \
-device tpm-tis-device,tpmdev=tpm0 \
-dtb /<path>/<to>/qemu.dtb"

Note that you need to source the Yocto build environment to have access to runqemu command. Also, remember to set up the swtpm TPM as instructed in the Yocto Emulation texts before booting up the system. You can use the same boot script that was used in the QEMU emulation texts.

Now, when the QEMU device boots, U-Boot will perform the measurements, store them into TPM PCRs, and the kernel is aware of this fabled measurement log. To read the event log in the Linux-land, you want to make sure that the securityfs is mounted. If not, you can mount it manually with:

mount -t securityfs securityfs /sys/kernel/security

If you face issues, make sure CONFIG_SECURITYFS is present in the kernel configuration. Once that is done, you should be able to read the event log with the following command:

tpm2_eventlog /sys/kernel/security/tpm0/binary_bios_measurements

This outputs the event log and the contents of the PCRs. You can also use tpm2_pcrread command to directly read the current values in the PCR registers. If you turn off the emulator and re-launch it, the hashes should stay the same. And if you make a small change to for example the U-Boot bootargs variable and boot the device, register 1 should have a different value.

The Limitations

Then, the bad news. Rebooting does not quite work as expected. If you reboot the device (as opposed to shutting QEMU down and re-starting it), the PCR values output by tpm2_pcrread change on subsequent boots even though they should always be the same. The binary_bios_measurements on the other hand stays the same after reboot even if the bootargs changes, indicating that it doesn’t get properly updated either.

From what I’ve understood, this happens because PCRs are supposed to be volatile, but the emulated TPM doesn’t really “reset” the “volatile” memory during reboot because the emulator doesn’t get powered off. With the actual hardware Raspberry Pi 4 TPM module this isn’t an issue, and tpm2_pcrread results are consistent between reboots and binary_bios_measurements gets updated on every boot as expected. It took me almost 6 months of banging my head on this virtual wall to figure out that this was most likely an emulation issue. Oh well.

Closing Words

Now we have (mostly) enabled measured boot to our example machine. Magnificient! There isn’t any attestation, though, so the measurement isn’t all that useful yet. The measurements could also be extended to the Linux side with IMA. These things will be addressed in future editions of Yocto hardening, so stay tuned!

While waiting for that, you can read the other Yocto hardening posts here!

Making USB Device With STM32 + TinyUSB

Have you ever wondered how USB devices are made? I sure have. It’s interesting how you can plug in devices using the same type of connector, and the devices work on (almost) any machine and you can get wildly different functionality from them. There are USB sound cards, network adapters, mass memory storages, oscilloscopes, table fans… The list goes on. The only limit is your imagination and five volts.

Introducing coffee cup warmer 3.0. It mines crypto (for me), and uses the generated heat to warm a coffee cup (for you). It’s win-win. And look, it even has a light so that you can drink coffee (and mine crypto) in the dark.

However, creating such a device seems like a big undertaking. There’s the device firmware that needs to be written, and the host side driver for the device, and the USB protocol itself is notoriously difficult to understand. However, with a good USB stack, code examples and some luck the task becomes a lot more manageable. In this blog post, I’m going to port a simple example from the TinyUSB stack to the STM32F446RE Nucleo board. I assume you have at least some basic understanding of USB and that you have completed at least the blinky project on some Nucleo board.

Mandatory Overview of USB Stuff

The first thing to always learn is the basics of the theory. However, if you’ve already tried understanding the USB protocol, you may already know that it’s not as trivial as “plug-n-play” under the hood. The spec is long, confusing and scary. I’m not going to say I understand it, and I barely even understand what’s coming up in the next chapters. But configuring these things is more or less required for getting the firmware code working, so it’s good to have some understanding of the fundamentals.

USB Host & USB Device

This is a fairly straightforward chapter, but written just to make sure that I’m correctly understood in the later chapters: USB devices connect to a USB host. In a typical scenario, the USB host is a computer, and a USB device is for example a keyboard. There may be USB hubs in between the two to increase the amount of ports in the USB host. The are also USB composite devices, which combine for example a mouse and a keyboard into a single device. Finally, there’s also a USB root hub, which is a hub in the USB host that the other devices connect to.

I’m not sure how I feel about these dark-mode graphs. Something about them just feels off.

USB Speeds

Different versions of USB specification have defined different maximum transfer speeds for the USB protocol. As one can guess, newer protocol version = faster speed. Interestingly enough, the naming also gets increasingly confusing over time.

  • Low Speed (USB 1.0/1.1) [1996]
    • Data Transfer Rate: 1.5 Mbps
  • Full Speed (USB 1.0/1.1) [1996]
    • Data Transfer Rate: 12 Mbps
  • High Speed (USB 2.0) [2000]
    • Data Transfer Rate: 480 Mbps
  • SuperSpeed (USB 3.0 Gen 1) [2008]
    • Data Transfer Rate: 5 Gbps
  • SuperSpeed+ (USB 3.1 Gen 2) [2013]
    • Data Transfer Rate: 10 Gbps
  • SuperSpeed+ (USB 3.2 Gen 2×2) [2017]
    • Data Transfer Rate: 20 Gbps
  • USB4 (USB4 Gen 3×2) [2019]
    • Data Transfer Rate: 40 Gbps
  • USB4 (USB4 Gen 4) [2022]
    • Data Transfer Rate: 80 Gbps

The speeds listed here are the maximums for each version.

Descriptors

So far so good. The descriptors are where things start to become more confusing. This chapter won’t explain all the descriptors, because there are too many of them. However, when we are writing and porting the code we need to write some structs defining the USB descriptors of the device, so it’s good to have a basic understanding of what they are.

The USB device uses descriptors in hierarchical layers to describe itself to the USB host. The topmost descriptor is the device descriptor, which contains for example vendor ID, product ID, and supported USB version. There’s one device descriptor for the device. The device descriptor also contains the number of configuration descriptors. Configuration descriptors contain for example the power requirements and the amount of interfaces the configuration contains. The driver can select the device configuration from multiple different configurations.

The interfaces are described using, you guessed it, interface descriptors. Each configuration descriptor contains one or more of these. Interface descriptor contains for example class code, protocol code, and the amount of endpoints. Finally, endpoints are defined using endpoint descriptors. Endpoint descriptor defines for example max packet size, polling interval, transfer type and data direction of the endpoint.

The simplest type of device can have one of each of the four basic descriptors. A complex device with different configuration profiles and multiple interfaces may have a lot more. And, to make matters more confusing, there are also extra descriptors. For example, there may be class code-specific descriptors (like in our example there will be), and a string descriptor that contains strings, like for example the human-readable device name.

This graph shows only the descriptors we need to define for our example device. TinyUSB stack will handle the rest of the descriptors for us.

Creating the Device

Now that was boring, wasn’t it? It’s time to do something interesting. As one can guess from the fairly complex protocol, we’re going to need a microcontroller. I have a STM32F446RE Nucleo board lying around in my drawer, so I’m using that for this project. As far as I know, most of the Nucleo boards should work for this project, as long as they have USB OTG. As a USB stack, I chose TinyUSB which is easy enough to use and integrate. Also, I made an example repository of this project that you can use to follow along if you want.

About TinyUSB

TinyUSB is a cross-platform USB stack, suitable both for USB devices and USB hosts. It supports power management, multiple device classes, and is thread- and memory-safe. Especially the latter two are big promises. ST also provides their USB stack for their devices. However, I usually prefer to use solutions that are not vendor-specific. For example, if we would like to change the hardware from STM32F4 to Rasperry Pi Pico, it’d be a lot easier with TinyUSB. Granted, it may not necessarily be perfectly optimized for all of the devices, but having portable code and transferrable knowledge is always good.

Configuring Project and STM32 Nucleo Board

First, we’re going to integrate TinyUSB and configure the Nucleo board in the STM32CubeIDE. We are going to follow the instructions from this GitHub comment. Big thanks to the person who made it, this step would have been a nightmare otherwise.

I’m using STM32CubeIDE version 1.15.1, so the TinyUSB integration steps below apply to that version of the IDE. I’ve also tried this on version 1.9.0 and TinyUSB seems to work with that version as well, but some labels and menus may have different texts, so keep that in mind if you’re using some other version of the IDE.

The first thing to do is to create a new STM32CubeIDE project and set the target board. In my case, it’s the F446RE. The rest of the defaults (C project, STMCube target, etc.) should be fine.

The next step is adding the TinyUSB stack. This consists of adding headers and sources. First, clone the TinyUSB repository to some location where it doesn’t get added to the build sources automatically. I created Libs folder at the root of the project and cloned the repository there. After that is done, add the src and hw include folders to the project and src as the source folder by right-clicking the project in the project explorer and setting them in the project properties.

Here’s how the includes should look like…

…and here are the sources

Then, we can configure the chip. Open the chip configuration ioc-file, and from the menu open Connectivity->USB_OTG_FS to set up the full-speed USB port. There may be USB_OTG_HS option as well, but the high-speed USB requires an external PHY (unless you have F7 board). OTG_HS can be configured as full speed, but let’s just stick to OTG_FS. As a side note, I find the name “full speed” ironic considering that the “maximum full speed” is less than 3% of the “maximum high speed”.

I digress. Once you’ve selected USB_OTG_FS, set the “Mode” to “Device Only”. In the Configuration window below the Mode window, under NVIC settings, enable “USB On The Go FS global interrupt”. Finally, double-check that the ST USB middleware is not enabled. Scroll down the menu, select “Middleware”, and make sure that USB_DEVICE and USB_HOST related middleware is set to “Disabled”.

USB configuration is now done, but you may still need to open the Clock Configuration tab to resolve clock issues. Just open the tab, click “Resolve Clock Issues”, and hope for the best.

USB config should look something like this.

Generating the code after integrating TinyUSB has one irritating side effect: it opens main.c of all the TinyUSB examples, resulting in quite a few new tabs being opened in the IDE. I’m not sure how to fix this. I can just say that this happens.

Wiring

Wiring is simple. It consists just of connecting the relevant Nucleo headers to a USB connector with jumper wires. If you don’t happen to have a spare USB connector, you can salvage one from a USB cable.

To power the board you can either use the power from the USB host coming through the USB connector, or you can plug in a cable to the Nucleo’s USB port. For development and debugging purposes, I’d recommend using Nucleo’s USB port and leaving out the power wire because Nucleo port is used for programming the device. However, the schematic below shows how to power the board using the USB connector because it’s a bit more complicated.

Notice that if you’re using the Nucleo’s USB port to power the board, the U5V rail needs to be active, and if want to power the board with your custom USB connector, the E5V rail should be active. Active rail is controlled with the jumper visualized with a blue wire in the schematic.

Making full use of that $8 Fritzing license by creating the second schematic this year.

If you’re facing issues with the USB enumeration, for example if Windows complains that the device could not be recognized, try swapping D+ and D-. The usual mistake is to get those the wrong way around, and then wonder for too long what could be the issue. To me, it feels like these are always printed incorrectly on the silkscreen, but I’m not sure how many times I can still use that excuse.

Here’s a picture of the final product. In this setup the USB connector can be plugged into a computer and the USB host is used as the power source. It’s perhaps worth noting that the pin layout of my connector is different than in the schematic above.

Code

To summarize, programming the firmware consists of the following tasks:

  • Writing the TinyUSB configuration header
  • Writing the USB descriptors
  • Replacing the ST USB interrupt with the TinyUSB USB interrupt
  • Adding TinyUSB setup and device task functions
  • Programming the functionality of the USB device

It doesn’t make sense for me to go through these steps line-by-line, so you can check out each point from the example GitHub repository. However, I’ll go through each step on a higher, more hand-wavy level. Most of these steps rely heavily on copy-pasting the relevant code from a TinyUSB example. In this project, we are using the CDC dual ports example.

The CDC Dual Ports example demonstrates a CDC class USB device that creates two serial ports. Users can then write into either one of these two. One of the serial ports will output the written characters in lowercase, and the other port will output the same characters in uppercase.

Device class is a standardized definition that categorizes devices based on their functionality. CDC stands for “Communications Device Class”. While we’re not creating a device providing typical CDC functionality (e.g. network card, modem, fax), we can use the CDC class to easily create a serial port because that’s what devices in that class typically use for communication.

Let’s now start going through the steps. Note that the headings are links to the relevant commits.

Configuration Header

The heading is quite descriptive of what happens in this step. We need to configure the TinyUSB with the chip we’re using, the root hub we intend to use, the mode of the root hub (host or device), the maximum supported speed, etc. In our F446RE board, the full-speed USB OTG is the root hub number 0.

To write the configuration, we can pretty much copy the configuration header tusb_config.h from the example, add the fields from the GitHub answer linked earlier, and replace the root hub number, USB speed and chip with the values applicable to our project. You can diff the configuration in the TinyUSB example and my example to see what exactly was changed.

USB Descriptors

Time for the infamous USB descriptors. Actually, this is a lot simpler than I made you believe earlier. We can simply copy the usb_descriptors.c from the example folder to our project source folder. Of course, if you were writing a USB device from scratch this step would involve more work to get the device to appear correctly to the host. I still recommend checking out the commit and trying to understand what each of the structs does and contains, as they should make (some) sense after reading about the USB descriptors.

USB Interrupt

This is an easy one. Open the file containing the generated FS USB interrupt, add a call to the TinyUSB interrupt handler, and return early to avoid calling the ST USB interrupt—literally two lines (and one include).

TinyUSB Setup and Task

Before the main code enters the main loop, add tud_init call, and in the main loop call tud_task. tud stands for “TinyUSB Device” (or so I assume). Some examples have functions with tuh prefix, and these are host-related functions, so I’m guessing that is the meaning of the last letter. There is also a generic tusb_init for initializing both device and host. Discussion about the differences between the two can be found here, but to summarize tud_init is a more flexible and newer way of doing the initialization.

Note that TinyUSB and examples have an initialization function called board_setup. We are not going to use that, because the initialization code generated by CubeIDE handles the board setup for us.

Programming Functionality

The actual functionality should be the hardest part. Or at least it would be if we were making an actual device from scratch. Since we’re using a ready-made example, we can just copy the functionality we desire from the example to our project.

The functionality we want to copy over from the example contains one task that is to be run in the main loop, and a few callbacks. Quite simple, especially since TinyUSB abstracts a lot of the hardware stuff away. All we have to do is read and write the serial devices in a fairly typical fashion, nothing too USB-specific is required at this stage anymore.

Of course, this step varies wildly depending on what you’re trying to achieve. But, in our humble example project the commit enabling the functionality is quite small and easy to understand.

Testing the Device

After doing the hard work of copying the code and flashing the board, you can plug your USB connector into a computer (don’t do it on your most expensive gaming rig though, better be cautious). At least on Windows, the device should get recognized, and two new serial ports should be added to the system. If you open the serial ports (baud rate 9600), and write to one of them, you should see the input text magically appear in the other one as well. And it’s all upper- or lower-cased!

But what about the driver? Why does the device work without a driver? Well, since we are creating a CDC class device, we don’t need a special driver for it. The generic CDC driver from the operating system is used for driving the device. With more specialized functionality where we couldn’t use (or wouldn’t want to use) a device class for defining the device a custom driver would of course be required. Maybe writing a driver like that is worth a text of its own.

I’m just joshing, it’s all fun stuff. In a kind of masochistic way.

But for now, you should have your first USB device up and running. Not one of the easiest projects, but considering the complexity of the protocol it was in the end quite simple. Big thanks for this go to the TinyUSB project. Also, thanks to the GitHub answer I linked above as it was a massive help with getting familiar with the TinyUSB and STM32. I’m not sure if this text would have even happened without it. That’s all for now, thanks for reading.

Black-Box Fuzzing Kernel Modules in Yocto

It’s been almost ten years since I wrote my thesis. It was about guided fuzz testing, and as usual, I have done zero days of actual work related to the topic of my thesis. However, I was feeling nostalgic one day and thought that I’d fire up a good ol’ fuzzer and see what I could do with it. In the end, not much. But it was fun to try to break something and relive the golden days of my youth.

To shake things up a bit, this time I tried fuzzing a Linux kernel module in a Yocto image, because it seems that I just can’t help but cram Yocto into every blog post I write. But let’s start from the beginning.

What Is Fuzzing?

Fuzzing is a type of testing where more or less broken input is used to check how a program behaves in unexpected situations. Usually, the process consists of collecting input samples, good or bad, running them through a fuzzer that does “something” to the sample, and then feeding this mystery sample to the program being tested. Well-behaving programs handle the erroneus input gracefully, but the badly behaving programs may hang, crash, or even worse, use the bad input like nothing is wrong.

Fuzz testing can be subcategorized into a few different groups: black-box, grey-box and white-box fuzzing. In black-box fuzzing there is no knowledge of the internals of the program, and no test feedback is used to guide the fuzzer. On the other hand, when using the white-box fuzzing the full knowledge of the program flow and protocols is available. In grey-box, there is no “deep” knowledge of the program, but for example code coverage may be used to guide the fuzzer.

As one can guess, a black-box fuzzer is the simplest to set up, but generally it is inefficient. White-box fuzzing is the opposite, where the initial effort may not even be worth it in the end. Grey-box, once again, lands somewhere in the middle. The instrumentation and feedback may require some effort, but it is (usually) worth it in the form of improved results.

Fuzzing on Embedded Target

Even though fuzzing can reveal some fascinating bugs, it’s worth noting that performing fuzzing on an embedded device may not always be a good idea. Usually, the efficiency of the fuzzing is directly proportional to the amount of tests being run per second. “Real” computers tend to be more powerful, resulting in more tests getting churned out compared to the embedded systems. The requirement for speed is especially true for black-box fuzzing which is basically brute forcing bugs out of the system. Therefore, you may want to consider fuzzing high-level application code on a more powerful computer, or in a virtualized environment to reveal more complex issues.

Fuzzing on the actual hardware makes the most sense in the following scenarios:

  • The code you’re testing relies on some architecture-specific functionality
  • The code relies on some hardware functionality that cannot be easily simulated
  • The hardware can generate samples and run target programs with “tolerable” efficiency
  • You want to do a quick smoke test type of fuzzing run

However, despite trying to talk you out of fuzzing on the target HW, I personally think it’s a good idea to give a quick black-box fuzzing session at least a try. It can reveal some low-hanging bugs, and setting up a black-box fuzzer takes little to no effort. Just be aware of the limitations, and the fact that it’s not going to be as efficient as it could.

Sometimes the bugs look for you though. Like ants at the picnic.

Finally, it’s worth knowing that things can go really wrong with fuzzing, so consider the potential risks, and if there’s a possibility of some hardware breaking. It’s usually unlikely, but aggressively fuzzing for example a poorly written device driver can result in bricking.

Radamsa

There are plenty of black-box fuzzers available for various purposes. Protocol fuzzers, web-app fuzzers, cloud fuzzers, etc. In this example, I’m using Radamsa. It’s a generic command line fuzzer that is simple to use yet it is fairly powerful. Not coincidentally, I also used it 10 years ago when writing my thesis.

Radamsa takes input either from stdin or from a file, and outputs fuzz either to stdout or to a file. This can then either be piped to the tested program, or the tested program can be instructed to open the file. Radamsa can also act as a TCP client or server, but I haven’t tried either of those so I can’t comment much on that. You can read more about Radamsa from it’s git repo.

The program is written in Owl Lisp, which gets translated into C, so the cross-compilation is quite straightforward once the Owl Lisp is set up. Because we don’t have to do any compilation time instrumentation for grey-box fuzzing guidance, the steps to build the fuzzer and the testable software are quite simple. The testable software in our case is going to be a kernel module. We still want to do some error instrumentation that will be covered in the next chapter, but since we’re fuzzing in kernel, it’s easier than one would guess (for once).

The Yocto recipe for building Radamsa can be found from meta-fuzzing repo I made to accompany this blog text.

Instrumentation

Breaking stuff with no consideration is rude. Breaking stuff and analyzing the results can be considered science. Therefore, to get something useful out of the fuzzing efforts we should figure out how to get as much information as possible from the system when it’s being bombarded. While black-box fuzzing doesn’t really need instrumentation, it makes fuzzing a lot more useful when we can detect more errors.

So, usually with all types of fuzzing some amount of compile-time instrumentation is used. This allows injecting extra code into the compiled binaries that may prove useful information when things start going wrong. A commonly used tool for this is AddressSanitizer (ASAN) and its fellow sanitizers. AddressSanitizer is a memory error detector that can detect things like use-after-frees, buffer overflows, and double-frees. As the nature of these bugs implies, it’s meant for C and C++ programs.

Sometimes I think I deserve happiness

Of course, this comes with a price. On average, AddressSanitizer tends to slow down the programs 2x. Who would have guessed that injecting code into binaries has some side effects? For debugging purposes, this is still usually acceptable.

The best part of the AddressSanitizer is that it’s readily available in the Linux kernel! To enable KernelAddressSanitizer KASAN, all that needs to be done is to set two configuration flags:

CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y

You can read more about the different KASAN modes from the KASAN documentation, but in summary, generic is the heaviest, but also the most compatible mode. There are faster modes, but they may be architecture and compiler specific. After enabling these flags, we can detect memory errors not only in the kernel but also in the modules we are building for that kernel.

Linux also has undefined behaviour sanitizer (UBSAN), Kernel concurrency sanitizer (KCSAN), and Kernel memory leak detector (no fun acronym), but let’s leave them out for now. They can be enabled similarly by toggling configuration flags, so no special work is needed from the driver side.

Example Module

To have something to fuzz, I wrote a simple Linux kernel module (with help from ChatGPT). The module creates two sysfs files, one that takes input and one that gives output. Anything written to the first file can be read from the second file. This allows passing data from user space to kernel space, and is a suitable input surface for fuzzing. sysfs interface isn’t maybe the most interesting one, because there is some processing that happens before the input written by user ends up in the kernel module, but it’s a simple test for verifying that the set-up works.

The code for this module can be found in meta-fuzzing repo as well.

Putting It All Together

Rest of the stuff is quite simple. If you’re using Yocto, add the meta-fuzzing layer to your Yocto build, add the kernel configuration into your kernel config, and install Radamsa (and the test module) to the image. If you’re using something else, then you do the same things but with a different system. Then, run the image, log into it, and run the following:

echo test | radamsa

Most likely something other than test gets printed. If not, give it a few more tries. If the output doesn’t look like t ejSt after a few tries something may be wrong.

To fuzz the actual test kernel module, you can run the following:

modprobe sysfs_attribute_echo
while true
  do cat /sys/kernel/sysfs_attribute_echo/output | radamsa > /sys/kernel/sysfs_attribute_echo/input
done

This probes the module, and then in a neverending loop reads the output from the kernel module, fuzzes it and passes it back to the input file. As an example of the sample file-based fuzzing, check this out:

mkdir /tmp/samples
echo aaa > /tmp/samples/sample-1
echo bbb > /tmp/samples/sample-2
echo ccc > /tmp/samples/sample-3
while true
  do radamsa -n 1 /tmp/samples/* > /sys/kernel/sysfs_attribute_echo/input
done

We create three sample files, and fuzz randomly one of them. Radamsa can output the fuzzed data into a file, but we still use stdout to send it to the kernel module. The samples in this case are quite trivial, but with more interesting sample files it would be possible to generate quite exotic fuzzed data.

For example, fuzzing a picture of “exotic beach” may result in something like this.

Does this find bugs from our module or kernel? No. Or at least it is highly unlikely. The kernel module itself is simple, and shouldn’t contain bugs (famous last words). Or, if there’s a bug, it’s either in the Linux kernel sysfs or kstrdup functions and those are already quite extensively tested (more famous last words). Unless there’s a regression of course.

However, this script demonstrates one admittedly simple approach of passing fuzzed data into the kernel space. The parsing of the data could be more exciting in a more complex module, which could in turn lead to actual bugs.

Closing Words

That’s all for this time. As shown here, the whole black-box fuzzing of the kernel can be straightforward. As mentioned about a dozen times in this text, the example was quite simple but demonstrates the point. The same ideas apply to more complex setups as well. The advantage of the black-box fuzzing is that it is easy to set up, so I recommend giving it a go and seeing what happens. Hopefully something exciting!

Yocto Emulation: Setting Up QEMU with TPM

As promised, it’s time for the QEMU follow-up. Last time we got Yocto’s runqemu command to launch u-boot, boot up a kernel, and mount a virtual drive with multiple partitions. Quite a lot of stuff. This time we are “just” going to add a TPM device to the virtual machine. As before, you can find the example meta-layer from Github. It contains the example snippets presented in this blog text, and should be ready to use.

Why is this virtualized TPM worth the effort? Well, if you have ever been in a painful situation where you’re working with TPMs and you’re writing some scripts or programs using them, you know that the development is not as straightforward as one would hope. The flows tend to be confusing, frustrating, and difficult. Using a virtual environment that’s easy to reset and that’s quite close to the actual hardware is a nice aid for developing and testing these types of applications.

In a nutshell, the idea is to run swtpm TPM emulator on the host machine, and then launch QEMU Arm device emulator that talks with the swtpm process. QEMU has an option for a TPM device that can be passed through to the guest device, so the process is fairly easy. With these systems in place, we can have a virtual TPM chip inside the virtual machine. *insert yo dawg meme here*

TPM Emulation With swtpm

Because I’m terrible at explaining things understandably, I’m going to ask my co-author ChatGPT to summarise in one paragraph what a TPM is:

Trusted Platform Module (TPM) is a hardware-based security feature integrated into computer systems to provide a secure foundation for various cryptographic functions and protect sensitive data. TPM securely stores cryptographic keys, certificates, and passwords, ensuring they remain inaccessible to unauthorized entities. It enables secure boot processes, integrity measurement, and secure storage of credentials, enhancing the overall security of computing devices by thwarting attacks such as tampering, unauthorized access, and data breaches.

I’m not sure if this is easier to understand than my ramblings, but I guess it makes the point clear. It’s a hardware chip that can be used to store and generate secrets. One extra thing worth knowing is that there are two notable versions of the TPM specification: 1.2 and 2.0. When I’m talking about TPM in this blog text, I mean TPM 2.0.

Since we’re using emulated hardware, we don’t have the “hardware” part in the system. Well, QEMU has a passthrough option for hardware TPMs, but for development purposes it’s easier to have an emulated TPM, something that swtpm can be used for. Installing swtpm is straightforward, as it can be found in most of the package repositories. For example, on Ubuntu, you can just run:

sudo apt install swtpm

Building swtpm is also an option. It has quite a few dependencies though, so you may want to consider just fetching the packages. Sometimes taking the easy route is allowed.

Whichever option you choose, once you’re ready you can run the following commands to set up the swtpm and launch the swtpm process:

mkdir /tmp/mytpm1
swtpm_setup --tpmstate /tmp/mytpm1 \
  --create-ek-cert \
  --create-platform-cert \
  --create-spk \
  --tpm2 \
  --overwrite
swtpm socket --tpmstate dir=/tmp/mytpm1 \
  --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \
  --tpm2 \
  --log level=20

Once the process launches, it opens a Unix domain socket that listens to the incoming connections. It’s worth knowing that the process gets launched as a foreground job, and once a connected process exits swtpm exits as well. Next, we’re going to make QEMU talk with the swtpm daemon.

QEMU TPM

Fortunately, making QEMU communicate with TPM isn’t anything groundbreaking. There’s a whole page of documentation dedicated to this topic, so we’re just going to follow it. For Arm devices, we want to pass the following additional parameters to QEMU:

-chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
-tpmdev emulator,id=tpm0,chardev=chrtpm \
-device tpm-spapr,tpmdev=tpm0 \

These parameters should result in the QEMU connecting to the swtpm, and using the emulated software TPM as a TPM in the emulated machine. Simple as.

One thing worth noting though. Since we’re adding a new device to the virtual machine, the device tree changes as well. Therefore, we need to dump the device tree again. This was discussed more in-depth in the first part of this emulation exercise, so I recommend reading that. In summary, you can dump the device tree with the following runqemu command:

BIOS=tmp/deploy/images/qemuarm-uboot/u-boot.bin \
runqemu \
core-image-base nographic wic.qcow2 \
qemuparams="-chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
-tpmdev emulator,id=tpm0,chardev=chrtpm \
-device tpm-tis-device,tpmdev=tpm0 \
-machine dumpdtb=qemu.dtb"

Then, you need to move the dumped binary to a location where it can get installed to the boot partition as a part of the Yocto build. This was also discussed in the first blog text.

TPM2.0 Software Stack

Configuring Yocto

Now that we have the virtualized hardware in order, it’s time to get the software part sorted out. Yocto has a meta-layer that contains security features and programs. That layer is aptly named meta-security. To add the TPM-related stuff into the firmware image, add sub-layer meta-tpm to bblayers.conf. meta-tpm has dependencies to meta-openembedded sub-layers meta-oe and meta-python, so add those as well.

Once the layers are added, we still need to configure the build a bit. The following should be added to your distro.conf, or if you don’t have one, local.conf should suffice:

DISTRO_FEATURES:append = " tpm"

Configuring Linux Kernel

Next, to get the TPM device working together with Linux, we need to configure the kernel. First of all, the TPM feature needs to be enabled, and then the driver for our emulated chip needs to be added. If you were curious enough to decompile the QEMU device tree binary, you maybe noticed that the emulated TPM device is compatible with tcg,tpm-tis-mmio. Therefore, we don’t need a specific driver, the generic tpm-tis driver should do. The following two config lines both enable TPM and add the driver:

CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=y

If you’re wondering what TCG means, it stands for Trusted Computing Group, the organization that has developed the TPM standard. TIS on the other hand stands for TPM Interface Specification. There are a lot of TLAs here that begin with the letter T, and we haven’t even seen all of them yet.

Well, here’s the yo dawg meme.

Configuring U-Boot

Configuring TPM support for U-Boot is quite simple. Actually, the U-Boot I built worked straight away with the defconfig. However, if you have issues with TPM in U-Boot, you should ensure that you have the following configuration items enabled:

# Enable TPM 2.0
CONFIG_TPM=y
CONFIG_TPM_V2=y
# Add MMIO interface for device
CONFIG_TPM2_MMIO=y
# Add TPM command
CONFIG_CMD_TPM=y
# This should be enabled automatically if
# CMD_TPM and TPM_V2 are enabled
CONFIG_CMD_TPM_V2=y

Installing tpm2-tools

In theory, we now should have completed the original goal of booting a Yocto image on an emulator that has a virtual TPM. However, there’s still nothing that uses the TPM. To add plenty of packages, tpm2-tools among them, we can add the following to the image configuration:

IMAGE_INSTALL:append = " \
    packagegroup-security-tpm2 \
    libtss2-tcti-device \
"

packagegroup-security-tpm2 contains the following packages:

tpm2-tools
trousers
tpm2-tss
libtss2
tpm2-abrmd
tpm2-pkcs11

For our testing purposes, we are mostly interested in tpm2-tools and tpm2-tss, and libtss2 that tpm2-tools requires. TSS here stands for TPM2 Software Stack. trousers is an older implementation of the stack, tpm2-abrmd (=access broker & resource manager daemon) didn’t work for me (and AFAIK using a kernel-managed device is preferred anyway), and PKCS#11 isn’t required for our simple example. libtss2-tcti-device is required to enable a TCTI (TPM Command Transmission Interface) for communication with Linux kernel TPM device files. These are the last acronyms, so now you can let out a sigh of relief.

Running QEMU

Now you can rebuild the image to compile a suitable kernel and user-space tools. Once the build finishes, you can use the following command to launch QEMU (ensure that swtpm is running):

BIOS=tmp/deploy/images/qemuarm-uboot/u-boot.bin \
runqemu \
core-image-base nographic wic.qcow2 \
qemuparams="-chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
-tpmdev emulator,id=tpm0,chardev=chrtpm \
-device tpm-tis-device,tpmdev=tpm0"

Then, stop the booting process to drop into the U-Boot terminal. We wrote the boot script in the previous blog, but now we can add tpm2 commands to initialize and self-test the TPM. The first three commands of this complete boot script set-up and self-test the TPM:

# Initalize TPM
tpm2 init
tpm2 startup TPM2_SU_CLEAR
tpm2 self_test full
# Set boot arguments for the kernel
setenv bootargs root=/dev/vda2 console=ttyAMA0
# Load kernel image
setenv loadaddr 0x40200000
fatload virtio 0:1 ${loadaddr} zImage
# Load device tree binary
setenv loadaddr_dtb 0x49000000
fatload virtio 0:1 ${loadaddr_dtb} qemu.dtb
# Boot the kernel
bootz ${loadaddr} - ${loadaddr_dtb}

Now, once the machine boots up, you should see /dev/tpm0 and /dev/tpmrm0 devices present in the system. tpm0 is a direct access device, and tpmrm0 is a device using the kernel’s resource manager. The latter of these is the alternative to tpm2-abrmd, and we’re going to be using it for a demo.

TPM Demo

Before we proceed, I warn you that my knowledge of actual TPM usage is a bit shallow. So, the example presented here may not necessarily follow the best practices, but it should perform a simple task that should prove that the QEMU TPM works. We are going to create a key, store it in the TPM, sign a file and verify the signature. When you’ve got the device booted with the swtpm running in the background, you can start trying out these commands:

# Set environment variable for selecting TPM device
# instead of the abrmd.
export TPM2TOOLS_TCTI="device:/dev/tpmrm0"
# Create contexts
tpm2_createprimary -C e -c primary.ctx
tpm2_create -G rsa -u rsa.pub -r rsa.priv -C primary.ctx
# Load and store contexts
tpm2_load -C primary.ctx -u rsa.pub -r rsa.priv -c rsa.ctx
tpm2_evictcontrol -C o -c primary.ctx 0x81010002
tpm2_evictcontrol -C o -c rsa.ctx 0x81010003
# Remove generated files and create message
rm rsa.pub rsa.priv rsa.ctx primary.ctx
echo "my message" > message.dat
# Sign and verify signature with TPM handles
tpm2_sign -c 0x81010003 -g sha256 -o sig.rssa message.dat
tpm2_verifysignature -c 0x81010003 -g sha256 -s sig.rssa -m message.dat

If life goes your way, all the commands should succeed without issues and you can create and verify the signature using the handles in the TPM. Usually, things aren’t that simple. If you see errors related to abrmd, you may need to define the TCTI as the tpmrm0 device. The TPM2TOOLS_TCTI environment variable should do that. However, if that doesn’t work you can try adding -T "device:/dev/tpmrm0" to the tpm2_* commands, so for example the first command looks like this:

tpm2_createprimary -C e -c primary.ctx -T "device:/dev/tpmrm0"

When running the tpm2_* commands, you should see swtpm printing out plenty of information. This information includes requests and responses received and sent by the daemon. To make some sense of these hexadecimal dumps, you can use tpmstream tool.

That should wrap up my texts about QEMU, Yocto and TPM. Hopefully, these will help you set up a QEMU device that has a TPM in it. I also hope that in the long run this setup helps you to develop and debug secure Linux systems that utilize TPM properly. Perhaps I’ll write more about TPMs in the future, it was quite difficult to find understandable sources and examples utilizing its features. But maybe first I’d need to understand the TPMs a bit better myself.

Yocto Emulation: Setting Up QEMU with U-Boot

I’ve been thinking about the next topic for the Yocto Hardening blog series, and it’s starting to feel like the easy topics are running out. Adding and using non-root users, basic stuff. Running a tool to check kernel configuration, should be simple enough. Firewalls, even your grandma knows what a firewall is.

So, I started to look into things like encryption and secure boot, but turns out they are quite complicated topics. Also, they more or less require a TPM (Trusted Platform Module), and I don’t have a board with such a chip. And even if I did, it’d be more useful to have flexible hardware for future experiments. And for writing blog texts that can be easily followed along it’d be beneficial if that hardware would be easily available for everyone.

Hardware emulation sounds like a solution to all of these problems. Yocto provides a script for using QEMU (Quick EMUlator) in the form of runqemu wrapper. However, by default that script seems to just boot up the kernel and root file system using whatever method QEMU considers the best (depending on the architecture). Also, runqemu passes just the root file system partition as a single drive to the emulator. Emulating a device with a bootloader and a partitioned disk image is a bit tricky thing to do, but that’s exactly what we’re going to do in this text. In the next part we’re going to throw a TPM into the mix, but for now, let’s focus on the basics.

Configuring the Yocto Build

Before we start, I’ll say that you can find a meta-layer containing the code presented here from GitHub. So if you don’t want to copy-paste everything, you can clone the repo. It’ll contain some more features in the future but the basic functionality created in this blog text should be present in the commit cf4372a.

Machine Configuration

To start, we’re going to define some variables related to the image being built. To do that, we will define our machine configuration that is an extension of a qemuarm configuration:

require conf/machine/qemuarm.conf

# Use the same overrides as qemuarm machine
MACHINEOVERRIDES:append = ":qemuarm"

# Set the required entrypoint and loadaddress
# These are usually 00008000 for Arm machines
UBOOT_ENTRYPOINT =       "0x00008000"
UBOOT_LOADADDRESS =      "0x00008000"

# Set the imagetype
KERNEL_IMAGETYPE = "zImage"
# Set kernel loaddaddr, should match the one u-boot uses
KERNEL_EXTRA_ARGS += "LOADADDR=${UBOOT_ENTRYPOINT}"

# Add wic.qcow2 image that can be used by QEMU for drive image
IMAGE_FSTYPES:append = " wic.qcow2"

# Add wks file for image partition definition
WKS_FILE = "qemu-test.wks"

# List artifacts in deploy dir that we want to be in boot partition
IMAGE_BOOT_FILES = "zImage qemu.dtb"

# Ensure things get deployed before wic builder tries to access them
do_image_wic[depends] += " \
    u-boot:do_deploy \
    qemu-devicetree:do_deploy \
"

# Configure the rootfs drive options. Biggest difference to original is
# format=qcow2, in original the default format is raw
QB_ROOTFS_OPT = "-drive id=disk0,file=@ROOTFS@,if=none,format=qcow2 -device virtio-blk-device,drive=disk0"

Drive Image Configuration with WIC

Once that is done, we can write the wks file that’ll guide the process that creates the wic image. wic image can be considered as a drive image with partitions and such. Writing wks files is worth a blog text of its own, but here’s the wks file I’ve been using that creates a drive containing two partitions:

part /boot --source bootimg-partition --ondisk vda --fstype=vfat --label boot --active --align 1024
part / --source rootfs --use-uuid --ondisk vda --fstype=ext4 --label platform --align 1024

The first partition is a FAT boot partition where we will store the kernel and device tree so that the bootloader can load them. Second is the ext4 root file system, containing all the lovely binaries Yocto spends a long time building.

Device Tree

We have defined the machine and the image. The only thing that is still missing is the device tree. The device tree defines the hardware of the machine in a tree-like format and should be passed to the kernel by the bootloader. QEMU generates a device tree on-the-fly, based on the parameters passed to it. The generated device tree binary can be dumped by adding -machine dumpdtb=qemu.dtb to the QEMU command. With runqemu, you can use the following command to pass the parameter:

runqemu core-image-base nographic wic.qcow2 qemuparams="-machine dumpdtb=qemu.dtb"

However, here we have a circular dependency. The image depends on the qemu-devicetree recipe to deploy the qemu.dtb, but runqemu cannot be run without an image, so the image needs to built to dump the device tree. To sort this out, remove the qemu-devicetree dependency from the machine configuration, build once, and dump the device tree. Then re-enable the dependency.

After this, you can give the device tree binary to a recipe and deploy it from there. Or you could maybe decompile it to a source file, and then re-compile the source as a part of kernel build to do things “correctly”. I was lazy and just wrote a recipe that deploys the binary:

SUMMARY = "QEMU device tree binary"
DESCRIPTION = "Recipe deploying the generated QEMU device tree binary blob"
LICENSE = "MIT"
LIC_FILES_CHKSUM = "file://${COMMON_LICENSE_DIR}/MIT;md5=0835ade698e0bcf8506ecda2f7b4f302"

SRC_URI = "file://qemu.dtb"

inherit deploy

do_deploy() {
    install -d ${DEPLOYDIR}
    install -m 0664 ${WORKDIR}/*.dtb ${DEPLOYDIR}
}

addtask do_deploy after do_compile before do_build

Once that is done, you should be able to build the image. I recommend checking out the meta-layer repo if you found this explanation confusing. I’m using core-image-base as the image recipe, but you should be able to use pretty much any image, assuming it doesn’t overwrite variables in machine configuration.

Setting up QEMU

Running runqemu

We should now have an image that contains everything needed to emulate a boot process: it has a bootloader, a kernel and a file system. We just need to get the runqemu to play along nicely. To start booting from the bootloader, we want to pass the bootloader as a BIOS for QEMU. Also, we need to load the wic.qcow2 file instead of the rootfs.ext4 as the drive source so that we have the boot partition present for the bootloader. All this can be achieved with the following command:

BIOS=tmp/deploy/images/qemuarm-uboot/u-boot.bin runqemu core-image-base nographic wic.qcow2

nographic isn’t mandatory if you’re running in an environment that has visual display capabilities. To this day I still don’t quite understand how the runqemu argument parsing works, even though I tried going through the script source. It simultaneously feels like it’s very picky about the order of the parameters, and that it doesn’t matter at all what you pass and at what position. But at least the command above works.

Booting the Kernel

If things go well, you should be greeted with the u-boot log. If you’re quick, spam any key to stop the boot, and if you’re not, spam Ctrl-C to stop bootloader’s desperate efforts of TFTP booting. I’m not 100% sure why the default boot script fails to load the kernel, I think the boot script doesn’t like the boot partition being a FAT partition on a virtio interface. To be honest, I would have been more surprised if the stock script would have worked out of the box. However, what works is the script below:

# Set boot arguments for the kernel
setenv bootargs root=/dev/vda2 console=ttyAMA0
# Load kernel image
setenv loadaddr 0x40200000
fatload virtio 0:1 ${loadaddr} zImage
# Load device tree binary
setenv loadaddr_dtb 0x49000000
fatload virtio 0:1 ${loadaddr_dtb} qemu.dtb
# Boot the kernel
bootz ${loadaddr} - ${loadaddr_dtb}

This script does exactly what the comments say: it loads the two artefacts from the boot partition and boots the board. We don’t have an init RAM disk, so we skip the second parameter of bootz. I also tried to create a FIT (firmware image tree) image with uImage to avoid having multiple boot files in the boot partition. Unfortunately, that didn’t quite work out. Loading the uImage got the device stuck with a nefarious "Starting kernel ..." message for some reason.

Back to the task at hand: if things went as they should have, the kernel should boot with the bootz, and eventually you should be dropped to the kernel login prompt. You can run mount command to see that the boot partition gets mounted, and cat /proc/cmdline to check that vda2 indeed was the root device that was used.

Closing Words And What’s Next

Congratulations! You got the first part of the QEMU set-up done. The second half with the TPM setup will follow soon. The example presented here could be improved in a few ways, like by adding a custom boot script for u-boot so that the user doesn’t have to input the script manually to boot the device, and by getting that darn FIT image working. But those will be classified as “future work” for now. Until next time!

The second part where the TPM gets enabled is out now!

My First Plug-In: Pastel Distortion

It’s time to finish a project. Lately, I have been mostly interested in embedded tinkering, but I’m also fascinated by audio and DSP programming. Partially because it is an interesting field, but mostly because I make music as a hobby so it’s interesting to see how the virtual instruments and audio effects work. So, in this text I’m presenting my first full-fledged and complete VST plug-in, Pastel Distortion. In a way it’s my second plug-in, as I used to make Delayyyyyy plug-in (that’s mentioned in some older texts of this blog as well), but that project has been abandoned in a state that I can’t quite call complete. However, here’s a screenshot of something that I actually have completed:

tada.wav

In short, VST plug-ins are software used in music production. They create and modify the sound based on the information they’re given by the VST host, that is usually a digital audio workstation. Plug-ins are commonly chained together so that one plug-in’s output is connected to the next one’s input. This is all done real-time, while the music is playing.

There will be free downloads at the end, but first, let’s go through the history of the project, some basic theory, and a six-paragraph subchapter that I like to call “I’m not sponsored by JUCE, but I should be”.

History of the Project

About two and a half years ago I started working on a distortion VST plug-in following this tutorial. Half a year after that, I got distracted when I thought about testing the plug-in (which resulted in this blog text and my first conference talk). As a side note, it may tell something about the development process and schedule when testing is “thought about” six months after starting the project. A year after that I got a Macbook for Macing Mac builds and got distracted by the new shiny laptop. And some time after that I made some overenthusiastic plans for the plug-in that didn’t quite come to reality and then I forgot to develop the plug-in.

The timeline is almost as confusing as Marvel Multiverse and full of delays, detours and time loops. In the end, I’ve just come to a conclusion that I’ll release this Pastel Distortion as it is, and add the new cool features later on if there’s interest in the plug-in. If there’s no interest, I can start working on a new plug-in, so it’s a win-win situation. But let’s finish this first.

What Is Waveshaping Distortion?

The Physics

In the real world that surrounds us all, sound is a change of pressure in a medium. Our ears then receive these changes of pressure and turn them into some sort of electricity in the brain. In short, it’s magic, that’s the best way I can explain it. To translate this into the world of computers, a microphone receives changes of pressure in air and converts them into changes in electricity that an analogue-to-digital converter then turns into ones and zeros understood by a computer. Magic, but of a slightly different kind.

I asked AI to generate an infographic for this section. Hopefully this helps you to understand.

After this transformation, we can process the analogue sound in the digital domain, and then convert it back into an analogue signal and play it out from speakers. Commonly the pressure/voltage changes get mapped into numbers between some range. One common range is [-1.0, 1.0]. -1.0 and 1.0 represent the extreme pressure changes where the microphone’s diaphragm is at its limit positions (=receiving loud sound), while the value of 0 is the position where it receives no pressure (receiving= silence).

Well, I also tried drawing the information myself. I’m not sure which one is better.

The Maths

Now we’ve established what sound is. But what is waveshaping distortion? You can think of it as a function that gets applied to the sampled values. Let’s take an example function that does not actually do any shaping, y=x:

This is quite possibly the dullest shaping function. It takes x as an input, and returns it. However, this is in theory what waveshaping does. It takes the input samples from -1.0 to 1.0, puts them into mathematical function, and uses output for new samples. Let’s take another example, y=sign(x):

This takes an input sample and outputs one of the extreme values. You can emulate this effect by turning up the gain of a microphone, shoving the mic in your mouth, and screaming as loud as possible. It’s not really a nice effect. Finally, let’s take a look at a useful function, y=sqrt(x) where x >= 0, y=-sqrt(-x) where x < 0:

Finally, we get a function that does something but isn’t too extreme. This will create a sound that’s more pronounced because the quieter samples get amplified. Or, in other words, values get mapped further away from zero. The neat part is that the waveshaper function can be pretty much anything. It can be a simple square root curve like here. It also can be a quartic equation combined with all of the trigonometric functions (assuming your processor can calculate it fast enough). Maybe it doesn’t sound good, but it’s possible.

But Why Bother?

It’s always a good idea to think why something is done. Why would I want to use my precious processor time to calculate maths when I could be playing DOOM instead? As an engineer, I’m not 100% sure, I think it has something to do with psychoacoustics which is a field of science of which I know nothing about and to be honest it sounds a bit made up. From a music producer’s point of view, I can say that distortion effects make the sound have more character, warmth, and loudness (and other vague adjectives which don’t mean anything), so it’s a good thing.

Implementing the Distortion

I have talked about JUCE earlier in this blog, but I think that’s been so long ago that it’s forgotten. So I’ll summarize it shortly again. It’s a framework for creating audio software. It handles input and output routing, VST interfacing, user interface, and all that other boring stuff so that we can focus on what we actually want to do: making the computer go bleep-bloop.

The actual method of audio signal processing may vary between different types of projects. For a VST audio effect like this, there usually is a processBlock function that receives an input buffer periodically. It is then your duty as a plug-in developer to do whatever you want with that input buffer and fill it with values that you deem correct. Doing all this in a reasonable amount of CPU time, of course.

In this Pastel Distortion plug-in, we receive an input buffer filled with values ranging from -1.0 to 1.0, and then we feed those samples to the waveshaping function and replace the buffer contents with the newly calculated values. Sounds simple, and to be honest, that’s exactly what it is because JUCE does most of the heavy work.

JUCE has a ProcessorChain template class that can be filled with various effects to process the audio. There’s a WaveShaper processor, to which you simply give the mathematical function you want it to perform, and the rest is done almost automatically! As you can guess, the plug-in uses that. In the plug-in there are also some filters, EQs, and compressors to tame the distorted signal a bit more because the distortion can start to sound really ugly really quickly. That doesn’t mean that you can’t create ugly sounds with Pastel Distortion, quite the contrary.

The life of a designer is a life of fight: fight against the ugliness

Another great feature of JUCE is that it has a graphics library built-in. It’s especially good in a sense that an embedded developer like me can create a somewhat professional-looking user interface, even though I usually program small devices where the only human-computer interaction methods are a power switch and a two-colour LED. Although I have to admit, most of the development time went into making the user interface. You wouldn’t believe the amount of hours that went into drawing these little swirls next to the knobs.

Honestly, it was pure luck that I managed to get these things looking even remotely correct. The best part is that in the end they’re barely even visible.

All in all, Pastel Distortion is a completed plug-in that I think is quite polished (at least considering the usual standards for my projects). There’s the distortion effect of course, but in addition to that there’s tone control to shape the distortion and output signal, a dry-wet mixer for blending the distorted and clean signal, and multiple waveshape functions to choose from. Besides GUI, I also spent quite a lot of time tweaking the distortion parameters, so hopefully that effort can be heard in the final product.

There’s still optimization that could be done, but the performance is in fairly good shape already. At least compared to the FL Studio stock distortion plug-in Disructor it seems to have about the same CPU usage. Disructor averages at around 8%, while Pastel Distortion averages at 9%. Considering the fact that my previous delay plug-in used about 20% I consider this a great success. This good number is most likely a result of the optimizations in JUCE and not because of my programming genius.

But enough talk, let’s get to the interesting stuff. How to try this thing out?

Getting Pastel Distortion

Obtaining Pastel Distortion plug-in is quite easy. Just click this link to go to the Gumroad page where you can get it. And if you’re quick, you can get it for free! The plug-in costs $0 until the end of February 2024. After that you can get the demo version for free to try it out, or if you ask me I can generate some sort of a discount code for it (I’d like to get feedback on the product in exchange for the discount).

If you don’t want to download Pastel Distortion but want to see it in action, check out the video below. I put all the skills I’ve learned from Windows Movie Maker and years of using Ableton into this one:

That’s all this time. I’ve already started working on the next plug-in, let’s hope that it won’t take another two and a half years. Maybe the next text will be out sooner than that when I get something else ready that’s worth writing about. I’ve been building a Raspberry Pi Pico-based gadget lately, and it got a bit out of hand, but maybe I’ll finish that soon.

Fixing Stability Issue In The Blog Server

The biggest fans of this blog (or just the people usually browsing between 6:00-7:00 UTC) may have noticed a frustrating issue where the site occasionally loads really slowly. Or in the worst-case scenario, refuses to load at all. Only an error page containing a message about a failing database connection gets returned.

Well, at least the message is short and to the point.

Investigating Issue

This issue started to occur sporadically in August and became consistent in October. And I started to consider fixing it in November. This kind of relaxed response time is common for hobby projects. The first obvious step to fix the issue was to check what was going on in the server when load times got longer. Once I noticed that the site was slowing down, I checked the monitoring stats. From the graphs, I saw that both CPU usage and disk reads were spiking. CPU was peaking at 80%, and disk reads were over 100MB/s for over 15 minutes. From the 7-day monitoring graph, it could be seen that this kind of spiking was happening almost daily.

Not every day though, and some spikes are taller than others.

Investigating the system log and comparing it with the time stamps of the peaks revealed the following cycle:

  1. One of the two daily apt package manager upgrade services gets started
  2. The CPU and disk activity starts ramping up
  3. The system starts heavy swapping and the website load times get longer
  4. About 15-20 minutes after the apt service starts the OOM (out-of-memory) killer kicks in and stops MySQL. Few other services may time out or get killed in this phase as well.
  5. MySQL restarts and the blog works again

I started investigating why the daily apt services seemed to constantly cause the server to run out of memory. The first of the two services downloaded the packages for upgrading, and the second one installed the downloaded upgrades. After trying out a few different things I realized that just installing or removing a package caused the server to randomly run out of memory if either of the apt services was started a few minutes earlier. It’s fun to do tests like this on a live server.

Fix Attempt 1: Installing System Upgrades

Some further investigation into the daily apt services revealed that the unattended upgrades had been failing for a long time. It seemed like the MySQL apt repository was missing keys, causing the apt update to fail. Also, it seemed like some upgrades required input from the user to configure packages. So I took a server backup and started installing the upgrades manually.

This isn’t foreshadowing. At least yet. Let’s see in three months.

Out of 141 packages, 127 wanted an update, which is “quite many” (to put it lightly). Fortunately, I have made no promises about the availability of this site, so I could liberally reboot the server as much as I needed for the upgrades. I was hoping that installing these pending upgrades would clear some cache that would reduce the RAM usage of the apt services. And in the worst-case scenario, it wouldn’t fix the issue but I would get an up-to-date server, so upgrading seemed like a win-win.

In addition to the upgrades I also installed an improved DigitalOcean monitoring service. This actually revealed something that should have been quite obvious from the beginning. The new monitoring service monitored RAM usage (the old one did not), and I could see that the server was using 90% of the RAM when it was idle. In hindsight, checking the RAM usage and monitoring how it gets consumed should have been the very first step when investigating an OOM issue.

Needless to say, 90% RAM usage is not good. I guess this happens because I’m running this blog on a low-end instance that doesn’t have much of RAM (I actually checked the minimum requirements of the OS, and the instance barely fills even that). However, before investigating the insufficient RAM, I wanted to first see if the upgrades would fix the original OOM issue. They did not.

So, the problem started to seem like a case of insufficient RAM. To fix this kind of issue, there are usually two options: scale the server up or scale the services down. In other words, throw money at the problem, or try to optimize the server. Being a cheapskate I chose the latter option. Also, I usually work with embedded things, so “just adding more RAM” feels like cheating. Also, considering the fact that on average I have about 20 daily visitors, beefing up the server seems like the wrong direction.

Fix Attempt 2: Optimizing RAM Usage

I used top to check the biggest memory consumers, and found two RAM gluttons: MySQL and Apache. Both are required for the well-being and existence of WordPress (that is the platform of this blog), but perhaps they could be optimized. At least they used to work on the server before, so perhaps they could be configured to work once again.

In the case of MySQL, there was a single mysqld daemon that was consuming plenty of RAM. Some googling revealed that disabling performance schema could help lower memory consumption. It seems to be a feature that measures the performance of the MySQL database server. Considering the fact that I’m using WordPress and I hope to write zero direct database queries to the database, that seemed nonmandatory. Perhaps when developing new software using MySQL such stats could be useful. Disabling performance schema lowered the mysqld RAM consumption from 39% to 19%.

In the case of Apache, there were ten worker threads, each consuming about 5%-8% of RAM. If my math is correct, in the last month I had about 0.00083 concurrent visitors on average. With that in mind, ten worker threads felt a bit excessive, and I scaled their amount down. I think it could be lowered even more, but I wanted to have enough workers in case there’s a sudden influx of readers.

Aaaany day now.

Conclusion

These actions took the idle RAM usage from 90% down to 60%. After this drop, I haven’t seen the OOM killer get activated in the past seven days, so I hope the issue is fixed. 60% is still a bit more than I’d like, but as long as the server stays stable and the performance doesn’t notably degrade I think that’s an acceptable percentage. Also, using the cheaper virtual machine saves me $6 a month!

The root cause for the increased RAM usage is still a bit of a mystery. I’m suspecting that installing WordPress plugins caused it because I was installing SEO plugins around the time the issue became more prevalent. If there’s one thing I’ve learnt from this, it’s that updates should be checked manually every now and then, and consumption of the system resources should be constantly monitored.