Implementation of Xen

Separation of Responsibilities

The simplest way to envision a machine running Xen is to imagine 3 layers:

This is roughly accurate but can lead to some misperceptions in where responsibilities lie. The important thing to understand is that Xen is CPU-centric: it maintains control over the system by mediating access to the CPU. This is sensible because the CPU is what enables everything else to function. A device driver knows how to talk to a piece of hardware, but the way that it literally talks to the hardware is by executing code on the CPU. So by controlling the CPU, Xen has control over the whole machine.

The CPU is also closely coupled with physical memory. It contains a Memory Management Unit (MMU) which translates virtual addresses to physical. This is particularly relevant for managing peripherals, because sometimes the virtual addresses map to a peripheral instead of RAM. So control of peripherals can be managed by controlling access to certain memory addresses, which the hypervisor needs to do anyway. If a guest could modify arbitrary physical addresses then it would be trivial to achieve code execution on adjacent guests. This is the same principles that applies to the need to isolate the memory of different processes from each other, even outside of a virtualization context.

Certain domains have permission to talk to certain pieces of hardware. Traditionally, there is one guest (dom0) which is the only domain with access to hardware. This is the only guest that Xen launches autonomously; others are requested by dom0. It is possible for dom0 to grant access to a specific piece of hardware to another guest.

The dom0less project adds additional functionality. In particular, it allows the administrator to configure Xen to launch multiple domains at boot. dom0 can still exist to enable control tasks (for example, dynamically creating new VMs). dom0 is also required for paravirtualized devices.

The hyperlaunch projects enables configurations that distribute dom0's privileges over multiple domains. For example, on my QubesOS installation lspci shows the USB controller connected to dom0 even though sys-usb implements the USB driver. (IIUC, this does NOT open dom0 to USB-based attacks because dom0 chooses not to interact with this PCI device; but it technically has the capability to.) My assumption is that once hyperlaunch is complete and integrated with QubesOS, sys-usb will be the only guest which is technically capable of communicating with the USB controller. It is not clear to me whether this would protect against a malicious USB device which is plugged into the system during boot, but defending against that attack is fairly trivial (don't plug in untrusted devices before booting...).

dom0 offers unique advantages over other solutions which implement dom0 functionality in the hypervisor itself. For example, contrast Xen to KVM which relies on the Linux kernel it is running in to provide device drivers, etc. Hyperlaunch is a natural progression of the technology.

Example: Split Device Drivers

This example assumes a traditional Xen system with a global dom0.

The split device driver model has a front-end for guests seeking to use a hardware device and a back-end for guests that implement the "real" driver (most commonly the back-end is in dom0). The communication happens through a ring buffer. The procedures which access the ring buffer are defined in the Xen repository. Guests call those procedures in order to make the driver work. The ring buffer exists literally as a shared page of memory. The "implementation" of Xen could be seen as residing only in Xen; or in Xen and dom0; or in all 3. The core of the process is the shared memory page which is provided by Xen without any help from dom0. However the hardware device is not usable without the driver, and the code that implements the API used to access the driver runs in dom0 (and is provided by Xen). However the driver exists to be used, and the code access the API runs in the guest (and is provided by Xen). The way to think about it that is most useful depends on the context.

Harware-Assisted Virtualization

Note: this is based on Intel's virtualization technology, I believe that AMD's works similarly but have not looked at it too closely.

Modern processors directly support virtualization with the concept of root and non-root execution. Note that this is separate from the concept of privilege rings. Privilege rings prevent most harmful actions, but there are some actions that are considered harmful in a virtualization context which rings do not protect against (on x86_64). (Some of these are concerned with the ability for a VM to detect that it is in a virtualized environment. It is true that it can be helpful for guests to know that they are guests, but it is useful for the hardware to support hiding this for certain tasks such as malware analysis.) Xen tells the processor about new guests and when they should be started/paused/restarted/stopped. The processor tells Xen when it needs to handle harmful instructions. The processor manages VM bookkeeping.

Build System

From the README:

make install and make dist differ in that make install does the right things for your local machine (installing the appropriate version of udev scripts, for example), but make dist includes all versions of those scripts, so that you can copy the dist directory to another machine and install from that distribution.

Does running make install make sense for the Guix package? Would it make more sense to set DISTDIR to the output directory? Do I want Xen to handle udev scripts, etc or package these separately in Guix, where the target <operating-system> can be examined (rather than just looking at the host environment)?

--with-linux-backend-modules specifically refers to dom0.

From INSTALL:

Some components of xen and tools will include an unpredictable timestamp into the binaries. To allow reproducible builds the following variables can be used to provide fixed timestamps in the expected format.
XEN_BUILD_DATE=<output of date(1)>
XEN_BUILD_TIME=hh:mm:ss
SMBIOS_REL_DATE=mm/dd/yyyy
VGABIOS_REL_DATE="dd Mon yyyy

Guix sets these to epoch:

#~(list "XEN_BUILD_DATE=Thu Jan  1 01:00:01 CET 1970"
        "XEN_BUILD_TIME=01:00:01"
        "XEN_BUILD_HOST="
        "ETHERBOOT_NICS="
        "SMBIOS_REL_DATE=01/01/1970"
        "VGABIOS_REL_DATE=01 Jan 1970"
        ;; QEMU_TRADITIONAL_LOC
        ;; QEMU_UPSTREAM_LOC
        "SYSCONFIG_DIR=/tmp/etc/default"
        (string-append "BASH_COMPLETION_DIR=" #$output
                       "/etc/bash_completion.d")
        (string-append "BOOT_DIR=" #$output "/boot")
        (string-append "DEBUG_DIR=" #$output "/lib/debug")
        (string-append "EFI_DIR=" #$output "/lib/efi")
        "MINIOS_UPSTREAM_URL=")

Tools

xenstore

The xenstore is kind of file-system-ish in the sense that it stores data using keys that look like paths (for example, /local/domain/1/name), the leaves of those paths contain data and no subpaths (as they are leaves), and the branches of the paths contain subpaths but conventionally not data. (It is technically possible to store data in non-leaf nodes, and there might be projects that take advantage of this - but there might not. In either case, this is probably a bad practice because xenstore is "file-system-y" enough that this will be counter-intuitive for most people.) The CLI tools used to interact with xenstore are reminiscent of tools that interact with the filesystem, but are not exactly the same. All tools have a xenstore- prefix, and can be accessed as a subcommand of the root command xenstore.

Services

QubesOS

QubesOS contains at least the following Xen-related services, on dom0 and/or domU (with a Fedora template). There are also some libvirt service in dom0, but they are all disabled (and not present in domU).

Guix

Guix installs to init.d because no systemd.

Comparison

The files that mention dom0 are missing in Guix, my impression is that the Guix package is more domU-focused than dom0-focused (for this and other reasons).

Not clear where domain launching is handled in the systemd version, there is no "xendomains" service in QubesOS. However, 2 of the service files do mention conflicting with xendomains and xen-watchdog is "After" it. This might be intentionally not installed in the QubesOS package because they have their own thing to handle it? (Pure speculation.)

Conclusions

Descriptiondom0domUsystemdinit.dGuix
Mount /proc/xenNeededAmbiguous but anecdotally necessaryproc-xen.mountshell scripts mount as neededfile-systems declaration
Virtual terminal managementNeededNonsensicalxenconsoled.servicexencommonsTBD
Serve Xen storeNeededNonsensicalxenstored.service & xen-init-dom0.servicexencommonsTBD
Launch startup domains automaticallyOptionalNonsensicalUnclearxendomainsTBD
WatchdogOptionalNonsensicalxen-watchdog.servicexen-watchdogTBD
React to hardware availabilityOptionalOptionalxendriverdomain.servicexendriverdomainTBD
Qemu disks (unclear purpose)Probably OptionalProbably OptionalUnclearxen-qemu-dom0-disk-backendTBD

Differences between the kinds of domains need to be handled at the fragment level, at least for services, because packages aren't responsible for service management on Guix. It might make sense to have separate outputs depending on how the dependency tree looks.

Linux Kernel Configuration

From source tree text search

There are several directories dedicated to Xen code:

There are also non-dedicated directories which have relevant configuration options. This list is based on one simple text search so might not be comprehenive:

From menuconfig search

From exploring make menuconfig (commit 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1), these configuration options are relevant (emphasis on QubesOS AppVM in PVH mode):

In-use Configurations

Searching /proc/config.gz for the following symbols:

PARAVIRT
PARAVIRT_SPINLOCKS
XEN_PV
XEN_PVHM_GUEST
XEN_PVH
XEN_PV_MSR_SAFE
XEN_PCIDEV_FRONTEND
XEN_BALLOON
XEN_SCRUB_PAGES_DEFAULT
XEN_DEV_EVTCHN
XENFS
XEN_COMPAT_XENFS
XEN_SYS_HYPERVISOR
XEN_GNTDEV
XEN_GRANT_DEV_ALLOC
XEN_GRANT_DMA_ALLOC
XEN_PVCALLS_FRONTEND
XEN_VIRTIO
CONFIG_DRM_XEN_FRONTEND
XEN_FBDEV_FRONTEND
HVC_XEN
HVC_XEN_FRONEND
TCG_XEN
USB_XEN_HCD
INPUT_XEN_KBDDEV_FRONTEND
SND_XEN_FRONTEND
XEN_NETDEV_FRONTEND
XEN_SCSI_FRONTEND
XEN_PRIVCMD

QubesOS Fedora guest. IIUC, QubesOS builds one kernel that is suitable for any kind of guest (AppVM, NetVM, etc) so options are probably liberal:

# CONFIG_DRM_XEN_FRONTEND is not set
CONFIG_HVC_XEN_FRONTEND=y
CONFIG_HVC_XEN=y
CONFIG_INPUT_XEN_KBDDEV_FRONTEND=m
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_PARAVIRT_SPINLOCKS=y
CONFIG_PARAVIRT_TIME_ACCOUNTING=y
CONFIG_PARAVIRT_XXL=y
CONFIG_PARAVIRT=y
CONFIG_SND_XEN_FRONTEND=m
# CONFIG_TCG_XEN is not set
CONFIG_USB_XEN_HCD=m
CONFIG_XEN_BALLOON_MEMORY_HOTPLUG=y
CONFIG_XEN_BALLOON=y
CONFIG_XEN_COMPAT_XENFS=y
CONFIG_XEN_DEV_EVTCHN=m
CONFIG_XEN_FBDEV_FRONTEND=y
CONFIG_XENFS=m
CONFIG_XEN_GNTDEV_DMABUF=y
CONFIG_XEN_GNTDEV=m
CONFIG_XEN_GRANT_DEV_ALLOC=m
CONFIG_XEN_GRANT_DMA_ALLOC=y
CONFIG_XEN_NETDEV_FRONTEND=m
CONFIG_XEN_PCIDEV_FRONTEND=m
# CONFIG_XEN_PVCALLS_BACKEND is not set
# CONFIG_XEN_PVCALLS_FRONTEND is not set
CONFIG_XEN_PV_DOM0=y
CONFIG_XEN_PVHVM_GUEST=y
CONFIG_XEN_PVHVM_SMP=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_PVH=y
CONFIG_XEN_PV_MSR_SAFE=y
CONFIG_XEN_PV_SMP=y
CONFIG_XEN_PV=y
CONFIG_XEN_SCRUB_PAGES_DEFAULT=y
CONFIG_XEN_SCSI_FRONTEND=m
CONFIG_XEN_SYS_HYPERVISOR=y
# CONFIG_XEN_VIRTIO is not set

Guix configuration, default kernel. I have no idea how much effort has been put into making Guix work effectively as a Xen guest:

CONFIG_DRM_XEN_FRONTEND=m
CONFIG_HVC_XEN_FRONTEND=y
CONFIG_HVC_XEN=y
CONFIG_INPUT_XEN_KBDDEV_FRONTEND=m
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_PARAVIRT_SPINLOCKS=y
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_PARAVIRT_XXL=y
CONFIG_PARAVIRT=y
CONFIG_SND_XEN_FRONTEND=m
CONFIG_TCG_XEN=m
CONFIG_USB_XEN_HCD=m
CONFIG_XEN_BALLOON_MEMORY_HOTPLUG=y
CONFIG_XEN_BALLOON=y
CONFIG_XEN_COMPAT_XENFS=y
CONFIG_XEN_DEV_EVTCHN=m
CONFIG_XEN_FBDEV_FRONTEND=m
CONFIG_XENFS=m
CONFIG_XEN_GNTDEV=m
CONFIG_XEN_GRANT_DEV_ALLOC=m
# CONFIG_XEN_GRANT_DMA_ALLOC is not set
CONFIG_XEN_NETDEV_FRONTEND=y
CONFIG_XEN_PCIDEV_FRONTEND=m
# CONFIG_XEN_PVCALLS_BACKEND is not set
CONFIG_XEN_PVCALLS_FRONTEND=m
CONFIG_XEN_PV_DOM0=y
CONFIG_XEN_PVHVM_GUEST=y
CONFIG_XEN_PVHVM_SMP=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_PVH=y
CONFIG_XEN_PV_MSR_SAFE=y
CONFIG_XEN_PV_SMP=y
CONFIG_XEN_PV=y
CONFIG_XEN_SCRUB_PAGES_DEFAULT=y
CONFIG_XEN_SCSI_FRONTEND=m
CONFIG_XEN_SYS_HYPERVISOR=y
# CONFIG_XEN_VIRTIO_FORCE_GRANT is not set
CONFIG_XEN_VIRTIO=y

Differences between the two (Qubes on left, Guix on right). Defaults are manually retrieved annotations discovered by searching menuconfig and making sure that all dependencies were satisfied. Once they were, the setting shown before making any other changes is treated as the "default" (unclear if this is appropriate, but seems like a reasonable first impression).

< # CONFIG_DRM_XEN_FRONTEND is not set (default "n")
---
> CONFIG_DRM_XEN_FRONTEND=m
8c8
< CONFIG_PARAVIRT_TIME_ACCOUNTING=y
---
> # CONFIG_PARAVIRT_TIME_ACCOUNTING is not set (default "n")
12c12
< # CONFIG_TCG_XEN is not set (default "n")
---
> CONFIG_TCG_XEN=m
18c18
< CONFIG_XEN_FBDEV_FRONTEND=y
---
> CONFIG_XEN_FBDEV_FRONTEND=m
20d19
< CONFIG_XEN_GNTDEV_DMABUF=y
23,24c22,23
< CONFIG_XEN_GRANT_DMA_ALLOC=y
< CONFIG_XEN_NETDEV_FRONTEND=m
---
> # CONFIG_XEN_GRANT_DMA_ALLOC is not set (default "n")
> CONFIG_XEN_NETDEV_FRONTEND=y
27c26
< # CONFIG_XEN_PVCALLS_FRONTEND is not set (default "n")
---
> CONFIG_XEN_PVCALLS_FRONTEND=m
39c38,39
< # CONFIG_XEN_VIRTIO is not set (default "n")
---
> # CONFIG_XEN_VIRTIO_FORCE_GRANT is not set
> CONFIG_XEN_VIRTIO=y

PVH Booting

TODO: see docs/hypervisor-guide/x86/how-xen-boots.rst, docs/misc/pvh.pandoc, chapter 2 of Definitive Guide, hope the exercise is PVH, figure out if there is a formal spec for PVH booting.

Project Dependencies

QEMU

According to the Xen wiki, QEMU is only used for emulating specific device models. It is required for certain situations, but QEMU is not generally used to run VMs.

OVMF

This enables for UEFI on VMs. See the wiki.

SEABIOS

This is a 16-bit BIOS. Not clear why this is needed. I think I remember reading something about early boot environments using legacy code that expects a 16-bit environment. I do not recall if this was VM-specific. This might just be for HVM machines.

References

Intel Vanderpool Technology for IA-32 Processors (VT-x) (University of Wisconsin - Madison, Intel)
An Introduction to IOMMU Infrastructure in the Linux Kernel (Adrian Huang, Lenovo)
Lecture 7: Memory Managements (Prof. Uyeda, University of California - San Diego)
Linux source tree
The Definitive Guide to the Xen Hypervisor by David Chisnall (2008)
QubesOS-distributed artifacts
Summary of the Gains of Xen oxenstored Over cxenstored
Xen and the Art of Virtualization (Barham et al, University of Cambridge)
Xen Project Wiki
Xen source tree

Download the markdown source and signature.