This walkthrough will explain one (opinionated) way to do testing of the Linux kernel client against a development cluster. We will try to mimimize any assumptions about pre-existing knowledge of how to do kernel builds or any related best-practices.
Note
There are many completely valid ways to do kernel development for Ceph. This guide is a walkthrough of the author's own environment. You may decide to do things very differently.
Clone the kernel:
git init linux && cd linux
git remote add torvalds git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git remote add ceph https://github.com/ceph/ceph-client.git
git fetch && git checkout torvalds/master
Configure the kernel:
make defconfig
Note
You can alternatively use the Ceph Kernel QA Config for building the kernel.
We now have a kernel config with reasonable defaults for the architecture you're building on. The next thing to do is to enable configs which will build Ceph and/or provide functionality we need to do testing.
cat > ~/.ceph.config <<EOF
CONFIG_CEPH_FS=y
CONFIG_CEPH_FSCACHE=y
CONFIG_CEPH_FS_POSIX_ACL=y
CONFIG_CEPH_FS_SECURITY_LABEL=y
CONFIG_CEPH_LIB_PRETTYDEBUG=y
CONFIG_DYNAMIC_DEBUG=y
CONFIG_DYNAMIC_DEBUG_CORE=y
CONFIG_FRAME_POINTER=y
CONFIG_FSCACHE
CONFIG_FSCACHE_STATS
CONFIG_FS_ENCRYPTION=y
CONFIG_FS_ENCRYPTION_ALGS=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_XFS_FS=y
EOF
Beyond enabling Ceph-related configs, we are also enabling some useful debug configs and XFS (as an alternative to ext4 if needed for our root file system).
Note
It is a good idea to not build anything as a kernel module. Otherwise, you would need to make modules_install
on the root drive of the VM.
Now, merge the configs.
scripts/kconfig/merge_config.sh .config ~/.ceph.config
Finally, build the kernel:
make -j
Note
This document does not discuss how to get relevant utilities for your distribution to actually build the kernel, like gcc. Please use your search engine of choice to learn how to do that.
A virtual machine is a good choice for testing the kernel client for a few reasons:
- You can more easily monitor and configure networking for the VM.
- You can very rapidly test a change to the kernel (build -> mount in less than 10 seconds).
- A fault in the kernel won't crash your machine.
- You have a suite of tools available for analysis on the running kernel.
The main decision for you to make is what Linux distribution you want to use. This document uses Arch Linux due to the author's familiarity. We also use LVM to create a volume. You may use partitions or whatever mechanism you like to create a block device. In general, this block device will be used repeatedly in testing. You may want to use snapshots to avoid a VM somehow corrupting your root disk and forcing you to start over.
# create a volume
VOLUME_GROUP=foo
sudo lvcreate -L 256G "$VOLUME_GROUP" -n $(whoami)-vm-0
DEV="/dev/${VOLUME_GROUP}/$(whoami)-vm-0"
sudo mkfs.xfs "$DEV"
sudo mount "$DEV" /mnt
sudo pacstrap /mnt base base-devel vim less jq
sudo arch-chroot /mnt
# # delete root's password for ease of login
# passwd -d root
# mkdir -p /root/.ssh && echo "$YOUR_SSH_KEY_PUBKEY" >> /root/.ssh/authorized_keys
# exit
sudo umount /mnt
Once that's done, we should be able to run a VM:
qemu-system-x86_64 -enable-kvm -kernel $(pwd)/arch/x86/boot/bzImage -drive file="$DEV",if=virtio,format=raw -append 'root=/dev/vda rw'
You should see output like:
VNC server running on ::1:5900
You could view that console using:
vncviewer 127.0.0.1:5900
Congratulations, you have a VM running the kernel that you just built.
This is the "hard part" and requires the most customization depending on what you want to do. For this author, I currently have a development setup like:
sepian netns ______________ | | | kernel VM | sepia-bounce VM vossi04.front.sepia.ceph.com | ------- | | ------ ------- | | | | | 192.168.20.1 | | | | | | |--|--|- <- wireguard -> | | <-- sepia vpn -> | | | |_____| | | 192.168.20.2 |____| |_____| | br0 | |______________|
The sepia-bounce VM is used as a bounce box to the sepia lab. It can proxy ssh connections, route any sepia-bound traffic, or serve as a DNS proxy. The use of a sepia-bounce VM is optional but can be useful, especially if you want to create numerous kernel VMs for testing.
I like to use the vossi04 developer playground to build Ceph and setup a vstart cluster. It has sufficient resources to make building Ceph very fast (~5 minutes cold build) and local disk resources to run a decent vstart cluster.
To avoid overcomplicating this document with the details of the sepia-bounce VM, I will note the following main configurations used for the purpose of testing the kernel:
- setup a wireguard tunnel between the machine creating kernel VMs and the sepia-bounce VM
- use
systemd-resolved
as a DNS resolver and listen on 192.168.20.2 (instead of just localhost) - connect to the sepia VPN and use systemd resolved update script to configure
systemd-resolved
to use the DNS servers acquired via DHCP from the sepia VPN - configure
firewalld
to allow wireguard traffic and to masquerade and forward traffic to the sepia vpn
The next task is to connect the kernel VM to the sepia-bounce VM. A network namespace can be useful for this purpose to isolate traffic / routing rules for the VMs. For me, I orchestrate this using a custom systemd one-shot unit that looks like:
# create the net namespace ExecStart=/usr/bin/ip netns add sepian # bring lo up ExecStart=/usr/bin/ip netns exec sepian ip link set dev lo up # setup wireguard to sepia-bounce ExecStart=/usr/bin/ip link add wg-sepian type wireguard ExecStart=/usr/bin/wg setconf wg-sepian /etc/wireguard/wg-sepian.conf # move the wireguard interface to the sepian nents ExecStart=/usr/bin/ip link set wg-sepian netns sepian # configure the static ip and bring it up ExecStart=/usr/bin/ip netns exec sepian ip addr add 192.168.20.1/24 dev wg-sepian ExecStart=/usr/bin/ip netns exec sepian ip link set wg-sepian up # logging info ExecStart=/usr/bin/ip netns exec sepian ip addr ExecStart=/usr/bin/ip netns exec sepian ip route # make wireguard the default route ExecStart=/usr/bin/ip netns exec sepian ip route add default via 192.168.20.2 dev wg-sepian # more logging ExecStart=/usr/bin/ip netns exec sepian ip route # add a bridge interface for VMs ExecStart=/usr/bin/ip netns exec sepian ip link add name br0 type bridge # configure the addresses and bring it up ExecStart=/usr/bin/ip netns exec sepian ip addr add 192.168.0.1/24 dev br0 ExecStart=/usr/bin/ip netns exec sepian ip link set br0 up # masquerade/forward traffic to sepia-bounce ExecStart=/usr/bin/ip netns exec sepian iptables -t nat -A POSTROUTING -o wg-sepian -j MASQUERADE
When using the network namespace, we will use ip netns exec
. There is a
handy feature to automatically bind mount files into the /etc
namespace for
commands run via that command:
# cat /etc/netns/sepian/resolv.conf nameserver 192.168.20.2
That file will configure the libc name resolution stack to route DNS requests
for applications to the systemd-resolved
daemon running on sepia-bounce.
Consequently, any application running in that netns will be able to resolve
sepia hostnames:
$ sudo ip netns exec sepian host vossi04.front.sepia.ceph.com vossi04.front.sepia.ceph.com has address 172.21.10.4
Okay, great. We have a network namespace that forwards traffic to the sepia VPN. The next mental step is to connect virtual machines running a kernel to the bridge we have configured. The straightforward way to do that is to create a "tap" device which connects to the bridge:
sudo ip netns exec sepian qemu-system-x86_64 \
-enable-kvm \
-kernel $(pwd)/arch/x86/boot/bzImage \
-drive file="$DEV",if=virtio,format=raw \
-netdev tap,id=net0,ifname=tap0,script="$HOME/bin/qemu-br0",downscript=no \
-device virtio-net-pci,netdev=net0 \
-append 'root=/dev/vda rw'
The new relevant bits here are (a) executing the VM in the netns we have
constructed; (b) a -netdev
command to configure a tap device; (c) a
virtual network card for the VM. There is also a script $HOME/bin/qemu-br0
run by qemu to configure the tap device it creates for the VM:
#!/bin/bash tap=$1 ip link set "$tap" master br0 ip link set dev "$tap" up
That simply plugs the new tap device into the bridge.
This is all well and good but we are now missing one last crucial step. What is the IP address of the VM? There are two options:
- configure a static IP but the VM's root device networking stack configuration must be modified
- use DHCP and configure the root device for VMs to always use dhcp to configure their ethernet device addresses
The second option is more complicated to setup, since you must run a DHCP server now, but provides the greatest flexibility for adding more VMs as needed when testing.
The modified (or "hacked") standard dhcpd systemd service looks like:
# cat sepian-dhcpd.service [Unit] Description=IPv4 DHCP server After=network.target network-online.target sepian-netns.service Wants=network-online.target Requires=sepian-netns.service [Service] ExecStartPre=/usr/bin/touch /tmp/dhcpd.leases ExecStartPre=/usr/bin/cat /etc/netns/sepian/dhcpd.conf ExecStart=/usr/bin/dhcpd -f -4 -q -cf /etc/netns/sepian/dhcpd.conf -lf /tmp/dhcpd.leases NetworkNamespacePath=/var/run/netns/sepian RuntimeDirectory=dhcpd4 User=dhcp AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_NET_RAW ProtectSystem=full ProtectHome=on KillSignal=SIGINT # We pull in network-online.target for a configured network connection. # However this is not guaranteed to be the network connection our # networks are configured for. So try to restart on failure with a delay # of two seconds. Rate limiting kicks in after 12 seconds. RestartSec=2s Restart=on-failure StartLimitInterval=12s [Install] WantedBy=multi-user.target
Similarly, the referenced dhcpd.conf:
# cat /etc/netns/sepian/dhcpd.conf option domain-name-servers 192.168.20.2; option subnet-mask 255.255.255.0; option routers 192.168.0.1; subnet 192.168.0.0 netmask 255.255.255.0 { range 192.168.0.100 192.168.0.199; }
Importantly, this tells the VM to route traffic to 192.168.0.1 (the IP of the
bridge in the netns) and DNS can be provided by 192.168.20.2 (via
systemd-resolved
on the sepia-bounce VM).
In the VM, the networking looks like:
[root@archlinux ~]# ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff 3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/sit 0.0.0.0 brd 0.0.0.0 [root@archlinux ~]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host noprefixroute valid_lft forever preferred_lft forever 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff inet 192.168.0.100/24 metric 1024 brd 192.168.0.255 scope global dynamic enp0s3 valid_lft 28435sec preferred_lft 28435sec inet6 fe80::5054:ff:fe12:3456/64 scope link proto kernel_ll valid_lft forever preferred_lft forever 3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/sit 0.0.0.0 brd 0.0.0.0 [root@archlinux ~]# systemd-resolve --status Global Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported resolv.conf mode: stub Fallback DNS Servers: 1.1.1.1#cloudflare-dns.com 9.9.9.9#dns.quad9.net 8.8.8.8#dns.google 2606:4700:4700::1111#cloudflare-dns.com 2620:fe::9#dns.quad9.net 2001:4860:4860::8888#dns.google Link 2 (enp0s3) Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported Current DNS Server: 192.168.20.2 DNS Servers: 192.168.20.2 Link 3 (sit0) Current Scopes: none Protocols: -DefaultRoute +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
Finally, some other networking configurations to consider:
- Run the VM on your machine with full access to the host networking stack. If you have the sepia vpn, this will probably work without too much configuration.
- Run the VM in a netns as above but also setup the sepia vpn in the same netns. This can help to avoid using a sepia-bounce VM. You'll still need to configure routing between the bridge and the sepia VPN.
- Run the VM in a netns as above but only use a local vstart cluster (possibly in another VM) in the same netns.
This guide uses a vstart cluster on a machine in the sepia lab. Because the mon addresses will change with any new vstart cluster, it will invalidate any static configuration we may setup for our VM mounting the CephFS via the kernel driver. So, we should create a script to fetch the configuration for our vstart cluster prior to mounting:
#!/bin/bash
# kmount.sh -- mount a vstart Ceph cluster on a remote machine
# the cephx client credential, vstart creates "client.fs" by default
NAME=fs
# static fs name, vstart creates an "a" file system by default
FS=a
# where to mount on the VM
MOUNTPOINT=/mnt
# cephfs mount point (root by default)
CEPHFS_MOUNTPOINT=/
function run {
printf '%s\n' "$*" >&2
"$@"
}
function mssh {
run ssh vossi04.front.sepia.ceph.com "cd ceph/build && (source vstart_environment.sh; $1)"
}
# create the minimum config (including mon addresses) and store it in the VM's ceph.conf. This is not used for mounting; we're storing it for potential use with `ceph` commands.
mssh "ceph config generate-minimal-conf" > /etc/ceph/ceph.conf
# get the vstart cluster's fsid
FSID=$(mssh "ceph fsid")
# get the auth key associated with client.fs
KEY=$(mssh "ceph auth get-key client.$NAME")
# dump the v2 mon addresses and format for the -o mon_addr mount option
MONS=$(mssh "ceph mon dump --format=json" | jq -r '.mons[] | .public_addrs.addrvec[] | select(.type == "v2") | .addr' | paste -s -d/)
# turn on kernel debugging (and any other debugging you'd like)
echo "module ceph +p" | tee /sys/kernel/debug/dynamic_debug/control
# do the mount! we use the new device syntax for this mount
run mount -t ceph "${NAME}@${FSID}.${FS}=${CEPHFS_MOUNTPOINT}" -o "mon_addr=${MONS},ms_mode=crc,name=${NAME},secret=${KEY},norequire_active_mds,noshare" "$MOUNTPOINT"
That would be run like:
$ sudo ip netns exec sepian ssh [email protected] ./kmount.sh
...
mount -t ceph [email protected]=/ -o mon_addr=172.21.10.4:40762/172.21.10.4:40764/172.21.10.4:40766,ms_mode=crc,name=fs,secret=AQD0jgln43pBCxAA7cJlZ4Px7J0UmiK4A4j3rA==,norequire_active_mds,noshare /mnt
$ sudo ip netns exec sepian ssh [email protected] df -h /mnt
Filesystem Size Used Avail Use% Mounted on
[email protected]=/ 169G 0 169G 0% /mnt
If you run into difficulties, it may be:
- The firewall on the node running the vstart cluster is blocking your connections.
- Some misconfiguration in your networking stack.
- An incorrect configuration for the mount.
There 3 static branches in the ceph kernel git repository managed by the Ceph team:
- for-linus: A branch managed by the primary Ceph maintainer to share changes with Linus Torvalds (upstream). Do not push to this branch.
- master: A staging ground for patches planned to be sent to Linus. Do not push to this branch.
- testing A staging ground for miscellaneous patches that need wider QA testing (via nightlies or regular Ceph QA testing). Push patches you believe to be nearly ready for upstream acceptance.
You may also push a wip-$feature
branch to the ceph-client.git
repository which will be built by Jenkins. Then view the results of the build
in Shaman.
Once a kernel branch is built, you can test it via the fs
CephFS QA suite:
$ teuthology-suite ... --suite fs --kernel wip-$feature --filter k-testing
The k-testing
filter is looking for the fragment which normally sets
testing
branch of the kernel for routine QA. That is, the fs
suite
regularly runs tests against whatever is in the testing
branch of the
kernel. We are overriding that choice of kernel branch via the --kernel
wip-$featuree
switch.
Note
Without filtering for k-testing
, the fs
suite will also run jobs using ceph-fuse or stock kernel, libcephfs tests, and other tests that may not be of interest to you when evaluating changes to the kernel.
The actual override is controlled using Lua merge scripts in the
k-testing.yaml
fragment. See that file for more details.