Skip to content

Latest commit

 

History

History
478 lines (352 loc) · 18.9 KB

kclient.rst

File metadata and controls

478 lines (352 loc) · 18.9 KB

Testing changes to the Linux Kernel CephFS driver

This walkthrough will explain one (opinionated) way to do testing of the Linux kernel client against a development cluster. We will try to mimimize any assumptions about pre-existing knowledge of how to do kernel builds or any related best-practices.

Note

There are many completely valid ways to do kernel development for Ceph. This guide is a walkthrough of the author's own environment. You may decide to do things very differently.

Step One: build the kernel

Clone the kernel:

git init linux && cd linux
git remote add torvalds git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git remote add ceph https://github.com/ceph/ceph-client.git
git fetch && git checkout torvalds/master

Configure the kernel:

make defconfig

Note

You can alternatively use the Ceph Kernel QA Config for building the kernel.

We now have a kernel config with reasonable defaults for the architecture you're building on. The next thing to do is to enable configs which will build Ceph and/or provide functionality we need to do testing.

cat > ~/.ceph.config <<EOF
CONFIG_CEPH_FS=y
CONFIG_CEPH_FSCACHE=y
CONFIG_CEPH_FS_POSIX_ACL=y
CONFIG_CEPH_FS_SECURITY_LABEL=y
CONFIG_CEPH_LIB_PRETTYDEBUG=y
CONFIG_DYNAMIC_DEBUG=y
CONFIG_DYNAMIC_DEBUG_CORE=y
CONFIG_FRAME_POINTER=y
CONFIG_FSCACHE
CONFIG_FSCACHE_STATS
CONFIG_FS_ENCRYPTION=y
CONFIG_FS_ENCRYPTION_ALGS=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_XFS_FS=y
EOF

Beyond enabling Ceph-related configs, we are also enabling some useful debug configs and XFS (as an alternative to ext4 if needed for our root file system).

Note

It is a good idea to not build anything as a kernel module. Otherwise, you would need to make modules_install on the root drive of the VM.

Now, merge the configs.

scripts/kconfig/merge_config.sh .config ~/.ceph.config

Finally, build the kernel:

make -j

Note

This document does not discuss how to get relevant utilities for your distribution to actually build the kernel, like gcc. Please use your search engine of choice to learn how to do that.

Step Two: create a VM

A virtual machine is a good choice for testing the kernel client for a few reasons:

  • You can more easily monitor and configure networking for the VM.
  • You can very rapidly test a change to the kernel (build -> mount in less than 10 seconds).
  • A fault in the kernel won't crash your machine.
  • You have a suite of tools available for analysis on the running kernel.

The main decision for you to make is what Linux distribution you want to use. This document uses Arch Linux due to the author's familiarity. We also use LVM to create a volume. You may use partitions or whatever mechanism you like to create a block device. In general, this block device will be used repeatedly in testing. You may want to use snapshots to avoid a VM somehow corrupting your root disk and forcing you to start over.

# create a volume
VOLUME_GROUP=foo
sudo lvcreate -L 256G "$VOLUME_GROUP" -n $(whoami)-vm-0
DEV="/dev/${VOLUME_GROUP}/$(whoami)-vm-0"
sudo mkfs.xfs "$DEV"
sudo mount "$DEV" /mnt
sudo pacstrap /mnt base base-devel vim less jq
sudo arch-chroot /mnt
# # delete root's password for ease of login
# passwd -d root
# mkdir -p /root/.ssh && echo "$YOUR_SSH_KEY_PUBKEY" >> /root/.ssh/authorized_keys
# exit
sudo umount /mnt

Once that's done, we should be able to run a VM:

qemu-system-x86_64 -enable-kvm -kernel $(pwd)/arch/x86/boot/bzImage -drive file="$DEV",if=virtio,format=raw -append 'root=/dev/vda rw'

You should see output like:

VNC server running on ::1:5900

You could view that console using:

vncviewer 127.0.0.1:5900

Congratulations, you have a VM running the kernel that you just built.

Step Three: Networking the VM

This is the "hard part" and requires the most customization depending on what you want to do. For this author, I currently have a development setup like:

  sepian netns
 ______________
|              |
| kernel VM    |              sepia-bounce VM      vossi04.front.sepia.ceph.com
|  -------  |  |                  ------                    -------
|  |     |  |  | 192.168.20.1     |    |                    |     |
|  |     |--|--|- <- wireguard -> |    |  <-- sepia vpn ->  |     |
|  |_____|  |  |     192.168.20.2 |____|                    |_____|
|          br0 |
|______________|

The sepia-bounce VM is used as a bounce box to the sepia lab. It can proxy ssh connections, route any sepia-bound traffic, or serve as a DNS proxy. The use of a sepia-bounce VM is optional but can be useful, especially if you want to create numerous kernel VMs for testing.

I like to use the vossi04 developer playground to build Ceph and setup a vstart cluster. It has sufficient resources to make building Ceph very fast (~5 minutes cold build) and local disk resources to run a decent vstart cluster.

To avoid overcomplicating this document with the details of the sepia-bounce VM, I will note the following main configurations used for the purpose of testing the kernel:

  • setup a wireguard tunnel between the machine creating kernel VMs and the sepia-bounce VM
  • use systemd-resolved as a DNS resolver and listen on 192.168.20.2 (instead of just localhost)
  • connect to the sepia VPN and use systemd resolved update script to configure systemd-resolved to use the DNS servers acquired via DHCP from the sepia VPN
  • configure firewalld to allow wireguard traffic and to masquerade and forward traffic to the sepia vpn

The next task is to connect the kernel VM to the sepia-bounce VM. A network namespace can be useful for this purpose to isolate traffic / routing rules for the VMs. For me, I orchestrate this using a custom systemd one-shot unit that looks like:

# create the net namespace
ExecStart=/usr/bin/ip netns add sepian
# bring lo up
ExecStart=/usr/bin/ip netns exec sepian ip link set dev lo up
# setup wireguard to sepia-bounce
ExecStart=/usr/bin/ip link add wg-sepian type wireguard
ExecStart=/usr/bin/wg setconf wg-sepian /etc/wireguard/wg-sepian.conf
# move the wireguard interface to the sepian nents
ExecStart=/usr/bin/ip link set wg-sepian netns sepian
# configure the static ip and bring it up
ExecStart=/usr/bin/ip netns exec sepian ip addr add 192.168.20.1/24 dev wg-sepian
ExecStart=/usr/bin/ip netns exec sepian ip link set wg-sepian up
# logging info
ExecStart=/usr/bin/ip netns exec sepian ip addr
ExecStart=/usr/bin/ip netns exec sepian ip route
# make wireguard the default route
ExecStart=/usr/bin/ip netns exec sepian ip route add default via 192.168.20.2 dev wg-sepian
# more logging
ExecStart=/usr/bin/ip netns exec sepian ip route
# add a bridge interface for VMs
ExecStart=/usr/bin/ip netns exec sepian ip link add name br0 type bridge
# configure the addresses and bring it up
ExecStart=/usr/bin/ip netns exec sepian ip addr add 192.168.0.1/24 dev br0
ExecStart=/usr/bin/ip netns exec sepian ip link set br0 up
# masquerade/forward traffic to sepia-bounce
ExecStart=/usr/bin/ip netns exec sepian iptables -t nat -A POSTROUTING -o wg-sepian -j MASQUERADE

When using the network namespace, we will use ip netns exec. There is a handy feature to automatically bind mount files into the /etc namespace for commands run via that command:

# cat /etc/netns/sepian/resolv.conf
nameserver 192.168.20.2

That file will configure the libc name resolution stack to route DNS requests for applications to the systemd-resolved daemon running on sepia-bounce. Consequently, any application running in that netns will be able to resolve sepia hostnames:

$ sudo ip netns exec sepian host vossi04.front.sepia.ceph.com
vossi04.front.sepia.ceph.com has address 172.21.10.4

Okay, great. We have a network namespace that forwards traffic to the sepia VPN. The next mental step is to connect virtual machines running a kernel to the bridge we have configured. The straightforward way to do that is to create a "tap" device which connects to the bridge:

sudo ip netns exec sepian qemu-system-x86_64 \
    -enable-kvm \
    -kernel $(pwd)/arch/x86/boot/bzImage \
    -drive file="$DEV",if=virtio,format=raw \
    -netdev tap,id=net0,ifname=tap0,script="$HOME/bin/qemu-br0",downscript=no \
    -device virtio-net-pci,netdev=net0 \
    -append 'root=/dev/vda rw'

The new relevant bits here are (a) executing the VM in the netns we have constructed; (b) a -netdev command to configure a tap device; (c) a virtual network card for the VM. There is also a script $HOME/bin/qemu-br0 run by qemu to configure the tap device it creates for the VM:

#!/bin/bash
tap=$1
ip link set "$tap" master br0
ip link set dev "$tap" up

That simply plugs the new tap device into the bridge.

This is all well and good but we are now missing one last crucial step. What is the IP address of the VM? There are two options:

  1. configure a static IP but the VM's root device networking stack configuration must be modified
  2. use DHCP and configure the root device for VMs to always use dhcp to configure their ethernet device addresses

The second option is more complicated to setup, since you must run a DHCP server now, but provides the greatest flexibility for adding more VMs as needed when testing.

The modified (or "hacked") standard dhcpd systemd service looks like:

# cat sepian-dhcpd.service
[Unit]
Description=IPv4 DHCP server
After=network.target network-online.target sepian-netns.service
Wants=network-online.target
Requires=sepian-netns.service

[Service]
ExecStartPre=/usr/bin/touch /tmp/dhcpd.leases
ExecStartPre=/usr/bin/cat /etc/netns/sepian/dhcpd.conf
ExecStart=/usr/bin/dhcpd -f -4 -q -cf /etc/netns/sepian/dhcpd.conf -lf /tmp/dhcpd.leases
NetworkNamespacePath=/var/run/netns/sepian
RuntimeDirectory=dhcpd4
User=dhcp
AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_NET_RAW
ProtectSystem=full
ProtectHome=on
KillSignal=SIGINT
# We pull in network-online.target for a configured network connection.
# However this is not guaranteed to be the network connection our
# networks are configured for. So try to restart on failure with a delay
# of two seconds. Rate limiting kicks in after 12 seconds.
RestartSec=2s
Restart=on-failure
StartLimitInterval=12s

[Install]
WantedBy=multi-user.target

Similarly, the referenced dhcpd.conf:

# cat /etc/netns/sepian/dhcpd.conf
option domain-name-servers 192.168.20.2;
option subnet-mask 255.255.255.0;
option routers 192.168.0.1;
subnet 192.168.0.0 netmask 255.255.255.0 {
    range 192.168.0.100 192.168.0.199;
}

Importantly, this tells the VM to route traffic to 192.168.0.1 (the IP of the bridge in the netns) and DNS can be provided by 192.168.20.2 (via systemd-resolved on the sepia-bounce VM).

In the VM, the networking looks like:

[root@archlinux ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/sit 0.0.0.0 brd 0.0.0.0
[root@archlinux ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.100/24 metric 1024 brd 192.168.0.255 scope global dynamic enp0s3
valid_lft 28435sec preferred_lft 28435sec
inet6 fe80::5054:ff:fe12:3456/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/sit 0.0.0.0 brd 0.0.0.0
[root@archlinux ~]# systemd-resolve --status
Global
        Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Fallback DNS Servers: 1.1.1.1#cloudflare-dns.com 9.9.9.9#dns.quad9.net 8.8.8.8#dns.google 2606:4700:4700::1111#cloudflare-dns.com 2620:fe::9#dns.quad9.net 2001:4860:4860::8888#dns.google

Link 2 (enp0s3)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
        Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.20.2
DNS Servers: 192.168.20.2

Link 3 (sit0)
Current Scopes: none
        Protocols: -DefaultRoute +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported

Finally, some other networking configurations to consider:

  • Run the VM on your machine with full access to the host networking stack. If you have the sepia vpn, this will probably work without too much configuration.
  • Run the VM in a netns as above but also setup the sepia vpn in the same netns. This can help to avoid using a sepia-bounce VM. You'll still need to configure routing between the bridge and the sepia VPN.
  • Run the VM in a netns as above but only use a local vstart cluster (possibly in another VM) in the same netns.

Step Four: mounting a CephFS file system in your VM

This guide uses a vstart cluster on a machine in the sepia lab. Because the mon addresses will change with any new vstart cluster, it will invalidate any static configuration we may setup for our VM mounting the CephFS via the kernel driver. So, we should create a script to fetch the configuration for our vstart cluster prior to mounting:

#!/bin/bash
# kmount.sh -- mount a vstart Ceph cluster on a remote machine

# the cephx client credential, vstart creates "client.fs" by default
NAME=fs
# static fs name, vstart creates an "a" file system by default
FS=a
# where to mount on the VM
MOUNTPOINT=/mnt
# cephfs mount point (root by default)
CEPHFS_MOUNTPOINT=/

function run {
    printf '%s\n' "$*" >&2
    "$@"
}

function mssh {
    run ssh vossi04.front.sepia.ceph.com "cd ceph/build && (source vstart_environment.sh; $1)"
}

# create the minimum config (including mon addresses) and store it in the VM's ceph.conf. This is not used for mounting; we're storing it for potential use with `ceph` commands.
mssh "ceph config generate-minimal-conf" > /etc/ceph/ceph.conf
# get the vstart cluster's fsid
FSID=$(mssh "ceph fsid")
# get the auth key associated with client.fs
KEY=$(mssh "ceph auth get-key client.$NAME")
# dump the v2 mon addresses and format for the -o mon_addr mount option
MONS=$(mssh "ceph mon dump --format=json" | jq -r '.mons[] | .public_addrs.addrvec[] | select(.type == "v2") | .addr' | paste -s -d/)

# turn on kernel debugging (and any other debugging you'd like)
echo "module ceph +p" | tee /sys/kernel/debug/dynamic_debug/control
# do the mount! we use the new device syntax for this mount
run mount -t ceph "${NAME}@${FSID}.${FS}=${CEPHFS_MOUNTPOINT}" -o "mon_addr=${MONS},ms_mode=crc,name=${NAME},secret=${KEY},norequire_active_mds,noshare" "$MOUNTPOINT"

That would be run like:

$ sudo ip netns exec sepian ssh [email protected] ./kmount.sh
...
mount -t ceph [email protected]=/ -o mon_addr=172.21.10.4:40762/172.21.10.4:40764/172.21.10.4:40766,ms_mode=crc,name=fs,secret=AQD0jgln43pBCxAA7cJlZ4Px7J0UmiK4A4j3rA==,norequire_active_mds,noshare /mnt
$ sudo ip netns exec sepian ssh [email protected] df -h /mnt
Filesystem                                   Size  Used Avail Use% Mounted on
[email protected]=/  169G     0  169G   0% /mnt

If you run into difficulties, it may be:

  • The firewall on the node running the vstart cluster is blocking your connections.
  • Some misconfiguration in your networking stack.
  • An incorrect configuration for the mount.

Step Five: testing kernel changes in teuthology

There 3 static branches in the ceph kernel git repository managed by the Ceph team:

  • for-linus: A branch managed by the primary Ceph maintainer to share changes with Linus Torvalds (upstream). Do not push to this branch.
  • master: A staging ground for patches planned to be sent to Linus. Do not push to this branch.
  • testing A staging ground for miscellaneous patches that need wider QA testing (via nightlies or regular Ceph QA testing). Push patches you believe to be nearly ready for upstream acceptance.

You may also push a wip-$feature branch to the ceph-client.git repository which will be built by Jenkins. Then view the results of the build in Shaman.

Once a kernel branch is built, you can test it via the fs CephFS QA suite:

$ teuthology-suite ... --suite fs --kernel wip-$feature --filter k-testing

The k-testing filter is looking for the fragment which normally sets testing branch of the kernel for routine QA. That is, the fs suite regularly runs tests against whatever is in the testing branch of the kernel. We are overriding that choice of kernel branch via the --kernel wip-$featuree switch.

Note

Without filtering for k-testing, the fs suite will also run jobs using ceph-fuse or stock kernel, libcephfs tests, and other tests that may not be of interest to you when evaluating changes to the kernel.

The actual override is controlled using Lua merge scripts in the k-testing.yaml fragment. See that file for more details.