Zoned Storage is a class of storage devices that enables host and storage devices to cooperate to achieve higher storage capacities, increased throughput, and lower latencies. The zoned storage interface is available through the SCSI Zoned Block Commands (ZBC) and Zoned Device ATA Command Set (ZAC) standards on Shingled Magnetic Recording (SMR) hard disks today and is also being adopted for NVMe Solid State Disks with the upcoming NVMe Zoned Namespaces (ZNS) standard.
This project aims to enable Ceph to work on zoned storage drives and at the same time explore research problems related to adopting this new interface. The first target is to enable non-overwrite workloads (e.g. RGW) on host-managed SMR (HM-SMR) drives and explore cleaning (garbage collection) policies. HM-SMR drives are high capacity hard drives with the ZBC/ZAC interface. The longer term goal is to support ZNS SSDs, as they become available, as well as overwrite workloads.
The first patch in these series enabled writing data to HM-SMR drives. This patch introduces ZonedFreelistManger, a FreelistManager implementation that passes enough information to ZonedAllocator to correctly initialize state of zones by tracking the write pointer and the number of dead bytes per zone. We have to introduce a new FreelistManager implementation because with zoned devices a region of disk can be in three states (empty, used, and dead), whereas current BitmapFreelistManager tracks only two states (empty and used). It is not possible to accurately initialize the state of zones in ZonedAllocator by tracking only two states. The third planned patch will introduce a rudimentary cleaner to form a baseline for further research.
Currently we can perform basic RADOS benchmarks on an OSD running on an HM-SMR drives, restart the OSD, and read the written data, and write new data, as can be seen below.
Please contact Abutalib Aghayev <[email protected]> for questions.
$ sudo zbd report -i -n /dev/sdc Device /dev/sdc: Vendor ID: ATA HGST HSH721414AL T240 Zone model: host-managed Capacity: 14000.520 GB (27344764928 512-bytes sectors) Logical blocks: 3418095616 blocks of 4096 B Physical blocks: 3418095616 blocks of 4096 B Zones: 52156 zones of 256.0 MB Maximum number of open zones: no limit Maximum number of active zones: no limit 52156 / 52156 zones $ MON=1 OSD=1 MDS=0 sudo ../src/vstart.sh --new --localhost --bluestore --bluestore-devs /dev/sdc --bluestore-zoned <snipped verbose output> $ sudo ./bin/ceph osd pool create bench 32 32 pool 'bench' created $ sudo ./bin/rados bench -p bench 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_h0.cc.journaling712.narwhal.p_29846 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 45 29 115.943 116 0.384175 0.407806 2 16 86 70 139.949 164 0.259845 0.391488 3 16 125 109 145.286 156 0.31727 0.404727 4 16 162 146 145.953 148 0.826671 0.409003 5 16 203 187 149.553 164 0.44815 0.404303 6 16 242 226 150.621 156 0.227488 0.409872 7 16 281 265 151.384 156 0.411896 0.408686 8 16 320 304 151.956 156 0.435135 0.411473 9 16 359 343 152.401 156 0.463699 0.408658 10 15 396 381 152.356 152 0.409554 0.410851 Total time run: 10.3305 Total writes made: 396 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 153.333 Stddev Bandwidth: 13.6561 Max bandwidth (MB/sec): 164 Min bandwidth (MB/sec): 116 Average IOPS: 38 Stddev IOPS: 3.41402 Max IOPS: 41 Min IOPS: 29 Average Latency(s): 0.411226 Stddev Latency(s): 0.180238 Max latency(s): 1.00844 Min latency(s): 0.108616 $ sudo ../src/stop.sh $ # Notice the lack of "--new" parameter to vstart.sh $ MON=1 OSD=1 MDS=0 sudo ../src/vstart.sh --localhost --bluestore --bluestore-devs /dev/sdc --bluestore-zoned <snipped verbose output> $ sudo ./bin/rados bench -p bench 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 61 45 179.903 180 0.117329 0.244067 2 16 116 100 199.918 220 0.144162 0.292305 3 16 174 158 210.589 232 0.170941 0.285481 4 16 251 235 234.918 308 0.241175 0.256543 5 16 316 300 239.914 260 0.206044 0.255882 6 15 392 377 251.206 308 0.137972 0.247426 7 15 458 443 252.984 264 0.0800146 0.245138 8 16 529 513 256.346 280 0.103529 0.239888 9 16 587 571 253.634 232 0.145535 0.2453 10 15 646 631 252.254 240 0.837727 0.246019 Total time run: 10.272 Total reads made: 646 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 251.558 Average IOPS: 62 Stddev IOPS: 10.005 Max IOPS: 77 Min IOPS: 45 Average Latency(s): 0.249385 Max latency(s): 0.888654 Min latency(s): 0.0103208 $ sudo ./bin/rados bench -p bench 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_h0.aa.journaling712.narwhal.p_64416 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 46 30 119.949 120 0.52627 0.396166 2 16 82 66 131.955 144 0.48087 0.427311 3 16 123 107 142.627 164 0.3287 0.420614 4 16 158 142 141.964 140 0.405177 0.425993 5 16 192 176 140.766 136 0.514565 0.425175 6 16 224 208 138.635 128 0.69184 0.436672 7 16 261 245 139.967 148 0.459929 0.439502 8 16 301 285 142.468 160 0.250846 0.434799 9 16 336 320 142.189 140 0.621686 0.435457 10 16 374 358 143.166 152 0.460593 0.436384