Skip to content

Commit

Permalink
Merge pull request ceph#16765 from liewegas/wip-bluestore-docs
Browse files Browse the repository at this point in the history
doc/rados/configuration: document bluestore
  • Loading branch information
liewegas authored Aug 3, 2017
2 parents 0293149 + f2bcd02 commit ea96265
Show file tree
Hide file tree
Showing 4 changed files with 288 additions and 65 deletions.
206 changes: 204 additions & 2 deletions doc/rados/configuration/bluestore-config-ref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,202 @@
BlueStore Config Reference
==========================

Devices
=======

BlueStore manages either one, two, or (in certain cases) three storage
devices.

In the simplest case, BlueStore consumes a single (primary) storage
device. The storage device is normally partitioned into two parts:

#. A small partition is formatted with XFS and contains basic metadata
for the OSD. This *data directory* includes information about the OSD
(its identifier, which cluster it belongs to, and its private keyring.
#. The rest of the device is normally a large partition occupying the
rest of the device that is managed directly by BlueStore contains all
of the actual data. This *main device* is normally identifed by a
``block`` symlink in data directory.

It is also possible to deploy BlueStore across two additional devices:

* A *WAL device* can be used for BlueStore's internal journal or
write-ahead log. It is identified by the ``block.wal`` symlink in
the data directory. It is only useful to use a WAL device if the
device is faster than the primary device (e.g., when it is on an SSD
and the primary device is an HDD).
* A *DB device* can be used for storing BlueStore's internal metadata.
BlueStore (or rather, the embedded RocksDB) will put as much
metadata as it can on the DB device to improve performance. If the
DB device fills up, metadata will spill back onto the primary device
(where it would have been otherwise). Again, it is only helpful to
provision a DB device if it is faster than the primary device.

If there is only a small amount of fast storage available (e.g., less
than a gigabyte), we recommend using it as a WAL device. If there is
more, provisioning a DB device makes more sense. The BlueStore
journal will always be placed on the fastest device available, so
using a DB device will provide the same benefit that the WAL device
would while *also* allowing additional metadata to be stored there (if
it will fix).

A single-device BlueStore OSD can be provisioned with::

ceph-disk prepare --bluestore <device>

To specify a WAL device and/or DB device, ::

ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device>

Cache size
==========

The amount of memory consumed by each OSD for BlueStore's cache is
determined by the ``bluestore_cache_size`` configuration option. If
that config option is not set (i.e., remains at 0), there is a
different default value that is used depending on whether an HDD or
SSD is used for the primary device (set by the
``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
options).

BlueStore and the rest of the Ceph OSD does the best it can currently
to stick to the budgeted memory. Note that on top of the configured
cache size, there is also memory consumed by the OSD itself, and
generally some overhead due to memory fragmentation and other
allocator overhead.

The configured cache memory budget can be used in a few different ways:

* Key/Value metadata (i.e., RocksDB's internal cache)
* BlueStore metadata
* BlueStore data (i.e., recently read or written object data)

Cache memory usage is governed by the following options:
``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and
``bluestore_cache_kv_max``. The fraction of the cache devoted to data
is 1.0 minus the meta and kv ratios. The memory devoted to kv
metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max``
since our testing indicates there are diminishing returns beyond a
certain point.

``bluestore_cache_size``

:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead.
:Type: Integer
:Required: Yes
:Default: ``0``

``bluestore_cache_size_hdd``

:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD.
:Type: Integer
:Required: Yes
:Default: ``1 * 1024 * 1024 * 1024`` (1 GB)

``bluestore_cache_size_ssd``

:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD.
:Type: Integer
:Required: Yes
:Default: ``3 * 1024 * 1024 * 1024`` (3 GB)

``bluestore_cache_meta_ratio``

:Description: The ratio of cache devoted to metadata.
:Type: Floating point
:Required: Yes
:Default: ``.01``

``bluestore_cache_kv_ratio``

:Description: The ratio of cache devoted to key/value data (rocksdb).
:Type: Floating point
:Required: Yes
:Default: ``.99``

``bluestore_cache_kv_max``

:Description: The maximum amount of cache devoted to key/value data (rocksdb).
:Type: Floating point
:Required: Yes
:Default: ``512 * 1024*1024`` (512 MB)


Checksums
=========

BlueStore checksums all metadata and data written to disk. Metadata
checksumming is handled by RocksDB and uses `crc32c`. Data
checksumming is done by BlueStore and can make use of `crc32c`,
`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
suitable for most purposes.

Full data checksumming does increase the amount of metadata that
BlueStore must store and manage. When possible, e.g., when clients
hint that data is written and read sequentially, BlueStore will
checksum larger blocks, but in many cases it must store a checksum
value (usually 4 bytes) for every 4 kilobyte block of data.

It is possible to use a smaller checksum value by truncating the
checksum to two or one byte, reducing the metadata overhead. The
trade-off is that the probability that a random error will not be
detected is higher with a smaller checksum, going from about one if
four billion with a 32-bit (4 byte) checksum to one is 65,536 for a
16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
The smaller checksum values can be used by selecting `crc32c_16` or
`crc32c_8` as the checksum algorithm.

The *checksum algorithm* can be set either via a per-pool
``csum_type`` property or the global config option. For example, ::

ceph osd pool set <pool-name> csum_type <algorithm>

``bluestore_csum_type``

:Description: The default checksum algorithm to use.
:Type: String
:Required: Yes
:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64``
:Default: ``crc32c``


Inline Compression
==================

BlueStore supports inline compression using snappy, zlib, or LZ4. Please note,
the lz4 compression plugin is not distributed in the official release.
BlueStore supports inline compression using `snappy`, `zlib`, or
`lz4`. Please note that the `lz4` compression plugin is not
distributed in the official release.

Whether data in BlueStore is compressed is determined by a combination
of the *compression mode* and any hints associated with a write
operation. The modes are:

* **none**: Never compress data.
* **passive**: Do not compress data unless the write operation as a
*compressible* hint set.
* **aggressive**: Compress data unless the write operation as an
*incompressible* hint set.
* **force**: Try to compress data no matter what.

For more information about the *compressible* and *incompressible* IO
hints, see :doc:`/api/librados/#rados_set_alloc_hint`.

Note that regardless of the mode, if the size of the data chunk is not
reduced sufficiently it will not be used and the original
(uncompressed) data will be stored. For example, if the ``bluestore
compression required ratio`` is set to ``.7`` then the compressed data
must be 70% of the size of the original (or smaller).

The *compression mode*, *compression algorithm*, *compression required
ratio*, *min blob size*, and *max blob size* can be set either via a
per-pool property or a global config option. Pool properties can be
set with::

ceph osd pool set <pool-name> compression_algorithm <algorithm>
ceph osd pool set <pool-name> compression_mode <mode>
ceph osd pool set <pool-name> compression_required_ratio <ratio>
ceph osd pool set <pool-name> compression_min_blob_size <size>
ceph osd pool set <pool-name> compression_max_blob_size <size>

``bluestore compression algorithm``

Expand All @@ -33,6 +224,17 @@ the lz4 compression plugin is not distributed in the official release.
:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
:Default: ``none``

``bluestore compression required ratio``

:Description: The ratio of the size of the data chunk after
compression relative to the original size must be at
least this small in order to store the compressed
version.

:Type: Floating point
:Required: No
:Default: .875

``bluestore compression min blob size``

:Description: Chunks smaller than this are never compressed.
Expand Down
62 changes: 0 additions & 62 deletions doc/rados/configuration/filesystem-recommendations.rst

This file was deleted.

2 changes: 1 addition & 1 deletion doc/rados/configuration/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ For general object store configuration, refer to the following:
.. toctree::
:maxdepth: 1

Disks and Filesystems <filesystem-recommendations>
Storage devices <storage-devices>
ceph-conf


Expand Down
83 changes: 83 additions & 0 deletions doc/rados/configuration/storage-devices.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
=================
Storage Devices
=================

There are two Ceph daemons that store data on disk:

* **Ceph OSDs** (or Object Storage Daemons) are where most of the
data is stored in Ceph. Generally speaking, each OSD is backed by
a single storage device, like a traditional hard disk (HDD) or
solid state disk (SSD). OSDs can also be backed by a combination
of devices, like a HDD for most data and an SSD (or partition of an
SSD) for some metadata. The number of OSDs in a cluster is
generally a function of how much data will be stored, how big each
storage device will be, and the level and type of redundancy
(replication or erasure coding).
* **Ceph Monitor** daemons manage critical cluster state like cluster
membership and authentication information. For smaller clusters a
few gigabytes is all that is needed, although for larger clusters
the monitor database can reach tens or possibly hundreds of
gigabytes.


OSD Backends
============

There are two ways that OSDs can manage the data they store. Starting
with the Luminous 12.2.z release, the new default (and recommended) backend is
*BlueStore*. Prior to Luminous, the default (and only option) was
*FileStore*.

BlueStore
---------

BlueStore is a special-purpose storage backend designed specifically
for managing data on disk for Ceph OSD workloads. It is motivated by
experience supporting and managing OSDs using FileStore over the
last ten years. Key BlueStore features include:

* Direct management of storage devices. BlueStore consumes raw block
devices or partitions. This avoids any intervening layers of
abstraction (such as local file systems like XFS) that may limit
performance or add complexity.
* Metadata management with RocksDB. We embed RocksDB's key/value database
in order to manage internal metadata, such as the mapping from object
names to block locations on disk.
* Full data and metadata checksumming. By default all data and
metadata written to BlueStore is protected by one or more
checksums. No data or metadata will be read from disk or returned
to the user without being verified.
* Inline compression. Data written may be optionally compressed
before being written to disk.
* Multi-device metadata tiering. BlueStore allows its internal
journal (write-ahead log) to be written to a separate, high-speed
device (like an SSD, NVMe, or NVDIMM) to increased performance. If
a significant amount of faster storage is available, internal
metadata can also be stored on the faster device.
* Efficient copy-on-write. RBD and CephFS snapshots rely on a
copy-on-write *clone* mechanism that is implemented efficiently in
BlueStore. This results in efficient IO both for regular snapshots
and for erasure coded pools (which rely on cloning to implement
efficient two-phase commits).

For more information, see :doc:`bluestore-config-ref`.

FileStore
---------

FileStore is the legacy approach to storing objects in Ceph. It
relies on a standard file system (normally XFS) in combination with a
key/value database (traditionally LevelDB, now RocksDB) for some
metadata.

FileStore is well-tested and widely used in production but suffers
from many performance deficiencies due to its overall design and
reliance on a traditional file system for storing object data.

Although FileStore is generally capable of functioning on most
POSIX-compatible file systems (including btrfs and ext4), we only
recommend that XFS be used. Both btrfs and ext4 have known bugs and
deficiencies and their use may lead to data loss. By default all Ceph
provisioning tools will use XFS.

For more information, see :doc:`filestore-config-ref`.

0 comments on commit ea96265

Please sign in to comment.