Skip to content

Commit

Permalink
Merge pull request ceph#44241 from kamoltat/wip-ksirivad-pool-bulk-flag
Browse files Browse the repository at this point in the history
mon: osd pool create <pool-name> with --bulk flag

Reviewed-by: Josh Durgin <[email protected]>
  • Loading branch information
neha-ojha authored Dec 22, 2021
2 parents be82f81 + abaab51 commit 9d09a81
Show file tree
Hide file tree
Showing 15 changed files with 448 additions and 365 deletions.
7 changes: 7 additions & 0 deletions PendingReleaseNotes
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,13 @@
* LevelDB support has been removed. ``WITH_LEVELDB`` is no longer a supported
build option.

* MON/MGR: Pools can now be created with `--bulk` flag. Any pools created with `bulk`
will use a profile of the `pg_autoscaler` that provides more performance from the start.
However, any pools created without the `--bulk` flag will remain using it's old behavior
by default. For more details, see:

https://docs.ceph.com/en/latest/rados/operations/placement-groups/

>=16.0.0
--------
* mgr/nfs: ``nfs`` module is moved out of volumes plugin. Prior using the
Expand Down
46 changes: 24 additions & 22 deletions doc/rados/operations/placement-groups.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,10 @@ the PG count with this command::

Output will be something like::

POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE PROFILE
a 12900M 3.0 82431M 0.4695 8 128 warn scale-up
c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn scale-down
b 0 953.6M 3.0 82431M 0.0347 8 warn scale-down
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
a 12900M 3.0 82431M 0.4695 8 128 warn True
c 0 3.0 82431M 0.0000 0.2000 0.9884 1.0 1 64 warn True
b 0 953.6M 3.0 82431M 0.0347 8 warn False

**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
present, is the amount of data the administrator has specified that
Expand Down Expand Up @@ -96,9 +96,12 @@ This factor can be be adjusted with::
**AUTOSCALE**, is the pool ``pg_autoscale_mode``
and will be either ``on``, ``off``, or ``warn``.

The final column, **PROFILE** shows the autoscale profile
used by each pool. ``scale-up`` and ``scale-down`` are the
currently available profiles.
The final column, **BULK** determines if the pool is ``bulk``
and will be either ``True`` or ``False``. A ``bulk`` pool
means that the pool is expected to be large and should start out
with large amount of PGs for performance purposes. On the other hand,
pools without the ``bulk`` flag are expected to be smaller e.g.,
.mgr or meta pools.


Automated scaling
Expand Down Expand Up @@ -126,28 +129,27 @@ example, a pool that maps to OSDs of class `ssd` and a pool that maps
to OSDs of class `hdd` will each have optimal PG counts that depend on
the number of those respective device types.

The autoscaler uses the `scale-up` profile by default,
where it starts out each pool with minimal PGs and scales
up PGs when there is more usage in each pool. However, it also has
a `scale-down` profile, where each pool starts out with a full complements
of PGs and only scales down when the usage ratio across the pools is not even.
The autoscaler uses the `bulk` flag to determine which pool
should start out with a full complements of PGs and only
scales down when the the usage ratio across the pool is not even.
However, if the pool doesn't have the `bulk` flag, the pool will
start out with minimal PGs and only when there is more usage in the pool.

With only the `scale-down` profile, the autoscaler identifies
any overlapping roots and prevents the pools with such roots
from scaling because overlapping roots can cause problems
The autoscaler identifies any overlapping roots and prevents the pools
with such roots from scaling because overlapping roots can cause problems
with the scaling process.

To use the `scale-down` profile::
To create pool with `bulk` flag::

ceph osd pool set autoscale-profile scale-down
ceph osd create <pool-name> --bulk

To switch back to the default `scale-up` profile::
To set/unset `bulk` flag of existing pool::

ceph osd pool set autoscale-profile scale-up
ceph osd pool set <pool-name> bulk=true/false/1/0

Existing clusters will continue to use the `scale-up` profile.
To use the `scale-down` profile, users will need to set autoscale-profile `scale-down`,
after upgrading to a version of Ceph that provides the `scale-down` feature.
To get `bulk` flag of existing pool::

ceph osd pool get <pool-name> bulk

.. _specifying_pool_target_size:

Expand Down
9 changes: 9 additions & 0 deletions doc/rados/operations/pools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,15 @@ You may set values for the following keys:
:Valid Range: 1 sets flag, 0 unsets flag
:Version: Version ``FIXME``

.. _bulk:

.. describe:: bulk

Set/Unset bulk flag on a given pool.

:Type: Boolean
:Valid Range: true/1 sets flag, false/0 unsets flag

.. _write_fadvise_dontneed:

.. describe:: write_fadvise_dontneed
Expand Down
6 changes: 1 addition & 5 deletions qa/suites/rados/singleton/all/pg-autoscaler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,9 @@ roles:
- osd.1
- osd.2
- osd.3
- client.0
- - mon.b
- mon.c
- osd.4
- osd.5
- osd.6
- osd.7
- client.0
openstack:
- volumes: # attached to each instance
count: 4
Expand Down
2 changes: 1 addition & 1 deletion qa/workunits/cephtool/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2217,7 +2217,7 @@ function test_mon_osd_pool_set()
ceph osd pool get pool_erasure erasure_code_profile
ceph osd pool rm pool_erasure pool_erasure --yes-i-really-really-mean-it

for flag in nodelete nopgchange nosizechange write_fadvise_dontneed noscrub nodeep-scrub; do
for flag in nodelete nopgchange nosizechange write_fadvise_dontneed noscrub nodeep-scrub bulk; do
ceph osd pool set $TEST_POOL_GETSET $flag false
ceph osd pool get $TEST_POOL_GETSET $flag | grep "$flag: false"
ceph osd pool set $TEST_POOL_GETSET $flag true
Expand Down
169 changes: 86 additions & 83 deletions qa/workunits/mon/pg_autoscaler.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,127 +17,130 @@ function wait_for() {
local cmd=$2

while true ; do
if bash -c "$cmd" ; then
break
fi
sec=$(( $sec - 1 ))
if [ $sec -eq 0 ]; then
echo failed
return 1
fi
sleep 1
if bash -c "$cmd" ; then
break
fi
sec=$(( $sec - 1 ))
if [ $sec -eq 0 ]; then
echo failed
return 1
fi
sleep 1
done
return 0
}

function power2() { echo "x=l($1)/l(2); scale=0; 2^((x+0.5)/1)" | bc -l;}

function eval_actual_expected_val() {
local actual_value=$1
local expected_value=$2
if [[ $actual_value = $expected_value ]]
then
echo "Success: " $actual_value "=" $expected_value
else
echo "Error: " $actual_value "!=" $expected_value
exit 1
fi
}

# enable
ceph config set mgr mgr/pg_autoscaler/sleep_interval 5
ceph config set mgr mgr/pg_autoscaler/sleep_interval 60
ceph mgr module enable pg_autoscaler
# ceph config set global osd_pool_default_pg_autoscale_mode on

# pg_num_min
ceph osd pool create a 16 --pg-num-min 4
ceph osd pool create b 16 --pg-num-min 2
ceph osd pool set a pg_autoscale_mode on
ceph osd pool set b pg_autoscale_mode on
ceph osd pool create meta0 16
ceph osd pool create bulk0 16 --bulk
ceph osd pool create bulk1 16 --bulk
ceph osd pool create bulk2 16 --bulk
ceph osd pool set meta0 pg_autoscale_mode on
ceph osd pool set bulk0 pg_autoscale_mode on
ceph osd pool set bulk1 pg_autoscale_mode on
ceph osd pool set bulk2 pg_autoscale_mode on
# set pool size
ceph osd pool set meta0 size 2
ceph osd pool set bulk0 size 2
ceph osd pool set bulk1 size 2
ceph osd pool set bulk2 size 2

# get num pools again since we created more pools
NUM_POOLS=$(ceph osd pool ls | wc -l)

# get profiles of pool a and b
PROFILE1=$(ceph osd pool autoscale-status | grep 'a' | grep -o -m 1 'scale-up\|scale-down' || true)
PROFILE2=$(ceph osd pool autoscale-status | grep 'b' | grep -o -m 1 'scale-up\|scale-down' || true)

# evaluate the default profile a
if [[ $PROFILE1 = "scale-up" ]]
then
echo "Success: pool a PROFILE is scale-up"
else
echo "Error: a PROFILE is scale-down"
exit 1
fi

# evaluate the default profile of pool b
if [[ $PROFILE2 = "scale-up" ]]
then
echo "Success: pool b PROFILE is scale-up"
else
echo "Error: b PROFILE is scale-down"
exit 1
fi
# get bulk flag of each pool through the command ceph osd pool autoscale-status
BULK_FLAG_1=$(ceph osd pool autoscale-status | grep 'meta0' | grep -o -m 1 'True\|False' || true)
BULK_FLAG_2=$(ceph osd pool autoscale-status | grep 'bulk0' | grep -o -m 1 'True\|False' || true)
BULK_FLAG_3=$(ceph osd pool autoscale-status | grep 'bulk1' | grep -o -m 1 'True\|False' || true)
BULK_FLAG_4=$(ceph osd pool autoscale-status | grep 'bulk2' | grep -o -m 1 'True\|False' || true)

# This part of this code will now evaluate the accuracy of
# scale-down profile
# evaluate the accuracy of ceph osd pool autoscale-status specifically the `BULK` column

# change to scale-down profile
ceph osd pool set autoscale-profile scale-down
eval_actual_expected_val $BULK_FLAG_1 'False'
eval_actual_expected_val $BULK_FLAG_2 'True'
eval_actual_expected_val $BULK_FLAG_3 'True'
eval_actual_expected_val $BULK_FLAG_4 'True'

# get profiles of pool a and b
PROFILE1=$(ceph osd pool autoscale-status | grep 'a' | grep -o -m 1 'scale-up\|scale-down' || true)
PROFILE2=$(ceph osd pool autoscale-status | grep 'b' | grep -o -m 1 'scale-up\|scale-down' || true)

# evaluate that profile a is now scale-down
if [[ $PROFILE1 = "scale-down" ]]
then
echo "Success: pool a PROFILE is scale-down"
else
echo "Error: a PROFILE is scale-up"
exit 1
fi

# evaluate the profile of b is now scale-down
if [[ $PROFILE2 = "scale-down" ]]
then
echo "Success: pool b PROFILE is scale-down"
else
echo "Error: b PROFILE is scale-up"
exit 1
fi
# This part of this code will now evaluate the accuracy of the autoscaler

# get pool size
POOL_SIZE_A=$(ceph osd pool get a size| grep -Eo '[0-9]{1,4}')
POOL_SIZE_B=$(ceph osd pool get b size| grep -Eo '[0-9]{1,4}')

# calculate target pg of each pools
TARGET_PG_A=$(power2 $((($NUM_OSDS * 100)/($NUM_POOLS)/($POOL_SIZE_A))))
TARGET_PG_B=$(power2 $((($NUM_OSDS * 100)/($NUM_POOLS)/($POOL_SIZE_B))))
POOL_SIZE_1=$(ceph osd pool get meta0 size| grep -Eo '[0-9]{1,4}')
POOL_SIZE_2=$(ceph osd pool get bulk0 size| grep -Eo '[0-9]{1,4}')
POOL_SIZE_3=$(ceph osd pool get bulk1 size| grep -Eo '[0-9]{1,4}')
POOL_SIZE_4=$(ceph osd pool get bulk2 size| grep -Eo '[0-9]{1,4}')

# Calculate target pg of each pools
# First Pool is a non-bulk so we do it first.
# Since the Capacity ratio = 0 we first meta pool remains the same pg_num

TARGET_PG_1=$(ceph osd pool get meta0 pg_num| grep -Eo '[0-9]{1,4}')
PG_LEFT=$NUM_OSDS*100
NUM_POOLS_LEFT=$NUM_POOLS-1
# Rest of the pool is bulk and even pools so pretty straight forward
# calculations.
TARGET_PG_2=$(power2 $((($PG_LEFT)/($NUM_POOLS_LEFT)/($POOL_SIZE_2))))
TARGET_PG_3=$(power2 $((($PG_LEFT)/($NUM_POOLS_LEFT)/($POOL_SIZE_3))))
TARGET_PG_4=$(power2 $((($PG_LEFT)/($NUM_POOLS_LEFT)/($POOL_SIZE_4))))

# evaluate target_pg against pg num of each pools
wait_for 120 "ceph osd pool get a pg_num | grep $TARGET_PG_A"
wait_for 120 "ceph osd pool get b pg_num | grep $TARGET_PG_B"
wait_for 300 "ceph osd pool get meta0 pg_num | grep $TARGET_PG_1"
wait_for 300 "ceph osd pool get bulk0 pg_num | grep $TARGET_PG_2"
wait_for 300 "ceph osd pool get bulk1 pg_num | grep $TARGET_PG_3"
wait_for 300 "ceph osd pool get bulk2 pg_num | grep $TARGET_PG_4"

# target ratio
ceph osd pool set a target_size_ratio 5
ceph osd pool set b target_size_ratio 1
sleep 10
ceph osd pool set meta0 target_size_ratio 5
ceph osd pool set bulk0 target_size_ratio 1
sleep 60
APGS=$(ceph osd dump -f json-pretty | jq '.pools[0].pg_num_target')
BPGS=$(ceph osd dump -f json-pretty | jq '.pools[1].pg_num_target')
test $APGS -gt 100
test $BPGS -gt 10

# small ratio change does not change pg_num
ceph osd pool set a target_size_ratio 7
ceph osd pool set b target_size_ratio 2
sleep 10
ceph osd pool set meta0 target_size_ratio 7
ceph osd pool set bulk0 target_size_ratio 2
sleep 60
APGS2=$(ceph osd dump -f json-pretty | jq '.pools[0].pg_num_target')
BPGS2=$(ceph osd dump -f json-pretty | jq '.pools[1].pg_num_target')
test $APGS -eq $APGS2
test $BPGS -eq $BPGS2

# target_size
ceph osd pool set a target_size_bytes 1000000000000000
ceph osd pool set b target_size_bytes 1000000000000000
ceph osd pool set a target_size_ratio 0
ceph osd pool set b target_size_ratio 0
ceph osd pool set meta0 target_size_bytes 1000000000000000
ceph osd pool set bulk0 target_size_bytes 1000000000000000
ceph osd pool set meta0 target_size_ratio 0
ceph osd pool set bulk0 target_size_ratio 0
wait_for 60 "ceph health detail | grep POOL_TARGET_SIZE_BYTES_OVERCOMMITTED"

ceph osd pool set a target_size_bytes 1000
ceph osd pool set b target_size_bytes 1000
ceph osd pool set a target_size_ratio 1
ceph osd pool set meta0 target_size_bytes 1000
ceph osd pool set bulk0 target_size_bytes 1000
ceph osd pool set meta0 target_size_ratio 1
wait_for 60 "ceph health detail | grep POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO"

ceph osd pool rm a a --yes-i-really-really-mean-it
ceph osd pool rm b b --yes-i-really-really-mean-it
ceph osd pool rm meta0 meta0 --yes-i-really-really-mean-it
ceph osd pool rm bulk0 bulk0 --yes-i-really-really-mean-it
ceph osd pool rm bulk1 bulk1 --yes-i-really-really-mean-it
ceph osd pool rm bulk2 bulk2 --yes-i-really-really-mean-it

echo OK

11 changes: 10 additions & 1 deletion src/common/options/global.yaml.in
Original file line number Diff line number Diff line change
Expand Up @@ -2566,6 +2566,15 @@ options:
services:
- mon
with_legacy: true
- name: osd_pool_default_flag_bulk
type: bool
level: advanced
desc: set bulk flag on new pools
fmt_desc: Set the ``bulk`` flag on new pools. Allowing autoscaler to use scale-down mode.
default: false
services:
- mon
with_legacy: true
- name: osd_pool_default_hit_set_bloom_fpp
type: float
level: advanced
Expand Down Expand Up @@ -6096,4 +6105,4 @@ options:
services:
- rgw
- osd
with_legacy: true
with_legacy: true
3 changes: 0 additions & 3 deletions src/mon/KVMonitor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,6 @@ void KVMonitor::create_initial()
dout(10) << __func__ << dendl;
version = 0;
pending.clear();
bufferlist bl;
bl.append("scale-up");
pending["config/mgr/mgr/pg_autoscaler/autoscale_profile"] = bl;
}

void KVMonitor::update_from_paxos(bool *need_bootstrap)
Expand Down
5 changes: 3 additions & 2 deletions src/mon/MonCommands.h
Original file line number Diff line number Diff line change
Expand Up @@ -1058,6 +1058,7 @@ COMMAND("osd pool create "
"name=size,type=CephInt,range=0,req=false "
"name=pg_num_min,type=CephInt,range=0,req=false "
"name=autoscale_mode,type=CephChoices,strings=on|off|warn,req=false "
"name=bulk,type=CephBool,req=false "
"name=target_size_bytes,type=CephInt,range=0,req=false "
"name=target_size_ratio,type=CephFloat,range=0|1,req=false",\
"create pool", "osd", "rw")
Expand All @@ -1082,11 +1083,11 @@ COMMAND("osd pool rename "
"rename <srcpool> to <destpool>", "osd", "rw")
COMMAND("osd pool get "
"name=pool,type=CephPoolname "
"name=var,type=CephChoices,strings=size|min_size|pg_num|pgp_num|crush_rule|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|target_max_objects|target_max_bytes|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|erasure_code_profile|min_read_recency_for_promote|all|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority|compression_mode|compression_algorithm|compression_required_ratio|compression_max_blob_size|compression_min_blob_size|csum_type|csum_min_block|csum_max_block|allow_ec_overwrites|fingerprint_algorithm|pg_autoscale_mode|pg_autoscale_bias|pg_num_min|target_size_bytes|target_size_ratio|dedup_tier|dedup_chunk_algorithm|dedup_cdc_chunk_size|eio",
"name=var,type=CephChoices,strings=size|min_size|pg_num|pgp_num|crush_rule|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|target_max_objects|target_max_bytes|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|erasure_code_profile|min_read_recency_for_promote|all|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority|compression_mode|compression_algorithm|compression_required_ratio|compression_max_blob_size|compression_min_blob_size|csum_type|csum_min_block|csum_max_block|allow_ec_overwrites|fingerprint_algorithm|pg_autoscale_mode|pg_autoscale_bias|pg_num_min|target_size_bytes|target_size_ratio|dedup_tier|dedup_chunk_algorithm|dedup_cdc_chunk_size|eio|bulk",
"get pool parameter <var>", "osd", "r")
COMMAND("osd pool set "
"name=pool,type=CephPoolname "
"name=var,type=CephChoices,strings=size|min_size|pg_num|pgp_num|pgp_num_actual|crush_rule|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority|compression_mode|compression_algorithm|compression_required_ratio|compression_max_blob_size|compression_min_blob_size|csum_type|csum_min_block|csum_max_block|allow_ec_overwrites|fingerprint_algorithm|pg_autoscale_mode|pg_autoscale_bias|pg_num_min|target_size_bytes|target_size_ratio|dedup_tier|dedup_chunk_algorithm|dedup_cdc_chunk_size|eio "
"name=var,type=CephChoices,strings=size|min_size|pg_num|pgp_num|pgp_num_actual|crush_rule|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority|compression_mode|compression_algorithm|compression_required_ratio|compression_max_blob_size|compression_min_blob_size|csum_type|csum_min_block|csum_max_block|allow_ec_overwrites|fingerprint_algorithm|pg_autoscale_mode|pg_autoscale_bias|pg_num_min|target_size_bytes|target_size_ratio|dedup_tier|dedup_chunk_algorithm|dedup_cdc_chunk_size|eio|bulk "
"name=val,type=CephString "
"name=yes_i_really_mean_it,type=CephBool,req=false",
"set pool parameter <var> to <val>", "osd", "rw")
Expand Down
Loading

0 comments on commit 9d09a81

Please sign in to comment.