Introduction

Following on from the Ceph Upmap Balancer Lab this will run a similar test but using the crush-compat balancer mode instead. This can either be done as a standalone lab, or as a follow on to the Upmap lab.

Using the Ceph Octopus lab setup previously with RadosGW nodes, this will attempt to simulate a cluster where OSD utilisation is skewed. In the cluster, each node has an extra 50G OSD to help try and skew the usage percentages on the OSDs.

This is the current configuration of the cluster

Ceph Upman Test Cluster
Ceph crush-compat Test Cluster

Test Setup

In summary, the cluster will have data added to it using s3cmd. There will be two buckets created, one for ISO’s and other large images, and another for loads of photos and other small files.

Requirements



# ceph balancer off

# ceph balancer status

  • Example output. The key is to ensure "active": false
    
      {
      "active": false,
      "last_optimize_duration": "0:00:00.012292",
      "last_optimize_started": "Fri Jan 15 09:50:40 2021",
      "mode": "upmap",
      "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
      "plans": []
      }
      

Setup s3cmd

  • Install s3cmd


# dnf install s3cmd



# cat <<EOF > ~/.ceph_s3cfg

[default]
access_key = 4IMSY2D3RPWW7VB7ECPB
secret_key = 7AuFEvU9HKa4WB6BjfuTlZEDv6t1oHKhQ01zmIDo
host_base = ceph-rgw01.ceph.lab

# If wildcard DNS is configured

# host_bucket = %(bucket)s.ceph-rgw01.ceph.lab

# If no wildcrds DNS

host_bucket = ceph-rgw01.ceph.lab

# If a proxy host is needed to reach the RGW nodes

#proxy_host = 192.168.0.15
#proxy_port = 8888

# If SSL has not been enabled on RGW

use_https = False
human_readable_sizes = True
EOF

Modify ~/.ceph_s3cfg as required. Alternatively modify ~/.s3cfg if this will be the only cluster s3cmd will connect to.

  • Create two buckets


# s3cmd -c ~/.ceph_s3cfg mb s3://isos-2

# s3cmd -c ~/.ceph_s3cfg mb s3://photos-2

Seed Test Data

  • Sync local directories via s3cmd

This will probably take a while, depending on the amount of data, speed of the Ceph drives etc. May be useful to setup a custom Grafana Dashboard to ensure Crush-Compat is working as intended (if this has not already been done).



# s3cmd -c ~/.ceph_s3cfg sync /var/lib/libvirt/images/iso/ s3://isos-2/

# s3cmd -c ~/.ceph_s3cfg sync ~/Photos s3://photos-2/

  • Check files are accessible


# s3cmd -c ~/.ceph_s3cfg ls s3://isos-2/

2021-01-14 10:30 858783744 s3://isos-2/CentOS-7-x86_64-GenericCloud.qcow2
  • Check the size of the buckets


# s3cmd -c ~/.ceph_s3cfg du s3://isos-2/

12G 12 objects s3://isos-2/

Setup Custom Grafana Dashboard

Using either the default Grafana setup or the monitoring set up here, log into Grafana as admin

  • Import a new dashboard from this JSON

{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 16,
"links": [],
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Dashboard1",
"decimals": 0,
"description": "Space used on each OSD represented as a percentage of total space available",
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 0
},
"hiddenSeries": false,
"id": 4,
"legend": {
"alignAsTable": true,
"avg": false,
"current": true,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "(ceph_osd_stat_bytes_used{instance=~\"$mgr\"} / ceph_osd_stat_bytes{instance=~\"$mgr\"} * 100)",
"instant": false,
"legendFormat": "{{ ceph_daemon }}",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "OSD Space Used",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": 25,
"min": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "percent",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"decimals": null,
"format": "short",
"label": "",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": true,
"dashLength": 10,
"dashes": false,
"datasource": "Dashboard1",
"description": "Each bar indicates the number of OSD's that have a PG count in a specific range as shown on the x axis.",
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 7
},
"hiddenSeries": false,
"id": 2,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": false,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "ceph_osd_numpg",
"instant": true,
"legendFormat": "PGs per OSD",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Distribution of PGs per OSD",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": 20,
"min": null,
"mode": "histogram",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": "# of OSDs",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": "10s",
"schemaVersion": 22,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"allValue": null,
"current": {
"text": "ceph-mon02:9283",
"value": "ceph-mon02:9283"
},
"datasource": "Dashboard1",
"definition": "label_values(ceph_osd_stat_bytes_used, instance)",
"hide": 0,
"includeAll": false,
"label": "mgr",
"multi": false,
"name": "mgr",
"options": [],
"query": "label_values(ceph_osd_stat_bytes_used, instance)",
"refresh": 2,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
]
},
"timezone": "",
"title": "Upmap Dashboard",
"uid": "yU-29mBMk",
"version": 5
}
  • The new dashboard should look something like this
Grafana Dashboard
Grafana Dashboard

Enable Crush-Compat Balancer

Steps below will be following the instructions from Ceph docs to enable the crush-compat balancer.

Current State of the Cluster

With 3 OSDs above 85% utilisation and the lowest OSDs at ~50% utilisation. This is due to the lab being a follow on from the upmap balancer lab.

Pre CRUSH compat
OSD utilisation before crush-compat is enabled
  • Most utilised OSDs


# ceph osd df | sort -nr -k 17 | head -n 5

ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
14 hdd 0.00490 1.00000 5.0 GiB 4.5 GiB 3.5 GiB 149 KiB 1024 MiB 510 MiB 90.04 1.58 16 up
22 hdd 0.00490 1.00000 5.0 GiB 4.4 GiB 3.4 GiB 164 KiB 1024 MiB 630 MiB 87.69 1.54 19 up
29 hdd 0.00490 1.00000 5.0 GiB 4.3 GiB 3.3 GiB 1.1 MiB 1023 MiB 708 MiB 86.16 1.52 20 up
12 hdd 0.00490 1.00000 5.0 GiB 4.2 GiB 3.2 GiB 661 KiB 1023 MiB 787 MiB 84.63 1.49 21 up
13 hdd 0.00490 1.00000 5.0 GiB 4.2 GiB 3.2 GiB 137 KiB 1024 MiB 829 MiB 83.79 1.47 18 up
  • Least utilised OSDs


# ceph osd df | sort -n -k 17 | head -n 7

ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
34 hdd 0.04880 1.00000 50 GiB 25 GiB 24 GiB 1.9 MiB 1022 MiB 25 GiB 49.85 0.88 155 up
32 hdd 0.04880 1.00000 50 GiB 25 GiB 24 GiB 3.4 MiB 1021 MiB 25 GiB 50.50 0.89 142 up
24 hdd 0.00980 1.00000 10 GiB 5.1 GiB 4.1 GiB 206 KiB 1024 MiB 4.9 GiB 51.04 0.90 26 up
33 hdd 0.04880 1.00000 50 GiB 26 GiB 25 GiB 2.4 MiB 1022 MiB 24 GiB 51.29 0.90 148 up
38 hdd 0.04880 1.00000 50 GiB 26 GiB 25 GiB 1.5 MiB 1023 MiB 24 GiB 52.13 0.92 152 up

Starting the crush-compat Balancer

  • Set the balancer mode to crush-compat


# ceph balancer mode crush-compat

  • Confirm this has been set


# ceph balancer status

  • Example output

    
      {
      "active": false,
      "last_optimize_duration": "0:00:00.012292",
      "last_optimize_started": "Fri Jan 15 09:50:40 2021",
      "mode": "crush-compat",
      "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
      "plans": []
      }
      

  • Start the balancer



# ceph balancer on

  • Check the balancer is running


# ceph balancer status

  • Example output
    
      {
      "active": true,
      "last_optimize_duration": "0:00:00.446920",
      "last_optimize_started": "Fri Jan 15 13:42:59 2021",
      "mode": "crush-compat",
      "optimize_result": "Optimization plan created successfully",
      "plans": []
      }
      

At this point the cluster will start rebalancing data. ceph -s will show PGs being remapped to optimise data placement. The grafana graph should start to show the OSD utilisation percentages averaging out.

  • Once the balancer has completed, it will show a message similar to below. All I/O was stopped on the cluster during the rebalance.


# ceph balancer status

  • Example output
    
      {
      "active": true,
      "last_optimize_duration": "0:00:00.775223",
      "last_optimize_started": "Fri Jan 15 14:05:14 2021",
      "mode": "crush-compat",
      "optimize_result": "Unable to find further optimization, change balancer mode and retry might help",
      "plans": []
      }
      

Summary

Data utilisation across the cluster is a bit better, with the highest OSD usage at 87% and lowest at 49%.

Post crush-compat
Post crush-comapt OSD Utilisation

Although this doesn’t seem as efficient as the upmap balancer, it may be the test at fault for not starting from scratch, as when changing the balancer mode back to upmap produces this is the message:


> {

    "active": true,
    "last_optimize_duration": "0:00:00.027171",
    "last_optimize_started": "Fri Jan 15 14:11:23 2021",
    "mode": "upmap",
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []

}

Considering it unlikely that there would be drives 10x larger than others in the same CRUSH hierarchy in a production environment, the distribution of data across the available OSDs appears fairly balanced.

Cleanup

To clean up this lab, use the script here.