Upgrading Ceph

This section describes show to upgrade from one version of Ceph to another. The Ceph upgrade procedure is described here.

The Ceph release series is not strictly dependent upon the StackHPC OpenStack release, however this configuration does define a default Ceph release series and container image tag. The default release series is currently squid.

Known issues

Slow ceph-volume activate

A large slowdown of ceph-volume activate has been reported on version 19.2.3 (bug 73107).

On Reef, a host with 15 OSDs was measured taking around 10 seconds to activate all OSDs while exiting maintenance mode. On Squid 19.2.3, a host with 22 OSDs was measured taking 2 minutes to activate all OSDs.

This bug is resolved in 19.2.4.

Elastic Shared Blob crash

In Ceph Squid versions prior to 19.2.4, there is a known bug causing OSDs created on Squid to crash. To avoid it, disable the Elastic Shared Blob feature before any OSDs are created or replaced:

ceph config set osd bluestore_elastic_shared_blobs 0

This needs to be done after the upgrade is complete as the option is not available on Reef.

Prerequisites

Before starting the upgrade, ensure any appropriate prerequisites are satisfied. These will be specific to each deployment, but here are some suggestions:

  • Ensure that expected test suites are passing, e.g. Tempest.

  • Resolve any Prometheus alerts.

  • Check for unexpected ERROR or CRITICAL messages in OpenSearch Dashboard.

  • Check Grafana dashboards.

Consider whether the Ceph cluster needs to be upgraded within or outside of a maintenance/change window.

Preparation

Ensure that the local Kayobe configuration environment is up to date.

If you wish to use a different Ceph release series, set cephadm_ceph_release.

If you wish to use different Ceph container image tags, set the following variables:

  • cephadm_image_tag (tags)

  • cephadm_haproxy_image_tag (tags)

  • cephadm_keepalived_image_tag (tags)

Be sure to use a tag that matches the release series.

Upgrading Host Packages

Prior to upgrading the Ceph storage cluster, it may be desirable to upgrade system packages on the hosts.

Note that these commands do not affect packages installed in containers, only those installed on the host.

In order to avoid downtime, it is important to control how package updates are rolled out. In general, Ceph monitor hosts should be updated one by one. For Ceph OSD hosts it may be possible to update packages in batches of hosts, provided there is sufficient capacity to maintain data availability.

For each host or batch of hosts, perform the following steps.

Place the host or batch of hosts into maintenance mode:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/ceph/ceph-enter-maintenance.yml -l <host>

To update all eligible packages, use *, escaping if necessary:

kayobe overcloud host package update --packages "*" --limit <host>

If the kernel has been upgraded, reboot the host or batch of hosts to pick up the change. While running this playbook, consider setting ANSIBLE_SERIAL to the maximum number of hosts that can safely reboot concurrently.

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/maintenance/reboot.yml -l <host>

Remove the host or batch of hosts from maintenance mode:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/ceph/ceph-exit-maintenance.yml -l <host>

Wait for Ceph health to return to HEALTH_OK:

ceph -s

Wait for Prometheus alerts and errors in OpenSearch Dashboard to resolve, or address them.

Once happy that the system has been restored to full health, move onto the next host or batch or hosts.

Sync container images

If using the local Pulp server to host Ceph images (stackhpc_sync_ceph_images is true), sync the new Ceph images into the local Pulp:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-{sync,publish}.yml -e stackhpc_pulp_images_kolla_filter=none

Upgrade Ceph services

Start the upgrade. If using the local Pulp server to host Ceph images:

sudo cephadm shell -- ceph orch upgrade start --image <registry>/ceph/ceph:<tag>

Otherwise:

sudo cephadm shell -- ceph orch upgrade start --image quay.io/ceph/ceph:<tag>

The tag should match the cephadm_image_tag variable set in preparation. The registry should be the address and port of the local Pulp server.

Check the update status:

ceph orch upgrade status

Wait for Ceph health to return to HEALTH_OK:

ceph -s

Watch the cephadm logs:

ceph -W cephadm

After completing the upgrade to Squid, Ceph may show the following warning in the output of ceph -s:

all OSDs are running squid or later but require_osd_release < squid

To resolve this, first verify that all OSDs were upgraded to Squid with ceph versions. Once confirmed, run the following command:

ceph osd require-osd-release squid

Finally, verify the value of ceph osd get-require-min-compat-client. On older Ceph deployments, it may still be set to jewel, which would prevent using the upmap balancer mode which requires luminous or later. Similarly, the more recent read balancer requires reef.

Run ceph features to identify client versions and consider setting the minimum to an appropriate value:

ceph osd set-require-min-compat-client reef

Upgrade Cephadm

Update the Cephadm package:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/ceph/cephadm-deploy.yml -e cephadm_package_update=true

Testing

At this point it is recommended to perform a thorough test of the system to catch any unexpected issues. This may include:

  • Check Prometheus, OpenSearch Dashboards and Grafana

  • Smoke tests

  • All applicable tempest tests

  • Horizon UI inspection

Cleaning up

Prune unused container images:

kayobe overcloud host command run -b --command "docker image prune -a -f" -l ceph