Migrating to Rocky Linux 9¶
Overview¶
This document describes how to migrate systems from CentOS Stream 8 to Rocky Linux 9. This procedure must be performed on CentOS Stream 8 OpenStack Yoga systems before it is possible to upgrade to OpenStack Zed. It is possible to perform a rolling migration to ensure service is not disrupted. This section covers the steps required to perform such a migration.
For hosts running CentOS 8 Stream, the migration process has a simple structure:
Remove a CentOS Stream 8 host from service
Reprovision the host with a Rocky Linux 9 image
Configure and deploy the host with Rocky Linux 9 containers
While it is technically possible to migrate hosts in any order, it is strongly recommended that migrations for one type of node are completed before moving on to the next i.e. all compute node migrations are performed before all storage node migrations.
The order of node groups is less important, however it is arguably safest to perform controller node migrations first, given that they are the most complex and it is easiest to revert their state in the event of a failure. This guide covers the following types of hosts:
Controllers
Compute hosts
Storage hosts
Seed
The following types of hosts will be covered in future:
Seed hypervisor
Ansible control host
Wazuh manager
Resources¶
Necessary patches¶
This section lists some patches that are necessary for the migration to complete successfully.
The following patches have been merged to the downstream StackHPC stackhpc/yoga
branches:
https://review.opendev.org/c/openstack/kayobe/+/898563 (to fix
kayobe overcloud deprovision
)https://review.opendev.org/c/openstack/kayobe/+/898284 (if deployment predates Ussuri)
https://review.opendev.org/c/openstack/kayobe/+/898434 (if seeing slow fact gathering)
https://review.opendev.org/c/openstack/kolla-ansible/+/900034
https://review.opendev.org/c/openstack/kolla-ansible/+/897667
Configuration¶
Make the following changes to your Kayobe configuration:
Merge in the latest
stackhpc-kayobe-config
stackhpc/yoga
branch.Set
os_distribution
torocky
inetc/kayobe/globals.yml
.Set
os_release
to"9"
inetc/kayobe/globals.yml
.Consider using a prebuilt overcloud host image or building an overcloud host image using the standard configuration.
If you are using Kayobe multiple environments, add the following into
kayobe-config/etc/kayobe/environments/<env>/kolla/config/nova.conf
(as Kolla custom service config environment merging is not supported in Yoga). See this PR for details.[libvirt] hw_machine_type = x86_64=q35 num_pcie_ports = 16
This change does not need to be applied before migrating to Rocky Linux 9, but it is likely the best time to do so.
Warning
This change will cause the interface names to change on any new VMs launched with images that do not specify a hw_machine_type already. Existing VMs will not be affected, but a rebuild will have the names changed. Customers should be informed of this in case they have any tooling that relies on interface names within their VMs.
Routing rules¶
Routing rules referencing tables by name may need adapting to be compatible with NetworkManager e.g:
undercloud_prov_rules: - from {{ internal_net_name | net_cidr }} table ironic-api
will need to be updated to use numeric IDs:
undercloud_prov_rules: - from {{ internal_net_name | net_cidr }} table 1
The error from NetworkManager was:
[1697192659.9611] keyfile: ipv4.routing-rules: invalid value for "routing-rule1": invalid value for "table"
Updating the IPA kernel URL¶
If the enrolment of the overcloud nodes in Bifrost predates Ussuri, the
deploy_kernel
configuration probably still points to the old
ipa.vmlinuz
file, resulting in the following error in Bifrost:
Failed to prepare to deploy: Validation of image href http://10.161.0.3:8080/ipa.vmlinuz failed, reason: Got HTTP code 404 instead of 200 in response to HEAD request.
Switch the deployment kernel to ipa.kernel
:
(bifrost-deploy) OS_CLOUD=bifrost baremetal node set <node> --driver-info deploy_kernel=http://<seed-ip>:8080/ipa.kernel
Alternatively, the node inspection data can be reprocessed, but this may erase any manual configuration changes applied since the last inspection:
(bifrost-deploy) OS_CLOUD=bifrost baremetal introspection reprocess <node>
Switching to iPXE¶
The pxe
boot_interface is currently broken. When provisioning, you will see an error similar to:
Failed to prepare to deploy: Could not link image http://192.168.1.1:8080/ipa.vmlinuz from /httpboot/master_images/99d5b4b4-0420-578a-a327-acd88c1f1ff6.converted to /tftpboot/d6673eaa-17a4-4cd4-a4e7-8e8cbd4fca31/deploy_kernel, error: [Errno 18] Invalid cross-device link: '/httpboot/master_images/99d5b4b4-0420-578a-a327-acd88c1f1ff6.converted' -> '/tftpboot/d6673eaa-17a4-4cd4-a4e7-8e8cbd4fca31/deploy_kernel'
After deprovisioning a node, switch the boot interface to iPXE:
openstack baremetal node set <node> --boot-interface ipxe
Prerequisites¶
Before starting the upgrade, ensure any appropriate prerequisites are satisfied. These will be specific to each deployment, but here are some suggestions:
Ensure that there is sufficient hypervisor capacity to drain at least one node.
If using Ironic for bare metal compute, ensure that at least one node is available for testing provisioning.
Ensure that expected test suites are passing, e.g. Tempest.
Resolve any Prometheus alerts.
Check for unexpected
ERROR
orCRITICAL
messages in Kibana/OpenSearch Dashboard.Check Grafana dashboards.
Disable Ansible fact caching for the duration of the migration, or remember to clear hosts from the fact cache after they have been reprovisioned.
Migrate to OpenSearch¶
Elasticsearch/Kibana should be migrated to OpenSearch.
If necessary, take a backup of the Elasticsearch data.
Ensure
kolla_enable_elasticsearch
is set to false inetc/kayobe/kolla.yml
If you have a custom Kolla Ansible inventory, ensure that it contains the
opensearch
andopensearch-dashboards
groups. Otherwise, sync with the inventory in Kayobe.Set
kolla_enable_opensearch: true
inetc/kayobe/kolla.yml
kayobe overcloud service configuration generate --node-config-dir '/tmp/ignore' --kolla-tags none
kayobe overcloud container image pull -kt opensearch
kayobe kolla ansible run opensearch-migration
If old indices are detected, they may be removed by running
kayobe kolla ansible run opensearch-migration -ke prune_kibana_indices=true
Sync Release Train artifacts¶
New StackHPC Release Train content should be synced to the local Pulp server. This includes host packages (Deb/RPM) and container images.
To sync host packages:
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml
Once the host package content has been tested in a test/staging environment, it may be promoted to production:
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml
To sync container images:
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml
Build locally customised container images¶
Note
The container images are provided by StackHPC Release Train are suitable for most deployments. In this case, this step can be skipped.
In some cases it is necessary to build some or all images locally to apply
customisations. In order to do this it is necessary to set
stackhpc_pulp_sync_for_local_container_build
to true
before
syncing container images.
To build the overcloud images locally and push them to the local Pulp server:
kayobe overcloud container image build --push
It is possible to build a specific set of images by supplying one or more image name regular expressions:
kayobe overcloud container image build --push ironic- nova-api
Deploy latest CentOS Stream 8 images¶
Make sure you deploy the latest CentOS Stream 8 containers prior to this migration:
kayobe overcloud service deploy
Controllers¶
Migrate controllers one by one, ideally migrating the host with the Virtual IP (VIP) last.
Potential issues¶
MariaDB had serious issues one time during testing, after the first controller was migrated. The solution in that instance was to restart the container on the two original CS8 hosts. The behaviour has not been observed again when running
kayobe overcloud database recover
between migrations. It can’t be said for sure whether this is a genuine solution or the bug just hasn’t occurred these times during testing.Issues have been seen when attempting to backup the MariaDB database,
mariabackup
was segfaulting. This was avoided by reverting to an old MariaDB container image by adding the following inetc/kayobe/kolla/globals.yml
:mariabackup_image_full: "{{ docker_registry }}/stackhpc/rocky-source-mariadb-server:yoga-20230310T170929"
When using Octavia load balancers, restarting Neutron causes load balancers with floating IPs to stop processing traffic. See LP#2042938 for details. The issue may be worked around after Neutron has been restarted by detaching then reattaching the floating IP to the load balancer’s virtual IP.
If you are using hyper-converged Ceph, please also note the potential issues in the Storage section below.
Network interface names may change between CentOS Stream 8 and Rocky Linux 9, in which case you will need to update Kayobe configuration. Note that the configuration should remain correct for hosts not yet migrated, otherwise fact gathering may fail. For example, this can be done using
group_vars
with a temporary group for the updated hosts orhost_vars
. Once all hosts are migrated, the change can be moved to the original group’sgroup_vars
and the temporary changes reverted.
Full procedure for one host¶
If using OVN, check OVN northbound DB cluster state on all controllers:
kayobe overcloud host command run --command 'docker exec -it ovn_nb_db ovs-appctl -t /run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound' --show-output -l controllers
If using OVN, check OVN southbound DB cluster state on all controllers:
kayobe overcloud host command run --command 'docker exec -it ovn_sb_db ovs-appctl -t /run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound' --show-output -l controllers
If the controller is running Ceph services:
Set host in maintenance mode:
ceph orch host maintenance enter <hostname>
Check there’s nothing remaining on the host:
ceph orch ps <hostname>
Deprovision the controller:
kayobe overcloud deprovision -l <hostname>
Reprovision the controller:
kayobe overcloud provision -l <hostname>
Host configure:
kayobe overcloud host configure -l <hostname> -kl <hostname>
If the controller is running Ceph OSD services:
Make sure the cephadm public key is in
authorized_keys
for stack or root user - depends on your setup. For example, your SSH key may already be defined inusers.yml
. If in doubt, run the cephadm deploy playbook to copy the SSH key and install the cephadm binary.kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/cephadm-deploy.yml
Take the host out of maintenance mode:
ceph orch host maintenance exit <hostname>
Make sure that everything is back in working condition before moving on to the next host:
ceph -s ceph -w
Service deploy on all controllers:
kayobe overcloud service deploy -kl controllers
If using OVN, check OVN northbound DB cluster state on all controllers to see if the new host has joined:
kayobe overcloud host command run --command 'docker exec -it ovn_nb_db ovs-appctl -t /run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound' --show-output -l controllers
If using OVN, check OVN southbound DB cluster state on all controllers to see if the new host has joined:
kayobe overcloud host command run --command 'docker exec -it ovn_sb_db ovs-appctl -t /run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound' --show-output -l controllers
Some MariaDB instability has been observed. The exact cause is unknown but the simplest fix seems to be to run the Kayobe database recovery tool between migrations.
kayobe overcloud database recover
If you are using Wazuh, you will need to deploy the agent again. Note that CIS benchmarks do not run on RL9 out-the-box. See our Wazuh docs for details.
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml -l <hostname>
After each controller has been migrated you may wish to perform some smoke testing, check for alerts and errors etc.
Compute¶
Compute nodes can be migrated to Rocky Linux 9 in batches. The possible batches depend on a number of things:
willingness for instance reboots and downtime
available spare hypervisor capacity
sizes of groups of compatible hypervisors
Potential issues¶
Network interface names may change between CentOS Stream 8 and Rocky Linux 9, in which case you will need to update Kayobe configuration. Note that the configuration should remain correct for hosts not yet migrated, otherwise fact gathering may fail. For example, this can be done using
group_vars
with a temporary group for the updated hosts orhost_vars
. Once all hosts are migrated, the change can be moved to the original group’sgroup_vars
and the temporary changes reverted.
Full procedure for one batch of hosts¶
Disable the Nova compute service and drain it of VMs using live migration. If any VMs fail to migrate, they may be cold migrated or powered off:
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/nova-compute-{disable,drain}.yml --limit <host>
If the compute node is running Ceph OSD services:
Set host in maintenance mode:
ceph orch host maintenance enter <hostname>
Check there’s nothing remaining on the host:
ceph orch ps <hostname>
Deprovision the compute node:
kayobe overcloud deprovision -l <hostname>
Reprovision the compute node:
kayobe overcloud provision -l <hostname>
If the compute node is using Libvirt on the Host, and one wants to transition to containerized Libvirt.
Update kolla.yml
kolla_enable_nova_libvirt_container: "{{ inventory_hostname != 'localhost' and ansible_facts.distribution_major_version == '9' }}"
Update kolla/globals.yml
enable_nova_libvirt_container: "{% raw %}{{ ansible_facts.distribution_major_version == '9' }}{% endraw %}"
Note
Those settings are needed only for the timeframe of migration to Rocky Linux 9, when CentOS Stream 8 or Rocky Linux 8 hosts with Libvirt on the hosts exists in the environment.
Host configure:
kayobe overcloud host configure -l <hostname> -kl <hostname>
If the compute node is running Ceph OSD services:
Make sure the cephadm public key is in
authorized_keys
for stack or root user - depends on your setup. For example, your SSH key may already be defined inusers.yml
. If in doubt, run the cephadm deploy playbook to copy the SSH key and install the cephadm binary.kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/cephadm-deploy.yml
Take the host out of maintenance mode:
ceph orch host maintenance exit <hostname>
Make sure that everything is back in working condition before moving on to the next host:
ceph -s ceph -w
Service deploy:
kayobe overcloud service deploy -kl <hostname>
- If you are using Wazuh, you will need to deploy the agent again.
Note that CIS benchmarks do not run on RL9 out-the-box. See our Wazuh docs for details.
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml -l <hostname>
Restore the system to full health.
If any VMs were powered off, they may now be powered back on.
Wait for Prometheus alerts and errors in OpenSearch Dashboard to resolve, or address them.
Once happy that the system has been restored to full health, enable the hypervisor in Nova if it is still disabled and then move onto the next host or batch or hosts.
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/nova-compute-enable.yml --limit <hostname>
Storage¶
Potential issues¶
The procedure for the bootstrap host and the other ceph hosts should be identical, now that the “maintenance mode approach” is being used. It is still recommended to do the bootstrap host last.
Prior to reprovisioning the bootstrap host, it can be beneficial to backup
/etc/ceph
and/var/lib/ceph
, as sometimes the keys, config, etc. stored here will not be moved/recreated correctly.When a host is taken out of maintenance, you may see errors relating to permissions of /tmp/etc and /tmp/var. These issues should be resolved in Ceph version 17.2.7. See issue: https://github.com/ceph/ceph/pull/50736. In the meantime, you can work around this by running the command below. You may need to omit one or the other of
/tmp/etc
and/tmp/var
. You will likely need to run this multiple times. Runceph -W cephadm
to monitor the logs and see when permissions issues are hit.kayobe overcloud host command run --command "chown -R stack:stack /tmp/etc /tmp/var" -b -l storage
It has been seen that sometimes the Ceph containers do not come up after reprovisioning. This seems to be related to having
/var/lib/ceph
persisted through the reprovision (e.g. seen at a customer in a volume with software RAID). (Note: further investigation is needed for the root cause). When this occurs, you will need to redeploy the daemons:List the daemons on the host:
ceph orch ps <hostname>
Redeploy the daemons, one at a time. It is recommended that you start with the crash daemon, as this will have the least impact if unexpected issues occur.
ceph orch daemon redeploy <daemon name> to redeploy a daemon.
Commands starting with
ceph
are all run on the cephadm bootstrap host in a cephadm shell unless stated otherwise.Network interface names may change between CentOS Stream 8 and Rocky Linux 9, in which case you will need to update Kayobe configuration. Note that the configuration should remain correct for hosts not yet migrated, otherwise fact gathering may fail. For example, this can be done using
group_vars
with a temporary group for the updated hosts orhost_vars
. Once all hosts are migrated, the change can be moved to the original group’sgroup_vars
and the temporary changes reverted.
Full procedure for any storage host¶
Set host in maintenance mode:
ceph orch host maintenance enter <hostname>
Check there’s nothing remaining on the host:
ceph orch ps <hostname>
Deprovision the storage node:
kayobe overcloud deprovision -l <hostname>
Reprovision the storage node:
kayobe overcloud provision -l <hostname>
Host configure:
kayobe overcloud host configure -l <hostname> -kl <hostname>
Make sure the cephadm public key is in
authorized_keys
for stack or root user - depends on your setup. For example, your SSH key may already be defined inusers.yml
. If in doubt, run the cephadm deploy playbook to copy the SSH key and install the cephadm binary.kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/cephadm-deploy.yml
Take the host out of maintenance mode:
ceph orch host maintenance exit <hostname>
Make sure that everything is back in working condition before moving on to the next host:
ceph -s ceph -w
Deploy any services that are required, such as Prometheus exporters.
kayobe overcloud service deploy -kl <hostname>
If you are using Wazuh, you will need to deploy the agent again. Note that CIS benchmarks do not run on RL9 out-the-box. See our Wazuh docs for details.
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml -l <hostname>
Seed¶
Potential issues¶
The process depends a lot on the structure of the seed’s volumes. By default two volumes are created (one root volume and one data volume), however only the root volume is actually used. Most deployments have this behaviour overridden so that both the volumes are used and either
/var/lib/docker
or/var/lib/docker/volumes
is mounted to the data volume. This setup makes it considerably easier to migrate the seed, as the root volume can be deleted and the seed can be reprovisioned, leaving the data volume intact throughout. If the deployment is using the default setup, and nothing is stored in the data volume, the first step should be to back up either the docker volumes or the entire docker directory. This should then be restored to the seed afterseed host configure
The mariadb process within the bifrost_deploy container needs to be gracefully stopped. mariadb can’t boot a newer version if the previous version stopped with an error.
Full procedure¶
On the seed, check the LVM configuration:
lsblk
Use mysqldump to take a backup of the MariaDB database. Copy the backup file to one of the Bifrost container’s persistent volumes, such as
/var/lib/ironic/
in thebifrost_deploy
container.If the data volume is not mounted at either
/var/lib/docker
or/var/lib/docker/volumes
, make an external copy of the data somewhere on the seed hypervisor.On the seed, stop the MariaDB process within the bifrost_deploy container:
sudo docker exec bifrost_deploy systemctl stop mariadb
On the seed, stop docker:
sudo systemctl stop docker
On the seed, shut down the host:
sudo systemctl poweroff
Wait for the VM to shut down:
watch sudo virsh list --all
Back up the VM volumes on the seed hypervisor
sudo mkdir /var/lib/libvirt/images-backup sudo cp -r /var/lib/libvirt/images /var/lib/libvirt/images-backup
Delete the seed root volume and the configdrive (check the structure & naming conventions first)
sudo virsh vol-delete seed-root --pool default sudo virsh vol-delete seed-configdrive --pool default
Reprovision the seed
kayobe seed vm provision
Seed host configure
kayobe seed host configure
Rebuild seed container images (if using locally-built rather than release train images)
kayobe seed container image build --push
Service deploy
kayobe seed service deploy
Verify that Bifrost/Ironic is healthy.
If you are using Wazuh, you will need to deploy the agent again. Note that CIS benchmarks do not run on RL9 out-the-box. See our Wazuh docs for details.
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml -l <hostname>
Seed hypervisor¶
TODO
Ansible control host¶
TODO
Wazuh manager¶
TODO
In-place upgrades¶
Sometimes it is necessary to upgrade a system in-place. This may be the case for the seed hypervisor or Ansible control host which are often installed manually onto bare metal. This procedure is not officially recommended, and can be risky, so be sure to back up all critical data and ensure serial console access is available (including password login) in case of getting locked out.
The procedure is performed in two stages:
Migrate from CentOS Stream 8 to Rocky Linux 8
Upgrade from Rocky Linux 8 to Rocky Linux 9
Potential issues¶
Full procedure¶
Inspect existing DNF packages and determine whether they are really required.
Use the migrate2rocky.sh script to migrate to Rocky Linux 8.
Disable all DNF modules - they’re no longer used.
sudo dnf module disable "*"
Migrate to NetworkManager. This can be done using a manual process or with Kayobe.
The manual process is as follows:
Ensure that all network interfaces are managed by Network Manager:
sudo sed -i -e 's/NM_CONTROLLED=no/NM_CONTROLLED=yes/g' /etc/sysconfig/network-scripts/*
Enable and start NetworkManager:
sudo systemctl enable NetworkManager sudo systemctl start NetworkManager
Migrate Ethernet connections to native NetworkManager configuration:
sudo nmcli connection migrate
Manually migrate non-Ethernet (bonds, bridges & VLAN subinterfaces) network interfaces to native NetworkManager.
Look out for lost DNS configuration after migration to NetworkManager. This may be manually restored using something like this:
nmcli con mod System\ brextmgmt.3003 ipv4.dns "10.41.4.4 10.41.4.5 10.41.4.6"
The following Kayobe process for migrating to NetworkManager has not yet been tested.
Set
interfaces_use_nmconnection: true
as a host/group variable for the relevant hostsRun the appropriate host configure command. For example, for the seed hypervisor:
kayobe seed hypervisor host configure -t network -kt none
Make sure there are no funky udev rules left in
/etc/udev/rules.d/70-persistent-net.rules
(e.g. from cloud-init run on Rocky 9.1).
Inspect networking configuration at this point, ideally reboot to validate correctness.
Upgrade to Rocky Linux 9
Install Rocky Linux 9 repositories and GPG keys:
sudo dnf install -y https://download.rockylinux.org/pub/rocky/9/BaseOS/x86_64/os/Packages/r/rocky-gpg-keys-9.2-1.6.el9.noarch.rpm \ https://download.rockylinux.org/pub/rocky/9/BaseOS/x86_64/os/Packages/r/rocky-release-9.2-1.6.el9.noarch.rpm \ https://download.rockylinux.org/pub/rocky/9/BaseOS/x86_64/os/Packages/r/rocky-repos-9.2-1.6.el9.noarch.rpm
Remove the RedHat logos package:
sudo rm -rf /usr/share/redhat-logos
Synchronise all packages with current versions
sudo dnf --releasever=9 --allowerasing --setopt=deltarpm=false distro-sync -y
Rebuild RPB database:
sudo rpm --rebuilddb
Make a list of EL8 packages to remove:
sudo rpm -qa | grep el8 > el8-packages
Inspect the
el8-packages
list and ensure only expected packages are included.Remove the EL8 packages:
cat el8-packages | xargs sudo dnf remove -y
You will need to re-create all virtualenvs afterwards due to system Python version upgrade.