Operating Control Plane

Backup of the OpenStack Control Plane

As the backup procedure is constantly changing, it is normally best to check the upstream documentation for an up to date procedure. Here is a high level overview of the key things you need to backup:

Controllers

Compute

The compute nodes can largely be thought of as ephemeral, but you do need to make sure you have migrated any instances and disabled the hypervisor before rebooting, decommissioning or making any disruptive configuration change.

Monitoring

Seed

Ansible control host

  • Back up service VMs such as the seed VM

Control Plane Monitoring

This section shows user guide of monitoring control plane. To see how to configure monitoring services, read Monitoring Configuration.

The control plane has been configured to collect logs centrally using Fluentd, OpenSearch and OpenSearch Dashboards.

Telemetry monitoring of the control plane is performed by Prometheus. Metrics are collected by Prometheus exporters, which are either running on all hosts (e.g. node exporter), on specific hosts (e.g. controllers for the memcached exporter or monitoring hosts for the OpenStack exporter). These exporters are scraped by the Prometheus server.

Configuring Prometheus Alerts

Alerts are defined in code and stored in Kayobe configuration. See *.rules files in $KAYOBE_CONFIG_PATH/kolla/config/prometheus as a model to add custom rules.

Silencing Prometheus Alerts

Sometimes alerts must be silenced because the root cause cannot be resolved right away, such as when hardware is faulty. For example, an unreachable hypervisor will produce several alerts:

  • InstanceDown from Node Exporter

  • OpenStackServiceDown from the OpenStack exporter, which reports status of the nova-compute agent on the host

  • PrometheusTargetMissing from several Prometheus exporters

Rather than silencing each alert one by one for a specific host, a silence can apply to multiple alerts using a reduced list of labels. Log into Alertmanager, click on the Silence button next to an alert and adjust the matcher list to keep only instance=<hostname> label. Then, create another silence to match hostname=<hostname> (this is required because, for the OpenStack exporter, the instance is the host running the monitoring service rather than the host being monitored).

Control Plane Shutdown Procedure

Overview

  • Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They should all report a healthy status.

  • Put node into maintenance mode in bifrost to prevent it from automatically powering back on

  • Shutdown down nodes one at a time gracefully using systemctl poweroff

Controllers

If you are restarting the controllers, it is best to do this one controller at a time to avoid the clustered components losing quorum.

Checking Galera state

On each controller perform the following:

[stack@controller0 ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'"
Variable_name   Value
wsrep_local_state_comment       Synced

The password can be found using:

ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \
        --vault-password-file <Vault password file path> | grep ^database

Checking RabbitMQ

RabbitMQ health is determined using the command rabbitmqctl cluster_status:

[stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status

Cluster status of node rabbit@controller0 ...
[{nodes,[{disc,['rabbit@controller0','rabbit@controller1',
                'rabbit@controller2']}]},
 {running_nodes,['rabbit@controller1','rabbit@controller2',
                 'rabbit@controller0']},
 {cluster_name,<<"rabbit@controller2">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller1',[]},
          {'rabbit@controller2',[]},
          {'rabbit@controller0',[]}]}]

Checking Keepalived

On (for example) three controllers:

[stack@controller0 ~]$ docker logs keepalived

Two instances should show:

VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE

and the other:

VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE

Ansible Control Host

The Ansible control host is not enrolled in bifrost. This node may run services such as the seed virtual machine which will need to be gracefully powered down.

Compute

If you are shutting down a single hypervisor, to avoid down time to tenants it is advisable to migrate all of the instances to another machine. See Evacuating all instances.

Ceph

The following guide provides a good overview: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph

Shutting down the seed VM

virsh shutdown <Seed hostname>

Full shutdown

In case a full shutdown of the system is required, we advise to use the following order:

  • Perform a graceful shutdown of all virtual machine instances

  • Shut down compute nodes

  • Shut down monitoring node (if separate from controllers)

  • Shut down network nodes (if separate from controllers)

  • Shut down controllers

  • Shut down Ceph nodes (if applicable)

  • Shut down seed VM

  • Shut down Ansible control host

Rebooting a node

Use reboot.yml playbook to reboot nodes Example: Reboot all compute hosts apart from compute0:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0'

References

Control Plane Power on Procedure

Overview

  • Remove the node from maintenance mode in bifrost

  • Bifrost should automatically power on the node via IPMI

  • Check that all docker containers are running

  • Check OpenSearch Dashboards for any messages with log level ERROR or equivalent

Controllers

If all of the servers were shut down at the same time, it is necessary to run a script to recover the database once they have all started up. This can be done with the following command:

kayobe overcloud database recover

Ansible Control Host

The Ansible control host is not enrolled in Bifrost and will have to be powered on manually.

Seed VM

The seed VM (and any other service VM) should start automatically when the seed hypervisor is powered on. If it does not, it can be started with:

virsh start <Seed hostname>

Full power on

Follow the order in Full shutdown, but in reverse order.

Shutting Down / Restarting Monitoring Services

Shutting down

Log into the monitoring host(s):

ssh stack@monitoring0

Stop all Docker containers:

monitoring0# for i in `docker ps -a`; do systemctl stop kolla-$i-container; done

Shut down the node:

monitoring0# sudo shutdown -h

Starting up

The monitoring services containers will automatically start when the monitoring node is powered back on.

Software Updates

Sync local Pulp server with StackHPC Release Train

The host packages and Kolla container images are distributed from StackHPC Release Train to ensure tested and reliable software releases are provided.

Syncing new StackHPC Release Train contents to local Pulp server is needed before updating host packages and/or Kolla services.

To sync host packages:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml

If the system is production environment and want to use packages tested in test/staging environment, you can promote them by:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml

To sync container images:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml

For more information about StackHPC Release Train, see StackHPC Release Train documentation.

Once sync with StackHPC Release Train is done, new contents will be accessible from local Pulp server.

Update Host Packages on Control Plane

Host packages can be updated with:

kayobe overcloud host package update --limit <node> --packages '*'
kayobe seed host package update --packages '*'

See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages

Troubleshooting

Deploying to a Specific Hypervisor

To test creating an instance on a specific hypervisor, as an admin-level user you can specify the hypervisor name.

To see the list of hypervisor names:

# From host that can reach Openstack
openstack hypervisor list

To boot an instance on a specific hypervisor

openstack server create --flavor <flavour name> --network <network name> --key-name <key name> --image <image name> --os-compute-api-version 2.74 --host <hypervisor hostname> <vm name>

OpenSearch indexes retention

To alter default rotation values for OpenSearch, edit

$KAYOBE_CONFIG_PATH/kolla/globals.yml:

# Duration after which index is closed (default 30)
opensearch_soft_retention_period_days: 90
# Duration after which index is deleted (default 60)
opensearch_hard_retention_period_days: 180

Reconfigure Opensearch with new values:

kayobe overcloud service reconfigure --kolla-tags opensearch

For more information see the upstream documentation.