Operating Control Plane¶

Backup of the OpenStack Control Plane¶

As the backup procedure is constantly changing, it is normally best to check the upstream documentation for an up to date procedure. Here is a high level overview of the key things you need to backup:

Controllers¶

Compute¶

The compute nodes can largely be thought of as ephemeral, but you do need to make sure you have migrated any instances and disabled the hypervisor before rebooting, decommissioning or making any disruptive configuration change.

Monitoring¶

Seed¶

Back up bifrost

Ansible control host¶

Back up service VMs such as the seed VM

Control Plane Monitoring¶

This section shows user guide of monitoring control plane. To see how to configure monitoring services, read Monitoring Configuration.

The control plane has been configured to collect logs centrally using Fluentd, OpenSearch and OpenSearch Dashboards.

Telemetry monitoring of the control plane is performed by Prometheus. Metrics are collected by Prometheus exporters, which are either running on all hosts (e.g. node exporter), on specific hosts (e.g. controllers for the memcached exporter or monitoring hosts for the OpenStack exporter). These exporters are scraped by the Prometheus server.

Configuring Prometheus Alerts¶

Alerts are defined in code and stored in Kayobe configuration. See *.rules files in $KAYOBE_CONFIG_PATH/kolla/config/prometheus as a model to add custom rules.

Silencing Prometheus Alerts¶

Sometimes alerts must be silenced because the root cause cannot be resolved right away, such as when hardware is faulty. For example, an unreachable hypervisor will produce several alerts:

InstanceDown from Node Exporter
OpenStackServiceDown from the OpenStack exporter, which reports status of the nova-compute agent on the host
PrometheusTargetMissing from several Prometheus exporters

Rather than silencing each alert one by one for a specific host, a silence can apply to multiple alerts using a reduced list of labels. Log into Alertmanager, click on the Silence button next to an alert and adjust the matcher list to keep only instance=<hostname> label. Then, create another silence to match hostname=<hostname> (this is required because, for the OpenStack exporter, the instance is the host running the monitoring service rather than the host being monitored).

Control Plane Shutdown Procedure¶

Overview¶

Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They should all report a healthy status.
Put node into maintenance mode in bifrost to prevent it from automatically powering back on
Shutdown down nodes one at a time gracefully using systemctl poweroff

Controllers¶

If you are restarting the controllers, it is best to do this one controller at a time to avoid the clustered components losing quorum.

Checking Galera state¶

On each controller perform the following:

[stack@controller0 ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'"
Variable_name   Value
wsrep_local_state_comment       Synced

The password can be found using:

ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \
        --vault-password-file <Vault password file path> | grep ^database

Checking RabbitMQ¶

RabbitMQ health is determined using the command rabbitmqctl cluster_status:

[stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status

Cluster status of node rabbit@controller0 ...
[{nodes,[{disc,['rabbit@controller0','rabbit@controller1',
                'rabbit@controller2']}]},
 {running_nodes,['rabbit@controller1','rabbit@controller2',
                 'rabbit@controller0']},
 {cluster_name,<<"rabbit@controller2">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller1',[]},
          {'rabbit@controller2',[]},
          {'rabbit@controller0',[]}]}]

Checking Keepalived¶

On (for example) three controllers:

[stack@controller0 ~]$ docker logs keepalived

Two instances should show:

VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE

and the other:

VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE

Ansible Control Host¶

The Ansible control host is not enrolled in bifrost. This node may run services such as the seed virtual machine which will need to be gracefully powered down.

Compute¶

If you are shutting down a single hypervisor, to avoid down time to tenants it is advisable to migrate all of the instances to another machine. See Evacuating all instances.

Ceph¶

The following guide provides a good overview: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph

Shutting down the seed VM¶

virsh shutdown <Seed hostname>

Full shutdown¶

In case a full shutdown of the system is required, we advise to use the following order:

Perform a graceful shutdown of all virtual machine instances
Shut down compute nodes
Shut down monitoring node (if separate from controllers)
Shut down network nodes (if separate from controllers)
Shut down controllers
Shut down Ceph nodes (if applicable)
Shut down seed VM
Shut down Ansible control host

Rebooting a node¶

Use reboot.yml playbook to reboot nodes Example: Reboot all compute hosts apart from compute0:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0'

References¶

https://galeracluster.com/library/training/tutorials/restarting-cluster.html

Control Plane Power on Procedure¶

Overview¶

Remove the node from maintenance mode in bifrost
Bifrost should automatically power on the node via IPMI
Check that all docker containers are running
Check OpenSearch Dashboards for any messages with log level ERROR or equivalent

Controllers¶

If all of the servers were shut down at the same time, it is necessary to run a script to recover the database once they have all started up. This can be done with the following command:

kayobe overcloud database recover

Ansible Control Host¶

The Ansible control host is not enrolled in Bifrost and will have to be powered on manually.

Seed VM¶

The seed VM (and any other service VM) should start automatically when the seed hypervisor is powered on. If it does not, it can be started with:

virsh start <Seed hostname>

Full power on¶

Follow the order in Full shutdown, but in reverse order.

Shutting Down / Restarting Monitoring Services¶

Shutting down¶

Log into the monitoring host(s):

ssh stack@monitoring0

Stop all Docker containers:

monitoring0# for i in `docker ps -a`; do systemctl stop kolla-$i-container; done

Shut down the node:

monitoring0# sudo shutdown -h

Starting up¶

The monitoring services containers will automatically start when the monitoring node is powered back on.

Software Updates¶

Sync local Pulp server with StackHPC Release Train¶

The host packages and Kolla container images are distributed from StackHPC Release Train to ensure tested and reliable software releases are provided.

Syncing new StackHPC Release Train contents to local Pulp server is needed before updating host packages and/or Kolla services.

To sync host packages:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml

If the system is production environment and want to use packages tested in test/staging environment, you can promote them by:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml

To sync container images:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml

For more information about StackHPC Release Train, see StackHPC Release Train documentation.

Once sync with StackHPC Release Train is done, new contents will be accessible from local Pulp server.

Update Host Packages on Control Plane¶

Host packages can be updated with:

kayobe overcloud host package update --limit <node> --packages '*'
kayobe seed host package update --packages '*'

See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages

Troubleshooting¶

Deploying to a Specific Hypervisor¶

To test creating an instance on a specific hypervisor, as an admin-level user you can specify the hypervisor name.

To see the list of hypervisor names:

# From host that can reach Openstack
openstack hypervisor list

To boot an instance on a specific hypervisor

openstack server create --flavor <flavour name> --network <network name> --key-name <key name> --image <image name> --os-compute-api-version 2.74 --host <hypervisor hostname> <vm name>

OpenSearch indexes retention¶

To alter default rotation values for OpenSearch, edit

$KAYOBE_CONFIG_PATH/kolla/globals.yml:

# Duration after which index is closed (default 30)
opensearch_soft_retention_period_days: 90
# Duration after which index is deleted (default 60)
opensearch_hard_retention_period_days: 180

Reconfigure Opensearch with new values:

kayobe overcloud service reconfigure --kolla-tags opensearch

For more information see the upstream documentation.