======================= Operating Control Plane ======================= Backup of the OpenStack Control Plane ===================================== As the backup procedure is constantly changing, it is normally best to check the upstream documentation for an up to date procedure. Here is a high level overview of the key things you need to backup: Controllers ----------- * `Back up SQL databases `__ * `Back up configuration in /etc/kolla `__ Compute ------- The compute nodes can largely be thought of as ephemeral, but you do need to make sure you have migrated any instances and disabled the hypervisor before rebooting, decommissioning or making any disruptive configuration change. Monitoring ---------- * `Back up InfluxDB `__ * `Back up OpenSearch `__ * `Back up Prometheus `__ Seed ---- * `Back up bifrost `__ Ansible control host -------------------- * Back up service VMs such as the seed VM Control Plane Monitoring ======================== This section shows user guide of monitoring control plane. To see how to configure monitoring services, read :ref:`monitoring-service-configuration`. The control plane has been configured to collect logs centrally using Fluentd, OpenSearch and OpenSearch Dashboards. Telemetry monitoring of the control plane is performed by Prometheus. Metrics are collected by Prometheus exporters, which are either running on all hosts (e.g. node exporter), on specific hosts (e.g. controllers for the memcached exporter or monitoring hosts for the OpenStack exporter). These exporters are scraped by the Prometheus server. Configuring Prometheus Alerts ----------------------------- Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` files in ``$KAYOBE_CONFIG_PATH/kolla/config/prometheus`` as a model to add custom rules. Silencing Prometheus Alerts --------------------------- Sometimes alerts must be silenced because the root cause cannot be resolved right away, such as when hardware is faulty. For example, an unreachable hypervisor will produce several alerts: * ``InstanceDown`` from Node Exporter * ``OpenStackServiceDown`` from the OpenStack exporter, which reports status of the ``nova-compute`` agent on the host * ``PrometheusTargetMissing`` from several Prometheus exporters Rather than silencing each alert one by one for a specific host, a silence can apply to multiple alerts using a reduced list of labels. Log into Alertmanager, click on the ``Silence`` button next to an alert and adjust the matcher list to keep only ``instance=`` label. Then, create another silence to match ``hostname=`` (this is required because, for the OpenStack exporter, the instance is the host running the monitoring service rather than the host being monitored). Control Plane Shutdown Procedure ================================ Overview -------- * Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They should all report a healthy status. * Put node into maintenance mode in bifrost to prevent it from automatically powering back on * Shutdown down nodes one at a time gracefully using systemctl poweroff Controllers ----------- If you are restarting the controllers, it is best to do this one controller at a time to avoid the clustered components losing quorum. Checking Galera state +++++++++++++++++++++ On each controller perform the following: .. code-block:: console [stack@controller0 ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'" Variable_name Value wsrep_local_state_comment Synced The password can be found using: .. code-block:: console ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \ --vault-password-file | grep ^database Checking RabbitMQ +++++++++++++++++ RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``: .. code-block:: console [stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status Cluster status of node rabbit@controller0 ... [{nodes,[{disc,['rabbit@controller0','rabbit@controller1', 'rabbit@controller2']}]}, {running_nodes,['rabbit@controller1','rabbit@controller2', 'rabbit@controller0']}, {cluster_name,<<"rabbit@controller2">>}, {partitions,[]}, {alarms,[{'rabbit@controller1',[]}, {'rabbit@controller2',[]}, {'rabbit@controller0',[]}]}] Checking Keepalived +++++++++++++++++++ On (for example) three controllers: .. code-block:: console [stack@controller0 ~]$ docker logs keepalived Two instances should show: .. code-block:: console VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE and the other: .. code-block:: console VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE Ansible Control Host -------------------- The Ansible control host is not enrolled in bifrost. This node may run services such as the seed virtual machine which will need to be gracefully powered down. Compute ------- If you are shutting down a single hypervisor, to avoid down time to tenants it is advisable to migrate all of the instances to another machine. See :ref:`evacuating-all-instances`. Ceph ---- The following guide provides a good overview: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph Shutting down the seed VM ------------------------- .. code-block:: console virsh shutdown .. _full-shutdown: Full shutdown ------------- In case a full shutdown of the system is required, we advise to use the following order: * Perform a graceful shutdown of all virtual machine instances * Shut down compute nodes * Shut down monitoring node (if separate from controllers) * Shut down network nodes (if separate from controllers) * Shut down controllers * Shut down Ceph nodes (if applicable) * Shut down seed VM * Shut down Ansible control host Rebooting a node ---------------- Use ``reboot.yml`` playbook to reboot nodes Example: Reboot all compute hosts apart from compute0: .. code-block:: console kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0' References ---------- * https://galeracluster.com/library/training/tutorials/restarting-cluster.html Control Plane Power on Procedure ================================ Overview -------- * Remove the node from maintenance mode in bifrost * Bifrost should automatically power on the node via IPMI * Check that all docker containers are running * Check OpenSearch Dashboards for any messages with log level ERROR or equivalent Controllers ----------- If all of the servers were shut down at the same time, it is necessary to run a script to recover the database once they have all started up. This can be done with the following command: .. code-block:: console kayobe overcloud database recover Ansible Control Host -------------------- The Ansible control host is not enrolled in Bifrost and will have to be powered on manually. Seed VM ------- The seed VM (and any other service VM) should start automatically when the seed hypervisor is powered on. If it does not, it can be started with: .. code-block:: console virsh start Full power on ------------- Follow the order in :ref:`full-shutdown`, but in reverse order. Shutting Down / Restarting Monitoring Services ---------------------------------------------- Shutting down +++++++++++++ Log into the monitoring host(s): .. code-block:: console ssh stack@monitoring0 Stop all Docker containers: .. code-block:: console monitoring0# for i in `docker ps -a`; do systemctl stop kolla-$i-container; done Shut down the node: .. code-block:: console monitoring0# sudo shutdown -h Starting up +++++++++++ The monitoring services containers will automatically start when the monitoring node is powered back on. Software Updates ================ Sync local Pulp server with StackHPC Release Train -------------------------------------------------- The host packages and Kolla container images are distributed from `StackHPC Release Train `__ to ensure tested and reliable software releases are provided. Syncing new StackHPC Release Train contents to local Pulp server is needed before updating host packages and/or Kolla services. To sync host packages: .. code-block:: console kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml If the system is production environment and want to use packages tested in test/staging environment, you can promote them by: .. code-block:: console kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml To sync container images: .. code-block:: console kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml For more information about StackHPC Release Train, see :ref:`stackhpc-release-train` documentation. Once sync with StackHPC Release Train is done, new contents will be accessible from local Pulp server. Update Host Packages on Control Plane ------------------------------------- Host packages can be updated with: .. code-block:: console kayobe overcloud host package update --limit --packages '*' kayobe seed host package update --packages '*' See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages Troubleshooting =============== Deploying to a Specific Hypervisor ---------------------------------- To test creating an instance on a specific hypervisor, *as an admin-level user* you can specify the hypervisor name. To see the list of hypervisor names: .. code-block:: console # From host that can reach Openstack openstack hypervisor list To boot an instance on a specific hypervisor .. code-block:: console openstack server create --flavor --network --key-name --image --os-compute-api-version 2.74 --host OpenSearch indexes retention ============================= To alter default rotation values for OpenSearch, edit ``$KAYOBE_CONFIG_PATH/kolla/globals.yml``: .. code-block:: console # Duration after which index is closed (default 30) opensearch_soft_retention_period_days: 90 # Duration after which index is deleted (default 60) opensearch_hard_retention_period_days: 180 Reconfigure Opensearch with new values: .. code-block:: console kayobe overcloud service reconfigure --kolla-tags opensearch For more information see the `upstream documentation `__.