vRA Stops Working at 80% disk usage - VCAC vmo_workflowtokencontent Table Vaccuum - vRO 8.x

Mindwatering Incorporated

Author: Tripp W Black

Created: 01/30/2023 at 10:07 AM

Category:
VMWare
vRA

Issue:
vRA/vRO has "stopped working". Logins are not working.
If a cluster, it will be marked "Not Healthy". See VMware KB81387 / KB56114.

Perform the following checks:
- Checking the Kubernetes pods will show one or more with status "Init:ErrImageNeverPull"
root@vra-mw [ ~ ]# kubectl get pods --all-namespaces
or
root@vra-mw [ ~ ]# kubectl -n prelude get pods
For a cluster, there should be 3 instances of each service; each should have a status of Completed or Running.
In this case, typically one or more will show the status Init:ErrImageNeverPull

- The disk space for data_vg-data is greater than 80%.
root@vra-mw [~]# df -h /data

- The vmo_workflowtokencontent table may be growing consistently.
vco-db=# \c vco-db

vco-db=# SELECT relname as "Table", pg_size_pretty(pg_total_relation_size(relid)) as "Size" from pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 5;
Table | Size
-----------------------------+-------
vmo_workflowtokencontent | 50 GB
vmo_subworkflowtokencontent | 40 MB
vmo_statistic | 39 MB
vmo_vroconfiguration | 25 MB
vmo_workflowtokenstatistics | 23 MB

Workarounds:
Notes:
- Ensure there is a cold snapshot and a back-up before performing. KB81387, says you can just do a backup while running.
- You need as much free space as you have for the largest table. So if your largest table is 50 GB, then you're going to need 55 or so GB to proceed for temporary table rebuilding.

A. Check disk space
a. Optional. Free up old tmp logs:
root@vra-mw [~]# find /var/log -type f -mtime +70 -delete

b. Run a disk space query, the postgres /data volume likely on sdb:
root@vra-mw [~]# df -h /data
or
root@vra-mw [~]# vracli disk-mgr
/dev/sda4(/):
Total size: 50 GiB
...
/dev/sdb(/data):
Total size: 100 GiB
Free: 23.9 GiB (23.9%)
Available(for non-superusers): 18.9 GiB (18.9%)
...

c. As we are under 20%, we'll need to free up space to double the largest table.
i. Shutdown vRA appliance.

ii. vSphere --> VM --> Edit Settings --> Under Disks, increase the size of Disk2.
(e.g. 100 GB to 160 GB)

iii. Start-up VM.
Surprisingly, we found that we did not need to use GParted to increase the partition.

B. Vacuum Not Working
- If the vacuuming is not occurring regularly, look at KB56114. It contains troubleshooting steps and corrective actions to confirm the issue.
- Editing the postgres data should be down with the guidance of a VMware support SR ticket.
- Manual vacuum can be done via steps below:

The vacuum can be done while the docker services/images are all running.
Login as root user, and perform manual vacuum:
$ ssh root@vra-mw
...

root@vra-mw [~]# vracli dev psql
template1=# \c vco-db
or, do with one command:
root@vra-mw [~]# vracli dev psql vco-db
Type yes to continue, at the warning that these actions are recorded.

vco-db=# SELECT relname as "Table", pg_size_pretty(pg_total_relation_size(relid)) as "Size" from pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 5;
Table | Size
-----------------------------+-------
vmo_workflowtokencontent | 50 GB
vmo_subworkflowtokencontent | 40 MB
vmo_statistic | 39 MB
vmo_vroconfiguration | 25 MB
vmo_workflowtokenstatistics | 23 MB

vco-db=# VACUUM FULL vmo_workfkowtokencontent;
...

vco-db=# SELECT relname as "Table", pg_size_pretty(pg_total_relation_size(relid)) as "Size" from pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 5;
Table | Size
-----------------------------+-------
vmo_workflowtokencontent | 80 MB
vmo_subworkflowtokencontent | 40 MB
vmo_statistic | 39 MB
vmo_vroconfiguration | 25 MB
vmo_workflowtokenstatistics | 23 MB

C. Increase periodic vacuum schedule as needed.
Still in the postgres prompt, issue the following commands. From the KB81387 document:
vco-db=# alter table vmo_workflowtokencontent set (autovacuum_vacuum_scale_factor = 0.05);
vco-db=# alter table vmo_workflowtokencontent set (autovacuum_vacuum_threshold = 25);
vco-db=# alter table vmo_workflowtokencontent set (autovacuum_vacuum_cost_delay = 10);
vco-db=# alter table vmo_workflowtokencontent set (autovacuum_analyze_threshold = 25);
vco-db=# alter table vmo_workflowtokencontent set (toast.autovacuum_vacuum_scale_factor = 0.05);
vco-db=# alter table vmo_workflowtokencontent set (toast.autovacuum_vacuum_threshold = 25);
vco-db=# alter table vmo_workflowtokencontent set (toast.autovacuum_vacuum_cost_delay = 10);

Validate the new settings, with:
vco-db=# SELECT relname, pg_options_to_table(reloptions) AS reloption FROM pg_class WHERE reloptions IS NOT NULL AND relnamespace = 'public'::regnamespace ORDER BY relname, reloption;

Exit the sql prompt with:
vco-db=# \q

D. If there was still not 20% free after the start-up of the vRA appliance, you can restart the docker images that didn't start with:
(was not needed in our case)
root@vra-mw [~]# /opt/scripts/restore_docker_images.sh

previous page