[GNC-dev] [MAINT] Unexpected reboot/downtime of code (gnucash server)

Derek Atkins derek at ihtfp.com
Mon Feb 10 10:18:08 EST 2020


TL;DR:  The Ovirt VM system rebooted last night but the VMs didn't come
back up.  They are now back up and running normally (and the cause of
the lack of restart has been corrected).

Long Version:

Some of you may have noticed that code was unavailable for the past 12
hours.  Apparently the ovirt host rebooted last night around 10:45pm
local time and the script to start the VMs on reboot didn't work.  I've
spent the past 3 hours debugging and determined the problem with the
script was that the ovirt engine reports invalid state immediately upon
reboot.  Specifically, it reports that the storage domains are "up" even
when they are not.  It corrects itself shortly, but the startup script
sees the storage as "up" and then tries to start the VMs (which fail to
start).  This has been fixed by adding a short delay between when the
engine reports as "up" and when the script starts testing for the
storage domains.  I know this works because it ran from a clean restart
of the ovirt host system.

Still of concern is why the machine rebooted last night in the first
place.  I do not have an answer for that, and the logs don't really show
anything of substance.

I plan to continue to monitor the situation, and I will add some
additional debugging in case it decides to reboot itself yet again.
But at least if it does, we know the VMs will come back!  :)

Sorry for the downtime.


