Unplanned outage of GnuCash server/services last night

Derek Atkins warlord at MIT.EDU
Fri Jun 14 11:05:55 EDT 2013


Good morning,

The short:  there was a power outage from ~7pm until ~6:30pm knocking
            out code.gnucash.org; all services were back up and running
            by 10:30am.

If you don't care about the nitty-gritty, you can stop reading now.

The long story:

Many of you may have noticed that code went offline last night.  We had
some pretty severe storms come through the area and knocked out power to
at least 100,000 homes, if not more.  The power went out shortly before
8pm last night, which also knocked out my network at that time (more on
this later).  The UPS lasted for over 2 hours, finally expiring a bit
after 10pm.  

The power did not return until around 6:30 this morning.  When I finally
got out of bed this morning I noticed a few things.  First, my DHCP
server's name server didn't start.  This seems to be a perpetual issue
and I've never been able to figure out why this happens; if I restart it
by hand it works fine.  *shrugs*

I also noticed that the VM server had not come back online, and was
throwing LVM errors about not finding one of the Physical Volumes..  and
that appeared to be due to mdadm not being able to rebuild /dev/md2
because "sdc and sdc1 looked to be the same device".  YIKES!  But when I
booted the server with a rescue CD the volumes all came up just fine.
Apparently there is some issue with mdadm that I've never hit before;
the fix was to adjust the mdadm.conf in the initrd to point directly to
the device partitions instead of the UUID.  I don't particular like this
solution, but it solved the immediate problem.

This just underscores my need to find a new VM solution.  The VM Host is
still running a base OS of Fedora 13, with a Fedora-10 kernel!  The
reason it's on an F10 kernel is some scheduling and disk IO issues I was
hitting with the F13 kernels, and I've been extremely hesitant to
perform any other upgrades on the system since hitting that one.  One of
these years...

Anyways, I got everything fixed and the VM host booting just before
10am, and then it took ~20-30 minutes for all the VMs to fsck and come
up.  But at this time it appears all services are back online and
running normally.

In the longer term we plan to get a backup generator, but that's
probably still a year or three out.

Also, I was able to get Comcast to acknowledge that there is a real
problem on my node; over 100 people went offline with the power outage,
so hopefully the technicians wont just close the trouble ticket outright
again this time.  *fingers crossed*  I'd love to have my network stay up
for at least 60 minutes during a power outage!

Anyways, I need to try to get some real work done now..  Back to your
regularly scheduled gnucash hacking here..

-derek
-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available


More information about the gnucash-user mailing list