Unplanning network maintenance/outage

Derek Atkins warlord at MIT.EDU
Sun Mar 17 08:18:23 EDT 2013


Good morning, GnuCashers,

Some (many?) of you may have noticed the outage of 'code.gnucash.org'
starting with a lot of packet loss on Thursday and escalating into a
complete outage by Friday.  This took out our Subversion, Wiki, Email
List, everything server.  Well, as of 2:15pm US/EDT on Saturday
(yesterday) everything should be back to normal and operational.  If you
don't want to hear the gory details of what happened feel free to stop
reading now.

The issue was multiple simultaneous failures of multiple pieces of
equipment.  What I thought was a power outage turned out be caused by a
failure in my main network switch.  It started dropping ports, or
causing ports to fail partially (dropping packets).  This was also the
main cause of the packet loss, too.  However I didn't discover this
until later.

My main DHCP server was off the net; I swapped ethernet cables and it
appeared to fix the problem.

My main database server, however, lost its main network controller so I
had to install a new one (I have a few on hand, so it was a relatively
painless operation -- I just had to remember the magic voodoo to get the
system to call the new card 'eth0', but that was also only a few
minutes).

It was only after I got this working that I realized that it was the
switch that had failed -- many of the ports connected to actual hosts
had a 'dead link'.  I also noticed that my main DHCP server was
bouncing.  It would come on the net, stay for a bit, and then go dark.
Luckily I also had a few extra (smaller) switches lying around so I
linked a few of them together and moved all the non-working ports over.
This also fixed the bouncing DHCP server.

Last, but not least, the VM Server Host's network was wedged, requiring
a complete reboot to reset.  This also required resetting all the VMs,
some of which required a bit of hand-holding to come back (and many of
which required a virtual disk fsck as well, taking even more time).  The
last of the systems returned to service shortly after 2pm.

I do plan to acquire a new switch to replace the failing one, but what I
have now is working so I'll watch it closely for now.

Thanks,

-derek

-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available


More information about the gnucash-devel mailing list