[MAINT] Server is back online: full report

Fri Apr 8 09:33:40 EDT 2011

Hi,

Here is the promised low-down on the server happenings this week.  If
you don't care, feel free to just delete this message.  I'm partly
writing this up for your bemusement, but also as a note for myself
because I'm sure in 3 years' time when I need to add more disks this
knowledge might come in handy.

The VM server was originally running Fedora 10 and vmware-server-2.  

Back in December I had installed and wired a new SATA controller card in
preparation for adding more disk space to the server.  At the same time
I updated the system to Fedora 13 to try to freshen it up.

Unfortunately I noticed quickly that the 2.6.34 kernels of F13 caused
instability of the system.  In particular any significant disk I/O would
cause a catestrophic I/O slowdown, spikes in system load, and often it
would hang the whole system (including all the VMs).  This in turn would
sometimes cause the VMs to see a Disk I/O error and, depending on the
kernel version in the VM, it would make the VM's disk read-only,
effectively killing the VM!

Luckily at the time I still had the Fedora 10 (2.6.27) kernel lying
around and I was able to reboot into that.  With all else being equal
this stabilized the system and it was happily running that way until
this week.

Last month I finally acquired the new disks, added them to the system,
performed various I/O tests to burn them in, and then eventually created
a new RAID1 array to mirror the drives and then added them to the
logical volume group on which the main system obtains its filesystem.
That's all well and good, until the power outage.  When the system tried
to reboot it failed to see the new raid array, so it couldn't build the
logical volume, and as a result it couldn't mount the root drive!  UGGH.

Fedora 13 to the rescue..  It was able to boot up just fine, so that's
what I did.  I upgraded to the most recent F13 kernel and let it fly.
Unfortunately the DiskIO issues from December didn't go away, and indeed
seems to come back with a vengeance.  One of my VMs (my personal web
server) was "offline" for 24 hours because it had gone read-only.  I
realized quickly that I had to revert to the 2.6.27 kernel (which I
painstakingly kept around *just in case*).

After a failure of getting "mkinitrd" to work on Fedora 13 (Fedora
changed from initrd to initramfs in F12, changing from mkinitrd to
dracut) I decided to go about it the painful way:  manually.  I
unpackaged the initrd and looked inside.  Voila, the init script only
started md1 and md2, not md3.  So all I needed to do was add my md3
configuration.  Voila.  I re-package the initrd, put it in place, and
wait for my reboot window.

Unfortunately, that didn't work.  Back to the drawing board.  To make a
long story short:
1) I had to make the init script echo its state so I could see what was 
   happening.  This was relatively easy to comment out the noecho command.
2) I had to add the sata_mv driver to the initrd.  This was easy to pull
   from the existing F10 kernel.
3) I had to add 'insmod' to the initrd, because I couldn't get
   "modprobe" to work (probably because I couldn't run 'depmod' within
   the proper context to get proper module dependencies).
4) Then I had to debug why md3 *still* wasn't coming up!  This is what
   took the longest time, and after finally using:
       mdadm --examine --scan -v 
   I was able to determine that I had the WRONG UUID for my RAID!!  I
   have no idea where I got the number I was using, but it was wrong.  I
   was able to verify that in both the initrd configuration and from the
   Fedora 13 state.   After I fixed the UUID in the mdadm.conf, voila,
   md3 started and all was happy!
5) I cleaned up the init script, turned echo back off, and packed up.
   After 14 test iterations I got a clean, working system.

So, in conclusion:

1) Always make sure you load all the necessary drivers in your initrd
2) Doublecheck your UUIDs
3) mdadm --examine --scan -v is your friend

Total downtime was just over two hours, and I missed my midnight
"promised to come to bed" cutoff by only 10 minutes.  :)

Again, I'm sorry for the extended downtime, but I would like to extend a
big thank you to roe_ on IRC who helped talk me through my pain and
suffering as I was iterating through trials to get it working.

-derek

-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available