As a side-effect of the Linux boot process, I suddenly realized that our boot process is not fault-tolerant! We have 100+ servers that boot from SAN using two Qlogic 2460 HBAs. We installed EMC PowerPath 5.0.0 on OEL 4u5 to get multipathing and automatic fail-over in case a path fails. The OS is pretty well off in case of hardware faults.
However, the boot process is not! Not even close! In stage 1, the boot loader reads the MBR, loads stage 1.5 so it can read the /boot ext2 partition where the kernel and initrd image are located. In our case, this is /dev/sdb1. But what if my HBA dies and /dev/sdb1 doesn't exist? It may be smart enough to try a device using the other path, but probably not. Also, since the kernel hasn't loaded yet, PowerPath does not exist and there is no multipath awareness to save the day...
So once systems are installed and the number of disk partitions is stable, I can
enhance GRUB with boot fallback systems, so it will try /dev/sdb1 and it that fails switch to /dev/sdm1 (actual device name dependent on the number of disks (LUNs) that system has.
However, the boot process is not! Not even close! In stage 1, the boot loader reads the MBR, loads stage 1.5 so it can read the /boot ext2 partition where the kernel and initrd image are located. In our case, this is /dev/sdb1. But what if my HBA dies and /dev/sdb1 doesn't exist? It may be smart enough to try a device using the other path, but probably not. Also, since the kernel hasn't loaded yet, PowerPath does not exist and there is no multipath awareness to save the day...
So once systems are installed and the number of disk partitions is stable, I can
enhance GRUB with boot fallback systems, so it will try /dev/sdb1 and it that fails switch to /dev/sdm1 (actual device name dependent on the number of disks (LUNs) that system has.
Comments