Showing posts from October, 2007

Oracle Linux has issues with LVM in anaconda

We've been having a variety of problems with our LVM system partitioning and anaconda. Mind you, once it is configured and installed, the system runs fine. But during rollout, LVM has been a head breaker...
Main issue is: dirty disks. Since we are booting from SAN, we have no control over the LUNs we get. Although we specify zerombr and clearpart --all in anaconda, LVM complains if there is anything on the disk. Especially if there are non-Linux partitions left on the disks. Altiris does it's boot "magic" using a DOS partition to create a known environment from which it launches stuff. This gave the anaconda installer headaches.
Our work-around: don't do partitioning from anaconda, but use the pre-install phase (%pre). We force a clear of the local disk (sda) and our boot LUN (sdb) and clear partitions explicitly as well as remove any traces of LVM signatures. We remove all Logical Volumes (LV), Volume Groups (VG) and Physical Volumes (PV). While this is a lot of w…

Heavy usage of NFS: NFS trouble shooting

During the (scripted) updates of ESX using Altiris, we discovered a ton of timeouts on the ESX hosts. The problem was the NFS server was getting very slow and Altiris scripts were failing due to connection losses and timeouts.
Although we updated /etc/init.d/nfs and increased the number of threads/servers from 8 to 16 (default is 8 per core, the VM has 2 cores), this didn't change the behavior we observed on the ESX hosts. Still timeouts...
So I dug a little deeper and found NFS for clusters, an excellent tuning/testing document for NFS servers. I tested the server settings using the suggested nfsstat -s and observed next to nothing. All was fine. However, the /proc/net/rpc/nfsd file (using watch -d cat /proc/net/rpc/nfsd) still showed at least 6 out of 8 processes (aka threads/servers) with 20+ seconds of 100% busy. Clearly something was wrong, but it wasn't the NFS server. More testing showed that manual actions (of what was going on in the scripts) pushed upto 100MB/s through…

Enterprise Linux 5: anaconda is multipath aware

While on the anaconda trail, OEL 5 adds multipath awareness for the anaconda installer by adding the linux option 'mpath' to the linux boot line. So if you are just starting out setting up Linux, you may want to start out with version 5 instead of 5. (5 also has more recent kernels with better hardware support.)

anaconda failing to install Linux over dual paths

During network installs of Linux, we have 2 HBA connected to the SAN. Each provides one or more LUNs for the OS and data, as we boot from SAN. However, we have had a lot of problems with the visibility of these LUNs to the (anaconda) installer of Oracle Enterprise Linux (OEL) 4 update 5 (4U5). The current version gets confused by disks being visible twice and can't distinguish between them. So it overwrites over the second path what it already did over the first.

Our current work-around was to either shutdown the FC port of one of the HBA cards on the FC switch. Another is to hide LUNs on the second HBA during install and reactivate them afterwards. However, both these cause extra manual work and unnecessarily increase complexity of the whole setup.

Another solution that we're testing is to only hide the boot LUN on the second HBA, not the whole set of LUNs. Anaconda (wiki) seems to have an issue writing to disks that are visible over more than path, no reading from them. Appare…

Oracle Critical Patch Update

Oracle has released new patches for a ton of their software. Check the Oracle Critical Patch Update - October 2007 for details...

USB devices mess up Linux device enumeration

Our Dell PowerEdges were also giving us a headache with their built-in USB devices and virtual devices. While they are useful for maintenance and patching, they mess up the Linux kernel's device enumeration because they come first in that list, no matter where you put your local disk or HBA card in your boot order.

The Dell Virtual Flash and Virtual Floppy always become /dev/sda, /dev/sdb, respectivally. The local PERC disk then becomes /dev/sdc and our boot LUN on the SAN ends up at /dev/sdd. Turning the Dell devices on or off on BIOS then shuffles your devices around and suddenly your boot LUN will be sdb or sdc. No real danger, but /etc/fstab will no longer be valid and booting the system either gives you the maintenance shell or X1 will fail and you get a text login prompt (i.e. runlevel 3).

Two solutions...
One, you can use device labelling or udev naming to have the kernel trace the partitions it needs. However, a quick solutions is - during production - to turn off USB devices…

Optimizing kickstart files

We've been optimizing our kickstart script for unattended Oracle Linux provisioning and found out some interesting tidbits...
When rolling out Linux to new Dell PowerEdge 2950 boxes that boot from (DMX-3) SAN, the anaconda installer has issues with pre-existing LVM volume groups and physical volume signatures. The zerombr option alone is not enough. Appearantly, some traces remain on that disk/LUN and anaconda fails with a cryptic error message.

The solution is to use the kickstart pre-install section to clear your partitions! However, a simple parted won't do. Here is what we ended up putting in kickstart to effectively clear out /dev/sda and /dev/sdb, resp. our local disks (PERC) and out boot LUN (qlogic qle2460), from all partitions and LVM housekeeping stuff:

#forcefully remove all primary partitions from sda
parted /dev/sda rm 1
parted /dev/sda rm 2
parted /dev/sda rm 3
parted /dev/sda rm 4
#forcefully remove all primary partitions from sdb
parted /dev/sdb rm 1
parted /dev/sdb …

VM-Based Rootkits Proved Easily Detectable

In light of our large virtualization project, I thought it would be sensible to point out that VM-Based Rootkits Prove Easily Detectable.
With VMs quickly moving into data centers, security is bound to become an issue. And security officers will find you sooner or later...

Anti-virus: ClamAV update

Just wanted to share an update about my clamav setup with you...
I currently have autoupdating of the data files working properly. But I still watch and download Dag Wieers' site for new packages of the engine. I tried swiping his site with wget for changes and automatically copying (aka mirroring) them, but the filenames are different each time. So I rely on the update emails from clamav and then check Dag Wieers' site. He usually has them pretty quickly anyway. (Thanks for that, BTW!)
I roll out the clamav RPMs in a script from a shared NFS mount, install them, copy my custom .conf files over the default and do a freshclam to get the latest data files.
I get the data files from a local web server, where I run a check every 6 hours for updates and put them in the root of the web server. The last thing is important, because it took me a while to get that right. The .cvd files must be in the web server's root, not in a subdir or so.

Adding new LUN on Linux without reboot

We had a small problem today, where we wanted to add a few LUNs to a (backup) Linux server without rebooting. The problem was that the qlogic qle2460 did see the LUN but the OS wouldn't assign it a SCSI device ID (i.e. /dev/sdg or /dev/sdr or so). We needed to figure out a way to rescan the LUNs on the HBA and force a detection of LUNs in the OS layer.
A new colleague solved it by using /sys, issuing a LIP and doing what is also pointed out in the old mail archive from Dell: New LUN available on Linux without reboot. Good to know. Thanks Dell and Koen!

VMware support for Oracle Enterprise Linux 4 and 5

It took me a while to notice this, but VMware now indirectly supports Oracle "Unbreakable" Enterprise Linux (OEL) 4 (update 5) and 5. Check the VMware Guest Operating System Installation Guide.
Both the 32 bit and 64 bit versions of Red Hat Enterprise Linux (RHEL) 4U5 and 5 are included in ESX server 3.0.2. Since OEL is based on the same code as RHEL, this would imply support for OEL 4U5 and 5 in VI3 using the latest ESX version.