During the (scripted) updates of ESX using Altiris, we discovered a ton of timeouts on the ESX hosts. The problem was the NFS server was getting very slow and Altiris scripts were failing due to connection losses and timeouts.
Although we updated /etc/init.d/nfs and increased the number of threads/servers from 8 to 16 (default is 8 per core, the VM has 2 cores), this didn't change the behavior we observed on the ESX hosts. Still timeouts...
So I dug a little deeper and found NFS for clusters, an excellent tuning/testing document for NFS servers. I tested the server settings using the suggested nfsstat -s and observed next to nothing. All was fine. However, the /proc/net/rpc/nfsd file (using watch -d cat /proc/net/rpc/nfsd) still showed at least 6 out of 8 processes (aka threads/servers) with 20+ seconds of 100% busy. Clearly something was wrong, but it wasn't the NFS server. More testing showed that manual actions (of what was going on in the scripts) pushed upto 100MB/s through disk and network I/O. So the problem is in the network, the firewall doing packet inspection of the FC network...
Although we updated /etc/init.d/nfs and increased the number of threads/servers from 8 to 16 (default is 8 per core, the VM has 2 cores), this didn't change the behavior we observed on the ESX hosts. Still timeouts...
So I dug a little deeper and found NFS for clusters, an excellent tuning/testing document for NFS servers. I tested the server settings using the suggested nfsstat -s and observed next to nothing. All was fine. However, the /proc/net/rpc/nfsd file (using watch -d cat /proc/net/rpc/nfsd) still showed at least 6 out of 8 processes (aka threads/servers) with 20+ seconds of 100% busy. Clearly something was wrong, but it wasn't the NFS server. More testing showed that manual actions (of what was going on in the scripts) pushed upto 100MB/s through disk and network I/O. So the problem is in the network, the firewall doing packet inspection of the FC network...
Comments