Skip to main content

Heavy usage of NFS: NFS trouble shooting

During the (scripted) updates of ESX using Altiris, we discovered a ton of timeouts on the ESX hosts. The problem was the NFS server was getting very slow and Altiris scripts were failing due to connection losses and timeouts.
Although we updated /etc/init.d/nfs and increased the number of threads/servers from 8 to 16 (default is 8 per core, the VM has 2 cores), this didn't change the behavior we observed on the ESX hosts. Still timeouts...
So I dug a little deeper and found NFS for clusters, an excellent tuning/testing document for NFS servers. I tested the server settings using the suggested nfsstat -s and observed next to nothing. All was fine. However, the /proc/net/rpc/nfsd file (using watch -d cat /proc/net/rpc/nfsd) still showed at least 6 out of 8 processes (aka threads/servers) with 20+ seconds of 100% busy. Clearly something was wrong, but it wasn't the NFS server. More testing showed that manual actions (of what was going on in the scripts) pushed upto 100MB/s through disk and network I/O. So the problem is in the network, the firewall doing packet inspection of the FC network...

Comments

Popular posts from this blog

Preventing PuTTY timeouts

Just found a great tip to prevent timeouts of PuTTY sessions. I'm fine with timeouts by the host, but in our case the firewall kills sessions after 30 minutes of inactivity... When using PuTTY to ssh to your Linux/Unix servers, be sure to use the feature to send NULL packets to prevent a timeout. I've set it to once every 900 seconds, i.e. 15 minutes... See screenshot on the right.

Tuning the nscd name cache daemon

I've been playing a bit with the nscd now and want to share some tips related to tuning the nscd.conf file. To see how the DNS cache is doing, use nscd -g. nscd configuration: 0 server debug level 26m 57s server runtime 5 current number of threads 32 maximum number of threads 0 number of times clients had to wait yes paranoia mode enabled 3600 restart internal passwd cache: no cache is enabled [other zero output removed] group cache: no cache is enabled [other zero output removed] hosts cache: yes cache is enabled yes cache is persistent yes cache is shared 211 suggested size 216064 total data pool size 1144 used data pool size 3600 seconds time to live for positive entries 20 seconds time to live for negative entries 66254 cache hi...

Dell Linux - OMSA Hardware Monitoring

Just getting started using Dell's OpenManage Server Administrator (OMSA) on our Oracle Linux platform. There are some confusing instructions going around so it's not immediately clear what to do, hence my blogging here. :) There is a site on Dell - Hardware Monitoring , as well as a wiki with instruction on how to setup their OMSA tooling using yum or up2date. [update]My first update for their instructions: be sure your server has Internet access, as most servers will use a proxy or so. use export http_proxy=http://yourproxy.example.com:port to configure it just for the session, and setup up2date to use an HTTP proxy by editing the settings in /etc/sysconfig/rhn/up2date .