I have been getting some weird errors in some Linux VMs running on VMware ESX 3.0.x (and perhaps ESX 3.5, as we're migrating to that). The symptoms are always that some weird errors occur for users or some maintenance task, and when you check the VM's consoles, you discover that the file system is suddenly read-only.
My dear friend MrVanes discovered there is a known issue with the VMware LSI driver mptscsi in VMware guests running Debian.
A savvy user called TuxyTurvy has thoroughly investigated this issue. He was using Red Hat RHEL 4 and 5 on Dell 1850 servers with a Clarion AX150i el-cheapo SAN. When I/O load got high, running under VMware, due to storage contention, SCSI timeouts and busy conditions are far more likely to occur than normal.
TuxyTurvy has a patch for RHEL 4U5 and 5. VMware would support him due to the HCL and Red Hat said that they would go with whatever VMware and LSI decided. I'm not sure if by now (his post is from October 2006) patches are available and if VMware maybe updated their LSI driver from VMwareTools, but I'll sure look into this and stress test some VMs! The post also has useful comments below it that shows the problem is not isolated.
Update:
Check your logs for this message:
"Nov 28 04:43:20 webhost kernel: [157267.731622] mptscsih: ioc0: task abort: SUCCESS (sc=f6840000)
Nov 28 04:43:20 webhost kernel: [157267.731600] mptscsih: ioc0: attempting task abort! (sc=f6840000)
Nov 28 04:43:20 webhost kernel: [157267.731622] mptscsih: ioc0: task abort: SUCCESS (sc=f6840000)"
My dear friend MrVanes discovered there is a known issue with the VMware LSI driver mptscsi in VMware guests running Debian.
"...one could get SCSI timeouts when there is massive workload on the host system. Some kernel versions these will get the file systems remounted read-only, which probably makes sense for real hardware, but doesn't make sense for emulated hardware. Instead it should just wait a bit longer."
A savvy user called TuxyTurvy has thoroughly investigated this issue. He was using Red Hat RHEL 4 and 5 on Dell 1850 servers with a Clarion AX150i el-cheapo SAN. When I/O load got high, running under VMware, due to storage contention, SCSI timeouts and busy conditions are far more likely to occur than normal.
TuxyTurvy has a patch for RHEL 4U5 and 5. VMware would support him due to the HCL and Red Hat said that they would go with whatever VMware and LSI decided. I'm not sure if by now (his post is from October 2006) patches are available and if VMware maybe updated their LSI driver from VMwareTools, but I'll sure look into this and stress test some VMs! The post also has useful comments below it that shows the problem is not isolated.
Update:
Check your logs for this message:
"Nov 28 04:43:20 webhost kernel: [157267.731622] mptscsih: ioc0: task abort: SUCCESS (sc=f6840000)
Nov 28 04:43:20 webhost kernel: [157267.731600] mptscsih: ioc0: attempting task abort! (sc=f6840000)
Nov 28 04:43:20 webhost kernel: [157267.731622] mptscsih: ioc0: task abort: SUCCESS (sc=f6840000)"
Comments