A. If your storage ain't happy, you virtual machines (VMs) won't be happy either. Storage problems are an insidious problem in virtual environments, because they can be particularly challenging to track down. Why is your Exchange server running slowly today? Is it getting enough processor attention? Can your storage keep up?

One way to identify storage performance problems is to use the esxtop command. You can use this command to measure how much storage I/O your virtual environment is using. You can configure it to show storage performance per HBA, per LUN, or per individual VM. Getting this information all starts by first entering the esxtop command into the Service Console, but figuring out which series of characters to press to get the right data gets confusing.

Navigate to your ESX host's Service Console and enter the esxtop command. Here's a cheat sheet for which characters you'll want to hit:

  • Monitoring storage performance per HBA: d, f, b, c, d, e, h, j, s, 2, Enter.
  • Monitoring storage performance per LUN: u, f, b, c, f, h, s, 2, Enter.
  • Monitoring storage performance per VM: v, f, b, d, e, h, j, s, 2, Enter.

Four columns are important in the data you'll now see:

  • CMDS/s is the number of I/O operations per second (IOPS) going in and out of the element being monitored.
  • DAVG/cmd is the average response time in milliseconds per command being sent to the element.
  • KAVG/cmd is the amount of time the command spends in the VMkernel.
  • GAVG/cmd is the response time as perceived by the guest VM. This number is the sum of DAVG and KAVG.

This information is further broken down into read and write metrics, with \\{x\\}AVG/rd referring to read response time and \\{x\\}AVG/wr referring to write response time.

This information is perhaps best read in comparison with what is considered good performance in your environment. However, the VMware article "Using esxtop to Identify Storage Performance Issues" suggests that values above 10ms for DAVG/cmd, KAVG/cmd, and GAVG/cmd may relate to problems with switch hardware or limits in an array's ability to handle the necessary load (such as in a spindle contention situation). Response times over 5000ms will create SCSI aborts in /var/log/vmkernel.

Need answers? Submit your own questions!