VCOPs provides me excellent data and information. The problem is building enough knowledge to understand and translate what I am seeing.
An example is shown of peak Disk IO being very different from average IO over 7 days. Commands per second is what vCOPs calls IO. Drilling into the peak it lasts for a few minutes in a 1 hour window. Then the metric graphs are used to add vCenter commands per second for the top 5 VM’s. This is compared for each peak to find a common VM which is causing the peak in disk IO usage.
Next we look at the suspicious VM and compare with vCenter total iops report. The lines align so this is the VM. The question now, is it read or write intensive? The result. A SQL box with peak usage everyday that is write intensive, knowledge is needed to suspect a SQL agent on the box. Everyday at 4:45 this SQL agent is configured to run multiple scheduled jobs. These jobs could be divided to run over multiple time slots at non peak times.
The VMware management blog has the recording of this demo to reuse in your own environments. It was called analyze and optimize. Click Here for the Blog Post
Several useful dashboards are displayed. This have scoreboards for showing things like the total capacity of clusters and current usage via memory and CPU. Colored by health including reserve space for things like failover or procurement time buffers.
Report capacity risk based on your knobs. I have a general rule I use in my environments as 6,7,8,9 it’s resume time. It is catchy and helps me remember its purpose and value.
60% – Analyze and attempt to reduce.
70% – Begin procurement of additional resources.
80% – Stop Provisioning new workloads
90% – Watch closely, actively move workloads out.