Aug 29

VCOPs provides me excellent data and information. The problem is building enough knowledge to understand and translate what I am seeing.

An example is shown of peak Disk IO being very different from average IO over 7 days. Commands per second is what vCOPs calls IO. Drilling into the peak it lasts for a few minutes in a 1 hour window. Then the metric graphs are used to add vCenter commands per second for the top 5 VM’s. This is compared for each peak to find a common VM which is causing the peak in disk IO usage.

Next we look at the suspicious VM and compare with vCenter total iops report. The lines align so this is the VM. The question now, is it read or write intensive? The result. A SQL box with peak usage everyday that is write intensive, knowledge is needed to suspect a SQL agent on the box. Everyday at 4:45 this SQL agent is configured to run multiple scheduled jobs. These jobs could be divided to run over multiple time slots at non peak times.

The VMware management blog has the recording of this demo to reuse in your own environments. It was called analyze and optimize. Click Here for the Blog Post

Several useful dashboards are displayed. This have scoreboards for showing things like the total capacity of clusters and current usage via memory and CPU. Colored by health including reserve space for things like failover or procurement time buffers.

Report capacity risk based on your knobs. I have a general rule I use in my environments as 6,7,8,9 it’s resume time. It is catchy and helps me remember its purpose and value.

    60% – Analyze and attempt to reduce.
    70% – Begin procurement of additional resources.
    80% – Stop Provisioning new workloads
    90% – Watch closely, actively move workloads out.
Aug 28

Orchestrator is probably the second best product VMware makes, this is of course my opinion. The kicker? It’s free. Free!

vCenter Orchestrator uses drag and drop design, is very scalable, has flexible triggers, fully integrated with vSphere, vCenter, and vCAC.

There are thousands of out of the box workflows. 11 products from VMware have plugins for vCO.

Solutions exchange on VMware’s website has a vCenter Operations Remediation Workflow Package. This will allow for workflows to be launched in response to alerts from vCOPs. This makes use of SNMP traps.

A use case displayed is a datastore nearing capacity. An alert could trigger a workflow to find powered off VM’s to move off and send you am email.

Two powerful examples of self healing are automating configurations and automating incidents.

Example 1. Enable HA/DRS and ensure set correctly
Calculating HA % based on number of hosts. Ensuring HA and DRS are enabled. All of this was automated with vCO and some JavaScript.

Example 2. Incident response to filling datastore.
When vCO receives SNMP trap from vCOPs kick off vCO workflow for storage DRS in vCenter.

Full examples are available on v-nick.com

You can take this to the next level. How about a workflow that creates your change order? Instead of just storage DRS how about add a new datastore? Use Solarwinds and SCCM to expand a disk.

Be stewards of vCO, get other teams involved, network, storage, everyone. Keep it simple and reuse code and look at existing workflows.

EMC has released a plugin for vCO to orchestrate UIM.

Advice, considerations, and tips.

    Map out your process before you automate.
    Factor in alert storms.
    Know when to give up, the workflow only knows as much as you teach it.
    Establish credibility with the low hanging fruit
    Don’t reinvent the wheel.

Excellent session. Need to find the slide deck from Part 1 and start keeping an eye on v-nick.com.

Aug 30

I recently acquired vCenter Operations enterprise. I have had a PoC done by VMware to ensure it met my requirements of a consolidated view of what is really going on in my converged infrastructure. Although I have some extensive knowledge into vCOps I am very interested in getting an under the covers look at how some of the data is calculated. I am also interested on how best to monitor my operation and what KPI’s I should really be looking at. Then I want to learn how best to interpret some of them. Let’s see what this session has to offer….

1. Lol first slide says it’s not just black magic! I like this session already.
2. With virtualization capacity is now fluid, I agree.
3. Invisible walls, with vm CPU and memory issues may not be resolved by adding more, contention can play a significant role. Proper troubleshooting is required.
4. With vmview you have to monitor end users not VM’s. I agree here as well as a user may move between virtual desktops. End user experience is important.
5. The key thing for VC OPs to do for me is to take the tons of metrics I have and to present the end calculation to me. Am I green or red?
6. Dynamic threshold analysis uses competing algorithms, meaning the system actually uses multiple methods to calculate the trend, then checks to see who is right more often and then uses that method. Genius. These are calculated every night.
7. A version change can cause the normal operation of a system to change. Thresholds may not catch this but trends have a better chance. I liked this idea.
8. Trending noise to determine abnormalities, this is really going to help my environment since we use a number of tools that are all sending emails for every little thing. We use our brains today to get a feel for the data center health. I declared this a broken model earlier in the year.
9. Alerts should be an indication of a real problem, yes yes yes! Do not alert on every threshold that is reached. Yes please. May I subscribe to your newsletter!
10. Root cause determination in this product is really root metric determination. It isn’t telling you what the problem was just what metric that was being monitored was the starting metric of the issue. I.e. We saw disk latency go to 100ms before the app crashed.
11. Workload is demand divided by entitlement.
12. Right-sizing is a concept I always support, but VM admins and I seem to be on the front line of this alone on this.