Quick Lesson: VM Performance Checklist for Troubleshooting

VM Performance issuesLet’s set the stage for this VMware troubleshooting lesson…

It’s 2 o’clock in the morning and you just received a call from the NOC. Hundreds of users from around the world are calling the help desk and reporting very slow performance [and some are unable to log-in].

Before we get going, let me recommend my new guide for the best step by step vSphere online training.

Welcome to virtualization Admin reality!

For many virtualization admins this is a common event and having a plan or checklist for troubleshooting VM performance is key to getting back to sleep.

5 Basic VM Troubleshooting Steps

Over the years I’ve learned what works best in these situations.

Having basic VM performance checklist for troubleshooting and to help me analyze the stack from the bottom-up  (environment to the user) or top-down (user to the environment) has been the key.

So let get the lesson started…

First, Gather Information

1. Before you do anything get a complete update from the NOC or help desk on what’s happening, who are impacted and try to narrow it down to location or application-pool.

The reason why you need to narrow it down is that many applications span more than one data center (or cloud) location and the issue could be caused from a patch installed on a server halfway across the world and the wrong admin was pulled out of bed.

Once you verify the issue is within your realm of responsibility and you understand the impact, you’re ready to start troubleshooting.

  • Is the outage due to too many users trying to use the system? Sometimes systems fail because there aren’t enough system resources to handle hordes of users visiting after an advertisement campaign [this really does happen when marketing and IT are out of sync].
  • Is there an outage with one of the service providers [Internet and telecom providers have outages too]?

Begin Troubleshooting Steps

2. As I said, I start from the bottom of the stack and work upward [depending on your environment you can start at the top and work downward, too].

The first step I usually do is make sure there’s power and check if something has failed.

  • Are there any alarms or alerts in your email from equipment?
  • What alerts has the NOC been seeing on their monitors? Many environments have SiteScope or other monitoring tools.
  • Is the issue due to a hardware failure? Power supply, disk, server crash, etc.
  • Is there a dependency offline? License expired or 3rd party vendor issue.

Once you have checked for alarms and verified, then move up the stack if the problem still exists.

3. Now it’s time to drill down and see if something is going on with network, storage or servers.

  • Is there network congestion caused by backups running or something similar?
  • Are backups trashing storage and using up all your IOPs?
  • Is your ESXi host maxed because too many VMs are sucking up all the memory and CPU?
  • Is something else going on that is causing performance issues?

Once you have finished checking for infrastructure, then move up the stack and log into vCenter if the problem still exists.

Next, Start the Deep Drive

4. Log into vCenter and check the VMs that are in the application pool having the issue.

  • Are there any alerts or warnings showing up in vCenter?
  • Are the VMs powered on? Sometimes the server will accidentally get powered off after patching or code updates.
  •  How does the performance chart look on memory, CPU, storage latency or network?
  • Do you see anything else going on in vCenter that could be causing VM performance issue?

Because vSphere shares resources, sometimes a bully VM could be using up all the resources and strangling the virtual servers sharing the same host or data store. Look for signs of other VMs having issues.

If vCenter shows performance on the VMs and shared resources looks good, then move up the stack and start logging into VMs if the problem still exists.

Diving Deeper

5. Traditional application stacks have at least 3 servers [Database, application, and web], but this can be scaled into hundreds of servers in an app-pool so becoming familiar with your environment is key to guiding you to where to start.

Since our example is about users complaining of slowness, let’s start with the web server and check it first since it’s at the front-end.

  • Did you notice any latency when logging into the server?
  • Or Maybe the login failed and the server is frozen and needs to be rebooted?
  • After logging in how’s the VM performance? Check perfmon or run top.
  • Are there any services stopped that need to be restarted, IIS or Apache?
  • Is there a resource issue like memory, CPU or storage space?
  • If all looks good can you ping the gateway?
  • Is storage latency too high?
  • Is antivirus scanning or backups running?
  • Did you check the logs for errors?

Now go through each server in the stack and check the same thing. If nothing can be found then move up the stack again and get other teams involved.

Warm Hand-off…

Never just assume the NOC or help desk know what is going on when an incident is in progress. Keep them updated as you work your way through the issue.

And once you have ruled-out the infrastructure, VM, and OS as the root cause of the performance issue or outage, it’s time to transfer the ownership of the issue to the next layer of the stack.

Make sure you do a warm hand-off and someone acknowledges they are picking up the ball now.

For example if the DBA is now checking the DB stack then make sure they know to do a clean pass-off to the application team next and keep the NOC updated.

This will save you from getting called again once you have returned to bed because the issue was left hanging unresolved and nobody knew you were done with your part.

Single Point of Support

If you support a small IT and everything lands on one virtualization engineer then the next step will be to check the DB and application.

  • Is the Db blocking or locking?
  • Is SQL running?
  • Has a gigantic report been running for hours and taking up all the resources?
  • Is the application hanging?
  • Was there a code update pushed recently?
  • Did a developer make an unapproved change?
  • Was the system hacked?

Lessons Learned

I did say quick so this brings us to the end of this VM performance troubleshooting lesson. And yes, there are many other steps you can add to the checklist if I missed something unique to your vSphere cloud.

My goal here was to provide a high-level guide for beginner admins and get them understanding the need for having a troubleshooting plan or checklist to follow… rather than jumping head first into the abyss of a chaotic incident call.

Free Download – VM Performance Checklist for Troubleshooting PDF

Bonus Content: VM Performance Best Practices Checklist

  1. Split the VM VMDK volumes on separate datastore.
  2. Create separate vControllers for each VMDK (especially if it is a database VM server).
  3. Configure Windows or the OS  to manage the swap file.
  4. Keep VMtools updated.
  5. Use 1:1 memory and CPU ratios (don’t over-leverage hardware).
  6. Don’t overprovision memory or vCPUs (remember it’s not a physical server).
  7. Use golden templates to deploy VMs.
  8. Keep Gold templates updated (patches, agents & middleware).
  9. Don’t oversubscribe network and storage devices.
  10. Avoid running file level backups or antivirus scans on the local VM OS.

Read more vSphere installation and setup tips.

Leave a Reply

vSphere Data Protection EOA
Finding The Best vSphere Backup Replacement For VDP (3 Alternatives)

Important Notification: vSphere Data Protection (VDP) End of Availability (EOA) That...

vSphere 6
8 Updates That Make vSphere 6 Better – Keith Barker

Editor’s note: Keith Barker has been a CBT Nuggets Trainer since 2012. Some...

VMware Interview Questions
25 VMware Interview Questions And Answers: Tough & Technical (Download PDF)

Free VMware vSphere Interview Guide In this VMinstall Guide, I’ll share...

Project Photon
VMware Project Photon: Technical Review for Linux Admins

My quick and dirty review of Project Photon. I was very happy...

DevOps Plan
Best DevOps Strategy Hack (Winners Start With Why)

What is DevOps? It’s a model. It’s a culture. It’s a...

DevOps Tools Download
DevOps Toolchain: Download 3 Free Tools Used By Pros (Ansible, Git, Jenkins)

Hack Your Resume with DevOps Skills Lately, I’ve been getting a...

DevOps Guide
The Best Microsoft DevOps Skills: 25+ For Windows SysAdmins

What is Microsoft DevOps? This is an interesting question because first off,...

continuous integration tools
The Best Automation Training Video Courses (DevOps Tools)

When I was just starting out as a Jr SysAdmin, I...

Assessing Your Company's Cloud Readiness
13 Cloud Readiness Assessment Tips To Guide Your Migration Success (Updated)

Planning your migration to the cloud? If you’re planning to move...

How Bare Metal Virtualization Made Cloud Computing A Reality
How Bare Metal Virtualization Made Cloud Computing A Reality

An Overview of Bare Metal Virtualization Let’s start by explaining what...

Advantages of Cloud Computing
Advantages of Cloud Computing (Private versus Public)

Advantages of Cloud Computing Whether you’re an enterprise IT department planning...

Free Offical Amazon VPC & EC2 Cloud Guides for Beginners

Free Cloud Guides for Beginners Are you researching Amazon’s AWS Virtual...

DesTechAZ
What’s So Interesting About Woz U?

On 10/12/2017, I had the privilege of attending the AZ Tech...

tensorflow training
TensorFlow Tools for Beginners (7 Easy Takeaways To Get You Started)

Introduction To Deep Learning As a follow-up to my book review...

job rut
How To Get MOVING When You’re Stuck In A Career Rut (Video)

You worked hard to get your degree and after graduation you...

Storage Engineer Skills
Why Storage Engineer Skills Are HOT! (Can You Say Big Data?)

3 BIG Skills New Storage Engineers Are Missing Out On… Behind...

VULTR Reviews
VULTR Review And Comparison: Best VPS Platforms Now Thru 2018
Best Hosting for WordPress
Case Study: Best VPS For Easy WordPress Migration
White Box Storage
White Box Storage that Rocks! Great for SME
cloudways review updated
11 Cloudways Review Pros & Cons From My Deep Dive Testing (VPS Series: Part 2)
Website for Small Business
Best Small Business Website Packages (6 Alternatives That Rank!)
Click here to learn how to Optimize WordPress Speed
WordPress Speed Optimization: Learn To Rank Higher In Google