Have Zombies and Bullies Overrun Your Cloud Capacity?

Bully VmA Lesson on Cloud Resource Management

Isn’t it interesting how technology mimics life (and in this case) after life?

If you’re like me you probably get 10 invites, or spam emails, a week about new storage products or some new software that will help you manage your cloud better? Obviously marketers know we have cloud capacity and operations problems or they wouldn’t be sending us these messages so often.

Managing a cloud isn’t easy, even with tools, and the larger the cloud is the more likely zombies and bullies will become a problem.

Are you wondering what I’m talking about? Well, hang in there and I’ll explain.

Zombie VmFirst let’s discuss what a zombie is.

In the movies zombies are normally flesh eating dead people that really have no purpose except to feed on you. And in the case of the zombie I’m talking about, they are virtual machines, servers, or desktops, that are left running “unused” in your cloud – feeding on cloud resources.

For example:

Maybe once upon a time there was a project that required building a bunch of servers for a new product and 50 VMs were spun up with 2 CPU, 4 GB memory and 40 GB of disk space. All of the VMs were joined to the network and DNS was reserved, and firewall rules and a VIP were created.

Then as things normally happen in IT, the plan changed or the project was cancelled and everyone went in new directions. But unfortunately nobody on the infrastructure team was asked to decommission these project VMs so today (5 weeks, 5 months or a year later) they are still running and using up valuable resources: 50 vCPUs, 200 GB memory, 2 TB storage, 50+ IP addresses.

In the example I just gave the VMs were accidentally just left running but another example is when they are intentionally left because (just in case) somewhere in the future someone decides to restart the project. Really! Two years have past and you have 50 VMs that are powered off in your vCenter or OpenStack that you keep around just in case.

These are Zombies feeding on your cloud!

Do you have “Zombie VMs” eating up expensive system resources and producing no value? Hunt them down and get rid of them. Left unchecked this will happen over and over until a large percentage of your cloud is wasted.

BTW, Amazon, Azure and Rackspace love charging you monthly for Zombies!

What are Bullies? You guessed it, VMs.

Bully VMs are very different than Zombie VMs and most of the time they actually have a purpose other than to eat your flesh.

Here’s the common definition about what we know about bullies.

  • They generally pick on the smaller guy.
  • They take away lunch money.
  • And they make fun of people.

Well, in the cloud, bullies tend to do the same stuff but in a slightly different way. They take away all the system resources from smaller VMs sharing the same host, network, and storage.

For example:

Imagine you have an MSSQL DB VM sharing a host, storage, and network with other App VMs. The majority of the time everything is fine but every “month end”, this DB VM becomes a Raging Hulk and starts hammering all the system resources for IOPs, CPU, and bandwidth.

The person on the front end running month-end reports isn’t aware this is happening, but other departments are feeling it because all of the sudden their application is sluggish and taking much longer to load up, or is crashing.

There’s a Bully in your cloud!

In a public cloud bullies aren’t a problem (per se) because AWS will automatically handle the additional demand so other VMs are not starved of resources. But in a private cloud this can go on for months before anyone realizes what is causing the slowness. Especially if there are multiple bullies invading at the same time.

Another common cloud capacity management problem is VM sprawl and you can find out more about it in my health check lesson.

Wrap up.

We covered zombie VMs and bully VMs. Both of these types of virtual machines impact the capacity and health of a cloud. And both need to be hunted down and dealt with before they cause untimely downtime and impact your server or application availability SLA.

A well documented  and followed decommissioning process is best for minimizing zombies and a capacity strategy for managing workload will help isolate bullies, DRS and SDRS also help.

Thanks for your time and interest!

Leave a Reply