How To Avoid 5 Painful ESXi & Virtual Server Rookie Mistakes!

rookie-mistakes

We’ve all made rookie mistakes before but some errors are painful long after we’ve learned our lesson, ouch…

Warning! These FIVE mistakes will make “Best of Brand” hardware look bad:

  1. Not testing ESXi host configuration settings (DRS, Storage-DRS, HA).
  2. Not testing NEW hardware deployments (server, storage, and networking).
  3. Deploying virtual servers with over-provisioned resources (memory, CPU, and storage).
  4. Not testing virtual hardware configs (vNIC, vControllers, and vDisks).
  5. Not properly managing dependent resources and oversubscribing storage and networking.

Do you want to know why it’s so critical for new vSphere admins to avoid these mistakes? One word, Trust!

And here’s why…because the most difficult error to recover from is a breakdown in TRUST between all the people counting on compute — and the admins taking care of vSphere.

Once developers or service owners have been burned a few times by slow performing or crashing VMs; which by the way translates into outages for their application, they will stop trusting “you” or “your vSphere” for their compute needs.

Aside from great tools, this mistake alone is probably the #1 driver why developers are using AWS or Google cloud services…no more dealing with rookie mistakes!

Let what I just said sink in before I make my next statement…

When virtual servers crash it’s a long and frustrating journey restoring confidence.

Throughout my career as a VMware admin and Ops manager, I’ve had the unfortunate opportunity to deal with all the vSphere problems note here [and others] which is why I’m sharing this insight.

Are you ready for more?

Rookie Mistakes Uncovered!

What I’m about to share with you comes from years of my own trial and errors and of those mistakes made by others whom I’ve had the luxury of managing or cleaning up after as a consultant…no insult intended.

Mistake #1 – Testing ESXi changes in production.

First, let’s look at why you should be extra careful about how ESXi is set up.

ESXi is the core component of every VMware product. Horizon, vSphere, and even VMware’s own cloud all ride on ESXi which is why planning and testing should always be included in any new environment, or hardware roll-out, or change.

For example: If you are migrating from rack servers to blades – even of the same brand, you should test the storage and networking configurations so you know the best and most efficient setting have been used. And don’t just confirm on the virtual end, also work with the network and/or storage teams to ensure they are right on the physical switch and storage ports, too. This also goes for converged hardware!

Common mistakes happen far too often!

It’s far too easy to mismatch ports between hosts, switches, and/or storage with the various setting for:

  • IP hash
  • Port-channeling
  • MTUs (1500 vs 9000)
  • Active/Passive
  • Multi-Path
  • VLAN tags

The effects of these mistakes may end up in painful problems with performance issues, disconnects, or worst – outages.

Testing changes in production is a no-no!

Mistake #2 – Too much of a good thing.

Then there’s the rookie mistake of enabling all the cool vSphere features, such as DRS, HA, Storage-DRS, and EVC.  I admit, these whistles and bells are all great to have but only as long as they can operate without causing problems.

For example Storage-DRS and DRS can easily overwhelm a shared network or storage I/O during peak business hours when CPU, memory, or I/O utilization are high. When this happens, multitudes of VMs can be queued in vCenter waiting to be shifted around from host-to-host or datastore-to-datastore; which, unfortunately, can cause major performance problems and even storage and network disconnects. Not to mention, outages!

Recommendation #1 (Covers mistake 1 & 2)

Always test and validate any ESXi change properly before applying new settings or adding new hardware to any of your environments: test, dev, QA or production. In my experience, all environments should be handled with the same best practices because the impact to users or customers is the similar – slowness and outages.

Mistake #3 – Having it your way too often.

For the next rookie mistakes, we’ll cover common VM or virtual server danger-zones. These perils happen all the time, even with experienced admins who know better, which is why you should take extra care and avoid them. Let’s drill deeper into this topic.

At the top of my VM bad practice list is waste a.k.a. SPRAWL!

Yes, I am referring to over-allocating or over-provisioning valuable resources, which, unfortunately, once they are issued are nearly impossible to reclaim. This not only includes individual system resources but also refers to deploying too many VMs.

Sprawl happens with the best of admins.

Do you know why you need a plan to stay ahead of sprawl?

Because it doesn’t matter if you can prove with graphs and reports from an Ops tool – developers and/or service owners don’t want to make changes to memory, CPUs, or storage once their application, website, or database server is active in production.

And another important tip to remember is this!

If you do happen to reclaim excess resources after the VM is active in Prod, you take the risk of getting blamed the minute something goes wrong anywhere near the applications you took back memory on. Even if the memory wasn’t the cause, most likely it will be re-added.

Recommendation #2

While new VMs are still pre-production you can normally make changes to storage, networking, and even memory or CPU, without getting anyone too excited. But tread lightly because this can only happen as long as clear expectations have been set that more resources can be added later, if required, after load testing.

A good VMware admin knows when and how to build big VMs!

Next, we’ll cover the hazards of using cookie cutter templates for everything.

Mistake #4 – No, Golden Templates don’t work for everything!

I know when VMware first came out we went crazy creating golden templates. Heck, I’ve even written posts on how to create a golden template. But now I am asking you carefully tailor VMs for the use case.

This doesn’t mean you still can’t use a baseline template to deploy a Windows or Linux VM but it does mean afterward you should tune the VM with best practices before it goes live in your environment.

App and web servers should be configured different from a database VM.

For example: When deploying an app or web server VM you may only want to use 1 CPU and 2 GB of memory until you know it needs more. And for a database VM, you may want to follow best practices for the type of database [MSSQL, MySQL or Oracle] running on the VM (or VMs if it will be a cluster). Depending on the use-case there may be tweaks for virtual hardware settings, such as: vNIC types, splitting disks across multiple vControllers, or adding various types of storage such as RDM, iSCSI or even using NFS mounts via a UNC path.

Most VMware admins already know all VMs should not be created the same but it’s easy for rookie mistakes to happen when a project is behind schedule and a technical project manager is riding you to hurry up!

Recommendation #3

Always test, test, and test. And then set baseline standards you can repeat, and then tweaked as required.

Mistake #5 – There are unlimited bandwidth and storage.

With more IT organizations converging technology and staff resources, over-subscription is not as common as it used to be. But let’s be real. We know everyone hasn’t converted to a UCS converged environment so this rookie mistake is still happening.

Never ending loop…

Years ago, VMware admins, who were once sysadmin, started building VMware servers without really considering the fall-out of overloading networks and storage with VMs. The assumption was if we need more storage and bandwidth, we’ll just request it and it will appear.

The effect caused by overloading were briefly solved with 10G networks and deduping/thin-provisioning storage. But then came more workloads running on more virtual servers and shared 10G networks and storage were once again oversubscribed and slow again.

Fix it, break it, fix it again, break it again…

Once again another change is going on and dedicated hardware is now getting deployed to separate data, storage, and maintenance network traffic – as is dedicated (all flash) storage is being purchased to isolate virtual storage needs from non-virtual needs. But how long will this last?

Recommendation #4

There’s no simple solution for this rookie mistake because we are now dealing with people, ownership, and emotions. My recommendation here is better communication and collaboration between teams who share interlocked responsibilities for furnishing compute resources to meet business needs. Sysadmins can’t just build a vSphere anymore without buy-in and involvement from other teams!

Bonus Tip (Build for Failure)

My Secret vSphere Server Hardware Specs

Since 2007 I’ve been tweaking my secret recipe for the perfect “Build for Failure” vSphere hardware specs. And in this “Bonus Tip” I am going to reveal it to my vBeginners audience!

3 reasons why this simple strategy works…

  1. Tunes server specs for balancing VM performance and density
  2. Builts for failure to occur across individual failure zones
  3. Maximizes ROI on VMware licenses and infrastructure resources

Try it and you’ll get:

  • Great virtual server performance
  • Balanced operational ease
  • ESXi and VM HA and scalability
  • Contained failure with the least amount of impact

Myth Buster!

But first, let’s deal with the myth of running 100’s of VMs per ESXi host. Yes, it can be done but it will become an operational nightmare for support people who deal with managing hardware maintenance, patching, and firmware updates. This is the stuff most vendors leave out because they have never had to deal with supporting overloaded ESXi hosts nor have they had to explain to groups of angry people why their app is slow or has crashed.

Best Server Hardware Config For vSphere

  • 1U pizza box (2U if you are using local disks for VSA or VSAN)
  • Dual socket motherboard
  • Dual CPU (2.0 GHz or greater)
  • 6 or more cores per CPU
  • 192 GB of memory
  • 2.0 TB of disk space per host (SAN or local disk)
  • 15 – 20 ms disk latency
  • 2 – 4 VMs per CPU core
  • 25 – 30 virtual server per host

At full load (25 VMs), this ESXi server configuration will run at 40-50% CPU with 70-80% memory utilization. Yes, CPU is slightly low and adding more memory will allow a higher VM density. But then the failure zone increases and a host loss could pose a high risk. This resource threshold also gives n+1 for HA, where a host can fail and the other hosts in the same fault zone have the capacity to absorb the VMs when they are restarted by vSphere after the HA event occurs.

Are their other hardware options? Yes, there are – too many to compare.

But these ESXi server specs have worked perfectly for me more than once! Also, if you’re looking for hardware or virtual software options please see my other posts on white box servers options and virtual server software alternatives.

Immediate Results!

What I’ve shared in this post isn’t just for rookies. As a matter-of-fact, anyone managing a VMware vSphere or team can use this advice and get immediate results if they:

  1. Test everything before it goes into production (treat all working environments like they are productions)
  2. Do not overallocated resources or you will not be able to reclaim them without causing disruption.
  3. Take time to develop baseline standards for building ESXi hosts and VMs that fit you various workload types such as app, web and database servers.
  4. Work closely with other teams to ensure you are all on the same page when configuring storage, networking and server equipment.

I know this is all common sense and most VMware admins with experience already have their own way of managing their vSpheres. But I’m not writing this for them. I am writing this for the new admin. The Rookie who doesn’t know any better. The guy or gal who will overload ESXi to the gills and then wonder why people are complaining.

It’s a long and hard way back once ESXi and VMs starts crashing and people lose confidence in you and you skills to fix the problems. You don’t want to be in the shoes of the person who has to explain why “Best of Brand” hardware and software are failing.

To get updated vSphere training we recommend reading The Best VMware Training For Beginners.

Good luck taking action today to avoid these painful rookie mistakes with proper planning, testing, and collaboration…cheers!

Leave a Reply

vSphere Data Protection EOA
Finding The Best vSphere Backup Replacement For VDP (3 Alternatives)

Important Notification: vSphere Data Protection (VDP) End of Availability (EOA) That...

vSphere 6
8 Updates That Make vSphere 6 Better – Keith Barker

Editor’s note: Keith Barker has been a CBT Nuggets Trainer since 2012. Some...

VMware Interview Questions
25 VMware Interview Questions And Answers: Tough & Technical (Download PDF)

Free VMware vSphere Interview Guide In this VMinstall Guide, I’ll share...

Project Photon
VMware Project Photon: Technical Review for Linux Admins

My quick and dirty review of Project Photon. I was very happy...

DevOps Plan
Best DevOps Strategy Hack (Winners Start With Why)

What is DevOps? It’s a model. It’s a culture. It’s a...

DevOps Tools Download
DevOps Toolchain: Download 3 Free Tools Used By Pros (Ansible, Git, Jenkins)

Hack Your Resume with DevOps Skills Lately, I’ve been getting a...

DevOps Guide
The Best Microsoft DevOps Skills: 25+ For Windows SysAdmins

What is Microsoft DevOps? This is an interesting question because first off,...

continuous integration tools
The Best Automation Training Video Courses (DevOps Tools)

When I was just starting out as a Jr SysAdmin, I...

Assessing Your Company's Cloud Readiness
13 Cloud Readiness Assessment Tips To Guide Your Migration Success (Updated)

Planning your migration to the cloud? If you’re planning to move...

How Bare Metal Virtualization Made Cloud Computing A Reality
How Bare Metal Virtualization Made Cloud Computing A Reality

An Overview of Bare Metal Virtualization Let’s start by explaining what...

Advantages of Cloud Computing
Advantages of Cloud Computing (Private versus Public)

Advantages of Cloud Computing Whether you’re an enterprise IT department planning...

Free Offical Amazon VPC & EC2 Cloud Guides for Beginners

Free Cloud Guides for Beginners Are you researching Amazon’s AWS Virtual...

DesTechAZ
What’s So Interesting About Woz U?

On 10/12/2017, I had the privilege of attending the AZ Tech...

tensorflow training
TensorFlow Tools for Beginners (7 Easy Takeaways To Get You Started)

Introduction To Deep Learning As a follow-up to my book review...

job rut
How To Get MOVING When You’re Stuck In A Career Rut (Video)

You worked hard to get your degree and after graduation you...

Storage Engineer Skills
Why Storage Engineer Skills Are HOT! (Can You Say Big Data?)

3 BIG Skills New Storage Engineers Are Missing Out On… Behind...

VULTR Reviews
VULTR Review And Comparison: Best VPS Platforms Now Thru 2018
Best Hosting for WordPress
Case Study: Best VPS For Easy WordPress Migration
White Box Storage
White Box Storage that Rocks! Great for SME
cloudways review updated
11 Cloudways Review Pros & Cons From My Deep Dive Testing (VPS Series: Part 2)
Website for Small Business
Best Small Business Website Packages (6 Alternatives That Rank!)
Click here to learn how to Optimize WordPress Speed
WordPress Speed Optimization: Learn To Rank Higher In Google