We’ve all made rookie mistakes before but some errors are painful long after we’ve learned our lesson, ouch…
Warning! These FIVE mistakes will make “Best of Brand” hardware look bad:
- Not testing ESXi host configuration settings (DRS, Storage-DRS, HA).
- Not testing NEW hardware deployments (server, storage, and networking).
- Deploying virtual servers with over-provisioned resources (memory, CPU, and storage).
- Not testing virtual hardware configs (vNIC, vControllers, and vDisks).
- Not properly managing dependent resources and oversubscribing storage and networking.
Do you want to know why it’s so critical for new vSphere admins to avoid these mistakes? One word, Trust!
And here’s why…because the most difficult error to recover from is a breakdown in TRUST between all the people counting on compute — and the admins taking care of vSphere.
Once developers or service owners have been burned a few times by slow performing or crashing VMs; which by the way translates into outages for their application, they will stop trusting “you” or “your vSphere” for their compute needs.
Aside from great tools, this mistake alone is probably the #1 driver why developers are using AWS or Google cloud services…no more dealing with rookie mistakes!
Let what I just said sink in before I make my next statement…
When virtual servers crash it’s a long and frustrating journey restoring confidence.
Throughout my career as a VMware admin and Ops manager, I’ve had the unfortunate opportunity to deal with all the vSphere problems note here [and others] which is why I’m sharing this insight.
Are you ready for more?
Rookie Mistakes Uncovered!
What I’m about to share with you comes from years of my own trial and errors and of those mistakes made by others whom I’ve had the luxury of managing or cleaning up after as a consultant…no insult intended.
Mistake #1 – Testing ESXi changes in production.
First, let’s look at why you should be extra careful about how ESXi is set up.
ESXi is the core component of every VMware product. Horizon, vSphere, and even VMware’s own cloud all ride on ESXi which is why planning and testing should always be included in any new environment, or hardware roll-out, or change.
For example: If you are migrating from rack servers to blades – even of the same brand, you should test the storage and networking configurations so you know the best and most efficient setting have been used. And don’t just confirm on the virtual end, also work with the network and/or storage teams to ensure they are right on the physical switch and storage ports, too. This also goes for converged hardware!
Common mistakes happen far too often!
It’s far too easy to mismatch ports between hosts, switches, and/or storage with the various setting for:
- IP hash
- MTUs (1500 vs 9000)
- VLAN tags
The effects of these mistakes may end up in painful problems with performance issues, disconnects, or worst – outages.
Testing changes in production is a no-no!
Mistake #2 – Too much of a good thing.
Then there’s the rookie mistake of enabling all the cool vSphere features, such as DRS, HA, Storage-DRS, and EVC. I admit, these whistles and bells are all great to have but only as long as they can operate without causing problems.
For example Storage-DRS and DRS can easily overwhelm a shared network or storage I/O during peak business hours when CPU, memory, or I/O utilization are high. When this happens, multitudes of VMs can be queued in vCenter waiting to be shifted around from host-to-host or datastore-to-datastore; which, unfortunately, can cause major performance problems and even storage and network disconnects. Not to mention, outages!
Recommendation #1 (Covers mistake 1 & 2)
Always test and validate any ESXi change properly before applying new settings or adding new hardware to any of your environments: test, dev, QA or production. In my experience, all environments should be handled with the same best practices because the impact to users or customers is the similar – slowness and outages.
Mistake #3 – Having it your way too often.
For the next rookie mistakes, we’ll cover common VM or virtual server danger-zones. These perils happen all the time, even with experienced admins who know better, which is why you should take extra care and avoid them. Let’s drill deeper into this topic.
At the top of my VM bad practice list is waste a.k.a. SPRAWL!
Yes, I am referring to over-allocating or over-provisioning valuable resources, which, unfortunately, once they are issued are nearly impossible to reclaim. This not only includes individual system resources but also refers to deploying too many VMs.
Sprawl happens with the best of admins.
Do you know why you need a plan to stay ahead of sprawl?
Because it doesn’t matter if you can prove with graphs and reports from an Ops tool – developers and/or service owners don’t want to make changes to memory, CPUs, or storage once their application, website, or database server is active in production.
And another important tip to remember is this!
If you do happen to reclaim excess resources after the VM is active in Prod, you take the risk of getting blamed the minute something goes wrong anywhere near the applications you took back memory on. Even if the memory wasn’t the cause, most likely it will be re-added.
While new VMs are still pre-production you can normally make changes to storage, networking, and even memory or CPU, without getting anyone too excited. But tread lightly because this can only happen as long as clear expectations have been set that more resources can be added later, if required, after load testing.
A good VMware admin knows when and how to build big VMs!
Next, we’ll cover the hazards of using cookie cutter templates for everything.
Mistake #4 – No, Golden Templates don’t work for everything!
I know when VMware first came out we went crazy creating golden templates. Heck, I’ve even written posts on how to create a golden template. But now I am asking you carefully tailor VMs for the use case.
This doesn’t mean you still can’t use a baseline template to deploy a Windows or Linux VM but it does mean afterward you should tune the VM with best practices before it goes live in your environment.
App and web servers should be configured different from a database VM.
For example: When deploying an app or web server VM you may only want to use 1 CPU and 2 GB of memory until you know it needs more. And for a database VM, you may want to follow best practices for the type of database [MSSQL, MySQL or Oracle] running on the VM (or VMs if it will be a cluster). Depending on the use-case there may be tweaks for virtual hardware settings, such as: vNIC types, splitting disks across multiple vControllers, or adding various types of storage such as RDM, iSCSI or even using NFS mounts via a UNC path.
Most VMware admins already know all VMs should not be created the same but it’s easy for rookie mistakes to happen when a project is behind schedule and a technical project manager is riding you to hurry up!
Always test, test, and test. And then set baseline standards you can repeat, and then tweaked as required.
Mistake #5 – There are unlimited bandwidth and storage.
With more IT organizations converging technology and staff resources, over-subscription is not as common as it used to be. But let’s be real. We know everyone hasn’t converted to a UCS converged environment so this rookie mistake is still happening.
Never ending loop…
Years ago, VMware admins, who were once sysadmin, started building VMware servers without really considering the fall-out of overloading networks and storage with VMs. The assumption was if we need more storage and bandwidth, we’ll just request it and it will appear.
The effect caused by overloading were briefly solved with 10G networks and deduping/thin-provisioning storage. But then came more workloads running on more virtual servers and shared 10G networks and storage were once again oversubscribed and slow again.
Fix it, break it, fix it again, break it again…
Once again another change is going on and dedicated hardware is now getting deployed to separate data, storage, and maintenance network traffic – as is dedicated (all flash) storage is being purchased to isolate virtual storage needs from non-virtual needs. But how long will this last?
There’s no simple solution for this rookie mistake because we are now dealing with people, ownership, and emotions. My recommendation here is better communication and collaboration between teams who share interlocked responsibilities for furnishing compute resources to meet business needs. Sysadmins can’t just build a vSphere anymore without buy-in and involvement from other teams!
Bonus Tip (Build for Failure)
My Secret vSphere Server Hardware Specs
Since 2007 I’ve been tweaking my secret recipe for the perfect “Build for Failure” vSphere hardware specs. And in this “Bonus Tip” I am going to reveal it to my vBeginners audience!
3 reasons why this simple strategy works…
- Tunes server specs for balancing VM performance and density
- Builts for failure to occur across individual failure zones
- Maximizes ROI on VMware licenses and infrastructure resources
Try it and you’ll get:
- Great virtual server performance
- Balanced operational ease
- ESXi and VM HA and scalability
- Contained failure with the least amount of impact
But first, let’s deal with the myth of running 100’s of VMs per ESXi host. Yes, it can be done but it will become an operational nightmare for support people who deal with managing hardware maintenance, patching, and firmware updates. This is the stuff most vendors leave out because they have never had to deal with supporting overloaded ESXi hosts nor have they had to explain to groups of angry people why their app is slow or has crashed.
Best Server Hardware Config For vSphere
- 1U pizza box (2U if you are using local disks for VSA or VSAN)
- Dual socket motherboard
- Dual CPU (2.0 GHz or greater)
- 6 or more cores per CPU
- 192 GB of memory
- 2.0 TB of disk space per host (SAN or local disk)
- 15 – 20 ms disk latency
- 2 – 4 VMs per CPU core
- 25 – 30 virtual server per host
At full load (25 VMs), this ESXi server configuration will run at 40-50% CPU with 70-80% memory utilization. Yes, CPU is slightly low and adding more memory will allow a higher VM density. But then the failure zone increases and a host loss could pose a high risk. This resource threshold also gives n+1 for HA, where a host can fail and the other hosts in the same fault zone have the capacity to absorb the VMs when they are restarted by vSphere after the HA event occurs.
Are their other hardware options? Yes, there are – too many to compare.
But these ESXi server specs have worked perfectly for me more than once! Also, if you’re looking for hardware or virtual software options please see my other posts on white box servers options and virtual server software alternatives.
What I’ve shared in this post isn’t just for rookies. As a matter-of-fact, anyone managing a VMware vSphere or team can use this advice and get immediate results if they:
- Test everything before it goes into production (treat all working environments like they are productions)
- Do not overallocated resources or you will not be able to reclaim them without causing disruption.
- Take time to develop baseline standards for building ESXi hosts and VMs that fit you various workload types such as app, web and database servers.
- Work closely with other teams to ensure you are all on the same page when configuring storage, networking and server equipment.
I know this is all common sense and most VMware admins with experience already have their own way of managing their vSpheres. But I’m not writing this for them. I am writing this for the new admin. The Rookie who doesn’t know any better. The guy or gal who will overload ESXi to the gills and then wonder why people are complaining.
It’s a long and hard way back once ESXi and VMs starts crashing and people lose confidence in you and you skills to fix the problems. You don’t want to be in the shoes of the person who has to explain why “Best of Brand” hardware and software are failing.
To get updated vSphere training we recommend reading The Best VMware Training For Beginners.
Good luck taking action today to avoid these painful rookie mistakes with proper planning, testing, and collaboration…cheers!