Virtual Server DR Strategy for Beginners: Failure Domain

single failure domain (all your egg's in one basket)

Are all your egg’s in one basket?

Keeping all the business critical virtual servers in the same datastore, LUN, or storage volume is a common OVERSIGHT for many IT operation teams.

A key DR Strategy for beginners is to learn to separate out not only virtual servers on to separate storage aggregates…

…but also to separate drives (vmdk) within the virtual storage so data and system state are not lost together in the event of a storage failure.

These separate DR environments are called failure domains and at least two failure domains should always exist within each technology stack.

4 Failure Domain Examples for Beginners that could Save you from Disaster:

1. Network Stack

Within the network stack, we normally have redundant routers and switches configured in a mesh topology in the event a router or switch fails.

Depending on the network architecture and size of the switches, routers and the amount and type of traffic, the network may suffer performance degradation but should continue flowing.

This is normally well architected by a network engineer or team – classic networking 101, stuff.

2. Storage Stack

Within the storage stack the same principle applies.

Dual filer or storage heads with redundant paths from the IP network or Fiber switches (sometimes on dedicated switch hardware).

Then dual paths from storage controllers to ESXi host via NIC, GBIC or HBA.

When configured properly with multi-pathing within the ESXi datastore configurations, storage traffic will continue flowing in the event of a switch or storage controller failure.

There are also other considerations to consider for storage redundancy, such as RAID configuration and aggregate sizes and separation.

These items are normally architected pretty well by the storage engineer or team responsible.

The gotcha here is normally COST, but there should always be budget for multiple storage failure domains.

3. ESXi Clusters (HA)

At the ESXi host level, failure domains are generally created by clustering multiple ESXi hosts together in a N+1 configuration to allow for the failure of at least one host.

N + 1 is buffering-in 1 additional ESXi host into your capacity and managing the resource thresholds so that there is always enough CPU, memory and storage to fail over all the VMs of one host to the remaining hosts in the same cluster in the event an ESXi host fails.

This is normally called HA or high availability but really a cluster is a failure domain.

Also, if budget permits, it can be N+2 or 3 for more DR buffer space.

This is normally covered by the Virtaulization Engineer or team and is basic VMware for beginners, stuff!

4. VM Failure Domains

Here is where most infrastructure managers get into trouble!


Because we neglect this most critical part of the DR architecture strategy and leave it up to chance that the admin deploying the VMs are following a best practice or well documented build standard.

BTW, most build standards normally come out after a loss due to what I am about to explain.

For example, let’s take a basic SQL server VM deployment.

It contains an OS, data and logs, right?

Unless the admin deploying the SQL VM has a standard or understands the risk, a VM for a basic SQL server may be configured in one of two ways :

  1. VM-A will have C & D drive (2 VMDK files).
  2. VM-B will have C, D, & E drive (3 VMDK files).

On VM-A, the OS is installed on C drive and the data and logs are on D drive. This configuration has a signal fail domain as data and logs are lost if D drive is somehow deleted.

To improve this case the OS and data could share C and logs could be put on D or OS and logs on C drive, and data on D. (Better but something is still!)

Now let’s look at VM-B, OS on C, data on D, and logs on E. This will not only create multiple failure domains but also improves performance. (Even better but something is still missing!)

Most admins would agree VM-B is better than VM-A, but a key MISSING factor which is at the ROOT cause of many data losses is that all the eggs (VMDK files) are sitting on the same storage LUN or aggregate.

Even if all 3 drives (VMDKs) are provisioned in a separate datastore within vCenter, they are still at risk of loss because they are all still in a single failure domain at the storage level.

My point for beginners:

  • It is always critical to architect into your VM standards multiple failure domains for segmenting VMDKs during VM provisioning.
  • This is the most IMPORTANT VM best practice for database servers since they are at the core of most services.

A third example would be VM-C where the OS is on C drive (VMDK hosted on one storage aggregate), data on D (VMDK hosted on another aggregate), and logs on drive E (VMDK hosted on another aggregate).

If 3 or more storage aggregates are not available then data or logs can share the same storage aggregate with the OS – but NEVER put data and logs on the same storage aggregate.


Hindsight is always 20/20!

I’ve experienced my share of recovering from storage crashes and VMDK files accidentally being deleted.

Now is the time to act and perform a health check to ensure you have a proper DR strategies in place before a DR event happens.

What I’ve covered may add Capex to purchase drive space for adding additional failure domains, but it could save you from a nasty outage.

The fundamental DR strategy take-away for beginners is ensuring multiple failure domains exist for network, storage, host, VM and VMDK before a DR event occurs…

…And one more thing – always have BACKUPS of full system states just in case DR strategies fail.

Related Reading:

Leave a Reply

vSphere Data Protection EOA
Finding The Best vSphere Backup Replacement For VDP (3 Alternatives)

Important Notification: vSphere Data Protection (VDP) End of Availability (EOA) That...

vSphere 6
8 Updates That Make vSphere 6 Better – Keith Barker

Editor’s note: Keith Barker has been a CBT Nuggets Trainer since 2012. Some...

VMware Interview Questions
25 VMware Interview Questions And Answers: Tough & Technical (Download PDF)

Free VMware vSphere Interview Guide In this VMinstall Guide, I’ll share...

Project Photon
VMware Project Photon: Technical Review for Linux Admins

My quick and dirty review of Project Photon. I was very happy...

The Best Blockchain Jobs
10 Best Blockchain Jobs Near You (Perfect Match For DevOps Skills)

We’ll cover the best Blockchain jobs in a minute but first, let...

DevOps Plan
Best DevOps Strategy Hack (Winners Start With Why)

What is DevOps? It’s a model. It’s a culture. It’s a...

DevOps Tools Download
DevOps Toolchain: Download 3 Free Tools Used By Pros (Ansible, Git, Jenkins)

Hack Your Resume with DevOps Skills Lately, I’ve been getting a...

DevOps Guide
The Best Microsoft DevOps Skills: 25+ For Windows SysAdmins

What is Microsoft DevOps? This is an interesting question because first off,...

Assessing Your Company's Cloud Readiness
13 Cloud Readiness Assessment Tips To Guide Your Migration Success (Updated)

Planning your migration to the cloud? If you’re planning to move...

How Bare Metal Virtualization Made Cloud Computing A Reality
How Bare Metal Virtualization Made Cloud Computing A Reality

An Overview of Bare Metal Virtualization Let’s start by explaining what...

Advantages of Cloud Computing
Advantages of Cloud Computing (Private versus Public)

Advantages of Cloud Computing Whether you’re an enterprise IT department planning...

Free Offical Amazon VPC & EC2 Cloud Guides for Beginners

Free Cloud Guides for Beginners Are you researching Amazon’s AWS Virtual...

What’s So Interesting About Woz U?

On 10/12/2017, I had the privilege of attending the AZ Tech...

tensorflow training
TensorFlow Tools for Beginners (7 Easy Takeaways To Get You Started)

Introduction To Deep Learning As a follow-up to my book review...

job rut
How To Get MOVING When You’re Stuck In A Career Rut (Video)

You worked hard to get your degree and after graduation you...

Storage Engineer Skills
Why Storage Engineer Skills Are HOT! (Can You Say Big Data?)

3 BIG Skills New Storage Engineers Are Missing Out On… Behind...

VULTR Reviews
VULTR Review And Comparison: Best VPS Platforms Now Thru 2018
Website for Small Business
Best Small Business Website Packages (6 Alternatives That Rank!)
Click here to learn how to Optimize WordPress Speed
WordPress Speed Optimization: Learn To Rank Higher In Google
Best Hosting for WordPress
Case Study: Best VPS For Easy WordPress Migration