Virtual Server DR Strategy for Beginners: Failure Domain

Single Failure Domain (All Your Egg's In One Basket)

Are all your egg’s in one basket?

Keeping all the business critical virtual servers in the same datastore, LUN, or storage volume is a common OVERSIGHT for many IT operation teams.

A key DR Strategy for beginners is to learn to separate out not only virtual servers on to separate storage aggregates…

…but also to separate drives (vmdk) within the virtual storage so data and system state are not lost together in the event of a storage failure.

These separate DR environments are called failure domains and at least two failure domains should always exist within each technology stack.

4 Failure Domain Examples for Beginners that could Save you from Disaster:

1. Network Stack

Within the network stack, we normally have redundant routers and switches configured in a mesh topology in the event a router or switch fails.

Depending on the network architecture and size of the switches, routers and the amount and type of traffic, the network may suffer performance degradation but should continue flowing.

This is normally well architected by a network engineer or team – classic networking 101, stuff.

2. Storage Stack

Within the storage stack the same principle applies.

Dual filer or storage heads with redundant paths from the IP network or Fiber switches (sometimes on dedicated switch hardware).

Then dual paths from storage controllers to ESXi host via NIC, GBIC or HBA.

When configured properly with multi-pathing within the ESXi datastore configurations, storage traffic will continue flowing in the event of a switch or storage controller failure.

There are also other considerations to consider for storage redundancy, such as RAID configuration and aggregate sizes and separation.

These items are normally architected pretty well by the storage engineer or team responsible.

The gotcha here is normally COST, but there should always be budget for multiple storage failure domains.

3. ESXi Clusters (HA)

At the ESXi host level, failure domains are generally created by clustering multiple ESXi hosts together in a N+1 configuration to allow for the failure of at least one host.

N + 1 is buffering-in 1 additional ESXi host into your capacity and managing the resource thresholds so that there is always enough CPU, memory and storage to fail over all the VMs of one host to the remaining hosts in the same cluster in the event an ESXi host fails.

This is normally called HA or high availability but really a cluster is a failure domain.

Also, if budget permits, it can be N+2 or 3 for more DR buffer space.

This is normally covered by the Virtaulization Engineer or team and is basic VMware for beginners, stuff!

4. VM Failure Domains

Here is where most infrastructure managers get into trouble!

Why?

Because we neglect this most critical part of the DR architecture strategy and leave it up to chance that the admin deploying the VMs are following a best practice or well documented build standard.

BTW, most build standards normally come out after a loss due to what I am about to explain.

For example, let’s take a basic SQL server VM deployment.

It contains an OS, data and logs, right?

Unless the admin deploying the SQL VM has a standard or understands the risk, a VM for a basic SQL server may be configured in one of two ways :

VM-A will have C & D drive (2 VMDK files).
VM-B will have C, D, & E drive (3 VMDK files).

On VM-A, the OS is installed on C drive and the data and logs are on D drive. This configuration has a signal fail domain as data and logs are lost if D drive is somehow deleted.

To improve this case the OS and data could share C and logs could be put on D or OS and logs on C drive, and data on D. (Better but something is still!)

Now let’s look at VM-B, OS on C, data on D, and logs on E. This will not only create multiple failure domains but also improves performance. (Even better but something is still missing!)

Most admins would agree VM-B is better than VM-A, but a key MISSING factor which is at the ROOT cause of many data losses is that all the eggs (VMDK files) are sitting on the same storage LUN or aggregate.

Even if all 3 drives (VMDKs) are provisioned in a separate datastore within vCenter, they are still at risk of loss because they are all still in a single failure domain at the storage level .

My point for beginners:
It is always critical to architect into your VM standards multiple failure domains for segmenting VMDKs during VM provisioning.
This is the most IMPORTANT VM best practice for database servers since they are at the core of most services.

A third example would be VM-C where the OS is on C drive (VMDK hosted on one storage aggregate), data on D (VMDK hosted on another aggregate), and logs on drive E (VMDK hosted on another aggregate).

If 3 or more storage aggregates are not available then data or logs can share the same storage aggregate with the OS – but NEVER put data and logs on the same storage aggregate.

Conclusion:

Hindsight is always 20/20!

I’ve experienced my share of recovering from storage crashes and VMDK files accidentally being deleted.

Now is the time to act and perform a health check to ensure you have a proper DR strategies in place before a DR event happens.

What I’ve covered may add Capex to purchase drive space for adding additional failure domains, but it could save you from a nasty outage.

The fundamental DR strategy take-away for beginners is ensuring multiple failure domains exist for network, storage, host, VM and VMDK before a DR event occurs…

…And one more thing – always have BACKUPS of full system states just in case DR strategies fail.

VMinstall Training