Uptime for Beginners (99.999 Equals .00100 Downtime)

Intro to Uptime Monitoring

Do you know which is more important: server uptime or service downtime?

Hang in there while I break this question down into pieces easy to understand…

Let’s get going

Whether your IT department is following ITIL, DevOps, or legacy IT practices for managing IT service delivery, uptime[downtime] are the metrics you need to understand.

Why?

Because it’s easy to get caught up in the numbers game and start focusing on 9’s instead of what’s really important.

The real measurement is the impact to the customer or user who can’t log into their application, or can’t work because it’s so darn slow they are ready to lose their mind.

What’s on the other side of the uptime is downtime or (.00001), which translates into lost minutes, hours, days, and dollars.

Downtime Equals the Impact to Users

I don’t recall ITIL or DevOps criteria for my computer information systems degree, so I’m guessing it’s still not a requirement?

This means bunches of technology graduates every year are making their way to the IT world without knowing what the most import objective of their job is…

(Q) What is your most important objective, you might ask?
(A) Keeping the application user or customer happy!

Lets cut to the chase of this lesson.

I’m not going to tell you how to reduce downtime, but I am going to tell why you need to understand why it’s so important for your company to improve uptime so your valuable customers are not angry.

Server vs Service, which is more important?

For most VMware, Windows, or Linux admins, this is a no brainier – the server is the most important thing in their daily lives.

You take pride in designing and building servers and then give them cool names like Ironman or Thor, right?

Some admins will go as far as to not allow anyone to log into these servers without their permission.

Stay with me because I’m getting to the nugget of uptime for beginners and what I want you to learn.

And the lesson is, the service running on any server is the most important thing you should be worrying about.

The application uptime is the goal for everyone in IT, from sysadmin to developer, and all the way up to the CTO!

Learn this while you are still new, and it will help you become great at your job.

A Story about Server vs. Service Monitoring

Once upon a time I worked with a group of great guys who were supporting a vSphere environment. They had been working on a project for months to set up a new monitoring tool but it just wasn’t happening.

This tool had a cool dashboard and listed all of their servers in little green blocks that filled the screen. If any of their servers were down or having issues, the block would turn red or yellow and then they would start troubleshooting.

One day I was invited to the project meeting and as I reviewed their plan I could not help but to share with them how their way was good, but they could make it better by focusing on service monitoring.

After a whiteboard session, and some debating, I was able to show them how we should be monitoring the service instead of each server. They were finally convinced and we redesigned the monitoring plan with the services on top, instead of Ironman and Thor.

The design was a masterpiece monitoring plan. And we were able to see when a service like Exchange turned red, and how it was due to a storage or network problems that was also impacting multiple other services.

In the end this was a better monitoring solution than only seeing a red network switch and not knowing the service impact to users

Example of what to Monitor with Your Tools

Using email communication (Exchange) as an example service, what are all the devices required to make this service always available to the users?

Server uptime & availability
SQL uptime
Website uptime and load
Exchange and other dependent services
Storage uptime and capacity
Network uptime and availability
Internet and MPS (point to point)
Hardware (end to end)
Power (A & B)

Note: Sometimes a device is up but not available due to performance or other issues.

Every piece in the Exchange service stack needs to be monitored from Outlook installed on the user’s desktop, down to the hardware and power in the data center.

When monitoring is set up right, you are able to see when a service like Exchange has turned red, and what has caused this outage.

99.999 Uptime (aka 5 Nines)

Building big bad servers is not as important anymore now that cloud is making high availability (HA) standard. But don’t mistake high availability for active clustering which is still better for uptime than HA.

This is a common mistake many new VMware admins make. In our excitement about all the whistles and bells vSphere has, we overlook fail over requires crashing of the VM before it moves the VM over to another ESXi host.

Yes, it’s cool how migrations happen, but remember something still crashed.

And unfortunately, this slight downtime while the VM magically moves to its new location is an outage that could have caused user impact and/or possible data loss.

To achieve 5 nines, you will need redundancy all the way up the stack from hardware to the application, this also includes power and Internet.

Let this sink in!

Have I change your mind yet? The whole reason for 5 nines uptime on the service is to limit the downtime to customers [they are not paying for servers, unless you are a hosting service which is different].

Making customers unhappy with your service even 5 times in a month – for one minute at a time – will cause them to look elsewhere for the same service.

Consider banks, online education, health care, insurance, and government services; how would you feel if you tried logging into your account and it was slow or you got a 404 error because the website was down? Somewhere downstream a server or other type of device has crashed and maybe VMware is moving it to another host while you WAIT!

Lessons Learned:

So in your excitement to learn your trade as a VMware engineer or systems administrator, keep in mind everything you are doing is to give customers or patrons a great experience.

Delivering dependable services with the highest uptime possible is your goal and the point of this uptime for beginners lesson.

Career tip: Always be willing to take ownership and assist anyone (yes even a developer) to recover an offline service.

If you can do this without thinking it’s not your job because the server is still online, then you have learned this lesson about service delivery and why uptime is so important.

Uptime Monitoring Software to check out:

Nagios – Check server uptime, Full featured monitoring (graphic above from Nagios screenshot)
Solarwinds – Check server uptime, Full featured monitoring
Cacti – Graphing for monitoring

Do you have an uptime comment or tool to share with us?