I was on my way to lunch when I got the call…
When things break you need someone on the bridge call who not only understands protocol but they also have the skills and experience under their belt to drive the investigation in the right direction, and to closure.
Putting the pieces back together!
Incident response is extremely important, which is why incident managers are a must-have for companies that have 24/7/365 web services that host online environments where millions of users (or customers) are impacted by a single device going offline.
Think for a moment about the impact to customers if the website for your bank account, online school, or even Amazon were to go offline. HUGE!
I’ve worked in places where downtime was measured in dollars, not minutes. And let me also tell you there isn’t anything to compare to the STRESS of being on a bridge call with VIPs having amygdala hijacks every 5 minutes!
So what is the skill set of an incident manager?
Unlike with VMware interviews and skills, the person you’re looking for doesn’t just have technical expertise and deep hands-on skills. Moreover, most solid incident managers have grown through the ranks from a hands-on role into an incident manager role.
Ideally, the perfect prospect you are looking for has a broad perspective with all technologies such as: networks, servers, virtualization, cloud, database, applications and web servers. And, let’s not forget both Windows and Linux variations.
However, like I said, technical skills are not enough!
Continue reading and I promise to share the secret ingredients for incident managers, all revealed in the 10 interview questions below and intended to help you screen your job applicants for the best candidate(s).
Incident Manager Interview Questions
Note: There are no answers with these questions so you will need to pay close attention to the responses and body language during your interview. Anyone with experience will bubble to the top and have good examples to share. Look for good ‘been there and done that’ examples.
It’s 2 o’clock in the afternoon on Monday. And to make things interesting, it’s month-end. You have just been called and asked to join a bridge call. There’s already a team troubleshooting an outage on your main tier 1 application that is used for invoicing customers. Severity 1!
Explain your ability to coordinate a large group of technical contributors during this high severity incident and retain control of a fast-paced conference call?
Over the last 2 days you’ve been on a number of bridge calls that all seem to have the same root cause. Now it’s time to lead the investigation into how this problem happened in the first place.
Explain how you would lead an incident investigation (a.k.a. Root Cause Analysis or RCA)?
You have been call to a bridge that is already in progress and things are not going well. You can sense people on the call are tense and withdrawn because a high-level person has taken over the call and they are very frustrated. You asked for a summary of where things are. Immediately, the VIP tells you it’s broken and needs to get fixed ASAP!
How would you maintain a professional demeanor and attitude while being assertive to this person that you will take it from here?
You have been invited to an important meeting to share your investigation findings for a wide scale outage that was cause by a storage failure. At the meeting with you are the CIO and his executive staff to hear your analysis. After introductions your boss has turned the conversation over to you. You have brought your laptop and have your RCA projected on the big screen.
Please give an example of a time when you faced a situation similar to this one and had to exude the ability and confidence to act decisive and exercise influence over a wide range of individuals at all levels of technical and business leadership?
You are managing an incident that has been going on for over 2 hours and now things are really starting to heat up. You have the network admin checking a possible switch issue, a DC specialist checking the physical connection on an ESXi host that seems to be having network connection problems which are impacting 15 VMs. And unfortunately 2 of the VMs hosted are key systems to separate critical applications that are not redundant, yet. If this isn’t enough going on at once, while all this is happening, you are getting text messages from the boss wanting status updates.
Please share an example of a time when you had to multi-task and make sound judgments in a fast-paced, high stress environment, while at the same time keep people informed?
You have been assigned to the global incident response team which has staff spread out across the US, India, Mexico and Brazil. When a severity 1 or 2 issue happens people from each location are asked to join a bridge because the problem can be anywhere due to the distributed workflow designed in to the application.
Share an example of a time when you had to interact with people/groups of widely varying disciplines, cultures, and backgrounds. Explain how you influenced them to follow your lead?
In today’s world, most environments are using virtualization for hosting their servers and applications. Many operations are also using cloud service such as AWS and Google for IaaS. This creates a new challenge for understanding infrastructure topology for virtual servers. But not only is this new, some operations have added Docker containers and PaaS to an already complex world.
This question has 2 parts.
First, briefly explain your technical background and be specific about the technical breadth of your expertise. Explain how you would be able to ask the right questions about a virtual server, and even question the responses from the admin if you thought something didn’t sound right?
Second, if everything points to a bad KVM host, yet the admin is insisting it’s OK, give an example how you would challenge the admin’s assessment if the overwhelming evidence says it’s the host?
If you’ve been around IT staff for any amount of time then maybe you noticed our level of passion and ownership. Now imagine that you are the new incident manager on the call who has to work through each technology stack from the ground up until you find the problem. You’re working with a diverse group of personalities which includes admins and engineers from the data center, network, server, database and application teams; all wanting to prove it isn’t their problem. And in some special cases you may have a couple of managers, directors, and business partners on the call.
An incident response bridge has been opened for a tier 1 online application that is running in a local vSphere cloud on VMs (IaaS). You have been called to lead the call.
Please demonstrate your telephone and oral skills by sharing an example for how you would begin the incident investigation and then move through each technology group?
An incident manager who is new to the team may have problems getting people to follow their lead, which is why building trust by meeting with service and application owners is important. But even more important is building rapport with key players on all the different teams.
This question has 2 parts.
First, if you were selected to be our new incident manager, explain how you would establish strong interpersonal & relationship with our technical staff and managers?
Second, provide example of your social skills, ability to learn complex systems, and an estimate for how quickly you would be able to get up-to-speed?
You’ve just spent all night on a bridge call troubleshooting an application problem that could have been resolved in 10 minutes if the right admin would have joined the call. But unfortunately, they did not, and what you had to work with was the junior admin who was on-call. It’s a good thing you are technical because in the end you literally had to log into Windows Server and fix the issue yourself.
This is a 2 part question.
First, how would you handle communication to the senior level staff waiting for the problem to be solved?
Second, if you found out the key person was just not answering the call to join the bridge, how would you handle the communication with the admin’s manager after the incident was resolved?
I’ve given you a list of 10 questions that all have a unique attribute: Leadership, Judgement, Depth, Trust, etc…and here is my point.
Whether you are a small mom and pop shop doing the hiring, or a multi-billion dollar corporation, I understand how costly incidents are!
Which is why these incident manager interview questions come from my own experiences. They are crafted to help hiring managers quickly determine the level of experience an applicant has.
And on the other-hand, they will help the aspiring incident manager understand what’s expected so you can get training. Here’s a great place where you can find ON DEMAND IT Training in case you want to improve your technical skills in Linux, cloud, web services, etc.
This is a fact! Finding the right person to be your incident manager could save you hundreds or even millions of dollars in penalty fees for breaching your SLA. Let that sink in…
Thank you for your interest and please feel free to contact me if you need someone to screen your candidates.