I was on my way to lunch when I got the call…
When things break, you need someone on the bridge call which not only understands the protocol, but they also have the skills and experience under their belt to drive the investigation in the right direction and to closure.
Putting the pieces back together!
Incident response is critical, which is why incident managers are a must-have for companies that have 24/7/365 web services that host online environments where millions of users (or customers) are impacted by a single device going offline.
Think for a moment about the impact to customers if the website for your bank account, online school, or even Amazon were to go offline. HUGE!
I’ve worked in places where downtime was measured in dollars, not minutes. And let me also tell you there isn’t anything to compare to the STRESS of being on a bridge call with VIPs having amygdala hijacks every 5 minutes!
So what is the skill set of an incident manager?
Unlike with VMware interviews and skills, the person you’re looking for doesn’t just have technical expertise and in-depth, hands-on skills. Moreover, most solid incident managers have grown through the ranks from a hands-on role into an incident manager role.
Ideally, the perfect prospect you are looking for has a broad perspective on all technologies such as networks, servers, virtualization, cloud, database, applications and web servers. And, let’s not forget both Windows and Linux variations.
However, like I said, technical skills are not enough!
Continue reading, and I promise to share the secret ingredients for incident managers, all revealed in the 10 interview questions below and intended to help you screen your job applicants for the best candidate(s).
Incident Manager Interview Questions
Note: There are no answers to these questions so you will need to pay close attention to the responses and body language during your interview. Anyone with experience will bubble to the top and have good examples to share. Look for good ‘been there and done that’ examples.
It’s 2 o’clock in the afternoon on Monday. And to make things interesting, it’s month-end. You have just been called and asked to join a bridge call. There’s already a team troubleshooting an outage on your central tier 1 application that is used for invoicing customers. Severity 1!
Explain your ability to coordinate a large group of technical contributors during this high severity incident and retain control of a fast-paced conference call?
Over the last 2 days, you’ve been on many bridge calls that all seem to have the same cause. Now it’s time to lead the investigation into how this problem happened in the first place.
Explain how you would lead an incident investigation (a.k.a. Root Cause Analysis or RCA)?
You’ve been called to a bridge call that is already in progress, and things are not going well. You can sense people on the call are tense and withdrawn because a high-level person has taken over the call and they are very frustrated. You asked for a summary of where things are. Immediately, the VIP tells you it’s broken and needs to get fixed ASAP!
How would you maintain a professional demeanor and attitude while being assertive to this person that you will take it from here?
You have been invited to an important meeting to share your investigation findings for a full-scale outage that was caused by a storage failure. At the meeting with you are the CIO and his executive staff to hear your analysis. After introductions, your boss has turned the conversation over to you. You have brought your laptop and have your RCA projected on the big screen.
Please give an example of a time when you faced a situation similar to this one and had to exude the ability and confidence to act decisively and exercise influence over a wide range of individuals at all levels of technical and business leadership?
You are managing an incident that has been going on for over 2 hours, and now things are starting to heat up. You have the network admin checking a possible switch issue, a DC specialist checking the physical connection on an ESXi host that seems to be having network connection problems which are impacting 15 VMs. And unfortunately 2 of the VMs hosted are key systems to separate critical applications that are not redundant, yet. If this isn’t enough going on at once, while all this is happening, you are getting text messages from the boss wanting status updates.
Please share an example of a time when you had to multi-task and make sound judgments in a fast-paced, high-stress environment, while at the same time keep people informed?
You have been assigned to the global incident response team which has staff spread out across the US, India, Mexico, and Brazil. When a severity 1 or 2 issue happens people from each location are asked to join a bridge because the problem can be anywhere due to the distributed workflow designed into the application.
Share an example of a time when you had to interact with people/groups of widely varying disciplines, cultures, and backgrounds. Explain how you influenced them to follow your lead?
In today’s world, most environments are using virtualization for hosting their servers and applications. Many operations are also using cloud service such as AWS and Google for IaaS. This creates a new challenge for understanding infrastructure topology for virtual servers. But not only is this new; some operations have added Docker containers and PaaS to an already complex world.
This question has 2 parts.
First, briefly explain your technical background and be specific about the functional breadth of your expertise. Explain how you would be able to ask the right questions about a virtual server, and even question the responses from the admin if you thought something didn’t sound right?
Second, if everything points to a defective KVM host, yet the admin is insisting it’s OK, give an example how you would challenge the admin’s assessment if the overwhelming evidence says it’s the host?
If you’ve been around IT staff for any amount of time then maybe you noticed our level of passion and ownership. Now imagine that you are the new incident manager on the call who has to work through each technology stack from the ground up until you find the problem. You’re working with a diverse group of personalities which includes admins and engineers from the data center, network, server, database and application teams; all wanting to prove it isn’t their problem. And in some individual cases, you may have a couple of managers, directors, and business partners on the call.
An incident response bridge has been opened for a tier 1 online application that is running in a local vSphere cloud on VMs (IaaS). You have been called to lead the call.
Please demonstrate your telephone and oral skills by sharing an example of how you would begin the incident investigation and then move through each technology group?
An incident manager who is new to the team may have problems getting people to follow their lead, which is why building trust by meeting with service and application owners is essential. But even more important is building rapport with key players on all the different teams.
This question has 2 parts.
First, if you were selected to be our new incident manager, explain how you would establish strong interpersonal & relationship with our technical staff and managers?
Second, provide an example of your social skills, ability to learn complex systems, and an estimate of how quickly you would be able to get up-to-speed?
You’ve just spent all night on a bridge call troubleshooting an application problem that could have been resolved in 10 minutes if the right admin would have joined the call. But unfortunately, they did not, and what you had to work with was the junior admin who was on-call. It’s a good thing you are technical because, in the end, you had to log into Windows Server and fix the issue yourself.
This is a 2 part question.
First, how would you handle communication to the senior level staff waiting for the problem to be solved?
Second, if you found out the key person was just not answering the call to join the bridge, how would you handle the communication with the admin’s manager after the incident was resolved?
I’ve given you a list of 10 questions that all have a unique attribute: Leadership, Judgement, Depth, Trust, etc…and here is my point.
Whether you are a small mom and pop shop doing the hiring, or a multi-billion dollar corporation, I understand how costly incidents are!
Which is why these incident manager interview questions come from my own experiences. They are crafted to help to hire managers quickly determine the level of expertise an applicant has.
And on the other hand, they will help the aspiring incident manager understand what’s expected so you can get training. Here’s a great place where you can find ON DEMAND IT Training in case you want to improve your technical skills in Linux, cloud, web services, etc.
This is a fact! Finding the right person to be your incident manager could save you hundreds or even millions of dollars in penalty fees for breaching your SLA. Let that sink in…
Free: Download Incident Manager Interview Questions written by ITIL v3 Certified Manager.
Related Interview Questions For Hiring Technical Managers:
- 3 Tough Technical Project Manager Interview Questions & Answers
- How To Hire An IT Manager Who Will Get It Done, 24x7x365 (FREE Hiring Kit With Tough & Insightful Interview Questions)
Thank you for your interest and please feel free to contact me if you need someone to screen your candidates.
Last update on 2017-12-13 at 20:48 / Affiliate links / Images from Amazon Product Advertising API