I was on my way to lunch when I got the call…
When things break, you need someone on the bridge call who not only understands the protocol but also has the skills and experience under their belt to drive the investigation in the right direction and to closure.
Putting the pieces back together!
Incident response is critical, which is why incident managers are a must-have for companies that have 24/7/365 web services that host online environments where millions of users (or customers) are impacted by a single device going offline.
Think for a moment about the impact to customers if the website for your bank account, online school, or even Amazon were to go offline. HUGE!
I’ve worked in places where downtime was measured in dollars, not minutes. And let me also tell you there isn’t anything to compare to the STRESS of being on a bridge call with VIPs having amygdala hijacks every 5 minutes!
So what is the skill set of an incident manager?
Unlike with VMware interviews and skills, the person you’re looking for doesn’t just have the technical expertise and in-depth, hands-on skills. Moreover, most solid incident managers have grown through the ranks from a hands-on role into an incident manager role.
Ideally, the perfect prospect you are looking for has a broad perspective on all technologies such as networks, servers, virtualization, cloud, database, applications, and web servers. And, let’s not forget both Windows and Linux variations.
However, like I said, technical skills are not enough!
Continue reading, and I promise to share the secret ingredients for incident managers, all revealed in the 10 interview questions below and intended to help you screen your job applicants for the best candidate(s).
Incident Manager Interview Questions
Note for Interviewing Managers: There are no answers to these questions so you’ll need to pay close attention to the responses and body language during your interview. Anyone with experience will bubble to the top and have good examples to share. Look for good ‘been there and done that’ examples.
It’s 2 o’clock in the afternoon on Monday. And to make things interesting, it’s month-end. You have just been called and asked to join a bridge call. There’s already a team troubleshooting an outage on your central tier 1 application that is used for invoicing customers. Severity 1!
Explain your ability to coordinate a large group of technical contributors during this high severity incident and retain control of a fast-paced conference call?
Over the last 2 days, you’ve been on many bridge calls that all seem to have the same cause. Now it’s time to lead the investigation into how this problem happened in the first place.
Explain how you would lead an incident investigation (a.k.a. Root Cause Analysis or RCA)?
You’ve been called to a bridge call that is already in progress, and things are not going well. You can sense people on the call are tense and withdrawn because a high-level person has taken over the call and they are very frustrated. You asked for a summary of where things are. Immediately, the VIP tells you it’s broken and needs to get fixed ASAP!
How would you maintain a professional demeanor and attitude while being assertive to this person that you will take it from here?
You have been invited to an important meeting to share your investigation findings for a full-scale outage that was caused by a storage failure. At the meeting with you are the CIO and his executive staff to hear your analysis. After introductions, your boss has turned the conversation over to you. You have brought your laptop and have your RCA projected on the big screen.
Please give an example of a time when you faced a situation similar to this one and had to exude the ability and confidence to act decisively and exercise influence over a wide range of individuals at all levels of technical and business leadership?
You are managing an incident that has been going on for over 2 hours, and now things are starting to heat up. You have the network admin checking a possible switch issue, a DC specialist checking the physical connection on an ESXi host that seems to have network connection problems impacting 15 VMs. And unfortunately, 2 of the VMs hosted are key systems to separate critical applications that are not redundant, yet. If this isn’t enough going on at once, while all this is happening, you are getting text messages from the boss wanting status updates.
Please share an example of a time when you had to multitask and make sound judgments in a fast-paced, high-stress environment, while at the same time keep people informed?
You have been assigned to the global incident response team which has staff spread out across the US, India, Mexico, and Brazil. When a severity 1 or 2 issue happens people from each location are asked to join a bridge because the problem can be anywhere due to the distributed workflow designed into the application.
Share an example of a time when you had to interact with people/groups of widely varying disciplines, cultures, and backgrounds. Explain how you influenced them to follow your lead?
In today’s world, most environments are using virtualization for hosting their servers and applications. Many operations are also using cloud services such as AWS and Google for IaaS. This creates a new challenge for understanding infrastructure topology for virtual servers. But not only is this new; some operations have added Docker containers and PaaS to an already complex world.
This question has 2 parts.
First, briefly explain your technical background and be specific about the functional breadth of your expertise. Explain how you would be able to ask the right questions about a virtual server, and even question the responses from the admin if you thought something didn’t sound right?
Second, if everything points to a defective KVM host, yet the admin is insisting it’s OK, give an example of how you would challenge the admin’s assessment if the overwhelming evidence says it’s the host?
If you’ve been around IT staff for any amount of time then maybe you noticed our level of passion and ownership. Now imagine that you are the new incident manager on the call who has to work through each technology stack from the ground up until you find the problem. You’re working with a diverse group of personalities which includes admins and engineers from the data center, network, server, database, and application teams; all wanting to prove it isn’t their problem. And in some individual cases, you may have a couple of managers, directors, and business partners on the call.
An incident response bridge has been opened for a tier 1 online application that is running in a local vSphere cloud on VMs (IaaS). You have been called to lead the call.
Please demonstrate your telephone and oral skills by sharing an example of how you would begin the incident investigation and then move through each technology group?
An incident manager who is new to the team may have problems getting people to follow their lead, which is why building trust by meeting with service and application owners is essential. But even more important is building rapport with key players on all the different teams.
This question has 2 parts.
First, if you were selected to be our new incident manager, explain how you would establish strong interpersonal & relationships with our technical staff and managers?
Second, provide an example of your social skills, ability to learn complex systems, and an estimate of how quickly you would be able to get up-to-speed?
You’ve just spent all night on a bridge call troubleshooting an application problem that could have been resolved in 10 minutes if the right admin would have joined the call. But unfortunately, they did not, and what you had to work with was the junior admin who was on-call. It’s a good thing you are technical because, in the end, you had to log in to Windows Server and fix the issue yourself.
This is a 2 part question.
First, how would you handle communication to the senior-level staff waiting for the problem to be solved?
Second, if you found out the key person was just not answering the call to join the bridge, how would you handle the communication with the admin’s manager after the incident was resolved?
I’ve given you a list of 10 questions that all have a unique attribute: Leadership, Judgement, Depth, Trust, etc…and here is my point.
Whether you are a small mom and pop shop doing the hiring, or a multi-billion dollar corporation, I understand how costly incidents are!
This is why these incident manager interview questions come from my own experiences. They are crafted to help managers quickly determine the level of expertise an applicant has.
And on the other hand, they will help the aspiring incident manager understand what’s expected so you can get training. Here’s a great place where you can find ON DEMAND IT Training in case you want to improve your technical skills in Linux, cloud, web services, etc.
(Update) 5 Sample Answers
Alright, I give!
I’ve had a number of requests posted in the comments for examples that you can use to explain your incident management experience during an interview. So without going into all the details, I’ve decided to post 5 samples from my own experience.
These are only examples of many of the incidents I’ve dealt with over the years and the purpose of sharing them is to give you food for thought so you can draw from your own experiences.
Please don’t try to use these samples as your own answers in an interview unless you are being truthful…
- Users are calling the help desk and reporting that CRM is freezing up. This is causing them to not be able to complete their work. After getting technical staff on the bridge call to investigate the issue – the DBA has found there is database blocking going on and killed the SPID.
- Customers are calling technical support to report they cannot log into the HR application from the main URL. After a quick assessment of the issue, Linux admins are brought on the bridge to check the web services. They notice one of the nodes is not making connections and restart Apache to fix the problem.
- The helpdesk is getting calls about multiple applications that are unavailable. You suspect it’s a network issue and decide to bring a network admin on the bridge. After a few minutes, she reports the network is fine. Upon deeper investigation, you bring on a storage admin on the call who finds one of the main NAS units has failed over to the secondary controller but the process didn’t fail completely. After a forced take-over, the storage is back online.
- There’s a major rainstorm going on and you receive a call that rainwater has flooded the data center and damaged your main core switches. The corporate network has gone offline. After a quick assessment, you get the network and data center staff working to transfer the traffic to the secondary site. This requires a coordinated effort across multiple sites. This bridge goes on for hours while equipment is reconfigured – testing is completed and traffic is moving again but staff worked all night.
- There’s been an accident and a construction crew has cut a power line that feeds the main power to your data center. Now, 1000s of servers are running on backup power. You assess the problem and have your data center staff check fuel on the backup generators while the power company repairs the circuit. The bridge call remains open and you continue to give updates to VIPs.
Now it’s time for you to come up with your own examples from your experience. Be honest and give the details so the interview panel can see how well you handled the incident.
This is a fact! Finding the right person to be your incident manager could save you hundreds or even millions of dollars in penalty fees for breaching your SLA. Let that sink in…
Free: Download Incident Manager Interview Questions written by ITIL v3 Certified Manager.
Related Interview Questions For Hiring Technical Managers:
- 3 Tough Technical Project Manager Interview Questions & Answers
- How To Hire An IT Manager Who Will Get It Done, 24x7x365 (FREE Hiring Kit With Tough & Insightful Interview Questions)
Thank you for your interest and please feel free to contact me if you need someone to screen your candidates.
VMinstall.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com, Amazon.co.uk, Amazon.ca, and other Amazon stores worldwide. *Best Sellers last updated on 2024-02-26 at 07:44.