Learning By Interviewing On-Call Engineers
Recently in our organization we’ve been doing interviews with our on-call engineers across the teams. The goal of these interviews is to better understand the collective comfort and “health” of the engineers that take the pager. It is also one way to get a better understanding of what it really means to manage a production service from the people that are directly responsible for doing it. These interviews are 1:1 but answers are anonymized and the information is aggregated to minimize the chance of association back to a specific person. It also provides a basic sense of quantifying a largely qualitative process of being on-call.
Some of the direct benefits to having the SRE team drive these interviews is to foster direct connections to engineers and cultivate a trust the SREs have agency to make changes for the better. The art in the process comes from asking probing follow-up and clarifying questions to help expose underlying issues and contributing factors. If done well, this process helps to assess what is going well and what could be improved and can better set and drive priorities for the SRE team. The final report-out also spreads sympathy and empathy across the organization by elevating issues that on-call engineers experience that aren’t immediately clear to those not on-call.
What Questions Should I Ask?
In our group, we go through a set of 10 questions that build on each other and help to establish some report and pull out subtle issues that typically aren’t present in some common metrics like incident counts and Mean Time To Resolve (MTTR). To help build some comfort and rapport with the interviewer we start out with “1 to 5” rating questions:
Rate your overall on-call experience from 1 (terrible) to 5 (amazing)
Rate the tools you have available to you while on-call from 1 (terrible) to 5 (amazing)
How actionable were the alerts you received while on-call from 1 (not at all actionable) to 5 (very actionable)
How confident are you that the current system will accurately catch and notify you about problems from 1 (not at all confident) to 5 (very confident)
The on-call engineer needs to provide a number (we don’t allow decimals or fractions) and the interviewer immediately follows up with a question like “Why do you say [rating]?” or if they were waffling between two numbers “Why did you pick [rating] instead of [other choice]?”. The follow-up questions and getting them to expand is where most of the good stuff is hiding so keep digging until its clear why that rating was given.
After these questions, we move on to more question-answer format that don’t include ratings. These questions include:
What steps did you take to make the system better during your on-call shift?
What work items and/or bugs did you create to address (you or another team member/team) after your shift was over?
Where do you see the biggest gaps in our monitoring?
Where do you see the biggest gaps in our alerting? (Is the difference between the two clear?)
What do you like about the current on-call process?
What do you dislike about the current on-call process?
Again, probing follow-up questions are critical to pull out the lived experience of an on-call engineer on your team.
What To Do With The Results
All of the interviews should be done in person while taking notes to further enforce the empathy, investment, and agency for making a change based on their answers. The answers should all be anonymized and aggregated together into 3–5 Key Takeaways for each question. We also include a Notable Quotes section to highlight some specific statements that help emphasize some of the takeaways and highlight something notable. Things like Word Clouds can help elevate common aspects of the answers (I like the flexibility of https://wordart.com/create a lot). For the 4 rating questions, we also include an Average Rating and Standard Deviation that includes the number of interviews conducted. These are nice to have in the moment but become more useful as delta’s the more times you execute this process.
In addition to a summary for each question, we also include an overall summary for all the questions that has about 5 overall takeaways from the process and responses. These don’t need to be copied from the questions and can, in fact often are, combinations of takeaways and higher-order/system-level takeaways. With these takeaways you can start to provide a specific set of recommended changes to improve or solidify the discovered themes.
Other Things We’ve Learned
While this has been a great experience, there have been some great lessons along the way:
- Do this every 6–9 months. It is not a trivial process, especially if you have a lot of people in your rotation, so you won’t want to do this frequently. You also need time to implement the recommendations and allow people to rotate through the (hopefully) improved experience. The sweet spot seems to be somewhere between 6 and 9 months depending on your team size.
- Templatize the process. Because of the effort involved, using a web form template to collect information helps to keep it organized and formatted for easier integration later. Having a consistent format for presenting it out is useful from both the person responsible for putting it together but also for people consuming the report and comparing to previous versions
- Ask for feedback and “anything else?”. These questions have evolved over time because we end each interview with “Is there anything we didn’t ask that you wanted to include in the data?”. This open-ended question has created some great feedback opportunities on the process itself, insight into phrasing and ordering of the questions, and new questions and areas to explore altogether. You may not get much the first time through but once people have a better understanding of what it is and what this process can do for them, they will typically be more forthcoming with the feedback and constructive criticism.
What do you see that is missing or you would change in the process? Let me know!