penelope.zone - Questions for a prospective employer about on call (1/3)

Hi Folks,

This is the first in a series of three blog posts. These contain questions that I asked prospective employers when I was interviewing recently and why I asked them. On-call is one of the biggest determiners in whether or not a job is particularly painful. It also seems to me to be becoming a more “normalized” practice. These questions are designed to give you a feel for how seriously engineers, and, importantly, engineering leadership take their on-call responsibilities.

The reason you want to determine how engineering leaders think about on-call is that they’re ultimately responsible for how well or badly on-call will go. Things like how much engineering pain they’re aware of, how they’re helping drowning teams, and what they personally do to ensure there is mature operational practice have an outsized impact on how on-call goes.

In this first part, I’ll address questions that have to do with what I’m calling “general maturity”. This is a quick health check on where an organization is at on the design of their on-call practice. Do they have good procedures, are they thinking about who’s on-call and when, do they have an understanding of the uptime requirements of their product?

In the next post, I’ll talk about the specific nitty-gritty of being on-call. In the final one, what the entire incident management program of a company looks like, not just the on-call part.

These questions are formed from serving close to three years as an on-call engineer at organizations of various sizes. Also serving as an engineering manager for the last year of a team of 3, all of whom were on-call. I’m by no means trying to declare this is the absolute best framework to assess a company by, but it will let you start discussions and have a lot of insight. I really do hope you enjoy it.

General maturity questions

Does $product have to be online 24/7? Do you expect engineers to be on-call out of hours

This may seem like a “dumb” question, but I’ve been surprised by companies that have answered no to this question. There are obvious places where the answer to this question is no: retail POS, many types of financial services businesses, etc. If you get a “no” answer to this question, and it doesn’t sit right with you: probe. You might find that this isn’t well considered and that the software, in fact, needs to be online 24 hours. If that’s the case that’s a big red flag.

You’ll often hear a “yes” answer as a simple default. Things like clouds, software that people might need to query information out of at any time of day, communications software, etc, it’s probably a well-founded reason. If you get a “yes” answer and you think that the product doesn’t need to be online 24/7, again, probe. Making sure that you get to a good reason for the availability expectations that the leadership of a company is placing on you is important. A 24/7 bound is basically the starting point for all of the questions in the rest of this post. Pages during “normal business hours” or slightly outside (say 7 am to 7 pm) are significantly less painful to deal with than those in the middle of the night (put your hands up 4 am ops club). If you’re entering into an organisation that expects its engineers to be on-call out hours, you want to know that.

You want to inquire here for uncertainty or bad reasons for uptime requirements. If you’re expected to do on-call for something that demonstrably has almost no users in the wee hours, that’s a problem. Waking up at 4am sucks, and you want to make sure it’s for good reasons, so this is really the start of the conversation, and your chance to probe both individual contributor engineers and engineering leadership (I’ve asked this question all the way up to CTO level) about how they think about on-call.

Is anyone at this company performance managed on the reliability of the software that is being produced?

No is the most common answer to this question. Most companies are small and require little process. As such, frequently, shipping features is performance managed and reliability often isn’t. If you’re joining a company that’s lucky enough to have a “real” SRE (Site Reliability Engineering) function (not a team of SREs that sit somewhere, but SREs embedded within product teams) the answer might be someone in SRE leadership. Sometimes a VP of engineering or Director will take on this role, and try to manage reliability holistically. That sometimes works, and sometimes doesn’t.

You should ask this question more as a determiner of attitude than a binary yes/no question on which to eliminate a company. If engineering managers and engineering leadership want to improve in this area, that’s a great signal. If it’s something to which they’re not putting a large degree of concern, that can be a red flag.

If you hear words or explanations to the effect of “we trust individual engineers to manage the reliability of the software that they write”, that’s a problem. No matter how well-intentioned an engineer is if the structure of their organization is focused on shipping software that engineer will eventually tune out their focus on reliability. Non-functional requirements, things like metrics, timeouts, retries, etc will go to the wind as you naturally incentivize those engineers to head down the “easy” path and write code that “just” works, apart from the myriad failure conditions they didn’t account for.

Is there any kind of cross organisational service level management that all teams aim for (e.g. support engineering on successful ticket submissions per hour, checkouts team on checkouts per hour etc). If so, when teams miss their prescribed service level objectives, what do you do?

This question is a good follow up to the previous one. Generally speaking, if nobody is being performance managed on service level, then service level won’t be being measured either. If service level is being managed, this is a great place to probe into what that practice looks like. Are they measuring business metrics (tickets submitted, emails sent, etc), or non-functional metrics (throughput, latency, error rate)? If so, does the organisation have a preference on service level of one kind of metric over the other? Neither is necessarily better but in my experience organizations that really care about their users tend to focus on the functional metrics.

Asking what organizations do if teams miss their prescribed service level is designed to probe at how the company holistically treats the health of their systems. Some companies will respond to failing service levels by making teams do reliability work until they are back within that service level. Some will take a look at any open postmortem action items and push those to the to top of their priority list. They’ll do this to improve their service level rather than continuing to ship features. Some will do nothing at all but insist that they really do care about reliability.

Use this as a good place to determine if reliability is being measured in any kind of consistent way throughout the organisation, or that it’s up to individual teams and engineers that care. When you’ve got a consistent organisational focus on reliability, that’s a sign that on-call is less likely to be hell.

How do teams at your company manage their on-call rotation?

The most likely answer that you will get to this question is that “we leave it to individual teams to manage their on-call rotations”. That’s a good answer but has a couple of pitfalls. If there’s no holistic review or individual teams within the organisation don’t have strong managers or a strong retrospective process, it’s easy for those teams to get stuck with a static on-call rotation. Being stuck like this makes people less adaptable to changes in circumstance, and can be a red flag.

By far and away the most common rotation I’ve seen is that teams do “on-call primary for 24 hours a day, for a week, and then rotate”, sometimes referred to as a 24 by 7. There’s usually some kind of secondary rotation which is a person who is the backup if the first person fails to respond. Sometimes that person is the lead of the team. As a final step: some teams opt to page everyone on the team if the first page is not responded to. This kind of rotation won’t work for everyone, and it’s important to make sure that a rotation of that form isn’t a blindly followed default. Look to see if the on-call rotation has actually been well considered when interviewing for a team. This is a great question to put to an individual engineering manager. As you go higher into engineering leadership, specific rotation details tend to melt away.

Another line of questioning here is how a team deals with swaps. Emergencies, flights, conferences, and vacations all happen. Determining if the team’s culture is healthy enough to support these kinds of things without scorekeeping can give you a really good lens into how healthy that team is.

A spin on this question for engineering leadership is to see what kind of health checks they’re conducting on their team’s rotations. Do they get any kind of reporting as to who’s getting paged the most frequently in their organisation? Do they insist that managers rotate an individual when that individual gets paged too many times in a week? What kind of high-level on-call rotation reviews are being conducted?

I’ll also note here that some people absolutely cannot deal with an out of hours call. This is for a variety of reasons ranging from being a single parent to mental health issues, and so on.

Sure. My problem is when that’s assumed to be enough flexibility for everyone.

I’m a single parent. My kids school still starts at 8 regardless of whether I was up the night before. There’s nobody else to take him. Getting up after midnight just is not an option.

— Sarah Mei (@sarahmei) December 28, 2018

These people also might be fantastic engineers, and excellent on-call responders, who don’t mind the interruptive work during their day to day. If you can determine that they’ve got some kind of process for dealing with this, or do 12 hours on-call slots where some people volunteer to take overnights on behalf of their team, that can also work. The point is, you should probe to determine if there’s any kind of flexibility there. Even if you don’t need it now, you may in the future, and it’s good to know it’s available.

Every answer that you hear to this question will be different. Red flags will include that someone in leadership can’t tell you holistically how different teams manage different on-calls. Also look to see if on-call is dictated across the entire organisation with little to no team specific flexibility. If a mid-level engineering manager has their on-call rotation all laid out but doesn’t have a specific reason as to why the on-call is like that, it might be worth probing to see if they’re open to changing it based on discussions and retros with the team.

Do you have a “Severity Scale”? If so, could you walk me through what the ratings are, at what severity people can be called out of hours, who makes severity determinations, etc?

This is a meaty question, and will likely consume the majority of the conversation with someone who’s in engineering leadership. There’s a lot of data that you can take away from this one. Firstly, you should aim to get the description of what the severity levels are. For example, you might have:

SEV-0: A critical business-ending event, multiple products are unavailable, all customers cannot log in, registrations are failing, the whole site is unavailable, etc. Engineers and engineering leadership paged immediately to coordinate response
SEV-1: A serious impact to a single product area, business function, etc. Entire site responding but outside of usual performance SLO, Product is regionally unavailable in a single country/area of the world. Engineers paged, potentially an engineering leader paged to coordinate incident response if needed
SEV-2: A serious workflow breaking bug in a single product that does not have a workaround, single view or part of site outside of performance SLO. Engineers paged immediately
SEV-3: A workflow breaking bug exists but support has been able to determine a workaround, JIRA ticket filed and engineers expected to fix when they come in the next day
SEV-4: minor bug or piece of functionality not working, either not serious enough to need a workaround or simple workaround exists, JIRA filed, engineers expected to fix as part of usual bug fixing/sprint work

You’re looking for something that is precise. Ideally you will have specific definitions based on the product and business units within the company that you’re interviewing for. The reason that you want something like this is that it clearly distinguishes when you can be interrupted out of hours to fix problems. Having shared definitions like this enables you, your engineering leadership, and your support and operations folks to agree on how to respond to any given problem that they’re seeing.

As to who makes severity decisions: you’ll want the majority of incidents to be triggered automatically. It’s worth acknowledging, however, that we can’t cover all possible things that would go wrong in automated alerts. As such we usually concede that as a last resort a human is going to make a call to bring in engineers to try and fix the problem. If that’s the case, you’ll want to follow up by probing into who can make severity decisions, and out of hours page decisions. When a decision to page someone is made who makes that decision: a support individual contributor? Support manager? Is there an operations team that can triage for you? An on-call incident manager who is an engineer from any team? Those things will determine how frequently you get a bad page and have to down-sev or redirect an incident.

It’s worth noting here that not having a sev scale isn’t necessarily a bad thing. If you’re joining a tiny company (n < 15 engineers on staff) you might not have a wide enough range of things that can go wrong as to need this much process and documentation. It’s worth considering whether the answers that you hear map to the amount of process you’d expect the company at which you’re interviewing to have.

A big red flag here is that if this document exists, but is stale, isn’t collaborated on by all stakeholders, etc. It should be a living document that accounts for new products, launches, teams, etc. Not a static document that was dictated by some engineering leader one time and then never changed again.

Is there anything similar to a holistic incident management program at this company? Is someone in charge of that?

This question goes right along with the severity scale question. Answers that you’re looking for here include maintaining the severity scale, postmortem process (if they have one!), pre- and post-incident management, etc. Not everyone has this! Larger companies that care about reliability tend to hire either a technical program manager into this position or make it the responsibility of one of the more senior engineers on staff.

Hearing a “no” here, again, isn’t a hard elimination. Not every company needs something like this, not every company has a service level requirement on their software that requires this role. If the program does exist, digging into why it was set up, and digging into why the person who is running it is running it can be a really good indicator of a healthy and mature attitude towards on-call.

Generally speaking, is the number of incidents that engineers are dealing with out of hours going up or down?

This is really the start of a discussion. Based on the answer that you hear you’ll want to ask what specific steps are being taken to improve the situation, what new practises the organisation is using, etc. Are your engineering leaders concerned with the state of this at all? Do they know? What are they doing to personally improve the lives of the engineers on the ground? Are those engineering leaders feeling any of the pain? Use this one as a point to establish whether or not you’re likely to encounter increasing or decreasing pain. I’ve not got a whole lot more to say here, because repeatedly asking “why?” at this stage is basically the best way to go.

Conclusion

This first round of questions will give you some insight in to the maturity of a specific organization’s on-call practise. If you hear strong consistent answers that have been clearly thought about through all these questions, that’s a really good sign. If you come away from the discussion worried, that’s something to follow up in with further conversations. I’ve found that when people are on top of on-call, you can bare it out really quickly. I do hope this is useful for you and helps you start to build a framework about what a good on call practice might look like for you.