penelope.zone - Questions for a prospective employer about on call (2/3)

Hi Folks,

This is part two in my series about talking to prospective employers about on call. In the first part we covered a bunch of general questions. In this part, we’ll dig down in to specific questions about incidents and how they function. An incident is something that happens that causes you to get paged when you’re on call, and so asking questions around this is pretty important.

How many people are on the smallest team that is on call, how frequently do they paged for out of hours incidents? Are you doing anything to help alleviate their pain?

By definition the smallest on-call team has their engineers rotate most frequently. If you get an answer which is a really small number (say 2 people in an engineering org of 50) that’s a big problem. Unless the platform that you’re working on is ridiculously stable, you’re going to have a bad time. Really this question is aimed to help you get an understanding of how the engineering leadership in the business thinks about the health of their incident management program. Are they working to remove single points of failure? Are they working to support that small team by sharing on-call with another team? Are they rotating engineers or cross training people back in? These are all good questions to ask to get a feeling for how much pain the worst off engineers in the company are. More importantly if you were to end up in the same situation, how likely it is that you’d have a leader you could talk to to make sure something is done about it.

Does your company have distributed employees that can take on call for times that make sense for them?

I was once lucky enough to work on an on-call team that split between East Coast America and European time zones. The five/six hour time shift meant that we as a group could take some of each other’s on-call burden. We’d specifically level it in such a way that the most obnoxious times for people to receive a page were covered by a more reasonable time in the other time zone. This gets easier as you have wider distribution.

Many companies do not have the luxury of doing this, because everyone works in a single time zone in a single office. If you do have a distributed employee base taking advantage of that in order to alleviate some of the pages at the worst hours can be really great. Talk to individual engineers and individual engineering managers about how they deal with remotes and timezones.

Do teams routinely create and put in to place newly automated alerts?

This is more common as a practice than it was, say, five years ago. Automated alerts are most important around new services or new features that are going into production. It sort of maps back into the severity scale discussion. Are you defining your severity scale by something automated, or relying on some kind of human at your back?

One of the things this question is designed to hint at though is which tooling a company is using to get these alerts in place. Do they have something like Datadog integrated all the way to their Pagerduty? Are they using the CNCF Prometheus/Grafana/AlertManager stack? How easy is it for an engineer to synthesize a hypothesis about what a good alert is, test it, and then put it into production if the alert is meaningful? Organizations that don’t have a good strategy around automated alerting are generally resistant to adding newly automated alerts, and this is what we’re trying to bare out here. How much work is it for you as an engineer to add a new alert in production?

Do you conduct any kind of review on your automated alerts to ensure they are not becoming noisy?

Businesses grow over time, operational characteristics of our systems change, third-party providers change their service levels. The metrics against which a given alert fires are going to drift over time. An alert that was very indicative of a problem six months ago might suddenly become noisy because of this drift. An alert which used to be sensitive and accurate may fail to fire when data starts to change.

Conducting some kind of alert review on a regular cadence is a great way to prevent this. Validate that the alert thresholds are still tight, that we’re not too close to suddenly firing and waking somebody up at 4am when we didn’t need to. If your teams have a reasonable workload this entire exercise can consume less than an hour a month, and that’s totally worth it.

This is one of those questions where if you get a “no” answer, it might not be a problem. The business might not be changing fast enough to require this kind of review. There may be mitigating circumstances. Here you’re looking, again, for thoughtful leadership responses and or good mitigations in place, not just a flat out “this is not a problem for us” with no reasoning.

Do you have a support/operations/customer success team that can call a page in lieu of an automated alert?

If something’s going horribly wrong, you want to be notified. Sometimes you’ve failed to cover your service in enough observability, something gets missed, and you’re down and not automatically woken up. Hopefully, you’ve got a support team who is awake to notice. Hopefully, they’ve got a path to escalate to you.

This can be a double-edged sword. A human making the decision to file a JIRA versus waking you up in the middle of the night can be a fraught thing. They don’t want to be a jerk for waking you up, and you don’t want to be woken up. Similarly, they’re incentivized to keep the customers happy, and so filing a ticket and waiting until morning might be a really shitty thing for them to have to deal with on their end. This is where having a severity scale can help, but when it comes down to human determination, there’s always going to be room for some error. The important thing for you as the engineer to know is that if they do wake you up for something that isn’t as big of a deal as it seems at first, that’s a systemic failure, it’s not the fault of that person. You need to collaborate with them to clarify on either the severity scale, how to triage incidents, or how the system works.

You’re going to get a lot of varied answers to this question because every organization puts these teams together differently, and gives those teams different objectives. This is a good opportunity for you to dig into the structure of the company that you’re joining such that it exists beyond the engineering organization. Engineering doesn’t exist in a vacuum, and so when you’re thinking about this, understand how communication from the company’s customers makes its way to engineers. When something is broken for a customer, what exactly happens before an engineer gets involved, and how frequently do humans make a call to bring an engineer in out of hours?

Do you have any kind of 24 by 7 team (e.g. a NOC, SOC, or “cloud operations” team) that does first level triage on behalf of engineers?

A yes answer to this question will make your life significantly easier. To take a quick diversion for a second: at DigitalOcean there is a 24x7 operations team called CloudOps (short for cloud operations) that works in three shifts. They’re technical folks who mix engineering, sysadmin, and SRE skills. They exist as a point to which support teams can escalate issues, make severity determinations if engineers aren’t already on top of a problem. They coordinate the start of an incident to get everyone needed to resolve it. This team is great (I love you if you’re reading this CloudOps) and it was one of the best tools in my operational arsenal. They would frequently prevent something from getting through to my team by executing a playbook against an incident for us, and for that, I’m eternally grateful.

Many engineering organizations are too small to have a team like this. If you’re big and you don’t have a team like this it might be worth considering what the impact of a 24 hour operations team would be. In my experience having a series of operators and communicators who are continually fresh can significantly reduce the time to resolution on any incident.

Getting a “no” answer to a question like this is pretty standard, and it’s not really a red flag. In addition to the usual 24x7 support and success operation that many organizations have, having a 24x7 technical team as well can be unnecessary overkill. If you don’t have a team like this, though, it might be worth considering you could form a rotation of this form. If your business has a significant operational burden (not everyone is a cloud hosting provider, or of similar class), it really can be a lifesaver.

Have you ever had an incident where you felt like a severity decision was made for an incident that was too high, causing people to get paged when they shouldn’t have? How did you remediate that?

This is one that I’d aim squarely at the engineering leadership of a company. Keeping incidents appropriately sev’d is going to show a strong difference between a healthy and unhealthy engineering organisation. When everything’s a SEV-0 biggest most important fire in the universe nothing is. It’s a natural response though: any little defect in the product is probably a big deal to somebody. Most likely your customer-facing teams are going to take some flak for even the smallest of bugs and it’s natural to want to fix them as quickly as possible.

Conversely: engineers getting paged too frequently causes them to burn out and leave, so striking a balance is important. When a call for severity is made it is of course open to inspection and review. If any person genuinely feels like an incident was over or under sev’d, they should absolutely speak up about it. It’s then up to the leadership of the company, or someone who’s responsible for running the incident management program to make a decision as to how that call ultimately sits.

Usually, when a “bad” call of this form is made, it’s due to a lack of communication. What that means is that probably some documentation needs updating, how a product works needs to be clarified, or the engineering team needs to improve their automated alerting or metrics gathering such that it’s more obvious what’s going wrong when something’s going wrong.

You may find that this hasn’t happened, because it’s not humans that are dealing with alerts, but in general, making sure that someone’s got an eye on this is a good idea.

Conclusion

Incidents are painful for everyone involved. When they trigger, who triggers them, and what happens when they trigger are really important for you to understand. Healthy engineering organizations have a strategy around alerts that goes beyond “customers start complaining and then we respond”. Digging in to the alerting strategy, alerting review, and process for support escalations will help you get a better handle on whether or not a company is careful and considered about its on call.

Questions for a prospective employer about on call (2/3)

Specific questions related to incidents