Trying to build a new on-call schedule or fix an existing one that doesn't
work? Here is how to do it the right way.
Many engineers have experience with on-call that simply doesn’t work for their
team. It’s either based on a years-old setup that hasn’t changed with the
progress of the organization. Or it was done in a rush not getting the necessary
care it deserves.
Here is how the @techroastshow characterized a poor on-call in one of their
short
videos.
Any engineer with experience with it would probably agree, that on-call is
inherently stressful and that it should be crafted properly. Once set up it will
quite literally control hours of valuable engineering time, so let’s make one
that works.
What is an on-call schedule?
On-Call is a practice of always having a team member on standby, ready to
respond in case of an urgent incident, even if it occurs outside of the regular
working hours. It’s one of the core processes of Incident
Management and a key to minimizing downtime and
ensuring a reliable service.
On-Call schedule is a dedicated calendar allowing teams to assign and
monitor the on-call shifts.
💡 On-call schedule at Google
Google SRE (site reliability engineering) teams usually have a schedule where each engineer is on-call for one week every month. During this week they are ready to respond to incidents at any time of the day and night.
The rest of the month they spend on engineering (ideally 50% of their time) and on other operational, non-project tasks (around 25% of their time).
With 24/7 service availability being a goal of many modern organizations having
a good on-call process is vital to:
Solve incidents as quickly as possible: On-call scheduling is a key part
of the incident management process, which directly influences how quickly will
the incident be acknowledged and eventually solved. A well-designed on-call
schedule can radically improve MTTA, MTTR, and other incident metrics.
Keep engineers happy and prevent burnout: On-call engineers are often
under great pressure, while also making personal sacrifices when it comes to
their duties. Quality scheduling is a key to keeping these vital engineers
satisfied and retaining them for the organization.
7 ways to build a developer-friendly on-call:
Creating a quality on-call is a challenging task because there is no
one-fits-all model. But on-call doesn’t have to be difficult, sleep-depriving,
or inevitably leading to burnout.
Here are 7 strategies to help you in designing an effective on-call schedule:
1. Start by talking to your team
Every team is different and every team member has different priorities and
preferences. Understand what your team needs and consider different individual
situations to help everyone work more efficiently.
There might be a consensus on what the on-call should look like within your team
and a complicated setup might not be necessary. For example, if your team agrees
on a week and weekend schedule there is no need to go for more complex
follow-the-sun schedules with other teams across the world.
🗓️ Common schedule types
Week and Weekend: Engineers are on-call during the workweek and weekend - 7 days in a row (as explained in the Google example). The on-call week is intense, however, the rest of the month is on-call free.
Nine to five (and five to nine): There is an office hour on-call duty from 9-5 and outside of business hours duty (5-9), which is handled by a different team member.
Follow-the-sun: This model leverages the timezone difference between different team members and allows for all on-call engineers to have only business hours duties - ideally avoiding night shifts.
Hybrid: Combinations of different schedules, for example, a follow-the-sun schedule with week and weekends create a schedule where team members have 8 hours every day (during business hours) for a week.
2. Apply follow the sun model when possible
Nobody likes being woken up by incident alerts at 4 in the morning. Not just
that poor sleep leads to worse performance during day hours, but any reaction
time and incident-solving capability are decreased whenever an engineer needs to
wake up. Not to mention the mental and physical health consequences.
Implementing the follow-the-sun model where possible is a commonly used practice
that prevents those issues. If the location of the on-call team members allows
it, it should be considered first.
You can nicely see different working hours across time zones with
worldtimebuddy.
3. Have primary and backup on-call duties
Being on-call outside business hours can be complicated and there are unexpected
situations when an engineer can’t respond as fast as it would be necessary. To
prevent any incident from going unsolved and causing damages it’s best to always
have a backup on-call engineer that is able to step up and fix the issue in case
of an emergency.
Backup on-call carries the same responsibilities as the primary one and it needs
to be treated that way. First, the team needs to understand that being a backup
is no different from primary - you must be ready to react within minutes.
Secondly, managers need to treat it that way and consider being a backup on-call
engineer equivalent to regular on-call duties, especially when it comes to
compensation.
4. Clearly define the on-call process and responsibilities
The responsibilities of on-call engineers need to be clearly defined - ideally
in a written form so there is a single source of truth of what is the incident
response process.
Responsibilities are specific to a given organization, but good questions to
answer and write down are for example:
Are developers doing development work during on-call time? And if yes, how are
the deliverables (development work) checked in the context of incidents?
Is there a difference between working hours and non-working hours (night time)
responsibilities?
What’s the maximum time a single person can be on-call every month?
How do moving schedules work during vacations?
What is the compensation for on-call employes?
5. Nurture a supportive culture
Creating a supportive culture within a team can significantly improve both
employee happiness and incident response effectiveness. Every once in a while
personal emergencies or important life events come up.
Maybe an old friend is in town or maybe you just hit that runners high.
Encouraging team members to help each other and switch duties to step in for
others makes all the difference. When teams care for each other the on-call
challenge feels much more manageable.
6. Empower your team with the best tools
Staying on the same page and having a unified source of truth is key to
alerting, collaboration and overall effective incident response. Using a tool
like Better Uptime allows for easy on-call scheduling, alerting and incident
collaboration - helping your team to maximize its potential and create the best
on-call both on side of metrics as well as on the side of teams well-being.
As products, organizations, and teams develop, there is always a need to iterate
and fine-tune to accommodate for changes. Don’t be afraid to revisit old
processes and ask your team for feedback frequently. On-call is not a static
process.
Focusing on improving incident metrics is a
great place to start. But also look into what directly influences the well-being
of on-call engineers:
Number of false positives: How many alerts were not actionable, and
engineers investigated something that actually wasn’t a problem? How can this
be prevented?
Number of duplicate alerts: How many alerts were duplicated, and what can
we iterate to prevent engineers being called multiple times for the incident
they are already aware of?
Number of low-priority alerts: How many alerts didn’t require immediate
reaction from the on-call team and how many of those were outside business
hours?
Number of all alerts: Is the current number of alerts manageable for the
number of people on-call?
In general, the fewer alerts an on-call person receives the lower the chance of
something like alert fatigue developing.
💤 What is alert fatigue?
Alert fatigue means that people responsible for responding to alerts become numbed to new incoming alerts. This leads to missed or ignored alerts as well as delayed response times.
There are two main reasons for alert fatigue:
Large volume of alerts: Responding to multiple alerts in quick succession is much harder, than solving one alert a day. If the number of alerts becomes unmanagable, alert fatigue develops and not all alerts are attended to properly.
Large percentage of false alarms: In these cases, an on-call person starts to recognize a pattern that new alerts are often false. As a result of this mental model, new alerts are treated with less urgency or disregarded altogether.
4 common mistakes when building on-call schedules
Sometimes on-call gets a bad reputation among engineers. The anxiety that it
brings up is caused by poor choices companies make, but it doesn't have to be
that way.
Here are 4 most common mistakes that companies make when designing on-call:
1. Not allowing flexibility in schedules
On-call teams are the most valuable resource a company has in tackling
incidents. Not allowing them to have a healthy work-life balance will eventually
lead to lower performance (or worse, people leaving) over time.
Flexible and changing schedules are harder for team leads to manage of course,
but the extra effort is worth it. When people can reschedule on-calls to
accommodate their lives, it creates a great team spirit and empowers everyone to
support each other.
2. Alerting about everything, all the time
Alerting rules need to be designed properly to avoid on-call teams being
overwhelmed with alerts. Being woken up in the middle of the night because of a
minor issue that can easily wait till the morning is something that can be
avoided with proper alerting policies.
On-call responsibilities can vary between the day and night shifts. A minor
incident alert might be useful for the on-call engineer during the middle of the
day, but is definitely not vital during middle of the night. Using severity levels is a best practice when it comes to defining the priority of incidents.
3. Relying just on operational engineers
Having only operational engineers responsible for the reliability of service is
risky at best. It gives the developers who write the software no incentive to
ship a reliable code. This often leads to more bugs and incidents.
To prevent this, enter the: you built it, you maintainit setup.
💡 What is: You built it, you maintain it?
It’s the idea that engineers that write code, should also take care of their systems and fix issues if they arise. Since they are the most familiar with the inner system workings, they are in the best position to troubleshoot and should be more efficient than purely operational teams.
An added benefit of this approach is that it gives engineers the incentive to test more before deployment. Giving a better sense of ownership, it aligns the motivations to ship, but also to ship reliable code, creating a more resilient system.
4. Disregarding organizational factors
Every team has team members with a specific preference, location, and
experience. When building on-call those need to be considered and built upon.
For example, in a team of two people in the same city, it’s impossible to go for
a follow-the-sun model, but it allows for having one primary and one backup
on-call person.
Those organizational factors need to be considered during scheduling to create a
good result for the whole team. Most importantly keeping them in mind prevents
you from implementing “best practices” (or “those are done at Google,
practices”) that are outside of your organizational capabilities.
What about on-call compensation?
Having developer-first on-call also means paying people what they are worth. To
do so it’s necessary to pick the right way of compensation.
The common types of on-call compensation models are:
1. No additional pay
This is often the default setup for many smaller companies. It’s not inherently
a bad setup. When people don’t spend significant time solving incidents, or when
they only solve incidents during business hours it can be fine.
In cases when solving incidents becomes an inconvenience this form of
no-compensation can become problematic and needs to be well communicated with
the team.
2. Part of base salary
This model means that on-call responsibilities and expectations are clearly
outlined in the contract and reflected by the salary. For example a year base
salary contract should include that it comes with a responsibility to be on-call
once a week every month.
No monthly calculations are necessary with this setup making it simple to
implement and manage.
On the other hand it limits scheduling flexibility since there is not much
motivation to switch duties because there is no extra reward for it. When
switches occur they tend to be transactional: “I will take a day for you if you
take a day for me”.
3. Paid for time spend on incidents
An alternative model to the base salary pay is to pay only for the time spent
working on an incident. This creates a direct relationship between incidents and
monetary rewards making the compensation easy to understand.
However, this model can be tricky because it creates a financial disincentive to
reduce incidents and to treat incident with the necessary urgency. Even though
this might be an extreme example it’s a possibility to consider.
Another issue is that when incidents don’t occur, on-call people might feel they
are carrying computers around and giving their peace of mind for free.
4. Paid for on-call time
This setup is simply paying employees for the time they spend on-call, even if
no issues arise.
Paying for the whole time spent is a great way to justify the need to be ready
and available at moment’s notice.
The main disadvantage of paying for on-call time is that incidents occur at
random making some duties easy and some extremelly challenging. Naturally this
can cause some team-members to feel this compensation is unfair.
This unfairness can be assesed by paying more for on-call with more severe
incidents. Esentially creating different rates for “high” and “low” workloads.
When a regular compensation model won’t be enough. You can explore other options
that can be added to the regular model that you decide to go for. This can
include for example:
Paying more for off-hour incidents: In the case of on-call schedules which
go during off-hours, you can choose to compensate team members for incidents
during these times more. It’s a great way of compensating for the loss of
their personal time.
Paying more for quick response times: Extra compensations can be also tied
to a specific threshold of incident management metrics, like MTTA and
MTTR. Rewarding the on-call person in case
of quick response.
Paying more for severe incidents: Severe incidents can take several hours
to resolve. Compensating differently for those and those lasting only a few
minutes is a way of acknowledging the extra amount of stress and work
connected to those incidents.
Final thoughts
Creating a good on-call schedule is an ongoing process, so don’t worry if you
don’t get it 100% right the first time.
Hopefully you now have a good foundation to start designing and improving your
on-call. If you want to dive a bit deeper feel free to read more in:
Being on-call: Google SRE book
If you want to read more about how you can improve your incident management
process, check out out status page articles:
Jenda leads Growth at Better Stack. For the past 5 years, Jenda has been writing about exciting learnings from working with hundreds of developers across the world. When he's not spreading the word about the amazing software built at Better Stack, he enjoys traveling, hiking, reading, and playing tennis.
Are you a developer and love writing and sharing your knowledge with the world? Join our guest
writing program and get paid for writing amazing technical guides. We'll get them to the right
readers that will appreciate them.