How to Create a Developer-Friendly On-Call Schedule in 7 steps

Trying to build a new on-call schedule or fix an existing one that doesn't work? Here is how to do it the right way.

Many engineers have experience with on-call that simply doesn’t work for their team. It’s either based on a years-old setup that hasn’t changed with the progress of the organization. Or it was done in a rush not getting the necessary care it deserves.

Here is how the @techroastshow characterized a poor on-call in one of their short videos.

Watch on TikTok

Any engineer with experience with it would probably agree, that on-call is inherently stressful and that it should be crafted properly. Once set up it will quite literally control hours of valuable engineering time, so let’s make one that works.

What is an on-call schedule?

On-Call is a practice of always having a team member on standby, ready to respond in case of an urgent incident, even if it occurs outside of the regular working hours. It’s one of the core processes of Incident Management and a key to minimizing downtime and ensuring a reliable service.

On-Call schedule is a dedicated calendar allowing teams to assign and monitor the on-call shifts.

💡 On-call schedule at Google

Google SRE (site reliability engineering) teams usually have a schedule where each engineer is on-call for one week every month. During this week they are ready to respond to incidents at any time of the day and night.

The rest of the month they spend on engineering (ideally 50% of their time) and on other operational, non-project tasks (around 25% of their time).

Recommended reading: Google SRE Book: Being On-call

Why is on-call scheduling so important?

With 24/7 service availability being a goal of many modern organizations having a good on-call process is vital to:

Solve incidents as quickly as possible: On-call scheduling is a key part of the incident management process, which directly influences how quickly will the incident be acknowledged and eventually solved. A well-designed on-call schedule can radically improve MTTA, MTTR, and other incident metrics.
Keep engineers happy and prevent burnout: On-call engineers are often under great pressure, while also making personal sacrifices when it comes to their duties. Quality scheduling is a key to keeping these vital engineers satisfied and retaining them for the organization.

7 ways to build a developer-friendly on-call:

Creating a quality on-call is a challenging task because there is no one-fits-all model. But on-call doesn’t have to be difficult, sleep-depriving, or inevitably leading to burnout.

Here are 7 strategies to help you in designing an effective on-call schedule:

1. Start by talking to your team

Every team is different and every team member has different priorities and preferences. Understand what your team needs and consider different individual situations to help everyone work more efficiently.

There might be a consensus on what the on-call should look like within your team and a complicated setup might not be necessary. For example, if your team agrees on a week and weekend schedule there is no need to go for more complex follow-the-sun schedules with other teams across the world.

🗓️ Common schedule types

Week and Weekend: Engineers are on-call during the workweek and weekend - 7 days in a row (as explained in the Google example). The on-call week is intense, however, the rest of the month is on-call free.
Nine to five (and five to nine): There is an office hour on-call duty from 9-5 and outside of business hours duty (5-9), which is handled by a different team member.
Follow-the-sun: This model leverages the timezone difference between different team members and allows for all on-call engineers to have only business hours duties - ideally avoiding night shifts.
Hybrid: Combinations of different schedules, for example, a follow-the-sun schedule with week and weekends create a schedule where team members have 8 hours every day (during business hours) for a week.

2. Apply follow the sun model when possible

Nobody likes being woken up by incident alerts at 4 in the morning. Not just that poor sleep leads to worse performance during day hours, but any reaction time and incident-solving capability are decreased whenever an engineer needs to wake up. Not to mention the mental and physical health consequences.

Implementing the follow-the-sun model where possible is a commonly used practice that prevents those issues. If the location of the on-call team members allows it, it should be considered first.

You can nicely see different working hours across time zones with worldtimebuddy.

3. Have primary and backup on-call duties

Being on-call outside business hours can be complicated and there are unexpected situations when an engineer can’t respond as fast as it would be necessary. To prevent any incident from going unsolved and causing damages it’s best to always have a backup on-call engineer that is able to step up and fix the issue in case of an emergency.

Backup on-call carries the same responsibilities as the primary one and it needs to be treated that way. First, the team needs to understand that being a backup is no different from primary - you must be ready to react within minutes. Secondly, managers need to treat it that way and consider being a backup on-call engineer equivalent to regular on-call duties, especially when it comes to compensation.

4. Clearly define the on-call process and responsibilities

The responsibilities of on-call engineers need to be clearly defined - ideally in a written form so there is a single source of truth of what is the incident response process.

Responsibilities are specific to a given organization, but good questions to answer and write down are for example:

Are developers doing development work during on-call time? And if yes, how are the deliverables (development work) checked in the context of incidents?
Is there a difference between working hours and non-working hours (night time) responsibilities?

5. Nurture a supportive culture

Creating a supportive culture within a team can significantly improve both employee happiness and incident response effectiveness. Every once in a while personal emergencies or important life events come up.

Maybe an old friend is in town or maybe you just hit that runners high. Encouraging team members to help each other and switch duties to step in for others makes all the difference. When teams care for each other the on-call challenge feels much more manageable.

6. Empower your team with the best tools

Staying on the same page and having a unified source of truth is key to alerting, collaboration and overall effective incident response. Using a tool like Better Uptime allows for easy on-call scheduling, alerting and incident collaboration - helping your team to maximize its potential and create the best on-call both on side of metrics as well as on the side of teams well-being.

Recommended reading: Better Uptime vs. Pagerduty vs. Opsgenie

7. Iterate, improve and fine-tune

As products, organizations, and teams develop, there is always a need to iterate and fine-tune to accommodate for changes. Don’t be afraid to revisit old processes and ask your team for feedback frequently. On-call is not a static process.

Focusing on improving incident metrics is a great place to start. But also look into what directly influences the well-being of on-call engineers:

Number of false positives: How many alerts were not actionable, and engineers investigated something that actually wasn’t a problem? How can this be prevented?
Number of duplicate alerts: How many alerts were duplicated, and what can we iterate to prevent engineers being called multiple times for the incident they are already aware of?
Number of low-priority alerts: How many alerts didn’t require immediate reaction from the on-call team and how many of those were outside business hours?
Number of all alerts: Is the current number of alerts manageable for the number of people on-call?

In general, the fewer alerts an on-call person receives the lower the chance of something like alert fatigue developing.

💤 What is alert fatigue?

Alert fatigue means that people responsible for responding to alerts become numbed to new incoming alerts. This leads to missed or ignored alerts as well as delayed response times.

There are two main reasons for alert fatigue:

Large volume of alerts: Responding to multiple alerts in quick succession is much harder, than solving one alert a day. If the number of alerts becomes unmanagable, alert fatigue develops and not all alerts are attended to properly.
Large percentage of false alarms: In these cases, an on-call person starts to recognize a pattern that new alerts are often false. As a result of this mental model, new alerts are treated with less urgency or disregarded altogether.

4 common mistakes when building on-call schedules

Sometimes on-call gets a bad reputation among engineers. The anxiety that it brings up is caused by poor choices companies make, but it doesn't have to be that way.

Here are 4 most common mistakes that companies make when designing on-call:

1. Not allowing flexibility in schedules

On-call teams are the most valuable resource a company has in tackling incidents. Not allowing them to have a healthy work-life balance will eventually lead to lower performance (or worse, people leaving) over time.

Flexible and changing schedules are harder for team leads to manage of course, but the extra effort is worth it. When people can reschedule on-calls to accommodate their lives, it creates a great team spirit and empowers everyone to support each other.

2. Alerting about everything, all the time

Alerting rules need to be designed properly to avoid on-call teams being overwhelmed with alerts. Being woken up in the middle of the night because of a minor issue that can easily wait till the morning is something that can be avoided with proper alerting policies.

On-call responsibilities can vary between the day and night shifts. A minor incident alert might be useful for the on-call engineer during the middle of the day, but is definitely not vital during middle of the night. Using severity levels is a best practice when it comes to defining the priority of incidents.

3. Relying just on operational engineers

Having only operational engineers responsible for the reliability of service is risky at best. It gives the developers who write the software no incentive to ship a reliable code. This often leads to more bugs and incidents.

To prevent this, enter the: you built it, you maintain it setup.

💡 What is: You built it, you maintain it?

It’s the idea that engineers that write code, should also take care of their systems and fix issues if they arise. Since they are the most familiar with the inner system workings, they are in the best position to troubleshoot and should be more efficient than purely operational teams.

An added benefit of this approach is that it gives engineers the incentive to test more before deployment. Giving a better sense of ownership, it aligns the motivations to ship, but also to ship reliable code, creating a more resilient system.

4. Disregarding organizational factors

Every team has team members with a specific preference, location, and experience. When building on-call those need to be considered and built upon. For example, in a team of two people in the same city, it’s impossible to go for a follow-the-sun model, but it allows for having one primary and one backup on-call person.

Those organizational factors need to be considered during scheduling to create a good result for the whole team. Most importantly keeping them in mind prevents you from implementing “best practices” (or “those are done at Google, practices”) that are outside of your organizational capabilities.

What about on-call compensation?

Having developer-first on-call also means paying people what they are worth. To do so it’s necessary to pick the right way of compensation.

The common types of on-call compensation models are:

1. No additional pay

This is often the default setup for many smaller companies. It’s not inherently a bad setup. When people don’t spend significant time solving incidents, or when they only solve incidents during business hours it can be fine.

In cases when solving incidents becomes an inconvenience this form of no-compensation can become problematic and needs to be well communicated with the team.

2. Part of base salary

This model means that on-call responsibilities and expectations are clearly outlined in the contract and reflected by the salary. For example a year base salary contract should include that it comes with a responsibility to be on-call once a week every month.

No monthly calculations are necessary with this setup making it simple to implement and manage.

On the other hand it limits scheduling flexibility since there is not much motivation to switch duties because there is no extra reward for it. When switches occur they tend to be transactional: “I will take a day for you if you take a day for me”.

3. Paid for time spend on incidents

An alternative model to the base salary pay is to pay only for the time spent working on an incident. This creates a direct relationship between incidents and monetary rewards making the compensation easy to understand.

However, this model can be tricky because it creates a financial disincentive to reduce incidents and to treat incident with the necessary urgency. Even though this might be an extreme example it’s a possibility to consider.

Another issue is that when incidents don’t occur, on-call people might feel they are carrying computers around and giving their peace of mind for free.

4. Paid for on-call time

This setup is simply paying employees for the time they spend on-call, even if no issues arise.

Paying for the whole time spent is a great way to justify the need to be ready and available at moment’s notice.

The main disadvantage of paying for on-call time is that incidents occur at random making some duties easy and some extremelly challenging. Naturally this can cause some team-members to feel this compensation is unfair.

This unfairness can be assesed by paying more for on-call with more severe incidents. Esentially creating different rates for “high” and “low” workloads.

When a regular compensation model won’t be enough. You can explore other options that can be added to the regular model that you decide to go for. This can include for example:

Paying more for off-hour incidents: In the case of on-call schedules which go during off-hours, you can choose to compensate team members for incidents during these times more. It’s a great way of compensating for the loss of their personal time.
Paying more for quick response times: Extra compensations can be also tied to a specific threshold of incident management metrics, like MTTA and MTTR. Rewarding the on-call person in case of quick response.
Paying more for severe incidents: Severe incidents can take several hours to resolve. Compensating differently for those and those lasting only a few minutes is a way of acknowledging the extra amount of stress and work connected to those incidents.

Final thoughts

Creating a good on-call schedule is an ongoing process, so don’t worry if you don’t get it 100% right the first time.

Hopefully you now have a good foundation to start designing and improving your on-call. If you want to dive a bit deeper feel free to read more in: Being on-call: Google SRE book

If you want to read more about how you can improve your incident management process, check out out status page articles:

Article by

Jenda Tovarys

Jenda leads Growth at Better Stack. For the past 5 years, Jenda has been writing about exciting learnings from working with hundreds of developers across the world. When he's not spreading the word about the amazing software built at Better Stack, he enjoys traveling, hiking, reading, and playing tennis.

Got an article suggestion? Let us know

What Is a Status Page? And Why You Should Have One?

Learn what is status page, how does it work, what are the benefits and drawbacks and how to set it up.

→

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contents