đź” Want to get alerted when your cron jobs don't run correctly?
Go to Better Uptime and start monitoring them in 2 minutes.
Cron job or heartbeat monitoring is an automated way of checking whether scheduled tasks run correctly. When a cron job fails the monitor spots the issue and alerts the right person on the development team. If your service performs a vital process periodically, this is the ideal monitoring solution.
Go to Better Uptime and start monitoring them in 2 minutes.
The cron monitoring process works by setting up a remote monitoring service with
a dedicated URL to which the scheduled task sends a GET
,HEAD
or POST
request after it has run correctly. This tracking of a system's health by
sending regular requests (heartbeats) is also called heartbeat monitoring. Cron
job and heartbeat monitoring are often used interchangeably.
The heartbeat monitor is set up to expect a heartbeat once every x minutes, hours, or days. There is also a grace period that assures that alerting doesn't start immediately if the job is delayed.
When the monitor receives a heartbeat within the pre-set time window, no action is taken, and the monitoring continues. However, when no heartbeat is received when it’s expected, the monitor starts what is called an incident and starts alerting according to the on-call calendar.
A cron job incident is a period of time during which the given monitor doesn’t receive heartbeats from the monitored service. This situation means that the monitored service didn’t run correctly as all the correct runs send a heartbeat to the monitor before finishing, keeping it from creating an incident.
After an incident is spotted by the cron job monitor, it needs to be communicated to the service admins. This process is called incident alerting or on-call alerting. In case of an incident, the person from a team who is currently on-call (has scheduled duty) receives the incident alert.
The most common types of getting alerted by an cron job monitor include automated phone calls, SMS, Slack, and Microsoft Teams messages. Ways of alerting depend on factors like the importance of the monitored service, time of the day, and team preference.
The incident alert for cron jobs and hearbeats in general is very basic because the monitoring provides only simple up/down information. Implementing logging into the monitored services and forwarding those logs into a log aggregation tool is great way of getting in-depth insights about any potential scheduled jobs incidents.
After an alert is received, it should be acknowledged immediately. If the alert is not acknowledged in a specified time frame (usually 3 minutes), the person next in line on the on-call duty is alerted. This process could continue further until the whole team is alerted. However, the best practice is to have the on-call schedule set up in a way that the first team member is always ready to solve incoming incidents.
Once the incident is acknowledged the escalation process is paused and the team can fully focus on solving it. The speed by which an alert is acknowledged is called Time to acknowledge (TTA). Its average from different incidents called Mean Time to Acknowledge (MTTA) is a widely used incident management metric.
The following steps in the downtime resolution process are individual to different teams and apps. For larger teams, they can include collaborations between a few developers or even teams of developers, delegations of incidents to dedicated team members, and more. There are some best practices that all teams managing incidents should use. These include incident communication (both internal and external) and incident post-mortems.
The heartbeat monitor will create an alert whenever it detects an issue. However, if the monitor sends an alert (for example, SMS or email) to all team members about the same incident ten times every day, they will very likely ignore it.
This situation when alerts are ignored or not treated with the necessary care is called Alert fatigue and poses a serious issue. To prevent alert fatigue, only vital services should be connected to the on-call alerting and notify the team immediatelly.
Grace time is the short time period after the time the heartbeat was expected when no incident will be started. This prevents delayed jobs from causing incidents and also helps to decrease the possibility of alert fatigue. However, when grace period is too long, it will delay the incident alerting in case of actual incident as well, so it needs to be set up carefully.
In many cases, your server running cron jobs will not be in the same timezone as
the monitoring service. To prevent any timezone differences and faulty alerting,
both should have the same time. Command-line utility timedatectl
shows the
server timezone, and monitors typically offer the option to change timezones, so
both can be synced.
The communication between the service and heartbeat monitor typically uses
HTTP GET
or POST
methods. The cron job usually includes a unique token
assigned by the monitor to each request. The token is an authorisation measure.
Without an authorisation token, anyone can send a fake heartbeat and your
monitor won't detect an incident. However, the cron job must use TLS encryption
(HTTPS). Otherwise, anyone on the Internet can capture your authorisation token.
Cron job monitoring is a great addition to the synthetic monitoring toolbox. Ideally it’s combined with with regular uptime checks as well as SSL certificate checks and Domain expiration checks to prevent any security issues or loss of valuable business assets respectively.
Synthetic monitoring also offers monitoring options like checking an API, DNS or Transaction monitoring.
Better Uptime is an infrastructure monitoring tool that offers cron job monitoring. Here is how to get notified whenever a service fails to run correctly, let’s set it up to get alerted whenever a database backup fails.
24 hours
15 minutes
For more information, explore Better Uptime docs.
Let’s say that to do the database backup you would run the following script:
$ bash /database/backup/script
Now, you can create a cron job by executing utility crontab
with parameter
-e
:
$ crontab -e
The -e
option is used to edit the file crontab
using your default
environment text editor. You will be redirected to the file. At the end of the
file append the following line of code (make sure to copy your heartbeat URL and
replace it in the code below):
0 0 * * * bash /database/backup/script && curl https://betteruptime.com/api/v1/heartbeat/<your-heartbeat-monitor-id>
We set up a heartbeat interval for 1 day, so we must set up the cron job to the
same time period, the cron expression for that is 0 0 * * *
. The curl
utility sends the heartbeat if the backup script runs successfully.
Once the crontab sends the first heartbeat to the monitor the monitoring will start - expecting the next request in 24 hours.
For more information, explore Better Uptime docs.
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for usWrite a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
[email protected]or submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github