TOIL is a
Four-Letter Word:

The SRE’s War on Manual, Repetitive Work.

Written by:

Principal Consultant
Sapience Consulting

A dramatic split image visually contrasting 'TOIL' and 'AUTOMATION' with a bolt of lightning between them. On the left (TOIL), an Asian male engineer is shown stressed and chained amidst server racks, surrounded by red alert and skull icons. On the right (AUTOMATION), the same engineer is smiling, holding a transparent screen of code, with helpful blue robots around him, symbolizing efficiency and freedom from manual work. The Sapience logo is in the bottom right.

The word toil carries a dread for Site Reliability Engineers. It reeks of struggle, drudgery, and incessant, tiring work. Understanding toil, and eradicating it is fundamental to the SRE philosophy.

What is Toil?

The classical definition of toil, as established by Google, is:

“The kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

An infographic illustrating the six core characteristics of SRE Toil, surrounding a central, unhappy figure or worn-out icon. The six labeled characteristics are: Manual, Repetitive (shown by a looping arrow), Automatable (shown by a robot outline), Tactical (shown by a fire alarm or interruption icon), Devoid of Enduring Value (shown by a leaking bucket), and Scales Linearly (shown by a straight, upward-sloping graph).

Let’s break down those key characteristics:

  • Manual: A human operator is directly involved in the execution, even if it’s just running a pre-written script.
  • Repetitive: It’s work you do over and over again, like acknowledging the same recurring alert every morning.
  • Automatable: A machine could perform the task just as well as a human, or the need for the task could be designed away. If human judgment is essential, it’s generally not toil.
  • Tactical: It’s reactive, interrupt-driven work, rather than proactive, long-term strategic engineering.
  • Devoid of Enduring Value: The service remains in the same state after the task is finished. It’s maintenance, not improvement.
  • Scales Linearly: As the service grows (more users, more servers, more traffic), the amount of work required also grows proportionally.
A dramatic, filmic image of a stressed Asian engineer working late in a dimly lit server room, hunched over a laptop and holding a coffee mug. Several holographic pop-up windows display critical alerts, code, and skull icons, conveying the drudgery of being interrupt-driven. A thick runbook and multiple empty coffee cups are visible on the desk, symbolizing repetitive, manual operational work

Where Does Toil Show Up?

Toil rears its ugly head everywhere in the effort to run production.
Common examples include:

  • Manually deploying code releases.
  • Handling routine resource quota requests.
  • Copying and pasting commands from a runbook to fix a known issue.
  • Performing manual system configuration updates.
  • Repetitive, non-critical alert triaging.

 High levels of toil are detrimental to both the individual engineer and the organisation as a whole.

A visual of an unbalanced digital scale or seesaw in a server room aisle. The left side is heavily weighed down by a pile of discarded hardware labeled 'Runbook' and 'Burnout,' with a screen overlay showing the warning '>50% TOIL / Operational Work' and 'ERROR'. The right side is elevated and features a clean holographic screen labeled 'QUALITY ENGINEERING / Innovation' with icons for speed, robots, and security. This symbolizes the negative impact of exceeding the 50% operational toil threshold.

The Cost of Excessive Toil to Your Business

Excessive toil is a direct route to engineer burnout. When a majority of an SRE’s day is spent on repetitive, manual, non-creative tasks, morale plummets. Engineers feel like their skills are being wasted. They have no time left for meaningful projects, learning new skills, or critical thinking, leading to career stagnation. For the organisation/business, excessive toil is the proverbial albatross around the neck.

Google SREs famously strive to keep operational work (toil) below a 50% threshold. The remaining time is dedicated to engineering project work—building features, improving reliability, and, most importantly, automating toil away. When toil exceeds 50%, there’s not enough time to implement the solutions that would reduce toil, creating a vicious downward spiral of pain and ineffectiveness

Manual, repetitive work is prone to human error. Automation, while requiring an initial investment, virtually eliminates these slip-ups, leading to higher quality and greater consistency and speed.

The SRE’s War on Toil:
Three Overarching Steps

The philosophy of SRE essentially declares war on toil.
SREs leverage their software engineering background to solve operational problems. Three overarching simple steps,
help to win that war.

1. Identify and Measure

The first step is to objectively track the time spent on operational work. Teams need to define what constitutes toil and use ticketing systems or other tools to log the human-hours spent. This data is crucial for prioritising automation efforts based on the highest return on engineering investment. I call it “mindful automation”.

2. Automate Aggressively

Automation is the primary weapon against toil. SREs look at manual tasks and ask, "How can I write code to do this for me?"

This involves:

  • Scripting: Turning runbooks into executable scripts.
  • Tooling: Developing new, reusable tools and platforms (often self-service) to handle common operations like provisioning, deployment, and remediation.
  • Infrastructure as Code (IaC): Managing infrastructure via code (e.g., Terraform, Ansible) to ensure repeatability and consistency, thus reducing manual configuration toil.

3. Designing Services For No Toil

The ultimate goal is to design systems that require minimal human intervention in the first place. This means building services to be inherently more robust, observable, and self-healing. A truly mature system should only require human intervention for truly novel problems.

From Existential Threat to Competitive Edge

Toil is a red flag. It is a necessary evil in small doses, and generally unavoidable but an existential threat when it dominates the workweek. By calling it out, measuring it, and prioritising its elimination, SREs ensure they spend their time where they deliver the most value: on long-term reliability, scalability, and innovation.

Check out our IBF and SSG funded courses! There is no better time to upskill than now!

IBF Funding

IBF Funding

Terms and conditions apply. Please visit our IBF STS programme page for full details.
LEARN MORE

SSG Funding

SSG Funding

Terms and conditions apply. Please visit our SkillsFuture Singapore (SSG) Funding page for full details.
LEARN MORE

Share This Piece:

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp
Share on email