TOIL is a
Four-Letter Word:
The SRE’s War on Manual, Repetitive Work.
Written by:
Principal Consultant
Sapience Consulting
The word toil carries a dread for Site Reliability Engineers. It reeks of struggle, drudgery, and incessant, tiring work. Understanding toil, and eradicating it is fundamental to the SRE philosophy.
What is Toil?
The classical definition of toil, as established by Google, is:
“The kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.“
Let’s break down those key characteristics:
- Manual: A human operator is directly involved in the execution, even if it’s just running a pre-written script.
- Repetitive: It’s work you do over and over again, like acknowledging the same recurring alert every morning.
- Automatable: A machine could perform the task just as well as a human, or the need for the task could be designed away. If human judgment is essential, it’s generally not toil.
- Tactical: It’s reactive, interrupt-driven work, rather than proactive, long-term strategic engineering.
- Devoid of Enduring Value: The service remains in the same state after the task is finished. It’s maintenance, not improvement.
- Scales Linearly: As the service grows (more users, more servers, more traffic), the amount of work required also grows proportionally.
Where Does Toil Show Up?
Toil rears its ugly head everywhere in the effort to run production.
Common examples include:
- Manually deploying code releases.
- Handling routine resource quota requests.
- Copying and pasting commands from a runbook to fix a known issue.
- Performing manual system configuration updates.
- Repetitive, non-critical alert triaging.
High levels of toil are detrimental to both the individual engineer and the organisation as a whole.
The Cost of Excessive Toil to Your Business
Excessive toil is a direct route to engineer burnout. When a majority of an SRE’s day is spent on repetitive, manual, non-creative tasks, morale plummets. Engineers feel like their skills are being wasted. They have no time left for meaningful projects, learning new skills, or critical thinking, leading to career stagnation. For the organisation/business, excessive toil is the proverbial albatross around the neck.
Google SREs famously strive to keep operational work (toil) below a 50% threshold. The remaining time is dedicated to engineering project work—building features, improving reliability, and, most importantly, automating toil away. When toil exceeds 50%, there’s not enough time to implement the solutions that would reduce toil, creating a vicious downward spiral of pain and ineffectiveness
Manual, repetitive work is prone to human error. Automation, while requiring an initial investment, virtually eliminates these slip-ups, leading to higher quality and greater consistency and speed.
The SRE’s War on Toil:
Three Overarching Steps
The philosophy of SRE essentially declares war on toil.
SREs leverage their software engineering background to solve operational problems. Three overarching simple steps,
help to win that war.

1. Identify and Measure
The first step is to objectively track the time spent on operational work. Teams need to define what constitutes toil and use ticketing systems or other tools to log the human-hours spent. This data is crucial for prioritising automation efforts based on the highest return on engineering investment. I call it “mindful automation”.

2. Automate Aggressively
Automation is the primary weapon against toil. SREs look at manual tasks and ask, "How can I write code to do this for me?"
This involves:
- Scripting: Turning runbooks into executable scripts.
- Tooling: Developing new, reusable tools and platforms (often self-service) to handle common operations like provisioning, deployment, and remediation.
- Infrastructure as Code (IaC): Managing infrastructure via code (e.g., Terraform, Ansible) to ensure repeatability and consistency, thus reducing manual configuration toil.

3. Designing Services For No Toil
The ultimate goal is to design systems that require minimal human intervention in the first place. This means building services to be inherently more robust, observable, and self-healing. A truly mature system should only require human intervention for truly novel problems.
From Existential Threat to Competitive Edge
Toil is a red flag. It is a necessary evil in small doses, and generally unavoidable but an existential threat when it dominates the workweek. By calling it out, measuring it, and prioritising its elimination, SREs ensure they spend their time where they deliver the most value: on long-term reliability, scalability, and innovation.
Check out our IBF and SSG funded courses! There is no better time to upskill than now!









