AI SRE vs AI Platform Eng

author image

Tim Nichols

CEO/Founder

2025-02-28T06:40:54.941Z

"Offense sells tickets, but defense wins championships," Bear Bryant

"The future is already here — it's just not very evenly distributed," William Gibson

We get asked if we’re building an AI SRE, which seems to be a tool that:

  1. Ingests a firehose of observability data
  2. Identifies root cause of any current/developing incidents
  3. Recommend solutions and accelerates time to recovery

This seems like a useful tool but we’re not building that. Here’s why:

Platform Engineering 101

In the 2010s, running code in the cloud was really hard. A lot of tooling and concepts descended on high from FAANG and most engineering teams weren’t ready for it:

  • Cloud → redefined scale
  • Kubernetes → managing scale was incredibly complex
  • Microservices + Squads → knowledge, ownership and responsibility were distributed outside of a central DevOps team

Smart teams took a look at this complexity and realized that the traditional DevOps model couldn’t scale. Instead they needed to build automations, golden paths and self-serve tooling so that everyone could work with the cloud.

Spinnaker was Dylan going Electric. In 2022, Charity Majors christened this movement as Platform Engineering; and now in 2025 Google says Platform Engineering is mandatory.

No more Incidents

Proposed:

Platform Engineering has significantly reduced the rate of incidents, and changed the role of SREs.

Hypotheses:

  1. Most incidents are triggered by rollouts
  2. Rollouts are much safer and routine due to modern Platform Engineering
    • High quality, representative development environments
    • Linting/Testing/Smoking/Policy for every PR
    • Templates for Cloud Resources
    • B/G deployments and Automated Rollbacks
    • Smarter, standardized observability
    • IDPs and SLOs have solved most of the ownership and knowledge problems
  3. Incidents are less frequent. When they happen they are more easily classified and understood because of standard open source projects and comparable cloud products
  4. SREs are becoming coaches or architects. They make sure an organization knows how to manage SLOs, Observability and Incidents.
  5. Staffing Ratios are changing to reflect this. We’re seeing 1 SRE for every 10 Platform Engineers, for every 100 Feature Engineers. Budgets follow similar ratios

tldr - we’ve learned how to build platforms that make the cloud (or your data center) safe and accessible. SREs are still important but they are now one of many critical roles supporting development and customer experience.

After Incidents, comes Toil

So why do engineers spend only 16% of their time coding applications?.

Well your platform team has a lot of work to do

  • Maintaining those Developer Environments
  • Writing & Updating Linting/Testing/Smoking/Policy systems
  • Managing Cloud Resources and Building Abstractions
  • B/G deployments and Automated Rollbacks
  • Updating instrumentation and log management
  • Updating the metadata model and SLOs underneath your IDP

Add in migrations, refactors, and GPUs and your platform team is incredibly busy. But they also need to give their stakeholder engineers the self-serve tools to do things like

  • Access cloud resources and using your abstractions
  • Debug, mutate and fork abstractions as needed
  • Release new code, and keep it running
  • Documentation, Labels, Hygiene, etc
  • FinOps
  • Compliance

In short, platform engineering has made the cloud productive, safe and accessible but these capabilities don't come for free. Copilot, cursor, etc can’t help you with these tasks because these tools don’t have visibility into observability and orchestration. That’s why you keep hiring Platform Engineers.

Don’t build an AI SRE, build an AI Platform Engineer

So that’s what we’re building … an AI agent that helps you build, maintain and protect the things you do in the cloud.

Flightcrew has many similarities with an AI SRE … we ingest observability data, traverse graphs, classify issues and recommend fixes.

The difference is that we’re not focused on playing whack-a-mole with incidents … we’re playing tower defense by generating code/IAC for reliable, efficient and compliant infrastructure.

Today Flightcrew is performing tasks like:

  • Refactoring hundreds of lines of IAC for a legal tech company
  • Optimizing Kubernetes resources for India’s largest delivery startup
  • Tuning autoscaling, networking and database config at a major digital education company

When we earn it, we’ll call Flightcrew an AI Platform Engineer. Until then, we'll simply say we're building Flightcrew.

If this resonates we’d love to chat.

And if you are building an AI SRE or Codegen tool we’d love to compare notes and integrate. We share a common enemy and the future is bright.

author image

Tim Nichols

CEO/Founder

Tim was a Product Manager on Google Kubernetes Engine and led Machine Learning teams at Spotify before starting Flightcrew. He graduated from Stanford University and lives in New York City. Follow on Bluesky

keep-reading-vector
Subscription decoration

Don’t miss out!

Sign up for our newsletter and stay connected