How we built it: Agentic PRs

Here’s how Flightcrew autonomously fixes code and cloud issues in real-time.

Last month we shared that Flightcrew can now generate PRs to fix real-time issues in your cloud infrastructure and codebase.

An "obvious" feature like Agentic PRs can look simple, but it’s a ton of work to get it running in production. We’re sharing more on how we built this feature for the sake of transparency, and give some ideas for building your own AI Agents.

Our Path to Agentic PRs

We launched Flightcrew as a Dependabot for Cloud Infrastructure, with Flightcrew analyzing GitHub PRs for breaking/unsafe changes and generating PRs to fix issues in your live cloud environment.

To deliver this, we developed the ability to

Ingest and analyze observability data to evaluate infrastructure health
Understand relative and absolute importance of an infrastructure issues
Navigate codebases to identify relevant code and config
Generate a PR with accurate code changes, analysis and explanations

At launch, we used dependabot preferences as the model for scheduling and prioritizing GitHub interactions; users declared where and when they wanted Flightcrew to fix reliability or cost issues. While we used a ton of LLMs, you would say that Flightcrew was a workflow rather than a true Agent.

In order to generate ~real-time PRs we needed to cross the blurry rubicon that separates Workflows and Agents while still relying on deterministic components for core parts of our architecture.

We also couldn’t compromise on core requirements around latency and accuracy. Flightcrew insights don’t have to be immediate (<1s) but they do have to be delivered quick enough to fit in modern developer workflows (<60s). Recommendations don’t have to be scientifically precise on impact, but they do have to be incredibly accurate when it comes to code structure.

Structuring Stochastic Problems

The key to Agentic PRs was breaking down an open-ended problem into well-ordered, verifiable components. So we rewrote our backend around four independently verifiable services.

Agentic PR Architecture

Assessor

The heart of Agentic PRs is a new service we call the assessor. Its job is to classify problems and delegate the hard work of finding a solution. This service …

Continuously checks entities for problems against a flexible policy document
Delegates problems to recommenders based on type, difficulty and priority
Evaluates potential solutions for relative and absolute value

Recommender(s)

These are reasoning engines which attempt to solve problems and explain their work. Flightcrew deploys different recommender systems based on the type and difficulty of a task. For example, most compliance problems only need a simple rule-based recommender to flip an incorrect config. Refactoring IaC is much more difficult and requires smarter, more expensive recommenders.

Coding Agent

The coding agent navigates codebases and translates Flightcrew’s favored solution into a usable (code), and human-interpretable format.

Knowledge Base

The coding agent (and many other services) rely on insights derived from code, cloud and observability data - we call this the knowledge base. Our coding agent uses the KB to navigate a GitHub repo, while a recommender model might look up infrastructure dependencies, or a precomputed metric.

These services are called independently based on state and workflow. If an issue emerges in live infrastructure (ex: a vendor goes down) then the assessor is recognizing a problem and kicking off a series of recommendations.

To prevent a bad deploy, our coding agent will notice a chance to infrastructure and tag in the assessor to evaluate risk.

Reflections

Models drive your KPIs

This is common knowledge in 2025 so we’ll simply repeat that when/where/how you call hosted LLMs will make or break your SLOs (and budget). We can generate an end-to-end PR in ~1 minute - the longest steps are when our Coding Agent and Recommenders call LLMs.

Also you will notice model provider updates because your tests will magically fail.

Dynamic workflows require a different user experience

With scheduled PRs, context is simple and usually public to an organization. For example, declaring "Each month, generate a pod sizing PR for all workloads that cost more than $X /month" then the logic, priority and impact are self-evident for most members of an engineering team.

Agentic PRs arise from dynamic context and and so we have to explain what's new in your code and infrastructure and why is Flightcrew acting on it. Crucially … what did we see that might no longer be the case due to a correction in traffic or code. Context is key.

Dogfooding is your friend

... as it provides reinforcement data and an accurate measure of when you're ready to release. We released Agentic PRs to customers when we felt comfortable putting auto-merge on our own environments, with an acceptance rate over 90%.

Try out Agentic PRs

Workflows, Agents ... it's all hard. We're very proud of Agentic PRs and hope you'll use them to prevent bad deploys, respond to roll-forward scenarios or just automate the worst parts of your day.