Behind the Scenes: Building a GitHub Copilot Extension

tl;dr - We built a Copilot Extension so that everything Flightcrew knows about scaling, maintaining and securing your cloud is accessible via GitHub Copilot. Here are our learnings from building and integrating with AI products.

Why build a Copilot Extension

We’re heavy users of Copilot and so it was an obvious move to make Flightcrew’s intelligence available in Copilot.

From a product perspective…

We believe in shifting left because many of the high volume, low upside chores (toil) we solve for don’t get done unless you solve last mile problems through integration.
Accessing Flightcrew via Copilot is a much better experience than through a CLI. Copilot is easier to learn and opens up open-ended problem solving
Customers loved our GitHub app for high-volume, low difficulty tasks (ex: pod sizing) - but wanted more interactivity for medium difficulty tasks (ex: setting up a new Kubernetes component). Copilot is a great fit for ambiguous, context-heavy infrastructure tasks.
Infrastructure tasks are context-heavy and need a lot of last mile assistance … placing our insights within an IDE with help from Copilot's "Apply in Editor" feature bridges the gap

Foundations for understanding Cloud Infrastructure

Our Copilot Extension helps engineers understand how a configuration change will impact the state (think SLOs) and posture (think policy) of their cloud environment. The copilot relies on the work we’ve done processing high volumes of cloud and observability data.

We’ve built a lightweight agent that quietly and securely ingests from a customer’s environment.
We can map files and configurations (ex: a Kubernetes deployment) to the entities they control in the cloud (ex: pods)
This semantic data allows us to roll up observability data so that we can reason about similar objects together. For example: Pod X is running application Y, which reads from service Z.
We’ve spent a lot of time on data sampling and representativeness. Spikey data could indicate a real problem that needs to be addressed … or could be a data issue that is best smoothed over. SREs spend a lot of time on representativeness, and we think the poor adoption of tools like VPA is because its metric engine isn’t sophisticated or transparent enough for production workloads.
We’ve done a lot of work to make sure we only surface high conviction recommendations with no hallucinations. We use real data for high risk information (the numbers, the configs, the impact), but allow the (minimized) risk of hallucinations for the ‘wrappings’ of recommendations that aren’t in the critical path (ex: a PR description).

Demoing AI is easy, but getting a customer to trust it in production environments is a lot of work!

Building a Copilot Extension

Agent vs Skillset

The first decision when building a Copilot Extension is deciding between building an Agent and Skillset.

Skillsets are for well-defined tasks that can be handled by an API call (5 max at that).
Agents are for complex, open-ended tasks.

We knew we needed to build an agent so that we can act on more complicated tasks like refactoring a configuration file from YAML to CDK or allowing a user to dig deeper into a recommendation. If you need help making a decision check out Anthropic’s framework on Agents vs Workflows - it roughly maps to Agents vs Skillsets.

Architecture

Copilot Extension Architecture

Copilot extensions are a way for you to inject custom logic and intelligence for specific queries. Copilot handles a lot of product and technical tasks for you:

Categorizing what the user is asking
Translating YAML into user's source file language
Providing explanation on top of what we've already recommended – allowing the user to dig in with sequential queries

Copilot provides a framework / API with good documentation. However some stuff was partially missing (i.e. the streaming headers) so we ended up using an open-source library

Authorization

The first entry point for the user is authorizing or installing the GitHub app.

We already have a GitHub app for automating Flightcrew PRs. However, we chose to create a new GitHub app to isolate the permissions that a user would agree to. We’ve found that codebase integrations (vs platform, observability) are the main hangup for security and compliance approvals. Compare the permissions between our GitHub app vs our Copilot extension.

The Copilot extension requires user authorization in order for the extension to have permission to use the GitHub token that is sent through the chat context.

The Copilot extension requires app installation to grant the extension permission to view the files in the associated repository. Without this permission, you can still use the extension, but all files sent are redacted.

After that we just needed to figure out how to connect the GitHub user with their Flightcrew account, which meant implementing an OAuth and installation flow into the Flightcrew app. When parsing the chat message context, we go from the GitHub token to the GitHub user to the Flightcrew user.

Prompting

Copilot routes specific queries to your extension service when your extension's name is invoked.

Copilot Extension Prompting

Copilot will send the entire chat context (both the user's queries with references and the extension's responses) every time a new message in the same session is sent. We found this necessary for the context-heavy, multiple query sessions in cloud infrastructure. Users can refer back to previous messages to dig in further.

We’ve built our prompts to respond to keywords (ex: High Priority Reliability issues) that are common to our user and context. We route everything through Copilot (like a fancy switch statement) to categorize what the user is trying to do.

In using roles, we can shape the flow of the conversation. The system/developer role is how we can provide guidelines for how the system responds.

When a user messages our extension, the first thing we do is ask Copilot to categorize the query. For example, we take the user's last message "@flightcrewhq-copilot give me a recommendation!" and append a system/developer prompt "Categorize the user's query into HIGHEST_PRIORITY, RECOMMENDATION, RESILIENCY, EXPLANATION, …" before deciding what to do.

Based on the response, we route to functions that can then choose to use the entire chat context or send more prompts.

For example:

In retrieving highest priority action items, we grab data and format it.
In our recommendation, we use a series of new prompts to handle translation from JSON/YAML into the user's target file's language.
In our explanations, we provide the entire chat context to Copilot and provide a system prompt with additional documentation and details to dig into a previous recommendation's output.

Lilian Weng (currently co-founder of Thinking Machines Lab, previously at OpenAI) wrote a great blog post for understanding what LLMs can and cannot do with tips for how to structure prompts for them.

Next Steps

Though it’s a limited release, we’re pretty happy with our Copilot Extension. Shifting left into the IDE puts Flightcrew in the hands of every engineer and removes the barrier of entry into learning how to use an app.

What’s next for us?

Surfacing more types of problems
Mapping files directly from the extension
Assigning workload metadata to make our recommendations smarter
Generative IAC workflows (ex: new service creation, or refactoring)

Our Copilot Extension is listed in the GitHub Marketplace - note that you still need to register an account on flightcrew.io and integrate Flightcrew with your cloud.