Build Your Own Validation Pipeline to Ship AI Agents with Confidence

author image

Sam Farid

CTO/Founder

2025-08-27T16:54:10.793Z
Building your own Validation Pipeline for AI Agents cover image

Validation platforms help companies building AI agents evaluate performance and sleep better at night. We built our own internal solution that focuses on what actually matters: catching real issues before customers do. This post shares our architectural insights and code snippets so you can build a validation pipeline that actually works.

The State of AI Validation

Tons of AI Agent validation platforms are emerging to solve the problem of testing unpredictable AI features. They're offering:

  • Test orchestration across multiple models
  • Performance benchmarking and regression testing
  • A/B testing between prompts and models

For teams who are not specializing in AI, or who need enterprise-scale control and customization, then these vendors can make sense.

But for an AI-native team, we believe you can and should build your own validation pipeline - and it’s surprisingly easy to get started.

This post walks through how you can build an MVP pipeline in 30 minutes that delivers real insights: meaningful benchmarking, drift detection, and practical A/B testing capabilities.

Why Validate Your AI Features?

LLMs are non-deterministic black boxes. Even with 'temperature' set to 0:

  • Model updates can silently change or degrade performance
  • Prompt changes have unexpected side effects
  • You only discover new edge cases when customers complain (not ideal)

But by combining deterministic checks with strategic LLM validation, you create a surprisingly robust safety net.

Our Approach: 30-minute Setup

We've refined this down to three components:

  1. A test suite that validates LLM calls using both deterministic and LLM-based checks
  2. Automated nightly runs via GitHub Actions
  3. Slack notifications for immediate visibility into regressions

The Foundation: Deterministic Agent Architecture

The key insight we (and others) have discovered is that AI Agents should be designed using small, targeted prompts that address individual parts of the interpretation and response process.

While this might seem to limit the LLM's creativity or add more network calls, we've found repeatedly that this approach:

  • Forces necessary determinism into agents for reliable responses
  • Prevents giving them enough rope to hang themselves
  • Makes validation actually meaningful and actionable

Step 1: Create a Test Suite

Start with a "golden set" of test cases representing real usage patterns, especially previous failure modes.

Below are two examples in Go that validate customer feedback interpretation. The first uses a deterministic check, and the second uses an LLM to validate the output quality—a technique we've found incredibly effective.

// tests/ai/validation_test.go

func TestAIFeatureValidation(t *testing.T) {
    testCases := []struct {
        name     string
        input    string
        validate func(output string) error
    }{
        {
            name:  "extracts_structured_data_correctly",
            input: "Process this customer feedback: 'Great product but shipping was slow'",
            validate: func(output string) error {
                // Deterministic check - catches structural issues
                if !strings.Contains(output, "shipping") {
                    return fmt.Errorf("missing shipping concern")
                }
                return nil
            },
        },
        {
            name:  "handles_response_generation_appropriately",
            input: "Given negative customer feedback, craft a response to send the relevant documentation links",
            validate: func(output string) error {
                // LLM-based validation - catches quality issues
                return validateWithLLM(output, `
                    Check if this generated response:
                    1. Explicitly apologizes to the customer
                    2. Documentation matches the input concern category
                    
                    Respond with only "PASS" or "FAIL: <reason>"
                `)
            },
        },
    }
    
    for _, tc := range testCases {
        t.Run(tc.name, func(t *testing.T) {
            output := callYourAIFeature(tc.input)
            if err := tc.validate(output); err != nil {
                t.Errorf("Validation failed: %v", err)
            }
        })
    }
}

func validateWithLLM(output, validationPrompt string) error {
    result := callLLM(fmt.Sprintf("%s\n\nOutput to validate:\n%s",
        validationPrompt, output))

    if strings.HasPrefix(result, "PASS") {
        return nil
    }
    return fmt.Errorf(result)
}

Our test design principles:

  1. Organic Test Growth - Every issue, whether discovered internally or reported by customers, becomes a test case. The suite evolves with your product's real failure modes.
  2. Hybrid Validation Approach
    • Deterministic tests (temperature=0): Check for specific outputs, formats, or behaviors. Best for objectively correct answers and structured JSON outputs.
    • LLM validation: Validate subjective quality, safety, or correctness. Perfect when the answer is nuanced but certain outcomes are clearly wrong. (For deeper exploration, see LLMs as judges and LLMs as juries)

Step 2: Set Up GitHub Actions

First, add a Makefile target for clean test execution:

.PHONY: test-ai
test-ai:
  go test -parallel=1 -timeout 30m ./tests/ai/... | tee test-output.log

Note: This setup naturally enables A/B testing by running models head-to-head across representative scenarios.

To create the GitHub action, first add these to your GitHub repository secrets:

  • OPENAIAPIKEY: API key for LLM calls
  • SLACKWEBHOOKURL: Slack incoming webhook to post results to

Then, create '.github/workflows/nightlyaitests.yaml':

name: Nightly AI Validation

on:
  schedule:
    # Runs at midnight UTC every night
    - cron: "0 0 * * *"
  workflow_dispatch: # Allow manual triggers

jobs:
  validate:
    runs-on: ubuntu-latest
    outputs:
      test_results: ${{ steps.parse_results.outputs.results }}
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      
      - uses: actions/setup-go@v5
        with:
          go-version-file: "go.mod"
      
      - name: Run validation tests
        run: make test-ai
      
      - name: Parse test results
        id: parse_results
        if: always()
        run: |
          # Extract test summary from output
          RESULTS=$(grep -E "(PASS|FAIL)" test-output.log | tail -1)
          echo "results=$RESULTS" >> $GITHUB_OUTPUT

  notify-slack:
    needs: validate
    if: always() && contains(needs.validate.outputs.test_results, 'FAIL')
    runs-on: ubuntu-latest
    steps:
      - name: Send Slack notification
        uses: slackapi/slack-github-action@v2
        with:
          webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
          webhook-type: webhook-trigger
          payload: |
            {
              "text": "🚨 AI Validation Failed",
              "blocks": [{
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": "*AI Validation Tests Failed*\n${{ needs.validate.outputs.test_results }}\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Details>"
                }
              }]
            }

Our notification philosophy:

  • Alert on ANY single failure—over-notification in Slack beats discovering subtle regressions in production
  • Include specific test results (99 passed, 1 failed) with detailed failure logs in GitHub)

Key Takeaways

  1. Performance drift is inevitable and often subtle. A validation pipeline with fast feedback loops is essential. Whether you build or buy, having something beats having nothing. More on eval-driven development.
  2. Deterministic behavior should be the architectural goal. While it's tempting to give LLMs maximum flexibility to leverage their intelligence, constraining them through structured, targeted prompts dramatically reduces unintended behavior and performance degradation.
  3. AI validation is just testing with different tools. By treating LLM outputs as testable units and leveraging LLMs themselves as validators, you can build on existing CI/CD patterns rather than reinventing the wheel.

And if you're building AI-powered developer tools, we're always interested in chatting and sharing approaches. Reach out to us at hello@flightcrew.io.

author image

Sam Farid

CTO/Founder

Before founding Flightcrew, Sam was a tech lead at Google, ensuring the integrity of YouTube viewcount and then advancing network throughput and isolation at Google Cloud Serverless. A Dartmouth College graduate, he began his career at Index (acquired by Stripe), where he wrote foundational infrastructure code that still powers Stripe servers. Find him on Bluesky or holosam.dev.

keep-reading-vector
Subscription decoration

Don’t miss out!

Sign up for our newsletter and stay connected