Building a Zero-Maintenance Data Collector

No engineer will trust your product if it can’t reliably integrate with their infrastructure. Building a robust and lightweight data collector is a major hurdle for many SaaS startups, and in this post we’re sharing our hard-earned lessons and solutions.

The Challenge

Collecting data from user-managed cloud environments is a deeply complex, single point of failure in your system.

Cloud providers (AWS, GCP, Azure) and observability solutions (Datadog, Prometheus, etc.) all have unique APIs, usage patterns, and security requirements. And every cloud setup has its own quirks and customizations that can break standard integrations.

And yet, data collection is table stakes for your product, and entirely thankless. The “perfect” agent goes unnoticed, and anything wrong will lead to complications through your whole stack, and even worse, bother your users.

Given this, I wanted to share our end-to-end thought process of how to design and implement an agent that, ideally, no one will need to think about again (in the best way).

Integration Options

Depending on technical requirements, an integration can take many forms. I’ll mainly focus on options for Kubernetes since it tends to be one of the most complex platforms to read from, so it should cover nearly all use cases as well.

1) Direct Instrumentation

Running an agent as a pod or daemonset to collect data directly from within a Kubernetes cluster is a great option because it integrates out-of-the-box with any cluster. All user-specific integrations can be bypassed by reading directly from a metrics-server or a kubelet endpoint on the cluster.

However, this option is restricted to use cases where you:

only need data from Kubernetes clusters and not other platforms
only need built-in metrics such as resource usage

In short, this is the golden path for Kubernetes-native information. Lower-level observability tools like Datadog's agent and Vantage's agent use this method.

2) Agentless

This option requires the user to create a service account with permissions that allow (often read-only) API access via a private key. A backend process polls the APIs to stay up to date.

This option has a number of benefits including:

Users have no install, upkeep, or resource usage
You can immediately upgrade the integration with a backend change. No need to maintain old client interfaces

However, there are some snags:

Security-minded users may not be comfortable with uploading these keys externally.
Required data must be surfaced by an API, which differs per cloud provider.
Observability APIs also differ per provider. And not only that, but a Prometheus instance may not be exposed externally to be read from, and doing so may pose a security risk.

Tools like Monte Carlo and Aikido both use this option, as they have strong credibility with sales prospects, and required data is readily available via stable cloud APIs.

3) Hybrid Agent

What I’m calling Flightcrew’s “hybrid” option is an agent that is installed on a user’s cluster from (1), but it employs many cloud and observability APIs from (2). The point of this hybrid approach is to optimize for the broadest number of integrations while still passing strict security scrutiny.

By running on within a cluster, the agent:

follows a simple and familiar helm install pattern that users expect from observability tools
makes security folks happy by keeping their access keys within their own cluster, and allows for a one-button uninstall with no possibility of leaks
is able to access cluster internals, such as an internal Prometheus endpoint with no extra setup steps

And by hitting observability APIs, the agent:

can be reused and installed on a variety of non-K8s platforms such as Serverless or VMs with the same code and interface
accesses cloud APIs to understand external information such as cluster autoscalers or logging data which may affect or inform on-cluster behavior
uses SQL-like queries to create more complex metric derivations and even read custom business metrics defined in external dashboards

This is Flightcrew’s goldilocks option that allows for the most flexibility within our users’ security preferences.

Core Requirements

Early on we decided our main SLA for our data collection was users should have to interact with our agent at most once per quarter. This includes upgrades, debugging, or even seeing it pop up in alerting dashboards.

To run an agent with this minimal intervention, we need a strict set of requirements:

Do not crash on runtime errors.
- Doesn’t matter what it is - unavailable backend, third-party APIs, or weird new configs we’ve never seen.
- However, fail loudly on setup issues to alert the user immediately if something is misconfigured.
- If an issue is truly irrecoverable, we should know about it before the user.
Resource usage must be low at any scale
- Instead of scaling linearly with infrastructure / data size, ensure eventual consistency with the given resources.

Sounds deceptively simple, yet it’s doable. Here’s what we learned making it happen:

Make the Backend Deal With It

Although it’s impossible to preempt every issue that could arise, it’s definitely possible to pare down the number of potential issues by simplifying code and limiting the points of failure.

Take any opportunity to offload logic from the agent:

Be completely stateless: Our initial agent kept a local cache of all known resources in the cluster it was monitoring, so that it could run diffs locally and avoid sending unnecessary data. So in exchange for minimizing network I/O, the agent had a complex protobuf interface, high and variable memory usage, and localized parsing logic. This is a horrible tradeoff! The agent should be stateless, sending raw data to the backend which in turn handles any parsing logic.
Use dynamic configuration: There should be no “magic numbers” for configuration, nor should the agent even know what queries it’s supposed to run on external APIs. This can all be configured by the backend and passed to the agent, allowing for adaptive pushback to moderate resource usage and throughput, as well as on-the-fly updates to metric definitions without code updates.
Handle backend persistence async: The backend itself shouldn’t be a point of failure either, so it should ack any messages from the agent immediately with no dependencies on the critical path. Backend database queries, RabbitMQ messages, etc. should be handled asynchronously, avoiding passing on backend errors or latency to the agent itself.

Failure Handling

Even after limiting points of failure, there’s still the inevitability of bugs in code, intermittent network issues, unhandled corner cases, and drifting libraries that often can’t be caught till after thorough battle-testing. Here are some ways we built resiliency to these types of unforeseen issues.

To handle errors and panics, catch them! Everything should be retried or recovered. Golang builtins allow for the following execution structure:

func CatchHandleRecover(handlerFn func(error)) {
    if r := recover(); r != nil {
        handlerFn(recoverStacktrace(r))
    }
}

func runTaskAsync(
    ctx context.Context, task func(context.Context) error,
    jobName string, wg *sync.WaitGroup, catcher *errorTracker,
) {
    logger := log.FromContextOrDefault(ctx).WithName(jobName)
    logger.Info("beginning async call")
    startTime := time.Now()

    wg.Add(1)
    go func() {
        defer wg.Done()
        defer CatchHandleRecover(func(err error) {
            catcher.errorChannel <- util.ErrorWrapper{
                Err:     err,
                IsPanic: true,
            }
        })

        err := util.RetryOnError(ctx, task(ctx))
        // The catcher receives nil errors too, which updates tolerance counts.
        catcher.errorChannel <- util.ErrorWrapper{Err: err}

        logger.Info("ending async call", "executionTime", time.Since(startTime))
    }()
}

Additionally, the errorCatcher takes care of sending error reports to the backend for all issues to make sure it’s alerted on, even if it’s recoverable automatically.

To handle deadlocks, the best solution we found was to implement a synchronous for loop that controls the execution of all other async calls. The idea here is the container’s liveness probe’s endpoint is only updated with a “healthy” verdict within this loop, ensuring that a deadlocked agent is considered unhealthy after a couple minutes. This also attempts to send a last-ditch effort error report before being restarted by the controller.

To handle OOMs which can cause crashes outside of the standard code paths mentioned above, the best option was to have the agent watch itself via the Kubernetes API, and report an error on startup if its previous status was OOMKilled. However the more ideal situation is to limit memory usage in code, which is discussed in the next section.

Lightweight

The key to not only reduce memory usage, but keep it nearly constant regardless of cluster size, is to cap throughput instead of using more memory. Without a requirement for real-time data, the agent can be very small, and the data will be eventually consistent.

Our best solution here is:

Use client-side gRPC streaming with size-limited requests, to stream small batches of data to the backend as it comes in.
Additionally, pass Go channels to each of the reader threads, allowing for batches of data to be sent during the external API reads, as opposed to caching and forwarding.

Some very simplified code to illustrate this concept:

func (t *ControlTower) CollectAndSendBatches(ctx context.Context) error {
    var stream pb.ControlTower_BatchClient
    req := &pb.BatchRequest{}
    for {
        // This channel is passed to async API reader threads
        data := <-t.dataChannel
        req.Data = append(req.Data, data)

        if reqSize >= maxByteSize {
            stream.Send(req)
            req.Reset()
        }
    }
}

This way, the overall memory pressure is limited to our specified maxByteSize (again, dynamically configured by the backend).

Here are some real-world results, showing the resource usage of our agent in a small cluster (5-10 nodes) and a larger cluster (40-50 nodes):

Small Cluster Usage

Large Cluster Usage

Note: these clusters have spot nodes, so the agent gets restarted occasionally when its node scales down, and then picks up where it left off.

Dev Should be Bigger than Prod

To avoid any scaling surprises when installing on a prod environment, we set up a sandbox cluster twice as large as any user’s cloud. It was as simple as installing random open source projects to make it as big as possible and then blasting it with traffic.

If the financial and resourcing cost of setting up and maintaining a sandbox cluster doesn’t seem worth it, consider:

The cost of losing a customer over a crashing agent is higher, both financially and reputationally.
At some point, there will be a user who has a production cluster this large, so you’re just doing the work ahead of time rather than scrambling after scaling becomes an issue.

Conclusion

Designing a zero-maintenance data collector isn’t about eliminating all challenges upfront, it’s about making smart architectural choices that push complexity to where it’s easier to manage and away from users.

There are no silver bullets and every use case is different, so it’s best to start building and testing as early as possible. The effort will pay off when you realize how long it’s been since a user reported an issue, and even you barely notice your own agent anymore.

Have any thoughts or questions? We’re always happy to talk shop. Reach out at hello@flightcrew.io.