Interview
August 15, 2023

Scaling Reliability at Stripe

By
Sam Farid

[We’re building Flightcrew to solve production configuration issues at scale and we couldn’t think of a better perspective than that of Gautam Raj, Staff Infrastructure Engineer at Stripe

What are the most significant challenges for scaling reliability at Stripe? 

Stripe is critical internet infrastructure, and we take reliability very seriously.

We have some unique challenges around the reliability, security, and efficiency of our systems and we’re in the process of migrating from a Ruby monolith to microservices. We need to ensure reliability across our service fleet while making sure we don’t compromise new releases and ongoing migrations. Everything is a tradeoff. 

What’s been most effective at navigating these tradeoffs?

At our scale, it’s about building tools, culture and policies to keep feature teams from shooting themselves in the foot. 

As much as possible we try to drive down the differences between stacks, so that someone working in Ruby has similar abstractions available when they have to switch to a Java context, or vice versa. 

Maybe the biggest impact has been gently educating SRE mindset and policy across our organization. Every service has to commit to internal SLAs which we track rigorously with internal tools. And every service has to provide a networking configuration that describes things like timeouts, retry settings, and rate limits, and clients are required to abide by these configs. 

With those in place, it becomes about Rollout Safety as changes are often the biggest source of production incidents. All code is automatically deployed with blue/green traffic shifting, and with automated rollback on errors. Services first deploy into a pre-production environment where they receive ambient traffic. Risky changes are wrapped in feature flags to slowly roll out new features across the fleet.

Finally, we do a lot of load test capacity planning, especially against big traffic days like Black Friday / Cyber Monday. We run chaos tests during load tests for running gameday exercises and simulating server/network failures.

Without getting you in trouble - do you have internal tools that have made a difference with production engineering? 

Yes. Our feature teams are in a decent place and so we’ve built  a number of pre-built jobs ready to intervene in case of an incident. 

For example, we have homegrown tools to roll back an accidental data model change in production. And when things go very wrong, our fallback is another homegrown tool that lets us securely access a shell into production.

As a specific example, we had an issue where our recovery job failed in an unexpected way, and we had to drop down into the secure shell. Even that was challenging because we had to grant a number of permissions as we went. Homegrown developer tools are useful, but can also fail at the worst possible time if they aren’t exercised regularly.

Thanks Gautam! Any final thoughts? 

It’s been a big year for the Stripe infra team - thanks everyone!

Interview
August 15, 2023

Scaling Reliability at Stripe

By
Sam Farid

[We’re building Flightcrew to solve production configuration issues at scale and we couldn’t think of a better perspective than that of Gautam Raj, Staff Infrastructure Engineer at Stripe

What are the most significant challenges for scaling reliability at Stripe? 

Stripe is critical internet infrastructure, and we take reliability very seriously.

We have some unique challenges around the reliability, security, and efficiency of our systems and we’re in the process of migrating from a Ruby monolith to microservices. We need to ensure reliability across our service fleet while making sure we don’t compromise new releases and ongoing migrations. Everything is a tradeoff. 

What’s been most effective at navigating these tradeoffs?

At our scale, it’s about building tools, culture and policies to keep feature teams from shooting themselves in the foot. 

As much as possible we try to drive down the differences between stacks, so that someone working in Ruby has similar abstractions available when they have to switch to a Java context, or vice versa. 

Maybe the biggest impact has been gently educating SRE mindset and policy across our organization. Every service has to commit to internal SLAs which we track rigorously with internal tools. And every service has to provide a networking configuration that describes things like timeouts, retry settings, and rate limits, and clients are required to abide by these configs. 

With those in place, it becomes about Rollout Safety as changes are often the biggest source of production incidents. All code is automatically deployed with blue/green traffic shifting, and with automated rollback on errors. Services first deploy into a pre-production environment where they receive ambient traffic. Risky changes are wrapped in feature flags to slowly roll out new features across the fleet.

Finally, we do a lot of load test capacity planning, especially against big traffic days like Black Friday / Cyber Monday. We run chaos tests during load tests for running gameday exercises and simulating server/network failures.

Without getting you in trouble - do you have internal tools that have made a difference with production engineering? 

Yes. Our feature teams are in a decent place and so we’ve built  a number of pre-built jobs ready to intervene in case of an incident. 

For example, we have homegrown tools to roll back an accidental data model change in production. And when things go very wrong, our fallback is another homegrown tool that lets us securely access a shell into production.

As a specific example, we had an issue where our recovery job failed in an unexpected way, and we had to drop down into the secure shell. Even that was challenging because we had to grant a number of permissions as we went. Homegrown developer tools are useful, but can also fail at the worst possible time if they aren’t exercised regularly.

Thanks Gautam! Any final thoughts? 

It’s been a big year for the Stripe infra team - thanks everyone!

Interview
August 15, 2023

Scaling Reliability at Stripe

By
Sam Farid

[We’re building Flightcrew to solve production configuration issues at scale and we couldn’t think of a better perspective than that of Gautam Raj, Staff Infrastructure Engineer at Stripe

What are the most significant challenges for scaling reliability at Stripe? 

Stripe is critical internet infrastructure, and we take reliability very seriously.

We have some unique challenges around the reliability, security, and efficiency of our systems and we’re in the process of migrating from a Ruby monolith to microservices. We need to ensure reliability across our service fleet while making sure we don’t compromise new releases and ongoing migrations. Everything is a tradeoff. 

What’s been most effective at navigating these tradeoffs?

At our scale, it’s about building tools, culture and policies to keep feature teams from shooting themselves in the foot. 

As much as possible we try to drive down the differences between stacks, so that someone working in Ruby has similar abstractions available when they have to switch to a Java context, or vice versa. 

Maybe the biggest impact has been gently educating SRE mindset and policy across our organization. Every service has to commit to internal SLAs which we track rigorously with internal tools. And every service has to provide a networking configuration that describes things like timeouts, retry settings, and rate limits, and clients are required to abide by these configs. 

With those in place, it becomes about Rollout Safety as changes are often the biggest source of production incidents. All code is automatically deployed with blue/green traffic shifting, and with automated rollback on errors. Services first deploy into a pre-production environment where they receive ambient traffic. Risky changes are wrapped in feature flags to slowly roll out new features across the fleet.

Finally, we do a lot of load test capacity planning, especially against big traffic days like Black Friday / Cyber Monday. We run chaos tests during load tests for running gameday exercises and simulating server/network failures.

Without getting you in trouble - do you have internal tools that have made a difference with production engineering? 

Yes. Our feature teams are in a decent place and so we’ve built  a number of pre-built jobs ready to intervene in case of an incident. 

For example, we have homegrown tools to roll back an accidental data model change in production. And when things go very wrong, our fallback is another homegrown tool that lets us securely access a shell into production.

As a specific example, we had an issue where our recovery job failed in an unexpected way, and we had to drop down into the secure shell. Even that was challenging because we had to grant a number of permissions as we went. Homegrown developer tools are useful, but can also fail at the worst possible time if they aren’t exercised regularly.

Thanks Gautam! Any final thoughts? 

It’s been a big year for the Stripe infra team - thanks everyone!

Engineering teams use Flightcrew to make
every pull request more reliable

Engineering teams use Flightcrew to make every pull request more reliable

Engineering teams use Flightcrew to make every pull request more reliable