Get to know a config
March 20, 2024

The Redis Connection Pool

By
Sam Farid

TL;DR: tuning Redis is painful and time-intensive. We share heuristics, methodology, and aggregated data to help you shortcut this process.

Redis can be tricky to maintain because small issues often cascade into larger ones. At the same time, it’s already a cost center and throwing money at Redis resources is often not the best solution to reliability issues.

I encountered and helped fix many Redis incidents plaguing top-tier GCP customers when I was on the Serverless Networking team at Google Cloud, and I want to share some advice on how to ride the knife’s edge between cost and performance.

What is a connection pool?

A connection pool tracks and reuses existing connections instead of opening new ones each time your application wants to speak to Redis. This is useful because storage solutions like Redis have a limited number of connections they can accept at any given time, and it’s inefficient and slow to keep closing and recreating these connections. Therefore, using a connection pool is a best practice at any scale.

What configs to care about

Redis’s client libraries to create connection pools have many, many different configuration settings. For example, the top Go library (go-redis) has at least 15. The key settings to pay attention to are:

  • Max Connections: Total connections available to the pool. Lower values can lead to network bottlenecks, while higher values can lead to memory overflows.

  • Idle Timeout: How long to wait before closing an inactive connection - this keeps your applications from wasting unused memory. 

  • Retry Backoff: Specifies how long to wait before retrying failed connections. High values can lead to slow performance, and low values is the easiest way to DOS yourself while new and retrying traffic all blast Redis at the same time.

Infrastructure platform configs are equally important to the performance of your application and how it interacts with Redis, especially if your platform has autoscaling enabled. Some key settings are:

  • Min/Max Replicas: tells the autoscaler the range of how many replicas to provision.

  • Resource Requests: how much CPU and Memory are allocated to each replica.

  • Target Utilization: defines a percentage of resource utilization per replica before autoscaler adds another one to handle the extra load.

So when should I update these configs?

In reactive scenarios, you should take a look if:

  • Redis is down (duh)

  • You’re seeing vague “connection reset” error logs on your services

  • Spiky latency graphs that can be attributed to interactions with Redis

On the other hand if you’re being proactive you should check if:

  • Scale planning for a big infrastructure change, such as adding autoscaling or deploying new services where Redis consumption is expected to increase

  • Utilization of CPU or memory is steadily low on Redis, which may mean it’s over-provisioned and unnecessarily expensive

  • The default settings haven’t been revisited in months, while production workloads or traffic have changed

Defining Success

To ride the knife’s edge, we want to optimize connection settings to maximize availability and performance, while minimizing cost.

The main obstacles: 

  • You’ll need to touch interdependent application and infrastructure configs that are owned by different teams (e.g., max_connections and max_instances)

  • High-dimensionality configuration plus dynamic traffic loads means that the truly optimal configuration is a moving target hour by hour.

  • Autoscaling means an even more dynamic environment where the ground shifts under your feet.

  • When a core infrastructure component such as Redis fails, the failures will most certainly cascade into other parts of the system.

How to tune it

Most teams will use a mix of experimentation, rules of thumb, and metrics to keep the connection pool balanced. If this is your first time, expect to spend at least a week building an intuition and getting it right. And unfortunately, a lot of this can only be learned for sure in production environments, so you have to be okay with an inherent risk to production in order to fix them.

Before you start:

  • Pull connection pool configs into environment variables so that you can rapidly iterate.

  • Set up monitoring for each service, and Redis itself. Take a snapshot of each metric as the baseline.

  • If possible, start with a staging environment to get a few safe hypotheses for your production environment. Ideally there is load testing in the staging environment to match prod as closely as possible, including code versions and traffic patterns.

To run experiments:

  • Vary one config at a time, trying it out in a staging environment and match the observed metrics against the baseline.

  • Run each configuration for at least a half hour and ideally longer, as errors won’t show up right away.

  • Note down whether the config had a positive or negative affect on each metric, and repeat the process varying another value. Although there may be complex interactions between configs and non-linear effects, it’s a good proxy to build up your intuition for each config’s effect.

  • Then, start testing the most reliable hypotheses in production environments.

This process is incidentally very similar to hyperparameter tuning in ML, so a tool like Ax can help automate the process if it needs to be done on a large scale. But it’s still always a good idea to try to understand the intuition behind the results, since this isn’t practical during an incident.

Best Practices

At Flightcrew, we’re building the world’s best dataset on “magic numbers” in cloud configs, so we’ve got a sense of which best practices to apply across production workloads. During experimentation, it may be helpful to look for some of the intuitive and unintuitive patterns we’ve learned:

  1. Decrease the number of connections per replica as the number of replicas go up, to stay under the maximum number of connections Redis can handle. These should also be paired with shrinking resources per replica, to avoid unnecessary costs.

  2. The target resource utilization per replica should increase with the number of replicas, to aim for smaller replicas that run hotter, which will be more conservative about opening precious connections when getting close to the knife’s edge.

  3. Retry backoff intervals should increase as more connections are used, to avoid further overloading the poor server when something first goes wrong.

  4. Idle timeouts should be low at small scales because it’s best to reap unused connections and free up resources when connections are not often reused. The timeout should then increase with scale to enforce stricter reuse of existing connections. However at maximum scale, it’s best to lower it again because any hanging connection is a waste and should be reaped quickly, especially when replicas are scaling down from peak traffic.

Some global advice when tuning configs:

  1. Application and infrastructure configs are deeply intertwined, so don’t take a siloed perspective when settings need to be changed.

  2. Theoretically, the most optimized solution would be to adjust settings constantly throughout the day and on weekends, adjusting for varying traffic. This is impossible for a human and shouldn’t be the goal.

  3. Assuming a fixed configuration that prioritizes availability over cost, you should generally set values that are optimized for the maximum scale, and this will work well at different loads.

  4. In general, running more, smaller replicas will help get the most out of autoscaling. More instances means better availability, and smaller instances means there are smaller step changes during traffic cycles, and therefore less wasted money.

  5. All settings eventually get stale. Leave comments in .yaml files and document experiments for your next tuning session, or when it’s someone else’s turn.

And that’s it! Good luck out there and feel free to reach out to hello@flightcrew.io with any questions on what we've seen, or to find out how Flightcrew can help reduce this process from days to minutes.

Get to know a config
March 20, 2024

The Redis Connection Pool

By
Sam Farid

TL;DR: tuning Redis is painful and time-intensive. We share heuristics, methodology, and aggregated data to help you shortcut this process.

Redis can be tricky to maintain because small issues often cascade into larger ones. At the same time, it’s already a cost center and throwing money at Redis resources is often not the best solution to reliability issues.

I encountered and helped fix many Redis incidents plaguing top-tier GCP customers when I was on the Serverless Networking team at Google Cloud, and I want to share some advice on how to ride the knife’s edge between cost and performance.

What is a connection pool?

A connection pool tracks and reuses existing connections instead of opening new ones each time your application wants to speak to Redis. This is useful because storage solutions like Redis have a limited number of connections they can accept at any given time, and it’s inefficient and slow to keep closing and recreating these connections. Therefore, using a connection pool is a best practice at any scale.

What configs to care about

Redis’s client libraries to create connection pools have many, many different configuration settings. For example, the top Go library (go-redis) has at least 15. The key settings to pay attention to are:

  • Max Connections: Total connections available to the pool. Lower values can lead to network bottlenecks, while higher values can lead to memory overflows.

  • Idle Timeout: How long to wait before closing an inactive connection - this keeps your applications from wasting unused memory. 

  • Retry Backoff: Specifies how long to wait before retrying failed connections. High values can lead to slow performance, and low values is the easiest way to DOS yourself while new and retrying traffic all blast Redis at the same time.

Infrastructure platform configs are equally important to the performance of your application and how it interacts with Redis, especially if your platform has autoscaling enabled. Some key settings are:

  • Min/Max Replicas: tells the autoscaler the range of how many replicas to provision.

  • Resource Requests: how much CPU and Memory are allocated to each replica.

  • Target Utilization: defines a percentage of resource utilization per replica before autoscaler adds another one to handle the extra load.

So when should I update these configs?

In reactive scenarios, you should take a look if:

  • Redis is down (duh)

  • You’re seeing vague “connection reset” error logs on your services

  • Spiky latency graphs that can be attributed to interactions with Redis

On the other hand if you’re being proactive you should check if:

  • Scale planning for a big infrastructure change, such as adding autoscaling or deploying new services where Redis consumption is expected to increase

  • Utilization of CPU or memory is steadily low on Redis, which may mean it’s over-provisioned and unnecessarily expensive

  • The default settings haven’t been revisited in months, while production workloads or traffic have changed

Defining Success

To ride the knife’s edge, we want to optimize connection settings to maximize availability and performance, while minimizing cost.

The main obstacles: 

  • You’ll need to touch interdependent application and infrastructure configs that are owned by different teams (e.g., max_connections and max_instances)

  • High-dimensionality configuration plus dynamic traffic loads means that the truly optimal configuration is a moving target hour by hour.

  • Autoscaling means an even more dynamic environment where the ground shifts under your feet.

  • When a core infrastructure component such as Redis fails, the failures will most certainly cascade into other parts of the system.

How to tune it

Most teams will use a mix of experimentation, rules of thumb, and metrics to keep the connection pool balanced. If this is your first time, expect to spend at least a week building an intuition and getting it right. And unfortunately, a lot of this can only be learned for sure in production environments, so you have to be okay with an inherent risk to production in order to fix them.

Before you start:

  • Pull connection pool configs into environment variables so that you can rapidly iterate.

  • Set up monitoring for each service, and Redis itself. Take a snapshot of each metric as the baseline.

  • If possible, start with a staging environment to get a few safe hypotheses for your production environment. Ideally there is load testing in the staging environment to match prod as closely as possible, including code versions and traffic patterns.

To run experiments:

  • Vary one config at a time, trying it out in a staging environment and match the observed metrics against the baseline.

  • Run each configuration for at least a half hour and ideally longer, as errors won’t show up right away.

  • Note down whether the config had a positive or negative affect on each metric, and repeat the process varying another value. Although there may be complex interactions between configs and non-linear effects, it’s a good proxy to build up your intuition for each config’s effect.

  • Then, start testing the most reliable hypotheses in production environments.

This process is incidentally very similar to hyperparameter tuning in ML, so a tool like Ax can help automate the process if it needs to be done on a large scale. But it’s still always a good idea to try to understand the intuition behind the results, since this isn’t practical during an incident.

Best Practices

At Flightcrew, we’re building the world’s best dataset on “magic numbers” in cloud configs, so we’ve got a sense of which best practices to apply across production workloads. During experimentation, it may be helpful to look for some of the intuitive and unintuitive patterns we’ve learned:

  1. Decrease the number of connections per replica as the number of replicas go up, to stay under the maximum number of connections Redis can handle. These should also be paired with shrinking resources per replica, to avoid unnecessary costs.

  2. The target resource utilization per replica should increase with the number of replicas, to aim for smaller replicas that run hotter, which will be more conservative about opening precious connections when getting close to the knife’s edge.

  3. Retry backoff intervals should increase as more connections are used, to avoid further overloading the poor server when something first goes wrong.

  4. Idle timeouts should be low at small scales because it’s best to reap unused connections and free up resources when connections are not often reused. The timeout should then increase with scale to enforce stricter reuse of existing connections. However at maximum scale, it’s best to lower it again because any hanging connection is a waste and should be reaped quickly, especially when replicas are scaling down from peak traffic.

Some global advice when tuning configs:

  1. Application and infrastructure configs are deeply intertwined, so don’t take a siloed perspective when settings need to be changed.

  2. Theoretically, the most optimized solution would be to adjust settings constantly throughout the day and on weekends, adjusting for varying traffic. This is impossible for a human and shouldn’t be the goal.

  3. Assuming a fixed configuration that prioritizes availability over cost, you should generally set values that are optimized for the maximum scale, and this will work well at different loads.

  4. In general, running more, smaller replicas will help get the most out of autoscaling. More instances means better availability, and smaller instances means there are smaller step changes during traffic cycles, and therefore less wasted money.

  5. All settings eventually get stale. Leave comments in .yaml files and document experiments for your next tuning session, or when it’s someone else’s turn.

And that’s it! Good luck out there and feel free to reach out to hello@flightcrew.io with any questions on what we've seen, or to find out how Flightcrew can help reduce this process from days to minutes.

Get to know a config
March 20, 2024

The Redis Connection Pool

By
Sam Farid

TL;DR: tuning Redis is painful and time-intensive. We share heuristics, methodology, and aggregated data to help you shortcut this process.

Redis can be tricky to maintain because small issues often cascade into larger ones. At the same time, it’s already a cost center and throwing money at Redis resources is often not the best solution to reliability issues.

I encountered and helped fix many Redis incidents plaguing top-tier GCP customers when I was on the Serverless Networking team at Google Cloud, and I want to share some advice on how to ride the knife’s edge between cost and performance.

What is a connection pool?

A connection pool tracks and reuses existing connections instead of opening new ones each time your application wants to speak to Redis. This is useful because storage solutions like Redis have a limited number of connections they can accept at any given time, and it’s inefficient and slow to keep closing and recreating these connections. Therefore, using a connection pool is a best practice at any scale.

What configs to care about

Redis’s client libraries to create connection pools have many, many different configuration settings. For example, the top Go library (go-redis) has at least 15. The key settings to pay attention to are:

  • Max Connections: Total connections available to the pool. Lower values can lead to network bottlenecks, while higher values can lead to memory overflows.

  • Idle Timeout: How long to wait before closing an inactive connection - this keeps your applications from wasting unused memory. 

  • Retry Backoff: Specifies how long to wait before retrying failed connections. High values can lead to slow performance, and low values is the easiest way to DOS yourself while new and retrying traffic all blast Redis at the same time.

Infrastructure platform configs are equally important to the performance of your application and how it interacts with Redis, especially if your platform has autoscaling enabled. Some key settings are:

  • Min/Max Replicas: tells the autoscaler the range of how many replicas to provision.

  • Resource Requests: how much CPU and Memory are allocated to each replica.

  • Target Utilization: defines a percentage of resource utilization per replica before autoscaler adds another one to handle the extra load.

So when should I update these configs?

In reactive scenarios, you should take a look if:

  • Redis is down (duh)

  • You’re seeing vague “connection reset” error logs on your services

  • Spiky latency graphs that can be attributed to interactions with Redis

On the other hand if you’re being proactive you should check if:

  • Scale planning for a big infrastructure change, such as adding autoscaling or deploying new services where Redis consumption is expected to increase

  • Utilization of CPU or memory is steadily low on Redis, which may mean it’s over-provisioned and unnecessarily expensive

  • The default settings haven’t been revisited in months, while production workloads or traffic have changed

Defining Success

To ride the knife’s edge, we want to optimize connection settings to maximize availability and performance, while minimizing cost.

The main obstacles: 

  • You’ll need to touch interdependent application and infrastructure configs that are owned by different teams (e.g., max_connections and max_instances)

  • High-dimensionality configuration plus dynamic traffic loads means that the truly optimal configuration is a moving target hour by hour.

  • Autoscaling means an even more dynamic environment where the ground shifts under your feet.

  • When a core infrastructure component such as Redis fails, the failures will most certainly cascade into other parts of the system.

How to tune it

Most teams will use a mix of experimentation, rules of thumb, and metrics to keep the connection pool balanced. If this is your first time, expect to spend at least a week building an intuition and getting it right. And unfortunately, a lot of this can only be learned for sure in production environments, so you have to be okay with an inherent risk to production in order to fix them.

Before you start:

  • Pull connection pool configs into environment variables so that you can rapidly iterate.

  • Set up monitoring for each service, and Redis itself. Take a snapshot of each metric as the baseline.

  • If possible, start with a staging environment to get a few safe hypotheses for your production environment. Ideally there is load testing in the staging environment to match prod as closely as possible, including code versions and traffic patterns.

To run experiments:

  • Vary one config at a time, trying it out in a staging environment and match the observed metrics against the baseline.

  • Run each configuration for at least a half hour and ideally longer, as errors won’t show up right away.

  • Note down whether the config had a positive or negative affect on each metric, and repeat the process varying another value. Although there may be complex interactions between configs and non-linear effects, it’s a good proxy to build up your intuition for each config’s effect.

  • Then, start testing the most reliable hypotheses in production environments.

This process is incidentally very similar to hyperparameter tuning in ML, so a tool like Ax can help automate the process if it needs to be done on a large scale. But it’s still always a good idea to try to understand the intuition behind the results, since this isn’t practical during an incident.

Best Practices

At Flightcrew, we’re building the world’s best dataset on “magic numbers” in cloud configs, so we’ve got a sense of which best practices to apply across production workloads. During experimentation, it may be helpful to look for some of the intuitive and unintuitive patterns we’ve learned:

  1. Decrease the number of connections per replica as the number of replicas go up, to stay under the maximum number of connections Redis can handle. These should also be paired with shrinking resources per replica, to avoid unnecessary costs.

  2. The target resource utilization per replica should increase with the number of replicas, to aim for smaller replicas that run hotter, which will be more conservative about opening precious connections when getting close to the knife’s edge.

  3. Retry backoff intervals should increase as more connections are used, to avoid further overloading the poor server when something first goes wrong.

  4. Idle timeouts should be low at small scales because it’s best to reap unused connections and free up resources when connections are not often reused. The timeout should then increase with scale to enforce stricter reuse of existing connections. However at maximum scale, it’s best to lower it again because any hanging connection is a waste and should be reaped quickly, especially when replicas are scaling down from peak traffic.

Some global advice when tuning configs:

  1. Application and infrastructure configs are deeply intertwined, so don’t take a siloed perspective when settings need to be changed.

  2. Theoretically, the most optimized solution would be to adjust settings constantly throughout the day and on weekends, adjusting for varying traffic. This is impossible for a human and shouldn’t be the goal.

  3. Assuming a fixed configuration that prioritizes availability over cost, you should generally set values that are optimized for the maximum scale, and this will work well at different loads.

  4. In general, running more, smaller replicas will help get the most out of autoscaling. More instances means better availability, and smaller instances means there are smaller step changes during traffic cycles, and therefore less wasted money.

  5. All settings eventually get stale. Leave comments in .yaml files and document experiments for your next tuning session, or when it’s someone else’s turn.

And that’s it! Good luck out there and feel free to reach out to hello@flightcrew.io with any questions on what we've seen, or to find out how Flightcrew can help reduce this process from days to minutes.

Engineering teams use Flightcrew to make
every pull request more reliable

Engineering teams use Flightcrew to make every pull request more reliable

Engineering teams use Flightcrew to make every pull request more reliable