Anatomy of a 36-minute downtime

Last Friday, July 5th, 2019, we experienced a 36-minute downtime of one of our services, our payment channel API. In this article, we will be explaining to you about what happened, and what we learned from it.

To better understand the cause of the downtime, we will first share a bit about our hybrid infrastructure.

We run two production systems based on two different technologies: Kubernetes clusters (GKE) on Google Cloud Platform (GCP) and Chef stacks on Amazon Web Services (AWS). For each production system, we have an identical equivalent staging system in terms of architecture, design and configuration.

These pairs of systems must be kept as identical as possible at all times. This allows us to test changes and catch any issues internally before applying them to production where our live merchants could be affected.

For security and compliance reasons, most of our servers do not have direct internet access. They instead depend on outbound NAT gateways and load balancers to send and receive information respectively. None of our application servers, databases nor Elasticsearch nodes have a publicly routable IP address, they only have an internal IP address. e.g. 10.10.0.1 or 192.168.0.1.

When our servers need to communicate with our other systems and third-party systems over the internet, they need to first go through a NAT gateway. A NAT gateway is a key component, it’s like a router in your typical home or office network. It allows devices to communicate over the internet without direct exposure to potential threats. Using NAT gateways and firewalls along with multiple private networks allows us to segment our servers securely. This is critical to our system’s architecture that helps keep credit card information stored with us secure in accordance with the PCI standards we follow.

NAT Gateway diagram

Advantages of using managed NAT gateways vs own instances:

  • It can be setup in a few minutes, all the components are managed by GCP or AWS, and secured by highest standards.
  • They don’t require regular OS upgrades and security maintenance.
  • High availability: GCP and AWS both provide high availability service without manual intervention or scripts for fail-over. They can support 5–45Gbps.
  • High performance.
  • Scalability: automatically adds the required resources to accommodate the traffic originated by the instance. It’s also possible to have multiple NAT gateways by adding one gateway per subnet.
  • Better security: There are no users, thus no SSH access. They’re completely managed by AWS and GCP.

Advantages of using Linux servers as NAT gateways over a managed service:

  • Customized Linux parameters: Since you have access to the instance as root, you can tune various “sysctl” parameters to best fit your requirements and choose your Linux OS.
  • Custom logging: Having access to the kernel allows logging of more packet data details.
  • Deep packet inspection: A centralized place to inspect the packets is always more convenient. It gives the ability to plugin Network Intrusion Detection, such as popular open-source IDS (Snort, Suricata, Bro) or paid solutions (i.e. AlienVault) in a single (or few) places in a centralized manner. Rather than on every instance.

The Downtime

As mentioned above, every change we make to our production systems are first tested on their staging counterpart. This is typically a very robust process and catches almost all issues before they reach production, except this time.

Our API running on our AWS Chef system was not completely equal with its staging counterpart. The production system was still using servers for its NAT gateway, while all other systems we run had all been migrated to use managed NAT gateways services in AWS and GCP.

On the 5th of July, we applied a configuration change that would update the “net.core.somaxconn” sysctl parameter in Linux to potentially fix (or improve) a TCP latency issue we have been experiencing between our load balancers and application servers.

Connection times recorded by the load balancer

pic 1

Average: The peaks are 30ms on avg.

pic 2

Maximum: The max are as high as 1 second (1000ms)

A few changes in “sysctl” and haproxy timeouts were successfully applied to all staging and production environments.

When the change was applied to our Production API system as part of the default security “sysctl” parameters, there is a secondary chef recipe that toggles “net.ipv4.ip_forward” from 0 to 1.

The necessity of this is because we used Linux machines for NAT on the day of the incident.

Once the NAT gateway instances had “net.ipv4.ip_forward” set to 0 they could not forward any health checks from the instances to AWS. This caused AWS to reboot all servers in the system that relied on the NAT gateways as it thought they were unhealthy. This simultaneous rebooting brought down our supporting services including our Elasticsearch cluster, database connection pooler, internal load balancers, in addition to all of our Application servers.

Recovery:

At the 16 minute mark of the downtime, we identified the problem and set the appropriate settings on the NAT gateways servers. However, it took another 20 minutes to fully recover all services that were unable to restart properly after rebooting without internet access.

What we have learned:

  • Staging and Production should always be equal in their configuration architecture. Even a small deviation could be catastrophic when it comes to changes.

  • Apply changes partially. Instead of pushing changes to all servers, select half (i.e. one zone at the time) and apply changes. In the worst-case scenario, one availability zone would go offline while the other zone could still serve requests.

  • Always have a good rollback plan for any changes, even the small ones. Having the ability to quickly revert back a change is crucial to any production environment.

Subscribe to receive the latest updates from Omise
Icon mail sent
Thank you!

You are subscribed.