Why this HA model
Gateway is
used in AWS VPC to control egress traffic. In addition to NAT
gateway, more feature rich gateway such as Gateway
Transparent mode provides more sophisticated controls.
What is an
optimal HA model for critical infrastructure components such as gateway? The
AWS reference HA model for NAT and Gateway is rather dated. It uses script
running on instances to ping each other for health, it has these potential
shortcomings:
- Depends on a continuous running shell script to monitor availability and perform failover. If the process were to be terminated then no failover would occur.
- Ping only provides limited indication of health
- "split brain" scenario: when connectivity between the NAT instances fails (possibly due to Security Group) but each of them are still capable of connecting to the Internet, it is possible that each NAT instance will shut the other one down
- Does not account for scenarios when an instance is terminated, the instance will not be recreated
1.
Basic HA model, with auto recovery of gateway
2.
Enhanced HA model, with dynamic route table
failover during gateway recovery
The design
and implementation of basic HA model is covered here. See part 2 for enhanced HA.
Design Overview
Health Monitoring and Auto Scaling
In cloud
architecture, all instances should be behind an auto scaling group for
resiliency. Here we leverage ASG to monitor gateway instance health. Auto
Scaling health checks use the results of the EC2 status checks to determine the
health status of an instance. Auto Scaling marks an instance as unhealthy if
its instance status is any value other than running or its system
status is impaired.
Therefore gateway
is monitored based on AWS
health monitoring for auto scaling instances. Customization is also supported.
Route
Target and ENI
In this
non-proxy design, internet access via default route, which is defined
in a private route table per AZ. In a HA scenario, instances may get
replaced, so the routing table entry will either 1) remain
"persistent" outside the instance, or 2) updated to point to the new
instance.
For the
first option, what could be a persistent target for default route to point to?
ELB would be an option, but it is not supported as a routing target. ENI is a
network interface that can be detached and attached to instances so it can
serve as the persistent target. Although there are some feature limitations and
workarounds required, it is still proven to work.
Instance
Recovery and Bootstrapping
Another
feature that comes with ASG is automated recovery of instances. However, there
are some limitations to ASG, for example, it cannot set instance attributes and
it cannot attach ENI. Those are implemented via instance bootstrapping.
Implementation
The diagram
shows an architectural view of the new HA model. Gateway is placed in a single
instance ASG, with two interfaces. An ENI is attached to gateway instance,
which provides persistence in route table even when a gateway instance fails
(the ENI is reattached to a recovered instance).
For sample
code, please refer to github repo:
There is a
limitation with this HA model, when a gateway instance fails, recovery time may
take up to 10-15 minutes (to build a new gateway, install and configure the
appliance). During the time gateway is being rebuilt for that zone, traffic is black
holed in the route table before ENI can be attached to a new gateway instance.
See part 2 for enhanced HA.
No comments:
Post a Comment