There are many advantages to running your infrastructure inside a VPC on AWS, as opposed to running in general EC2-Classic. While EC2-Classic is basically one giant, shared VPC, isolating your resources into your own VPC gives you greater control over internal communication, networking, and security.
We recently made a full transition from EC2-Classic into a set of VPC’s.
I’m going to touch on 5 main points here:
- Why move to a VPC?
- Considerations during setup
- Setting up Internal Routing
- Making the switch over
1. Why move to a VPC?
There are many reasons to prefer a VPC over EC2-Classic, however for us the main reasons were:
- PCI Compliance and security considerations
- We are moving to Kubernetes, which would require us to have stricter control over our network addressing
As a sidenote; many of the decisions we decided on were influenced by our end goal of getting set up with Kubernetes. Without the same end goal, some of these decisions may not be appliable to another VPC migration. I’ll highlight those decisions as I go.
A requirement for our move to Kubernetes was that we needed the ability to easily migrate traffic between our current, Ansible controlled infrastructure, and our new Kubernetes infrastructure. This is fairly trivial if we can get both sets of infrastructure behind the same Load Balancer, which would require all being in the same VPC (ALBs are limited to a single VPC). While it would be possible to also proxy before the AWS Load Balancers, due to complexity, this was not an option we wanted to consider.
2. Considerations during setup
Setting up the VPC properly is likely one of the most important things to get right. This is because changing it after the fact can be extremely difficult. With this in mind, we decided on the following:
- a VPC CIDR Block of
- ‘Legacy’ subnets for our non-Kubernetes infrastructure as
- ‘NAT’ subnets (will be discussed later) as
Lets discuss each one of these individually:
The VPC CIDR block was chosen as it is a typically reserved private address space.
The ‘Legacy’ subnet blocks were choosen specifically to keep the size small (as we did not plan to have too many nodes in them before the switch), and to avoid conflicting with
/20 address spaces under
172.20.0.0. We created 4 of these, in
The ‘NAT’ subnets are used to host NAT instances with private route tables. One was created in each AZ with a ‘legacy’ subnet.
3. Setting up Internal Routing
Now, let me explain the NAT’s. We needed these in order to access existing EC2-Classic resources, which we had to whitelist by IP. This approach was more straight forward than monitoring for instance creation/deletion, and updating Security Groups appropriately. By using the NATs, we were able the whitelist the 1 static IP per AZ, and not worry about keeping security groups in sync. By using a NAT per AZ, we were able to avoid inter-zone routing, and isolate each AZ from potential failures in other zones.
The ‘NAT’ subnets were being used specifically to hold a NAT instances per availability zone. If the NATs were in the ‘legacy’ subnets, we would end up with an infinite routing loop (outbound traffic from the NAT would be directed back to the NAT). To prevent this we need a separate ‘NAT’ subnet for each AZ. After this, we can route the IPs of our external services through the zone-appropriate NAT.
4. Making the switch over
Using our existing Ansible automation, we re-created our services within our new VPC, and created new Application Load Balancers (ALBs) (with interfaces in each of the ‘legacy’ subnets) to proxy to our new instances.
Before we could switch traffic over, we needed to:
- Test that all of the services in the VPC were working, which was done by updating hosts files, and hitting all services.
- Ensure our monitoring was sufficient, and we could see traffic migrating between our systems. We did this by creating a Datadog board.
In order to test this, I pulled the IP of the new ALBs, and updated my local hosts file for each of our services, and went through each service to confirm that all setup and networking was working as expected.
At this point, we have to switch our DNS to our new ALBs. Because we make use of client specific subdomains, we were able to do this incrementally.
After the switch, the timeline was as follows:
- Within a few minutes the majority of traffic (~95%) had switched
- After a day, we started spinning down most of our EC2-Classic instances
- After ~4 days, the only remaining traffic was direct IP hits, which we do not care about
- 5 days after our switch we deleted the old ELBs
A few weeks after moving into the VPC, we migrated our RDS instance into our VPC as well. Since it was our final EC2-Classic resource, we no longer need that NATs, so we can clean up:
- The service specific Route Table entries
- The 4 NATs
- The ‘NAT’ Subnets
As well, since all our Route Tables are now the same, we can delete 3 of them, and use a single Route Table for all of the ‘Legacy’ subnets.
We’ve now migrated our infrastructure from EC2-Classic into a new VPC, while maintaining our external connections during the migration, and ending up in a position where we can safely start to shift traffic into a Kubernetes setup.