Simplifying networks (and lessons in Engineering tradeoffs)

You’ve probably heard people say that networks are too complex and need to be made simpler. I certainly believe that. I think people build too complicated networks and not enough consider how hard it will be to operate those over time. There are many reasons: people believe vendor hype, like shiny things, the business asks for things that require complexity, or you didn’t plan well enough for growth, etc. The more complex the network, the harder to operate, the more fragile, and the harder to make changes. Everything comes down to how well you can operate your network over time. Just because a network is complex doesn’t mean that it’s wrong. Remember, engineering is the art of tradeoffs, so the areas that bring complexity are worth examining to see how to make things better.

I think it might be helpful to go through examples I’ve experienced of simplifying networks. Often as you are thinking about the tradeoffs and want to simplify the network, you have to have a discussion with the people using the network. You must weight the complexity of operating all the features you’ve been asked for and ask if there is a better way of doing that. You will get different answers with different people, and different companies/cultures involved in the discussion, but the point is to do the analysis and understand the tradeoffs.

L2/Vlans

When I started at Amazon in late 2002, we had two datacenters and they were filled with L2 and vlans. There were three main networks: website, database, and corporate. Each host in the datacenter had two NICs and were attached to two of the three networks, depending on the purpose of the host. I wasn’t around when these decisions were made, I think the idea was to keep the database hosts on the database network secure from the internet, they were on the database and the corporate net, but not on the website. Website hosts were on website and database but not corporate. I don’t remember how hosts were decided on which vlan they were on inside those networks, some large part of it was probably which load balancer they were on. This is a kind of complexity that I am especially offended by; psuedo-security which doesn’t really make things more secure but makes you feel like it and makes something else more complicated. However, it was not easy to chance. Just because I was offended and thought there was a better way didn’t mean we could make the change, it took work and time.

In the early 2000s, Load Balancers were almost completely L2 based, at least the ones that Amazon used. Which means that the LB needed to be the default gateway for devices that needed load balancing. IIRC, if one host needed to talk to a service that had hosts on the same vlan, then they couldn’t talk successfully, so that service needed to be moved to another VLAN. This made having lots of services more and more difficult, requiring the networking team to know what software was on what host. If you can avoid it, you never want the networking team to be in the middle of software operations.

As you can imagine, lots of outages and lots of operational nightmares. Hard to keep up and hard to scale. I think most people now know to hate L2 and vlans and spanning tree these days, but that was less so in the early 2000s. This is why we all moved away from L2 + vlans.

Maybe you’ve heard of Amazon’s famous 2-pizza teams. The concept is a little overblown in that it was almost never that formal or fixed, but it’s true that the culture was made so that teams could (and did) spin up as many different services that they needed to build the things that they needed for customers. This all lead to an explosion of software services, each interacting with other software services in ways that nobody truly understood or understands.

If we had thought ahead we would have seen that this network this couldn’t possibly work over time. Enough of us were offended by L2, and we’d had enough operational issues we argued to make it simpler. How could we decouple this knot of complexity? The first step was to move to L4/L7 load balancers. We needed the load balancers to rewrite the destination headers and not need to be in the middle of everything. We needed the LB to not be such an integral part of our network architecture. We had brought in some new LBs that had some new capabilities for other purposes that did L4/L7. Getting LBs out of the direct path and no longer being the default gateway gave us a lot more flexibility. Flexibility that we wouldn’t have survived without, but at the time, we didn’t know how badly we’d need it. We just knew that they way we were working wasn’t working.

That got rid of most of the L2 requirement. Except that also each datacenter host had those two NICs that needed to be in two of three networks. Which either would require custom cabling or something dynamic like vlans as hosts came into the network. Either way required host configuration in the network that was dependent on the software that would be running on the host. If you can help it, you don’t want networking in the way of the software that is running on the host. How did we solve this? We went to the software teams and asked if we could just have each host have one NIC on one network. We were in the process of building out new datacenter design because of big changes in Amazon.com. In early 2004 Amazon moved Amazon.com from our Seattle datacenter to datacenters in Virginia. We used that change to clean up the network. We got rid of the database network and had just a website and corporate network and hosts could only be on one. It was fairly easy when the hosts were ordered to know website or corporate, so this was not a big burden to assign them physically when being installed into the network. This is an example of understanding the tradeoffs of making the network simpler by understanding what the business really needed. The business originally asked for two NICs probably in support of better security, but that wasn’t really true and it lead to more operational pain which lead to more outages. It’s certainly easier to have this argument that it didn’t give better security and just led to more problems after you have years of struggling to keep the network running. At this point it’s easier to argue that the network needs less complexity.

After we had these things in place, then we could make the host network all L3. This also meant that each host network was almost the same on the switch, and that we could then automate configuration of all the ToR switches, which happened in 2005 or 2006. This simplification of the network was necessary for the growth that was coming, though we did not know that at the time, it’s just some of us were really tired of bad oncall.

LB policy

Another example is from Load Balancers in the 2000s, and while I can claim victory, in truth it’s not as clear cut. Load Balancers are hard, and we in networking had the responsibility for them. Software teams wanted more sophisticated web traffic routing and we pushed back. Load balancers are very complicated and we didn’t trust our ability to keep up with the policy changes and keep everything available. Eventually the software teams added a proxy layer in which they added software that they were in charge of to direct traffic as necessary. This gave them the control that they wanted and fit in with the rest of their development tools including deployment. Over time this became more and more sophisticated in ways that I don’t think we could have replicated on the Load Balancers.

I know that other companies did (and probably still do) sophisticated traffic routing in their load balancers. We chose not to and it worked out well for us. Some part of that is that this separation matched Amazon’s culture much better and allowed the team that cared about traffic routing to own their destiny, and not have to rely on the networking team that had a lot of other things going on to worry about at the same time.

Merchant silicon based routers

In 2009, several of us in AWS Networking were tasked with figuring out how to use merchant silicon, single ASIC based routers for for more than just ToR switches. In the end we came up with a large 3-tier Clos design to replace our aggregation network. As we were getting our heads around what this meant, several things became clear. We’d have hundreds (and then thousands) of these devices in a datacenter, very different from our tens and sometimes-low hundreds of devices than we had had before. Since the requirements for scale, including growth, became very clear, we had to change the way that we had been doing networking, especially around management and change.

We created a configuration generation system that allowed very few options and reliably made every device the same. This set of tradeoffs meant that no longer could engineers just try new features, instead new things took a long time to test out, and then to get into our software systems. We moved to a configuration generation system that was correct by construction – if it comes out of the generation system there is very little chance that it’s broken. This kind of system is less flexible and takes longer for change but with the large number of devices that we had we thought it was the best way to keep everything consistent, which was required.

A controversial approach we took is that (almost) all changes to these devices are a full configuration replace and reboot of the device. Again, we had to balance the tradeoffs. Rebooting for every change adds time to some changes and it adds risk that the device or links won’t come back up. It also requires software to manage that whole process when you have that many devices. On the other hand, we had many less workflows to figure out and program and we didn’t have to work through all the bugs in NOSes for when their state changes. I think this turned out to be a very good decision, but it was mine, so I’m biased. The tradeoffs did cost us, it wasn’t all good with no downside.

Another thing we did (by we, I mean the team, I had nothing to do with this part and it was wildly effective and important) was to come up with a fully packaged rack of network gear to bring in. We were 10xing the number of devices and 10xing the number of cables, and packing all that into a small number of racks. We came up with cable harnesses and wiring plans, and worked with systems integrators so that the rack of network gear came all ready to plug in. This packaging of the physical hardware was incredibly important to the success of this project, this network, and our ability to keep up with the AWS growth. In other words, one of the consequences of the choice to go to ToR style devices for aggregation meant a lot more cables and devices so that we then had to engineer a solution to deal with the physical complexity that we had just not worried about before.

Virtual Networking

In the first part of this post there is a section about removing L2 from our network. We made a consistent L3 network. Over time, and especially because of EC2 we had to add network virtualization back into the network. Because operating and scaling networks is hard, we insisted that the virtual networking happen in the host. When EC2 first launched (I think it’s now called EC2 classic), the networking was simplistic, with no ability to choose your own addresses, but you could isolate your network from others. That was replaced with VPC, which is a much more sophisticated multi-tenant network virtualization. None of this was on the routers, so that we could deal with the scale of possibly millions of tenants with many millions of hosts. This means that the physical network doesn’t know anything about the virtual network and can scale independently, which was necessary for AWS. Of course the tradeoff is a lot of sophisticated software on the hosts. I wish this option was more easily available for more networks, I’m no fan of EVPN in datacenters, but there’s not a lot of options if you need virtual networking in your datacenter.

Adding BGP to regions

Sometimes you have to add complexity. Ease of operations is critical but you must also contend with availability.

Amazon and AWS have multiple datacenters (availability zones) per region. Originally, in 2004 when we first did two datacenters in a region for Amazon.com, these were both small and so we just continued to use OSPF in and between these datacenters. This grew over time. Between regions we used BGP, but inside a region was all OSPF. This was easier to operate, but didn’t give us the isolation that we needed and had promised to customers. We had bugs in OSPF that broke the whole region. We made mistakes in changes and broke the whole region. We then decided we had to add BGP and it’s more sophisticated policy between the availability zones. This was a looong and complicated process because by this time there was a lot going on and we couldn’t turn off Amazon.com and AWS for a couple days while we completely move to BGP. This did lead to a more available region at the cost of lots of work and more complexity in policy to manage.

Conclusion

There are usually reasons, sometimes even good reasons for complexity in networks. As an engineer you need to think through the implication of those reasons and what the tradeoffs are. You need to understand enough of what’s going on in the software systems to know where/how the tradeoffs can be made. As your network and requirements change over time, think about how to make things simpler. As you notice what breaks, what is hard to change, what pages you in the middle of the night, think about how to make the network simpler. Always be thinking about how to make the network operate better, how to make it more available, and how to keep up with the changes necessary. These things can compete. For instance you can add a lot of complexity which makes the network more complicated so that no matter which way the business goes, you are ready, but this usually makes it much harder to keep the network up and then if there is something you didn’t plan for then making changes outside the plan is even harder. Tradeoffs. Simpler and less features is usually easier to change over time, but you must balance that with other requirements like availability or business requirements.

Not all lessons from here are applicable to all people. First off, of course, L2 is not very highly used anymore for all the reasons I talk about here. Second, at Amazon keeping up with business scale became the most important thing, and we had software teams to throw at some of the problems, so we had to make tradeoffs that aren’t worth it for other people.

Suzieq

Try out Suzieq, the open source, multivendor tool for network observability and understanding. Suzieq collects operational state in your network and lets you find, validate, and explore your network.