Lessons from load balancers and multicast

In general I dislike operating load balancers and IP multicast: I’m a network engineer. Load balancers and IP multicast are very complicated, have a large amount of state, and they are hard to understand and debug. They lead to a lot of operational issues and outages. They have directly lead to a lot of pages in my life and a lot of lost sleep. They are hard to capacity plan, fail in un-intuitive ways, etc. People underestimate how hard it is to operate load balancers no matter if they are appliances or software. Don’t you just turn on multicast and it works?

I could go on.

In my ideal datacenter network, I just want to deal with unicast IPv4, OSPF, and BGP. It’s hard enough just getting those basics running. I want a simple, uncomplex network. Components that add more state into the network make the network fragile, they make the whole system less stable. They’ve lead to an alarming amount of outages in my experience. I don’t think any other technologies or protocols have caused as much trouble as LBs and multicast in the datacenter networks that I’ve had responsibility.

But they’ve been very important for my career because I’ve learned a lot from operating them. In operating LBs and multicast, I had to directly connect with many different teams and software engineers at Amazon and from those experience I really learned a lot that is applicable to networks. I’m going to bring you on some of my journey through operating load balancers and multicast and what that led me to. There is a lot to learn from applications in the datacenter to help you think about how to be a better network engineer.

What’s so hard about that? – Load Balancing

Load balancers are crucial for applications, and somebody has operate them. They provide important functions such as abstracting services apart from each other, doing load balancing, and health checking. Multicast on the other hand, nope. Too hard. Not worth the effort. For a long time I thought it was cool technology that just needed to be worked out, but I haven’t thought that since 2005. We thought we could uniquely solve interesting problems with multicast, but the solutions were much worse than the problem.

We spent a lot of time with load balancer vendors. We ran into scaling issues, feature issues, and we needed help with managing and operating what became a significantly large number of LBs. We thought 10 pairs of load balancers was a lot, then 100 pairs, etc. That is not a lot compared to what Amazon has now. It took a long time to figure out a good pattern, and then build software to manage that. One of the hard parts about load balancers when you have hundreds, then thousands, then 10s of thousands of services that need load balancing is that all those applications change a lot and the teams that manage those need to make changes to their load balancing a lot. Provisioning and change management need sophisticated software to because they require precision and are error prone.

We once had a bug with our load balancer in which some of the time on the heaviest VIPs we would get small amounts of connection drops that would effect their 99.9th percentile latency. 99.9th percentile is probably the most important metric to Amazon: it gives an understanding of your outlier performance, but isn’t too biased by a few bad interactions. We couldn’t get the Vendor’s software team to take the issue seriously. Finally our sales engineer got me device code. In the middle of the night I woke up and realized what was wrong. And then figured out a solution. We already didn’t trust those LBs, but that was the last straw for me.

All of these systems have bugs, especially when you are talking about systems with as much code, complexity, and state as load balancers. But not all vendors will work with you to help you understand and build the best overall systems, and this is often the actual differentiator for me in which vendor to use. At Amazon we had many different hardware load balancer vendors and we talked to even more companies. Some had special hardware, such as NPU or ASICs, and some were on CPU. In my experience, invariably, the CPU ones were much much better. It’s not because of features or throughput but because of debuggability and familiarity of their engineers. This experience is a major reason why I never want things like stateless load balancing in networking switches. Also, it gives you worse actual load balancing, but that’s a different topic.

A critical lesson I learned is that running out of capacity is the worst thing you can do in networking, whether it’s load balancer capacity, or any other capacity. There are more resources that limit capacity than you might realize at first. Your job as an engineer is to find what those are. We used least connections load balancing, and one of our load balancers looked linearly through the servers to find which had the least connections. We thought that we had n*10s of thousands of connections per second, but that was drastically reduced because the highest throughput service had hundreds of servers in it. We didn’t know to test for that, and we had a set of huge outages. The way out of that was testing to destruction and a home-grown metric we made up based on multiple things that we could measure.

In 2005 and especially in 2006 I owned load balancers, and at that time the biggest problem was understanding capacity. To understand capacity, you have to understand the traffic in aggregate. After some analysis I realized that the top 10 services (of thousands at that point) did about 50% of the traffic. This was a power law, which helped me understand that the focus has to be on those 10 services. You can probably guess what those services are, based on an eCommerce website. We had to spend time with those top services understanding how they scale. Just because we were told that a service would have a 20% growth didn’t mean we weren’t still responsible when it turned out to be 100%. Or when VPs said no more hardware load balancers, but the website demanded 2x, it still remained our problem. Also, just because sales grow by 20% (or whatever the number is) doesn’t mean page hits or calls to search go up by only 20%.

I learned how to better dive into vendor products and make vendors work at describing their technology. You can prevent a lot of problems if you can deep dive into an architecture and understand it’s tradeoffs and limitations. Asking every vendor for the resources that are limited and if you can measure them is really crucial, and again this is true with any network hardware. If a resource isn’t monitorable and it’s truly critical, I would never again accept using that hardware. In the case of the LBs that had the linear lookup, they had NPUs doing the processing, and we had no way to measure the utilization other than running out of capacity. This is never the way you want to be when operating anything important. The next generation of this LB we stayed away from because we didn’t trust that vendor to understand how to make things operatable.

Pain and Suffering – IP Multicast

The reason multicast was interesting to Amazon in late 2002 and on was that Amazon started using Tibco Rendezvous¹ for messaging for much of the traffic between services. A client would send a request out to a multicast group, which would get picked up by a queuing service, which would then send it to a service host. This is some pretty cool infrastructure magic. Clients, queues, and servers can all be nicely separate from each other. But then we had so many disasters, it became clear that this was not the path to victory. Magic infrastructure is often extremely hard to troubleshoot and debug.

We had issues because our load balancers were our L3 gateway. Sometimes multicast traffic through the load balancer got dropped. I don’t know how the Amazon website ever worked at that point in time. In late 2002 or early 2003 (I can’t remember which month, it was a blur), we came upon a period that every night at the same time we would have a network and website outage and we did not know what was going on. I can’t remember how we narrowed things down to specific devices, but at some point we got Vendor support to log into our devices and do some debugging ². The engineer reported that we were dropping packets on a specific ASIC (Pinnacle – not salty at all that I still remember the name of that #%!##$ ASIC 15+ years later) in our line cards. We’d get drops on the device and not have any counters reporting drops on anything that we could access. Silent drops are the worst, but a topic for another time. What can you do about that? I learned that you have to pay attention to vendor architecture and it’s worth knowing the day-in-the-life-of-a-packet. Now what traffic did we have that caused packet drops? We had a content push at this time of night, which was multicast, every night, and it went to many different switch ports. I can’t remember what the band aid was to get us back, but it was probably some kind of throttling on the content push. In 2003-4 we changed our network architecture to not have LBs as gateways and used LBs that could do source NAT.

In 2004 and 2005, I believe we had the most S,Gs in any datacenter network in the world, primarily because the hardware available couldn’t handle more than that. I don’t remember exactly, but I believe we had 16K S,Gs, then a code change, and then up to 20K. If you don’t know what S,Gs are, they are source, group pairs, which is the state that multicast routers have to keep to know where to send traffic. In general, it’s not fun being on the edge of what a vendor can support. The vendor is rarely ready to deal with your needs and you will suffer for it, unless you do a lot of testing and design work to mitigate these issues. We didn’t know enough to know to do that.

We were multicasting all log messages from application servers to a fleet of processing hosts. The cool thing is that the clients didn’t need to know anything about the processing, and the processing hosts behind the scenes could change how they broke up processing work. Very cool magic infrastructure. One night I was oncall, probably in 2005 (or was it 2006?). Unbeknownst to anyone, somebody had broken the webserver cache on an Amazon website. There was a service that returned image size, which got a greater than 90% hit rate on that data. So the image size service got 10x the traffic it was expecting and started timing out requests. There was a log message for timeouts and because logs were multicast, we were browning out interfaces all over the datacenter with multicast. Untying that mess took a lot of time and more network engineers than me.

Amazon had a set of test servers to test all the services that made up Amazon.com. Because there weren’t very many servers, they had many different services on them. Because they had many different services, they had many different multicast groups on them. Because of reasons I can’t remember, many of these were production multicast groups. Because these small number of tests hosts had many services with many multicast groups subscribed to production traffic, they got a lot of multicast traffic and their CPUs were always very hot. This led to packet loss on those hosts, which led to rebroadcast of the multicast, which sometimes lead to congestion collapse and outage. One of the key lessons here is that having a host or router that does multiple functions means that these functions can interact with each other in unintended ways. Also, don’t allow test to have any connection to production! :(

There are many more multicast disaster stories, from bugs and limitations in network routers, bugs and limitations in the middleware, and then just running out of capacity, but those will have to wait. Oh, there was the period we had 15-30 1 minute outages a day for weeks because of hardware issues, then the work around caused multi-hour outages. on-and-on-and-on.

Lessons

I learned how to think about scaling networks and how critical it is to make things simpler. How to think about tradeoffs around magic abstractions and understandability. Infrastructure that is magic is often too good to be true, at least when you are scaling and growing very quickly. It requires deep introspection, understanding of what happens under failure, and some great monitoring. I got to learn all about applications and I got to learn all about how Amazon.com worked. In some ways, the best way to learn is through outages and we had a lot of them in the early 2000s (I think one of the critical reasons AWS is so good is because of the leaning about distributed systems during this period). I also got to learn about the most important services and how they thought about scale.

Running out of capacity on a shared resource is about the worst sin you can perform in a network. And we ran out of capacity a lot. As I mentioned above, really understanding the resources that are limited and carefully monitoring them is key in any distributed system or network. Things like running out of TCAM space in your forwarding table because you weren’t monitoring (or there was a software bug)

I learned a lot about the Amazon services architecture. I had the experience of fighting VPs; I won’t say I learned how to do that well. There is so much interesting to learn from distributed systems and software engineering that can be translated into how to scale networks well. For instance, every time you can make a hard failure, rather than a partial or grey failure is a big win. Grey failures are extremely hard to detect. Exponential backoff is a great thing (ok, maybe I already knew that one.)

I met so many interesting people and projects. LBs are how I met many of the people on the S3 team, and over time I got to help them out from time to time (doing non-LB work.) I still think S3 is the most amazing technology I’ve ever seen. I remember meeting the first five people on the Amazon unbox team, and Amazon Video is now over a thousand people. I have some vague and fading knowledge about how eCommerce sites work and how hard it is to have a very large catalog.

Lessons I’ve learned (and talked about in this post)

Networks and Distributed systems have many of the same kinds of failures:
- Congestion collapse – figure out constrained resources in the network and the distributed system and especially around those things that retry.
- Multicast – just because it’s not IP multicast doesn’t mean you don’t have to understand what happens when the system retries messages.
- Exponential backoff on retry is a beautiful idea when stability is your primary concern.
Infrastructure that looks like magic has the tendency to produce the largest catastrophes. Investigate deeply.
- Magic in networks and distributed systems is a dirty word as far as I’m concerned
Network Engineers must deep dive into device architecture. Get to know the day-in-the-life-of-a-packet.
Measure everything that can drop a packet. Most devices still don’t do this well, though it’s usually not the hardware, but the software on the device that doesn’t know how to make sense of all the counters and report it.
Understanding latency and the effect of outliers is very important. 99.9th percentile is nice for that.
Make things simple including separate services into their own domain. i.e., don’t run every service on the same server or the same router. Separate things out for blast radius and for ease of understanding.
Capacity management is critical and if/when you fail, you will often introduce congestion collapse and make things much worse than you could image.
- Work really hard to understand all the different resources that are constrained.
Do everything you can to avoid congestion collapse, on every level. Think about control plane protocols, and Load Balancers, multicasts, even databases.
Be careful what you ask vendors for, they might give it to you, and it might cause you unending problems.
Better vendors give you better access to their product teams. In load balancers this was the most important aspect in evaluating a vendor. It’s very hard to test for, but you can ask around to see what other people think
Get your research paper out of my network, I have real customers and real problems. cough cough p4.
If anybody says that load balancing is easy and they can just make their own load balancer in a year (or whatever) they do not know what they are talking about. Some of the best engineers at Amazon have told me that over the years, and I like to rub that in their noses from time-to-time.
Fighting VPs sucks, but it’s a little better if you are right and your company will fail if you aren’t listened to. But only a little better.

I think networking is very interesting field; that’s why I’m still in it. But I’ve also really enjoyed learning about the technology that got built around me. And I got to learn how to think about applications, engineering, design, and architecture from more than just the network engineers around me.

You might notice all my stories are 10+ years ago, there are several reasons for that:

I haven’t owned operations on load balancers for that long.
Unrelated to #1, Amazon is a lot better at operating load balancers than it used to be. :)
Amazon stopped production multicast more than a decade ago.
The stories are old enough I can’t get in too much trouble.
The most important: These are still many of the most important lessons of my career.

¹One of my two most hated products multiplied by vendor interactions. At one point they effectively told us it was our fault we had so much trouble because we ran both 100M and 1G Ethernet. Nope, the product was terrible. And it was a bad idea even if the product was good software. I still feel outrage 17 years later.

²Anybody at Amazon these days hearing about a vendor logging into devices is probably throwing up right now. This has not happened for a long time.

Suzieq

Try out Suzieq, our open source, multivendor tool for network observability and understanding. Suzieq collects operational state in your network and lets you find, validate, and explore your network.