This article comes from Cloudflare's official blog. We at Nanosek want to keep our clients informed on any news or updates from the Cloudflare team. For more information- and how you can upgrade to Cloudflare's incredible tech- contact us.
A lot of Cloudflare's technology is well documented. For example, how we handle traffic between the eyeballs (clients) and our servers has been discussed many times on this blog: “A brief primer on anycast (2011)”, "Load Balancing without Load Balancers (2013)", "Path MTU discovery in practice (2015)", "Cloudflare's edge load balancer (2020)", "How we fixed the BSD socket API (2022)".
However, we have rarely discussed the second part of our networking setup — how our servers fetch the content from the Internet. In this blog we’re going to cover this gap. We'll discuss how we manage Cloudflare IP addresses used to retrieve the data from the Internet, how our egress network design has evolved and how we optimized it for best use of available IP space.
Brace yourself. We have a lot to cover.
Terminology first!
Each Cloudflare server deals with many kinds of networking traffic, but two rough categories stand out:
Internet sourced traffic - Inbound connections initiated by eyeball to our servers. In the context of this blog post we'll call these "ingress connections".
Cloudflare sourced traffic - Outgoing connections initiated by our servers to other hosts on the Internet. For brevity, we'll call these "egress connections".
While rarely discussed on this blog, the egress part is critical for our operation. Our servers must initiate outgoing connections to get their jobs done! Like:
In our CDN product, before the content is cached, it's fetched from the origin servers. See "Pingora, the proxy that connects Cloudflare to the Internet (2022)", Argo and Tiered Cache.
For the Spectrum product, each ingress TCP connection results in one egress connection.
Workers often run multiple subrequests to construct an HTTP response. Some of them might be querying servers to the Internet.
We also operate client-facing forward proxy products - like WARP and Cloudflare Gateway. These proxies deal with eyeball connections destined to the Internet. Our servers need to establish connections to the Internet on behalf of our users.
And so on.
Anycast on ingress, unicast on egress
Our ingress network architecture is very different from the egress one. On ingress, the connections sourced from the Internet are handled exclusively by our anycast IP ranges. Anycast is a technology where each of our data centers "announces" and can handle the same IP ranges. With many destinations possible, how does the Internet know where to route the packets? Well, the eyeball packets are routed towards the closest data center based on Internet BGP metrics, often it's also geographically the closest one. Usually, the BGP routes don't change much, and each eyeball IP can be expected to be routed to a single data center.
However, while anycast works well in the ingress direction, it can't operate on egress. Establishing an outgoing connection from an anycast IP won't work. Consider the response packet. It's likely to be routed back to a wrong place - a data center geographically closest to the sender, not necessarily the source data center!
For this reason, until recently, we established outgoing connections in a straightforward and conventional way: each server was given its own unicast IP address. "Unicast IP" means there is only one server using that address in the world. Return packets will work fine and get back to the right server identified by the unicast IP.
Segmenting traffic based on egress IP
Originally connections sourced by Cloudflare were mostly HTTP fetches going to origin servers on the Internet. As our product line grew, so did the variety of traffic. The most notable example is our WARP app. For WARP, our servers operate a forward proxy, and handle the traffic sourced by end-user devices. It's done without the same degree of intermediation as in our CDN product. This creates a problem. Third party servers on the Internet — like the origin servers — must be able to distinguish between connections coming from Cloudflare services and our WARP users. Such traffic segmentation is traditionally done by using different IP ranges for different traffic types (although recently we introduced more robust techniques like Authenticated Origin Pulls).
To work around the trusted vs untrusted traffic pool differentiation problem, we added an untrusted WARP IP address to each of our servers:
Country tagged egress IP addresses
It quickly became apparent that trusted vs untrusted weren't the only tags needed. For WARP service we also need country tags. For example, UK-based WARP users expect the bbc.com website to work. However, the BBC restricts many of its services to people just in the UK.
It does this by geofencing — using a database mapping public IP addresses to countries and allowing only the UK ones. Geofencing is widespread on today's Internet. To avoid geofencing issues, we must choose specific egress addresses tagged with an appropriate country, depending on the WARP user location. Like many other parties on the Internet, we tag our egress IP space with country codes and publish it as a geofeed (like this one). Notice, the published geofeed is just data. The fact that an IP is tagged as say UK does not mean it is served from the UK, it just means the operator wants it to be geolocated to the UK. Like many things on the Internet, it is based on trust.
Notice, at this point we have three independent geographical tags:
the country tag of the WARP user - the eyeball connecting IP
the location of the data center the eyeball connected to
the country tag of the egressing IP
For best service, we want to choose the egressing IP so that its country tag matches the country from the eyeball IP. But egressing from a specific country tagged IP is challenging: our data centers serve users from all over the world, potentially from many countries! Remember: due to anycast we don't directly control the ingress routing. Internet geography doesn’t always match physical geography. For example our London data center receives traffic not only from users in the United Kingdom, but also from Ireland, and Saudi Arabia. As a result, our servers in London need many WARP egress addresses associated with many countries:
Can you see where this is going? The problem space just explodes!
Instead of having one or two egress IP addresses for each server, now we require dozens, and IPv4 addresses aren't cheap. With this design, we need many addresses per server, and we operate thousands of servers. This architecture becomes very expensive.
Is anycast a problem?
Let me recap: with anycast ingress we don't control which data center the user is routed to. Therefore, each of our data centers must be able to egress from an address with any conceivable tag. Inside the data center we also don't control which server the connection is routed to. There are potentially many tags, many data centers, and many servers inside a data center.
Maybe the problem is the ingress architecture? Perhaps it's better to use a traditional networking design where a specific eyeball is routed with DNS to a specific data center, or even a server?
That's one way of thinking, but we decided against it. We like our anycast on ingress. It brings us many advantages:
Performance: with anycast, by definition, the eyeball is routed to the closest (by BGP metrics) data center. This is usually the fastest data center for a given user.
Automatic failover: if one of our data centers becomes unavailable, the traffic will be instantly re-routed to the next best place.
DDoS resilience: during a denial of service attack or a traffic spike, the load is automatically balanced across many data centers, significantly reducing the impact.
Uniform software: The functionality of every data center and of every server inside a data center is identical. We use the same software stack on all the servers around the world. Each machine can perform any action, for any product. This enables easy debugging and good scalability.
For these reasons we'd like to keep the anycast on ingress. We decided to solve the issue of egress address cardinality in some other way.
Solving a million dollar problem
Out of the thousands of servers we operate, every single one should be able to use an egress IP with any of the possible tags. It's easiest to explain our solution by first showing two extreme designs.
Each server owns all the needed IPs: each server has all the specialized egress IPs with the needed tags.
One server owns the needed IP: a specialized egress IP with a specific tag lives in one place, other servers forward traffic to it.
Both options have pros and cons:
Specialized IP on every server | Specialized IP on one server |
Super expensive $$$, every server needs many IP addresses. | Cheap $, only one specialized IP needed for a tag. |
Egress always local - fast | Egress almost always forwarded - slow |
Excellent reliability - every server is independent | Poor reliability - introduced chokepoints |
There's a third way
We've been thinking hard about this problem. Frankly, the first extreme option of having every needed IP available locally on every Cloudflare server is not totally unworkable. This is, roughly, what we were able to pull off for IPv6. With IPv6, access to the large needed IP space is not a problem.
However, in IPv4 neither option is acceptable. The first offers fast and reliable egress, but requires great cost — the IPv4 addresses needed are expensive. The second option uses the smallest possible IP space, so it's cheap, but compromises on performance and reliability.
The solution we devised is a compromise between the extremes. The rough idea is to change the assignment unit. Instead of assigning one /32 IPv4 address for each server, we devised a method of assigning a /32 IP per data center, and then sharing it among physical servers.
Specialized IP on every server | Specialized IP per data center | Specialized IP on one server |
Super expensive | Reasonably priced | Cheap |
Egress always local - fast | Egress always local - fast | Egress almost always forwarded - slow |
Excellent reliability - every server is independent | Excellent reliability - every server is independent | Poor reliability - many choke points |
Sharing an IP inside data center
The idea of sharing an IP among servers is not new. Traditionally this can be achieved by Source-NAT on a router. Sadly, the sheer number of egress IP's we need and the size of our operation, prevents us from relying on stateful firewall / NAT at the router level. We also dislike shared state, so we're not fans of distributed NAT installations.
What we chose instead, is splitting an egress IP across servers by a port range. For a given egress IP, each server owns a small portion of available source ports - a port slice.
When return packets arrive from the Internet, we have to route them back to the correct machine. For this task we've customized "Unimog" - our L4 XDP-based load balancer - ("Unimog, Cloudflare's load balancer (2020)") and it's working flawlessly.
With a port slice of say 2,048 ports, we can share one IP among 31 servers. However, there is always a possibility of running out of ports. To address this, we've worked hard to be able to reuse the egress ports efficiently. See the "How to stop running out of ports (2022)", "How to share IPv4 addresses (2022)" and our Cloudflare.TV segment.
This is pretty much it. Each server is aware which IP addresses and port slices it owns. For inbound routing Unimog inspects the ports and dispatches the packets to appropriate machines.
Sharing a subnet between data centers
This is not the end of the story though, we haven't discussed how we can route a single /32 address into a datacenter. Traditionally, in the public Internet, it's only possible to route subnets with granularity of /24 or 256 IP addresses. In our case this would lead to great waste of IP space.
To solve this problem and improve the utilization of our IP space, we deployed our egress ranges as... anycast! With that in place, we customized Unimog and taught it to forward the packets over our backbone network to the right data center. Unimog maintains a database like this:
198.51.100.1 - forward to LHR
198.51.100.2 - forward to CDG
198.51.100.3 - forward to MAN
...
With this design, it doesn't matter to which data center return packets are delivered. Unimog can always fix it and forward the data to the right place. Basically, while at the BGP layer we are using anycast, due to our design, semantically an IP identifies a datacenter and an IP and port range identify a specific machine. It behaves almost like a unicast.
We call this technology stack "soft-unicast" and it feels magical. It's like we did unicast in software over anycast in the BGP layer.
Soft-unicast is indistinguishable from magic
With this setup we can achieve significant benefits:
We are able to share a /32 egress IP amongst many servers.
We can spread a single subnet across many data centers, and change it easily on the fly. This allows us to fully use our egress IPv4 ranges.
We can group similar IP addresses together. For example, all the IP addresses tagged with the "UK" tag might form a single continuous range. This reduces the size of the published geofeed.
It's easy for us to onboard new egress IP ranges, like customer IP's. This is useful for some of our products, like Cloudflare Zero Trust.
All this is done at sensible cost, at no loss to performance and reliability:
Typically, the user is able to egress directly from the closest datacenter, providing the best possible performance.
Depending on the actual needs we can allocate or release the IP addresses. This gives us flexibility with the IP cost management, we don't need to overspend upfront.
Since we operate multiple egress IP addresses in different locations, the reliability is not compromised.
The true location of our IP addresses is: “the cloud”
While soft-unicast allows us to gain great efficiency, we've hit some issues. Sometimes we get a question "Where does this IP physically exist?". But it doesn't have an answer! Our egress IPs don't exist physically anywhere. From a BGP standpoint our egress ranges are anycast, so they live everywhere. Logically each address is used in one data center at a time, but we can move it around on demand.
Content Delivery Networks misdirect users
As another example of problems, here's one issue we've hit with third-party CDNs. As we mentioned before, there are three country tags in our pipeline:
The country tag of the IP eyeball is connecting from.
The location of our data center.
The country tag of the IP addresses we chose for the egress connections.
The fact that our egress address is tagged as "UK" doesn't always mean it actually is being used in the UK. We’ve had cases when a UK-tagged WARP user, due to the maintenance of our LHR data center, was routed to Paris. A popular CDN performed a reverse-lookup of our egress IP, found it tagged as "UK", and directed the user to a London CDN server. This is generally OK... but we actually egressed from Paris at the time. This user ended up routing packets from their home in the UK, to Paris, and back to the UK. This is bad for performance.
We address this issue by performing DNS requests in the egressing data center. For DNS we use IP addresses tagged with the location of the data center, not the intended geolocation for the user. This generally fixes the problem, but sadly, there are still some exceptions.
The future is here
Our 2021 experiments with Addressing Agility proved we have plenty of opportunity to innovate with the addressing of the ingress. Soft-unicast shows us we can achieve great flexibility and density on the egress side.
With each new product, the number of tags we need on the egress grows - from traffic trustworthiness, product category to geolocation. As the pool of usable IPv4 addresses shrinks, we can be sure there will be more innovation in the space. Soft-unicast is our solution, but for sure it's not our last development.
For now though, it seems like we're moving away from traditional unicast. Our egress IP's really don't exist in a fixed place anymore, and some of our servers don't even own a true unicast IP nowadays.
Comments