Managing network traffic flow for multicore x86 processors at 40/100G, Part 1 of 2
May 24, 2017
Part one of this two part series takes an introspective look at the processors that will be used to facilitate the migration to 40G, 100G, and beyond.
Traffic in the enterprise and carrier network has exploded in recent years, driven by consumer broadband, corporate traffic, and newer IP-based services such as mobile connectivity, remote cloud services, IP video, and IPTV.
In addition, the advent of virtualization and the need for higher-performance (up to 100G) secure communication have put tremendous pressure on communication system designs, including the I/O subsystem. These demands, coupled with the success multicore x86 CPUs have had in embedded applications and the data center, have created the need for a coprocessor that can handle packet processing at tens of millions of stateful flows with a glueless, high-performance, virtualized interface to the x86 CPU subsystem.
The pressure of packet processing
Today, service providers offering cloud-based services and enterprise data centers are enabling access to valuable resources anytime, anywhere over wireline and wireless networks. The resulting increase in traffic is putting aggregation switches/routers and intermediate network nodes under constant pressure to meet ever-higher bandwidth demands. These processing elements do not simply switch or route traffic; they must also perform functions such as building a firewall with Deep Packet Inspection (DPI) capability and offering virtualization support for multi-tenant cloud environments.
The underlying Transmission Control Protocol (TCP), User Datagram Protocol (UDP), and Real-time Transport Protocol (RTP) traffic comprises many packets belonging to a network connection. An intermediate node such as a switch, router, or gateway at the network edge must process millions of network connections simultaneously.
Trying to process each packet in the connection individually inhibits the network element from being able to keep up with ever-increasing line rates. This is further complicated by the need to perform DPI for at least a portion of the traffic. Moreover, an intermediate node located deep in the network must process hundreds of millions of packets.
Each packet has no correlation with any other packet; that is, they are not related in space or time. Such asynchronous traffic is better served by grouping packets into flows. A flow is a collection of packets belonging to the same network session, usually between a source-destination pair. Incoming packets must be classified into flows. The processor then deals with all packets belonging to the same flow in the same way based on rules in a flow state table.
Stateful flow processing
All network elements require states for millions of flows, especially when it comes to implementing security processing such as firewalls, intrusion prevention or detection systems, and application-level load balancers. The resulting platform architecture must support flow state management by monitoring packets within a flow, updating a TCP connection, creating and timing out UDP connections, and tracking Virtual Private Network (VPN) connections. State handling is also required to support TCP proxy and TCP splicing.
System software should thus maintain flow state tables supporting millions of flows. Hardware must support software by performing a complex hash and lookup in a flow hash table. Software is responsible for analyzing the flow hash result and managing new flows, updating the hash table and maintaining the flows’ state.
System performance requirements at 100G
To meet the stringent system requirements at 100G, both processing and memory architectures must meet the time budget offered by one packet time at the worst case of 64-byte packets, which is as low as 5 ns.
Processing instruction and memory budgets
Given that most networks continue to use Ethernet frames or packets as the underlying transport, it is important to understand the composition of these frames and how they affect network performance.
The Ethernet frame
A typical Ethernet frame starts with an 8-byte preamble, followed by 12 bytes of addressing information for destination and source addresses, a 2-byte type/length field indicating the type of date used, and the length of the payload. The payload data can be as low as 46 bytes and as high as 1,500 bytes. A 32-bit (4-byte) cyclic redundancy check is computed and appended at the end of the frame (Figure 1).
Performance calculations at 100 GbE
System throughput calculation is usually expressed in a packets-per-second (pps) figure. The maximum number is calculated when all Ethernet frames are 64 bytes in length, or the minimum size frame. For 10 GbE, this number is 14,881 million pps, or commonly known as 15 Mpps. For 100 GbE, this number becomes roughly 150 Mpps.
Smaller packets present a challenge in meeting the short time budget, while large packets present a challenge in meeting the highest line rate. The per-packet time budget required to process a 64-byte packet is as little as 6 ns. With a processor running at 1 GHz, the instruction cycle time is 1 ns. Hence, a 64-byte packet translates into a 6-cycle budget at 150 Mpps. One way to get around this constraint is to use parallel processing with multiple cores and threads. For example, a 100 cores/threads processor will increase this time budget to 600 cycles – a far more manageable window.
Memory considerations at 100G
The use of specialized memories is not recommended in networking devices. At present, DDR3 memories are the preferred external memories. DDR memories operate well in longer bursts; however, transaction rates for clocks higher than 1,666 MHz reach maximum rate for 64-bit wide interfaces. Exchanging a 64-bit channel for two 32-bit memory channels can deliver higher transaction rates at clock frequencies of 2,133 MHz and higher.
Current approaches to fulfill 100G requirements
In the early 2000s, many new and established chip vendors started offering multicore CPU products based on standard general-purpose processors, creating Symmetric Multi-Processing (SMP) Linux structures. In leveraging the relatively simple programming model for SMP Operating Systems (OSs), networking vendors were able to introduce products to market in less time. However, this approach was limited to sub-10G levels of performance.
Performance in these processors is limited primarily because traditional general-purpose CPUs have relied on caches to work around memory latency issues. Cache misses force the CPU cores to starve for memory accesses, where main memory latency is way too slow compared to cache memory. This so-called “memory wall effect” implies that the SMP multicore model for processors does not scale to the hundreds of processor cores required to flexibly address 100 Gbps solutions. Attempts to minimize cache misses through branch prediction and speculative execution techniques fall short of solving the relatively low-cache hit-rate problem.
In an attempt to circumvent the performance bottleneck, vendors began embedding hardware accelerators into multicore processors to handle common performance-intensive functions such as security and DPI (see Figure 2). The resulting single-chip heterogeneous multicore processor has given way to proprietary architectures that are not OS friendly, and has defeated the original intent in having a simple, easy-to-program multicore processor.
Network processors are a category of processors focused on optimizing L2-L4 packet performance. In general, they contain smaller cores that scale reasonably well and can deliver 100 Gbps of performance. Memory performance is addressed through pipeline architectures, and in some cases, Very Long Instruction Word (VLIW) architectures.
Flexibility and intelligent processing are hampered in network processors due to complex programming and fixed internal structures focused at packet forwarding. Furthermore, performance in pipelined network processors suffers when traffic consists of several tunnels and/or when deeper tunnels are required.
This category of chips typically includes small pipelines with internal lookup engines and does not support external memory. Usage was common in enterprise Ethernet wiring closet switches. As the usage models grew in complexity with top-of-rack switches, the flexibility requirements also became more pronounced. Larger lookup tables and greater performance levels are now required from Ethernet switches, as well as several deep tunnels needed to support the many layers of virtualization in the data center.
Although some Ethernet switching chips have access to external ternary content-addressable memory for fast table lookups, a typical Ethernet switch can’t access external DDR memory, making it difficult to cater to networking applications that require support for millions of flows.
This category of products is used in server and client environments to connect multiple Ethernet interfaces to the host x86 CPU through a PCI Express interface. These devices can’t be programmed to perform complex networking tasks such as switching or in-line security. They have no access to external memory and hence can’t support millions of flows.
Now that the challenges of processing network traffic at 100G have been identified, it is important to discuss what is needed to address these challenges. Part 2 of this series, which will be featured in the February issue of Embedded Computing Design, will highlight the need for a coprocessor that can meet the challenges that arise with 100G network traffic. Additionally, the second article will discuss how the new coprocessor manages functions such as intelligent L2/L3 switching, flow classification, in-line security processing, virtualization, and load balancing for x86 CPU cores and virtual machines.