With the total number of connected devices now exceeding the world population at 7 billion and counting, the demand for rich content, including video, games, and mobile apps is skyrocketing. Around the globe, service providers are scrambling to transform their networks to satisfy the overwhelming demand for content bandwidth. Over the next few years, they will be looking to network equipment manufacturers to provide high performance and cost effective products that will ultimately fulfill the promise of 100G Ethernet.
100G Ethernet poses challenges not just to transmission, but also to packet processing, quality of service and rapid routing. Network equipment manufacturers must grapple with the reality that line speeds are increasing much faster than processor clocks speeds, and both of these are increasing much faster than memory clock speeds. With the aggregate bandwidth, i.e., port count x line rates, required by next generation routers, networking architects are finding it difficult to design SoCs that can process incoming packets fast enough to avoid memory overruns. Managing the flood of incoming data is just part of the problem. Within the packet processor, a multitude of tiny tasks must be performed based on the headers and/or content in each packet. Each of these tasks requires multiple independent accesses to memory, thus further compounding the problem.
Systems designers can crank up processing performance by using multicore processors. However, if memory performance cannot keep up, processors will have to wait for memory requests to execute, which will cause the system to stall. Memory performance must be increased, and the next logical step to accomplish this is to use multi-port memories, which allow multiple memory access requests to be processed in parallel in a single clock cycle.
As line rates approach 400 Gpbs, there are no practical viable memory solutions, especially for packet buffering other than to use multi-port memories. While conventional multi-port memories have a reputation for being difficult to implement, new technology is available that now makes multi-port memories an attractive choice for high performance networking applications. New Algorithmic MemoryTM technology uses commercially available single port memory IP and combines it with algorithms to create a multi-port memory that offers a superior profile in terms of performance, power, area, and versatility. Furthermore, since Algorithmic Memories are created from pre-validated memory IP, the generated multi-port memory cores do not need further verification.
Examining Memory Requirements for Networking Data Path Processing
To understand the challenge of meeting memory requirements for next generation networking applications, consider a typical packet processing data path in a 100 Gb/s network line card as shown in Figure 1. The typical loaded case for packet processing is when 64-byte packets are received. Since each packet requires about another 20 bytes of inter-packet gap on the wire, it works out to about 150M packets/s. Incoming data must be stored in memory that is typically 512 bits (64 bytes) wide. Since 64-byte packets will fit in one memory cell, 150M cells writes/s are needed. However, in terms of memory access, the worst case scenario is 65-byte packets, which need another cell, and therefore require 300M writes/s of memory performance. To avoid overruns there must also be at least 300M reads/s. These performance requirements cover the case of a unicast buffer, but if multicasting is required, each packet may need to be written multiple times.
Another common type of buffering architecture is Virtual Output Queues ('VOQ'). With VOQs, packets are still buffered on the ingress line card, but arranged into multiple logical queues, segregated by their destination. VOQs avoid the head of the line blocking problem associated with purely input-buffered architectures, but with the added cost of a higher number of read Memory Operations Per Second (MOPS) required to support the 'speed-up' of switching fabrics1 associated with VOQ architecture.
Multicasting, wherein an incoming packet is sent to multiple egress cards, further adds to the MOPS requirements of the memory buffers. One way to implement this easily - to keep buffer pointer management simple - is to write the packet multiple times to different VOQ buffers. This is often referred to as 'Copy Multicasting.' If the packet is be multicasted to 'n' output cards, then it needs to be written 'n' times in the buffer memory.
There are other buffer memory related operations, such as linked list pointer management, related to scheduling and control of the various queues. These require additional memories and associated read/write operations. Pointer multicasting is an alternate technique to do packet buffering, where the packet is written only once, and multiple pointers are maintained to this packet. However this requires more complex linked-list management.
A summary of the data path memory requirements for various line rates is shown in Table 1. A 64-byte (512 bit) cell size is assumed. It can be seen that the MOPS requirements increase proportionately with line rates. This suggests that the memory bottleneck problem will get worse over time. It is clear from the preceding example that faster processors alone cannot improve network performance unless we are able to increase the total MOPS.
Algorithmic Memory Delivers Up to 10X MOPS Performance
To address the need for greatly increased memory performance (measured in MOPS), Memoir Systems has pioneered Algorithmic MemoryTM technology. Algorithmic Memories use algorithms synthesized in hardware to increase the performance of existing embedded memory macros - up to 10X more MOPS. The RTL implements Memoir's algorithms which employ a variety of techniques such as caching, address-translation, pipelining, encoding, etc. All of these techniques appear seamless and transparent to the end user. To the SoC designer, Algorithmic Memories appear as standard multi-port embedded memories (typically with no added clock cycle latency), that can be easily integrated on chip with existing SoC design flows.
In addition, the memory cores are generated mainly from single-port memories in the specified base library and offer two principal advantages:
1. The generated core's area, power and frequencies are far superior compared to any custom multi-port memory library.
2. Algorithmic Memory technology eliminates any of the schedule or yield risks associated with custom multi-port memories, especially for newer process geometries.
Using an automated software tool, users are able to generate custom multi-port memories of any port capability (n# Read ports and/ m# Write ports), size (words x bits), and clock frequency.
A Practical Solution to an Impending Problem
Algorithmic Memory can be used to meet next generation networking infrastructure memory requirements. Consider the following example of three router cards sporting aggregate bandwidths of 100 Gbps, 200 Gbps, and 400 Gbps respectively. With the recent launch of 100G Ethernet, there is an impending need for cards supporting, for example, two or four 100G links per card.
To illustrate, let's assume there is a buffer size that supports receiving one packet every 1msec at the incoming line rates. Consider a buffer architecture which splits variable size incoming packets into constant size 64-byte (512 bit) cells. Assume a VOQ buffer architecture with a speed-up of 2, which will allow up to two cells per cycle to be switched from the ingress line card. As shown in Table 1, such a buffer needs to perform 2 Reads and 1 Write operation (2R1W). The area comparisons using 28nm embedded memory for such a buffer built with and without Memoir's Algorithmic Memory technology is given in Table 2. Since a 2R1W memory is not readily available in most standard memory compilers, it is assumed to be made out of two copies of 1R1W (2P) memories. The challenge for a 28nm memory compiler is that it tops out at 1 GHz clock frequency. However, a memory solution for 400 Gbps needs to support 600 million packets per second i.e., 1200 million cells per second in the worst case. This requires 1.2 GHz clock frequency or more depending on the design, which is not possible using conventional memory and can only be realized with Algorithmic Memory technology. Using this technology, we can add extra read and write ports to single port memory to achieve the performance required without hitting the frequency limit of the memories. So for 400 Gbps line cards, we can move from 2R1W multi-port memory to 4R2W multi-port memory running at half the frequency, yet providing the same performance in MOPS.
Fulfilling the Promise of 100G Ethernet
In summary, memory processing tends to be the weak link in increasing network performance. Networking wire speeds are increasing faster than the increase in memory frequency. Networking gear is very memory intensive and often requires several operations per packet. The bottom line is that faster processors alone cannot improve network performance unless we are able to increase the total MOPS. As rates approach 400 Gbps, there is no practical viable memory solution, especially for packet buffering other than to use multi-port memories. The scalability and versatility offered by Memoir's Algorithmic Memory technology is ideally suited to meet these challenges for 100G Ethernet and beyond.
Notes:
1 Switching fabrics in today's routers consist of parallel path crossbars where more than one packet can be switched in parallel in a given time-slot. Speed-up refers to the number of parallel paths in a crossbar fabric. Speed-ups of 2 to 3 are quite commonly used today.
Badawi Dweik is Director of Product Marketing at Memoir Systems, and has over 15 years of memory industry experience in the areas of design, product applications and marketing. He holds a B.S. in Electrical Engineering (Magna Cum Laude) from Northeastern University and a Masters in Business Administration from Regis University.