Explore Cadence IP here
Introduction
The DDR4 memory interface will double the clock speed of earlier DDR3 devices, but some fundamental DRAM timing parameters will remain at the same number of nanoseconds, effectively doubling the number of memory clock cycles required for those timing parameters to elapse.
How will your system adapt to these new timing parameters? Do your existing architectural assumptions stand up? Can you get the same (or better) performance from DDR4 in spite of the new parameters?
While every other part of the computing infrastructure seems to be getting faster, the latency trend for DRAM - as measured by DRAM clock cycles - has steadily increased over the last 3 generations of DRAM. The Read Latency and some other key timing parameters have increased from 2 clock cycles in DDR1 up to 11 clocks for high-speed DDR3-1600.
Within the DDR3 product family, the DRAM latency (RL), activate time (tRCD) and precharge time (tRP) have been steadily increasing with clock speed, as illustrated by Figure 1a. Why? Because the fundamental construction of DRAM is not changing very much.
The internal arrangement of DDR DRAM is carefully balanced between performance, power and cost by the standardization efforts of the companies participating in JEDEC (www.jedec.org) and in general DDR DRAM manufacturers tend to conform to those standards. Even though the DRAM manufacturers may have individual process and layout recipes for the DRAM cells on the DRAM die, RL, tRP and tRCD is staying relatively constant at 13.5ns, as shown by the trendline in Figure 1b.
Figure 1a: RL-tRCD-tRP of DDR3 DRAM by speed grade, with curve-fit prediction for DDR4. Source: Survey of 5 DDR3 manufacturer websites, 9/9/2011, and interpolation.
Figure 1b: As figure 1a, with 13.5ns latency superimposed upon it
While the perception is that each successive generation of DDR DRAM is roughly twice as fast as the last, what's actually happening is that the DDR core timing is staying relatively constant as measured in nanoseconds and thus is increasing when measured in clock cycles. The doubling of frequency and bandwidth while keeping DRAM core timing constant is achieved in DRAM by exploiting parallelism within the DRAM array, as shown in figure 2.
Figure 2: Abstract representation of the internal arrangement of DRAM devices. The arrangement for DDR4 is an extrapolation of the method used for GDDR5 devices and may or may not represent the actual internal arrangement of DDR4.
The System Impact of increasing latency
Increasing latency while keeping all other things in the system equal will generally result in reduced CPU processing efficiency (as measured by the ratio of useful clock cycles to wait states) as the CPU needs to insert additional wait states to compensate for having to wait more clock cycles for DRAM data. This effect is well known and forces architectural changes in the CPU and the rest of the system to compensate for increased latency in the DRAM.
Some systems add on-chip cache memory to the CPU with less latency than external DRAM, and use that cache memory preferentially over external DRAM. The more cache memory that exists on chip, the fewer external DRAM transactions will occur, the CPU waits for DRAM less, and the CPU's efficiency is improved.
The negative aspects of adding cache memory are mainly issues of cost - external DRAM is very inexpensive, with historical prices as low as $0.70 per billion bits, whereas on-chip memory can be significantly more expensive than off-chip DRAM. Also, there is a practical limit on how many bits of cache memory can exist on the CPU die.
Another commonly used technique to improve the efficiency of the CPU is to add (or increase the size of) an out-of-order execution pipeline in the CPU, such that read data for future commands may be fetched in advance of their execution, and write data storage may be delayed. This technique does increase CPU efficiency, but it comes at the expense of increased CPU complexity, area and power.
The problem of DRAM latency is exacerbated by multi-core designs and SoC architectures where there are a number of clients competing for DRAM bandwidth - any client in the system is likely to experience increased latency simply because there are other masters who are already using the DRAM.
Memory Controller Architecture and its effect on system bandwidth
A theoretical method to improve the CPU latency would be to simply reduce the latency of the DRAM controller. While this is correct in theory - and low latency is a design goal of Cadence's memory controller IP solutions - too much simplification in the name of latency can reduce system performance.
If the controller is too simple, for example, a simple in-line queue, there may be a small reduction in the minimum number of clock cycles of latency, but a substantial increase in average latency as well as a reduction in memory bandwidth, resulting in a degradation of system performance as the result of simplification of the memory controller.
An advanced memory controller - for example, Cadence's DDR4/ DDR3 memory controllers - will include a look-ahead queue or pipeline for upcoming transactions to allow the memory controller to prepare the DRAM for transactions in the pipeline.
DDR DRAM requires a delay of tRCD between activating a page in DRAM and the first access to that page. At a minimum, the controller should store enough transactions so that a new transaction entering the queue would issue it's activate command immediately and then be delayed by execution of previously accepted transactions by at least tRCD of the DRAM. At lower speeds of operation, for example DDR-800, the minimum amount of lookahead would be two cache lines; however with the increasing tRCD parameter of high-speed DDR4 at DDR-3200, most memory controllers would need a look-ahead queue storing a minimum of 6 cache line access requests to get full bandwidth out of the memory, as shown in figure 3:
Figure 3: Minimum look-ahead requirement of high-speed DRAM to ensure full bandwidth of DRAM. 16-bit and 32-bit systems assume 32-byte cache line, 64-bit system assumes 64-byte cache lines
Another problem that is exacerbated by high-speed DRAM is the effect of the activate-to-activate delay of the same bank in DRAM - the so-called tRC delay. If the memory controller receives a transaction to a recently-accessed bank, the memory controller must delay the next activate command to that bank such that tRC is not violated.
The problem with tRC is that in a system with multiple memory masters, it becomes very difficult to predict which ones may be accessing the different banks within the DRAM at any given time. Cadence believes that all DRAM controllers should behave like Cadence's DDR4 controller and should reorder transactions for this reason, allowing multiple transactions to different rows in the same bank to be stored in the DRAM controller and allowing other commands not restricted by tRC to bypass any transactions delayed by tRC.
tRC is another timing parameter that is not decreasing over time at the same rate as the rate of increase of the clock frequency. We predict the tRC of DDR4 devices to be around 45ns, as shown by Figure 4:
Figure 4: DRAM tRC trend from SDR to DDR4. Source: Cadence SOMA Models of DRAM Devices, derived from actual DRAM manufacturer datasheets. Sample sizes: SDR: 641 samples, DDR1: 615 samples, DDR2: 302 samples, DDR3: 281 samples, DDR4 prediction by curve fitting. Only devices with tRC expressed in ns were considered.
As we translate the 45ns predicted tRC for DDR4 into clock cycles, we get the table of Figure 5:
Figure 5: tRC reordering requirement of high-speed DRAM to ensure full bandwidth of DRAM. 16-bit and 32-bit systems assume 32-byte cache line, 64-bit system assumes 64-byte cache lines
Figure 5 shows us that if we have back-to-back transactions to different rows in the same bank, the system would have to separate the two commands by as much as 72 clock cycles. A reordering controller would have the capability to improve the bandwidth and latency for up to 18 cache line reads or 18 cache line writes in most systems when the system masters issued two back-to-back commands to different rows in the same bank.
Other optimizations
There are many other DRAM traffic optimizations that have the potential to increase bandwidth that can be implemented by an advanced DRAM controller such as Cadence's DDR4 controller. Grouping read and write transactions can provide a significant benefit for certain traffic patterns. Using open-page mode for transactions with good locality of access improves bandwidth, average latency, and power. Reordering for other DRAM holdoff conditions like tFAW can improve bandwidth. There are many such optimizations that are beyond the scope of this paper and may be addressed in a future publication.
Summary
This paper has showed some of the effects of the increasing clock speed on systems using DDR4 devices and some of the techniques that may be used to mitigate those effects.
Cadence offers a differentiated, integrated and proven set of DDR controllers, PHYs, Verification IP and signal integrity solutions, including a DDR controller specially redesigned for performance improvement with high-speed DDR3 and DDR4 devices.
Explore Cadence IP here