How Innovations in
DRAM Memory Architecture Promise to Raise Memory Throughput to
51.2GB/s
By Michael Ching
Director of Product Marketing
Rambus Inc.
Processor designers are working hard
to satisfy demands from system developers and end-users, but increased
processing throughput must be matched by improvements in memory bandwidth
to deliver usable system-level performance gains.
Other constraints, such as the need to
mitigate electromagnetic effects at multi-GHz switching frequencies,
minimising the effects of manufacturing tolerances such as board trace
lengths, and delivering continuous reductions in power consumption also
influence designers' choices and shape each successive generation
of performance enhancements.
Shifting the Balance of Power
For over a decade, increasing processor
speeds have spurred development of progressively faster DRAM technologies,
to maintain the performance balance and fully utilise the increased
processor performance. Double Data Rate (DDR) DRAM, for example, supports
faster processor-memory transactions without increasing the Front Side
Bus (FSB) clock rate, by transferring data on both the rising and falling
clock edges. Transaction speeds are dramatically increased without incurring
the EMI, power-consumption and thermal management challenges that usually
accompany a higher clock rate. However, some trade-offs are seen in
other aspects of the memory's behaviour. For example, "Double-pumping"
the memory bus has been accompanied by a doubling of the column pre-fetch
buffer, which has had a corresponding effect on column access granularity.
The transition from DDR to DDR2 memory
continues this trend. The DDR2 interface operates at twice the speed
of the core, which increases the peak transfer rate to 6.4Gbit/s at
200MHz memory bus speed but also introduces a further doubling of the
column prefetch buffer depth. With the advent of DDR3 memory, which
again doubles the interface speed compared to DDR2, the column prefetch
buffer is 8-bits deep corresponding to a sustained access granularity
of 128 Bytes for DDR3 modules of 64-bits data width.
Although this progression has dramatically
boosted the outright data transfer rate achievable, the accompanying
trend towards higher access granularity restricts the performance of
applications in many of today's fastest-growing and most exciting
market sectors. These include high-resolution graphics, 1080p HDTV,
network communications processing, and multi-core supercomputing, which
are predicated on ultra-high-speed processing of small blocks of data
that are often only a few bits in size. In addition, the temporal locality
of these blocks of data tends to be low. For example, in a network-switching
application each packet stream is mixed randomly with packets from other
simultaneous transfers. This results in a requirement for temporary
storage of small packets having no locality of reference.
Current high-bandwidth DRAM architectures
will be unable to meet the future demands of these applications, given
the high column and row granularity inherent in the memory interface.
This high granularity results in inefficient utilisation of the memory
bandwidth, since the majority of data retrieved will be discarded by
the application.
High-Speed, Fine-Granularity
Development of memory interface and core
technologies must now focus on regaining this lost efficiency, to better
serve emerging applications that require fine memory-access granularity.
The Extreme Data Rate (Rambus XDR™)
memory architecture, which is based on differential and point-to-point
signalling, has been developed to deliver a further increase in memory
bandwidth as processing speeds continue to rise. The XDR™ DRAM interface
while increasing signalling rates, also eliminates the effects of manufacturing
variations in PCB trace lengths, and supports scalability to large module
capacities without suffering the performance losses usually associated
with multi-drop bus topologies.
Memory Interface
Innovations
XDR introduces Octal Data Rate (ODR)
signalling, which allows data exchange on rising and falling edges of
a clock that that is multiplied to four-times the 400MHz system clock.
Eight bits of data are transferred per clock cycle, which enables 3.2GHz
data rates with a 400MHz clock and provides a scalable path to over
6.4GHz as bandwidth needs increase. In combination with the increased
signalling rate, improvements to signal integrity and speed are achieved
through the use of Differential Rambus Signalling Level (DRSL) technology.
DRSL has a signal excursion from 1.0V to 1.2V, resulting in higher speed
and lower power consumption without compromising data integrity.
At the XDR interface, DRSL is applied
in combination with Rambus FlexPhase™ Timing Adjustment technology.
FlexPhase compensates for incremental effects such as small variations
in PCB trace lengths due to manufacturing tolerances, producing controllable
and deterministic signal timing that allows systems to operate close
to ideal timing parameters rather than worst-case. In addition, Rambus
Dynamic Point-to-Point (DPP) technology allows XDR modules to combine
the easy scalability of a multipoint topology with the high speed of
point-to-point signalling.
By combining these technologies, the
XDR interface allows DRAMs featuring the standard core architecture
to support signalling from 3.2GHz to 6.4GHz for data bandwidth from
6.4GB/s to 12.8GB/s from a single x16 XDR DRAM component. Further
optimisation of the core enables access granularity to be reduced and
thereby maximise the benefit of XDR's higher interface speeds in future
generations of the XDR product family.
Focus on Core Issues
Let us now discuss the changes that are
required in the DRAM core to reduce access granularity. Consider a standard
DRAM core organised as eight banks that are logically interleaved, as
shown in figure 1. Two sets of data pins divide the banks into
two halves that operate in parallel in response to row and column commands.
A row command selects a single row within each bank half, and two column
commands select two column locations within each row half. Four bank
halves make up a quadrant having its own set of column and row decoder
circuits.
Figure
1. Standard DRAM core.
After a row command is received, the
selected row is sensed and latched. The row-access time, tRR,
must elapse before another bank can perform a row access. The bank's
row circuitry is occupied throughout this interval. After a column command
("col x") is received, the selected column is accessed. The column-access
time, tCC, must elapse before the bank can perform another
column access ("col y").
The physical limitations on signal propagation
times restrict the bit-transport interval to 0.25ns and constrain the
minimum tcc to typically 4ns. Hence the maximum column access rate is
250MHz, and 16 bits are transported on each link during a column access.
With 16 data links, the column granularity is 32 bytes. Because tRR
is twice tCC, the row granularity is 64 bytes.
Micro-Threading for Bandwidth Efficiency
Reorganising the core into a larger number
of banks, each with independent row and column circuitry, provides the
opportunity to overcome the restrictions on tRR and tCC.
This architecture can be implemented in most modern DRAM cores with
minimal area overhead, and allows several small accesses to occur during
these time intervals. The enhanced core is said to be micro-threaded.
Figure 2. Micro-threaded DRAM core.
Figure 2 shows the internal details for
a micro-threaded DRAM core. There are 16 independent banks, each equivalent
to a half-bank of the typical DRAM core shown in figure 1. Even-numbered
banks connect to the "A" data pins and the odd-numbered banks connect
to "B" data pins. The banks are organised as groups of four, forming
quadrants that have dedicated row and column circuitry and are therefore
able to operate independently in response to row and column commands.
A column access of an upper quadrant is interleaved with the corresponding
column access of the lower quadrant.
Figure 3 shows the timing of a transaction
for this micro-threaded DRAM component. After a row command ("r0")
is received, the selected row (in bank 0) is accessed. A time tRR
must elapse before another bank in the same bank quadrant can
perform a row access. However, banks in the other three quadrants may
be accessed during the interval – row commands r1, r2, and r3 are
directed to banks 1, 2, and 3, respectively.
Figure 3. Data transaction timing in
micro-threaded DRAM core.
After a column command ("c0x") is
received, the selected column is accessed (column 0x of row 0 of bank
0). A time tCC must elapse before this bank can receive another
column access command ("c0y"). However, banks in the other three
quadrants may be column-accessed during the interval – column commands
c1x, c2x, and c3x are directed to banks 1, 2, and 3, respectively.
As with the typical DRAM core example,
the tCC interval is 4ns, and the bit transport interval is
0.25ns. However, each column access only transports data for half the
tCC interval, and each column access only uses 8 of the 16
data links, resulting in a column granularity of 8 bytes, one-quarter
of the previous value. The row granularity is 16 bytes, again one-quarter
of the previous value.
Reducing granularity in this way delivers
performance advantages for applications in the groups mentioned previously,
even though interface transfer bandwidth and core access intervals are
unchanged compared to standard non-micro-threaded component. Figure
4 highlights the performance benefit of micro-threading, comparing two
DRAMs featuring identical core and interface speeds operating in a graphics
application accessing a range of triangle sizes. The micro-threaded
core has two to four times the effective triangle access rate.
Figure 4. Comparison of micro-threaded
and non-micro-threaded DRAM performance.
By adding this and other innovative features,
future generations of the XDR memory architecture are capable of supporting
data rates from 6.4GHz to 12.8GHz, thereby dramatically increasing the
bandwidth to between 25.6GB/s to 51.2GB/s from a single x32 future generation
XDR DRAM component.
The XDR memory architecture continues
to provide unprecedented levels of memory performance to keep up with
processor performance requirements in next generation gaming, compute,
and consumer platforms. Rambus innovations such as micro-threading effectively
regain the memory bandwidth efficiency lost through successive generations
of high-speed interfaces that have traded access granularity to gain
improvements in maximum data rate.
Continuing this trend, future demand
for increased memory bandwidth will require further architectural innovations.
Rambus is well placed to meet these requirements again going forward.
About the Author
Michael Ching has over 14 years of experience in high-speed design. He joined Rambus Inc. in 1996, and currently manages marketing of Rambus' high-speed interface products and intellectual property portfolio. At Rambus, he has held various positions in industry-infrastructure enabling and design engineering. Prior to joining Rambus, Michael designed high-speed I/Os for microprocessors for Intel Corporation.
Michael holds a M.S. in electrical engineering from the University of California at Berkeley.