Lisa Minwell — eSilicon
As the three C's - convergence of computers, communication and content (multimedia) continue, they increase memory content in SoCs. SoCs aggregate functionality with multiple CPUs, which must run at stringent power and performance specifications. Advanced process technologies are enabling this high level of functionality, which, in turn is driving a constant increase in memory content in every chip design. As the industry moves to more advanced process nodes and multi-die solutions, a comprehensive approach to system-level design and performance analysis is vital.
Disaggregation is also occurring within the industry. Because commercial IP providers are selling silicon-qualified IP, many companies are choosing to outsource their IP development to enjoy cost savings and potential risk reduction. Commercially available memory compilers offer a wide range of options to enable chip designers to deploy multiple power management schemes or adjust threshold voltage implants to offset leakage. But what happens as the chip design progresses and the target specifications cannot be met? The design may cost more to build – it may be a larger die size, it may take many more resources to close on the power, performance, or area (PPA) specifications, or it may require a different package because the power budget was not met.
It is highly likely that one or more memories in a chip will require some modification to meet PPA targets. In most new tapeouts, memory will encompass over 60 percent of an SoC's area. The most efficient way to improve PPA is to analyze memory IP to determine optimization strategies that close the gap between the design specification and design implementation.
For example, a 28nm chip with the dimensions of 13.5mm x 13.5mm that contains 400Mb of memory has 55.38 percent of its area associated with memory content. Memory is a very significant factor in controlling the footprint and packaging for this product.
Memory Optimization Strategies
Custom IC design still offers the best solution for meeting power, cost and performance requirements. It isn't just about getting the highest performance any more. It's about getting the desired performance at optimized power and area, on time and on budget. However, the permutations of technologies, libraries, and IP are increasing exponentially for complex custom ICs. It is imperative that the design team has deep silicon knowledge and up-to-date experience to make the right choices.
Memory Area Optimization at 40nm
For example, "Chip X," a 40nm SoC, includes 3310 memory macros with 536 unique configurations. The memory subsystem comprises three different memory architectures including single-port SRAM, one-port register files, and two-port register files. The total die area is 64mm2 which also includes duo processors running at 1.2GHz under worst-case conditions on a low-power technology node. When looking at potential die size reduction, a very likely candidate will be the memory content of the chip.
The most straightforward approach is to analyze the memory array efficiency. Small memory instances (<64Kb) will have an efficiency factor closer to 50 percent, while larger instances (~1Mb) should have an ideal factor closer to 80 percent. The array efficiency is measured by (total # of bit cells * bit cell area)/total memory instance area. The memory content should be analyzed by looking at the instances making the largest area contribution and assessing their array inefficiency (how far do they deviate from the ideal array efficiency factor).
Figure 1 - Area of Memory Subsystem Analysis for 40nm Example
The memory area may be improved by removing any unwanted compiler options. Commercial memory compilers include many options and may also include some features that may not be removed or are "always enabled." If the design does not make use of these features, it is possible to customize these instances and remove the circuitry. It is also possible to reduce area by removing some of the additional margin in the tiled compiler circuitry drive strengths.
Another key contributor to memory subsystem area is the increasing size of the processor caches. Many SoC designs make use of arraying a smaller memory macro to create their larger L2 cache. This approach may have some additional peripheral logic circuitry overhead as opposed to one aggregate multi-megabit cache macro. In the 40nm Chip X design, the 8Mb L2 cache is comprised of 250 instances of a popular 1K x 32 macro. By implementing an array-efficient, high-density memory architecture, a single 8Mb macro implementation will save this design 2mm2.
Table 1 - Area Reduction Analysis for 40nm Chip Example
The combination of array saving approaches results in an area reduction of 9.97 percent in this 40nm design while still achieving 1.2GHz performance and slightly reducing the dynamic power.
The Need for Speed
Published performance data for mobile phones reveals a continued trend in the requirement for increased performance. Chips fabricated on 28nm low-power processes will have embedded processors running at ≥1.5GHz. This performance criterion is expected to increase to ≥2.5GHz at the 22/20nm technology node.
There is a small subset of memory configurations that comprise the major embedded microprocessor memory subsystems. These memories must be very fast and area-efficient as the number of embedded processors per system is increasing rapidly. To achieve the required speed, the memories must support a very aggressive address set-up time to clock, self-timed, and edge-triggered from a single clock. This requires aggressive design margining and duty cycle variation management. These design challenges become magnified at 22nm where a networking chip is expected to have more than 80 processing units, while SoC chips targeting portable consumer markets are expected to contain eight main processing units in 2014.
eSilicon has produced a high-speed SRAM architecture that was used to build a MIPs multi-core test chip running in GLOBALFOUNDRIES 28SLP low-power process. The memories have been tested in silicon to operate at 1.5 GHz (typical).
Many companies are outsourcing their IP development to enjoy cost savings and potential risk reduction because commercial IP providers are selling silicon-qualified IP. As the memory content continues to grow, utilizing high-quality, silicon-proven IP will be critical. So how can we advocate customization while mitigating risk? The key is to have a silicon-proven base memory architecture that is highly array-efficient. As shown in Figure 1, the 40nm example has 536 unique memory configurations, but 21 of these configurations make up over 50 percent of the memory subsystem area. The design can implement commercial memory compilers for 515 memory configurations and then assume a small risk by deploying some customization for the 21 macros dominating the subsystem area. The risk is associated with minimal layout modifications and transistor size adjustments.
An integrated design and manufacturing house or a value chain producer has the capability of selecting commercial IP or developing IP to exactly meet the requirements of the design. Value chain producers have all of the necessary operations to fully harden a design by deploying in-house expertise in modeling, test engineering and yield management.
Figure 2 - Value Chain Producer Engagement Level Benefits with Optimized IP
Customizing SRAM content is an effective, efficient way to optimize an SoC to deliver the right power, performance and area for an individual design. There are a variety of strategies, including optimizing circuits and layout for the specific array size and timing target, de-featuring by removing unwanted circuitry, using smaller bit cells, and aggregating L2 caches into one large/dense RAM.