The physical implementation of processors in the 1980s and the 1990s was centered on the custom circuit design and layout of the heart of the processor: the speed-critical circuits that are dependent on the physical layout of the chips's main datapath. Datapath layout structures, popularized by the 1980 Mead and Conway methodology, were used to minimize these physical effects. A second generation of tools and methodologies developed by Silicon Compilers, Seattle Silicon Technology and Compass Design Automation addressed creating libraries for high-performance cores through semi-custom datapath structures, with the goal of managing these physical effects with datapath cells and standard cells to provide a flexible solution for multiple types of processors. Synthesis changed all of this, and by the end of the 1990s fully synthesizable processor cores became available.
With some of today's synthesizable cores targeting clock speeds over 2 GHz, it is critical to understand both the science and the art of using high-performance logic libraries to obtain the best performance, power, and area (PPA).
This article discusses the best technology and techniques for hardening CPU cores. These are fundamental principles that apply to CPU cores that are targeted to achieving the optimal PPA from the silicon process. You will learn proven best practices and solutions that can be immediately applied to your core optimization project to achieve best results.
A Word about Critical Paths
To achieve optimal processor performance you must reduce the delay in the critical paths of your design. These critical paths can be in your register-to-register paths (logic) or the memory access paths to/from the L1/L2 caches. All paths must meet their constraints in order to achieve timing closure.
Figure 1. The left image shows the most critical register-to-register path and the right image shows the memory paths to and from the L1 cache of a GHz CPU.
To keep memory timing out of the critical path, you can:
- Use high-performance memory compilers to generate the optimal configurations of memory instances required for your design over the set of processor memory configurations.
- Start with a good initial floorplan to minimize the physical distance between the memory I/O pins and the critical registers within the processor logic. The ability to change this floorplan is critical as your design progresses and you start applying engineering tradeoffs to achieve your goals.
A good floorplan based on the number of cores and the rest of your system-on-chip (SoC) interconnectivity requirements can minimize the physical distance in the top level of the design and reduce timing bottlenecks.
You will need four things to harden your high-performance core:
- A high-performance EDA tool flow
- High-performance logic libraries with power optimization kits
- High-performance memory compilers in the required configurations
- High-density logic libraries and memory compilers for the rest of the SoC that may operate at lower frequencies than the high-performance core
High-Performance EDA Tool Flow
Although there are multiple EDA tool sets and flows that can be used to harden processor cores, in this article we will use the Synopsys Galaxy tools with the High Performance Core (HPC) scripting methodology as the basis for the discussion. The HPC scripts are a reference methodology that is non-vendor/non-core specific, using the latest tools and methodology including Design Compiler Graphical and Synthesis Placement Guidance, along with a single set of user configurable scripts for each step. These scripts include a configurable handoff between Design Compiler Graphical and IC Compiler with common scripts for setup tasks and tool-specific scripts. These scripts enable you to set up your tools for your specific processor configuration, floorplan and performance targets, and are up to date with the latest tool versions. HPC scripts are also available for the Lynx Design System for visualizing your design flow, managing your project and managing runtime execution of EDA tools.
Logic Libraries for Core Hardening
The custom designed datapath blocks of the processors of the 1980s and 1990s, utilized custom cells for each function of the design to maximize performance due to minimum wire lengths and optimize spacing while minimizing overall area. This was a good solution with one large drawback: the need to assemble and manage a large team of custom cell designers for each core design, over several months, to complete the processor library. Only the largest CPU providers could maintain such methods. Synthesizable cores, today's high-performance standard cell libraries and EDA tools can achieve an optimal solution without having to design a new library for every processor implementation. You will need five things from your logic library to optimally harden your high-performance core:
- High-performance combinational cells
- High-performance clock cells
- High-performance sequential cells
- Power optimization kit
- Availability of the above libraries characterized in the Process, Voltage and Temperature (PVT) conditions you need
Figure 2. Cells for resolving high-performance core design challenges
High-Performance Combinational Cells
Optimizing register-to-register paths requires a rich standard cell library that includes the appropriate functions, drive strengths, and implementation variants. Even though the D-Algorithm stated that you could construct all logic functions from a single NAND gate, a rich set of functions (NAND, NOR, AND, OR, Inverter, buffers, XOR, XNOR, MUX, adders, compressors, etc.) are necessary for synthesis to create high-performance core implementations. Synthesis and optimizing routers can take advantage of a rich set of drive strengths to optimally handle the different fanouts and loads created by the design topology and physical distances between cells.
Multiple voltage threshold (VT) and channel lengths provide additional options for the tools as well as different variants of these cell functions such as tapered cells and drive strengths that are optimized for minimal delays in typical processor critical paths. Having these critical path-efficient cells and computationally efficient cells such as AOIs and OAIs available is the first step. It is important to have these performance-enhancing options supported by an EDA flow that can take advantage of these cells. High drive variants of these cells must be designed with special layout considerations to effectively manage electro-migration operating at GHz speeds. To encourage the tools to make the correct choices in selecting cells and to minimize cycle time, it is often necessary to use don't_use lists to temporarily hide different cells from or expose them to the tools. Grouping of multiple signals in the tools with similar constraints and loads due to physical placement and other techniques can also make a major difference in synthesis efficiency. Squeezing the last picosecond of performance out of a design requires the tools and flows to be pushed at different steps in the design flow (initial synthesis, clock tree synthesis, placement, routing, physical optimization) to provide the best results. These optimization techniques performed after a baseline design can typically provide 15-20% of additional performance.
High-Performance Sequential Cells
The setup plus the delay time of flip-flops is referred to as the dead time or the "black hole" time. Like clock uncertainty, this time eats into every clock cycle that could otherwise be doing useful computational work. Multiple sets of high-performance flip-flops are required to optimally manage this dead time. Delay-optimized flops are required to rapidly launch signals into the critical path logic clusters and setup-optimized flops are required in the capture registers to extend the available clock cycle. Synthesis and routing optimization tools can be effectively constrained to use these multiple flop sets for maximum performance, resulting in another 15-20% performance improvement
High-Performance Clock Cells
High-performance clock driver variants are tuned to provide the minimum delay to reduce clock latency and to minimize clock uncertainty caused by skew and process variability. Clock uncertainty eats into every clock cycle that could otherwise be doing useful computational work. Clock tree synthesis tools must understand the PPA tradeoffs of these variants to be able to use them effectively. Techniques such as "useful skew" can be applied to some designs. The term useful skew refers to a command or flow that can be used as an additional optimization trick in the timing closure arsenal. This optimization trick modifies the clock network, rather than the datapath, to close timing, creating other opportunities to achieve the design goals. Wise use of integrated clock gating cells (ICGs) in multiple functional and drive strength variants are critical to minimizing clock tree power, which can easily burn up 50% of the dynamic power consumed in an SoC.
Trading Area for Performance
Having a minimum core area because of the efficient interaction of standard cell architecture and layout with advanced router capabilities not only saves you money on silicon, but the smaller area can run faster and burn less power because of reduced wire length and capacitance. Low routing congestion also enables shortest path wire routing because wires do not have to take detours to avoid congestion. When you really need high performance and routing utilization is high, it can be beneficial to intentionally increase the metal routing pitch and add another metal layer to maintain the routing resources. The resulting routing will have the same wire length but reduced wire load due to smaller sidewall capacitance, which also reduces noise. The use of various combinations of lower VT implants and long channel transistors with their unique performance vs. leakage tradeoffs can provide additional performance without breaking the power budget.
Figure 3. This graph uses channel length and selected VTs to compare area vs. performance tradeoffs on synthesized processors. The vertical axis is relative core area and the horizontal axis is relative core performance. You can see that using the SVT_long channel delivers better performance for a given area than using HVT_min. Likewise the LVT_long channel has better performance for a given area than does SVT_min. Both have better area for a given performance, and with less leakage. The SVT_long library provides another benefit of skipping the HVT mask entirely to save mask and wafer processing costs.
Trading Power for Performance
Silicon processes have come a long way since the 1980s and 1990s with the availability of low VT and ultra-low VT implant options that can increase performance up to 60% at the expense of additional leakage (in the order of magnitudes). Running processor cores at overdrive voltages (typically 10% over nominal voltage for a 20% speed increase, or even higher voltages for short bursts of critical performance such as processing a hyperlink on a smartphone browser) is an option if these voltages are available on-chip and the power optimization kit has the right set of high-performance level shifters for these overdrive voltages. This increased dynamic power leakage can be mitigated by the selective use of longer channel length cells and the application of more advanced and more complex power optimization techniques such as shutdown, power down, low voltage operations, and dynamic frequency and voltage scaling (DFVS) when there is little or no processing load on the CPU. These techniques are supported by the multi-voltage cells available in a power optimization kit.
To achieve optimal PPA for your core hardening you will need a high-performance logic library and memory compilers with a rich set of cells in multiple architectures that support long channel, DVFS, shutdown and overdrive operations. To build the rest of your SoC (the other 80%) you need the same logic libraries and memory compilers designed with an architecture optimized for high density. To achieve time to market you need both these libraries integrated with the best EDA tools and flexible flows. To minimize your risk, you will need availability of multiple manufacturing options and local access to worldwide service and support teams.
Synopsys High Performance Physical IP - See for yourself
- High Speed Standard Cell Logic Library, TSMC 28HPM LVT
- High Speed Power Optimization Kit, TSMC 28HPM LVT
- High Speed Standard Cell Logic Library, Overdrive PVTs, TSMC 28HPM ULVT
- Single Port, High Speed Register File, TSMC 28nm HPM P-Optional Vt/Cell Std Vt
- Single Port, High Speed SRAM, TSMC 28nm HPM P-Optional Vt/Cell Std Vt