Semiconductor IP News and Trends Blog
Cadence Implementation Flow for an ARM Cortex-A73 at 10nm
Increasingly, taking an appropriate ARM® processor has become the standard way to pipe-clean a digital flow in a new process. ARM processors are widely used and are available at various levels of complexity. For 10nm (what TSMC calls N10), Cadence and ARM worked together to implement a Cortex®-A73 core, which is ARM’s highest performance mobile processor. At TSMC’s OIP Symposium recently, a joint presentation by Paddy Mamtora of Cadence and Shawn Hung of ARM went into the details.
There were several goals:
Ensure ARM’s next-generation CPU is optimized for advanced process
Ensure EDA tools and ecosystem are ready for lead partners
Provide feedback on PPA to ARM design teams
Identify additional physical IP requirements for optimal PPA
Accelerate adoption of next-generation process and associated flows
The Cortex-A73 is the highest performance processor that fits under the mobile power envelope. Compared to the Cortex-A72, it has a 30% increased power efficiency, meaning that it can go to new levels of sustained usage.
The implementation had several challenges. There is a need to focus on efficiency since this is focused on mobile and battery power. The power grid integrity is critical to keep placement density up while still meeting EM and IR targets. At 10nm, wire resistance continues to dominate, making the use of optimized physical libraries more important. The flow needs to be fully colored and triple coloring may be required. At 10nm, M1 and M2 are forced to be horizontal and vertical, which affects both cell architecture and routability. As has now become routine, there is a significant increase in design rules that, in turn, impact both the placer and the routers. There are electrical changes, too. Resistance increases, but different tracks may have different resistance and capacitance depending on the color. Generally, variation is going to be larger and so there is a need for more accuracy.
The implementation proceeded in several steps. The first stage, “process scaling”, was to take a small piece of the design (20K flops and 200K instances), big enough to present authentic challenges but small enough for rapid turnaround. It contained the uArchitecture critical paths. This allowed flow proving and also feedback on IP. Area was compared with a fully routed block (area reduction was 50%). Different libraries with different gate-lengths and Vts were tried (leakage reduction was 30-70% depending on the voltage, and dynamic power reduced about 37%).
The next stage, “putting it all together”, took it up to the SoC level. This was a quad-core Cortex-A73 with a simplified single-shader Mali™GPU. Each core was its own power domain. It taped out in December 2015 and silicon was proven in the middle of this year.
This flow started with Cadence’s Genus Synthesis Solution. The physically aware mapping improved correlation (same placement and timing engines as in physical implementation, too). The overall QoR was better, and run time was reduced from 12 hours to 6 hours compared to the previous tool.
The next step was floorplanning. For a new processor and a new process, floorplan trials are critical. The restrictions of the process add additional complications: placement regions based on the uArchitecture, macro placement has to follow the FinFET grid, and double-patterned M2 memory pins need to be correctly aligned.
The power grid needs to be refined since an incorrect power grid has significant impact on routing density and performance. 10nm uses horizontal M1 (which is different from 16nm) with no vertical metal allowed at all. There is no horizontal M2 standard-cell connection. All of M1, M2, and M3 are double patterned. One caution is that straightforward use of P&R power grid commands will not give good results. The philosophy was to use M3 as a horizontal backbone to connect to M2 fish bones through single V2s, which are then stacked to upper metal straps.
For place and route, end cap, top/bottom cap and corner cap cells are mandatory. Timing is optimized against dominant corners. Placement is DRC and color aware. With such high resistance, control of long paths is essential. Clocks were routed in M7/M8 with clock leafs in M4-M6, and minimum wire width on M2 but 3X on M3 to avoid EM issues. For 10nm timing, wave propagation models, and AOCV is essential to reduce the pessimism (excess margin).
Finally signoff. This was done across multiple PVT and extraction corners using Cadence’s Quantus QRC Extraction Solution for extraction and the Tempus Timing Signoff Solution for leakage recovery. Once again, on-chip variation needs to be analyzed to avoid over-fixing, using path-based analysis and stage-based OCV. The Cadence Voltus IC Power Integrity Solution was used for static/dynamic IR, EM, in-rush current, and power analysis. Of course, there is test, too, where a full production-quality DFT solution (compression, memory BIST, at-speed) was used.
In summary, the design uses Cadence’s full-flow digital solution. 10nm is an evolution of 16nm with some significant changes, but not a step-change like going from planar to FinFET. Power and area improvements on a real design were in line with what was expected based on the raw process details. Further tweaks and tool improvements might make the results a little bit better still.