Introduction
Artificial intelligence (AI) has come a long way. While our parents grew up with the dream to one day roam with robots, today we are interviewing Sophia, a citizen of Saudi Arabia, who is also the first humanoid robot to be granted a citizenship in any country. Deep learning, a brain-inspired discipline of AI has been around for a long time but has only recently taken off due to abundant data, plus advances in the Internet, machine learning algorithms and hardware accelerators. Critical decisions that used to take traditional trial and error experiments months and years to confirm are now taking minutes to hours with machine learning algorithms. Humans are relying more and more on machine learning to make better and faster business and personal decisions.
In the short history of deep neural networks in semiconductors, engineers have tried different hardware architectures to provide better performance to support the various machine learning algorithms. For example, GPUs are optimized for parallel floating-point computation most suitable for training neural networks. CPUs are typically optimized for trained models and are most suitable for running inference at the edge, where parallel computation is not needed and power efficiency becomes more important. Since it is generally too expensive to create specialized, purpose-built chips for every new deep learning algorithm, FPGAs have also become a popular solution, providing reconfigurable functionality that is most suitable for the constantly evolving neural network architectures. Customized algorithms in FPGAs provide better power efficiency but with lower performance. Application-specific integrated circuits, or ASICs, represent the highest-performance approach to deep learning. For the massive markets anticipated, the cost of design and revision of those ASICs are likely to be in line with the business imperatives of these markets.
Each architecture has found success in its specific domains.
Figure 1: Example architecture of a neural network processor
Memory in Deep Learning
One of the biggest challenges in deep learning hardware is memory. Memories are used to store input data, temporary data, weights and activation parameters of deep neural network algorithms. The popular types of deep neural networks include feed-forward networks, the simplest form of neural network; convolutional neural networks (CNNs), which are feed-forward and sparsely-connected networks with weight sharing; and recurrent neural networks (RNNs), which are networks with recurrent connections between hidden units and other networks.
In deep learning applications, the input is usually a matrix array of data and the core algorithm is usually a matrix array of parameters describing the interaction between each input unit and each output unit for the learning algorithm. Each element of the input, parameters, intermediate data and output are typically stored separately as the data propagates through the network.
An example of memory usage in a feed-forward network is to store the input nodes, weights and the sum of products of the weights and the inputs calculated at each node in the network. CNNs require less storage for parameters because they are typically sparsely connected with weight and parameter sharing. In RNNs, additional memory is needed to store states computed in the forward pass and they must be retained until they are reused in the backward pass.
Challenges and Opportunities for New Memory Architectures
New deep learning architectures are pushing towards memory bandwidth of at least 900GB/s for training and 400GB/s for inference. One of the key memory bottlenecks is memory bandwidth, especially with off-chip memory limited by fixed channels and off-chip connections. To address this problem, most deep learning architectures are targeting high-bandwidth memories (HBM) for the next-generation of chips. HBM achieves higher bandwidth while using less power in a substantially smaller form factor than DDR4 or GDDR5. For example, HBM2 addresses the bandwidth gap with up to 307GB/s data rate per 8-channel memory stack at 2.4Gbps pin speed. Four stacks of HBM memory can therefore support up to 1.2TB/s of memory bandwidth.
Many deep learning architectures include multiple processing clusters in the chip with HBM memory. Below is an example architecture of a neural network processor.
One limiting factor in these accelerators is the amount of processors and memory that can fit into the die. In addition to increasing the DRAM capacity, new architectures are packing more on-die SRAM, typically up to 1-2Gb of embedded memory per chip. Although high-density embedded memories are available in the market, those are not sufficient to achieve the optimal capacity required for deep learning. Pseudo SRAMs are particularly well-suited to address this limitation. Unlike standard high-density SRAMs, pseudo SRAMs are multi-port memories constructed from single-port bitcells to provide extreme density increase and power reduction, which is critical as data movement dominates energy consumption more than compute.
The main metric for deep learning hardware is peak operations per cycle, measured as TFLOPs for floating point operations or TOPs for integer operations. In the traditional approach, multiple intermediate values are read separately prior to computation and subsequent storage. This limited throughput can be significantly increased with multi-port SRAM architectures that support multiple operations per clock cycle, thus increasing peak TOPs/TFLOPs and overall system throughput. Several processing units can also access the same multi-port SRAM enabling memory sharing to reduce the overall memory storage requirement on the chip.
Other common memory reduction techniques include using lower precision to store weights and activation parameters, trading reduced memory for an increase in computation. For example, we may discard values that are easy to compute and re-compute them when necessary. For inference applications where latency is key, embedded SRAM may be distributed in the chip for closer placement to each processing unit.
Moving Forward
As data scientists continue to evaluate new DNN structures, convolution functions and data formats, various memory strategies are being implemented in new deep learning architectures. Without intimate knowledge and awareness of memory advancements and capabilities in the market, it is not easy to nail down the optimal memory strategy for a given deep learning architecture.
Today's standard embedded memories have very limited flexibility to support evolving deep learning structures. There aren’t many IP providers with the requisite skills to reliably design and optimize memory systems for deep learning applications. eSilicon has extensive experience in embedded custom memory and has been developing HBM PHY and 2.5D solution since 2011. Experienced custom memory experts are valuable partners to support AI ASICs so that the core teams can focus on their central business.
About eSilicon
eSilicon is an independent provider of complex FinFET-class ASICs, custom IP and advanced 2.5D packaging solutions. Our ASIC+IP synergies include complete, 2.5D/HBM2 and TCAM platforms for FinFET technology at 14/16/7nm as well as SerDes, specialized memory compilers and I/O libraries. Supported by patented knowledge base and optimization technology, eSilicon delivers a transparent, collaborative, flexible customer experience to serve the high-bandwidth networking, high-performance computing, artificial intelligence (AI) and 5G infrastructure markets.
Contact
Please contact eSilicon at sales@esilicon.com for more information, silicon quality results, white papers or complete data sheets. www.esilicon.com. eSilicon IP is available in Navigator at https://star.esilicon.com.
- 7FF HBM2 PHY
HBM Gen2 PHY, TSMC 7FF - 7FF TCAM Compiler
Ternary Content Addressable Memory (TCAM) Compiler, TSMC 7FF - 7FF Ultra-High-Density Psuedo Two-Port (P2P) SRAM Compiler
Pseudo Two Port (P2P) SRAM Compiler, Ultra High Density, TSMC 7FF