Monitoring system-on-chip performance under process, voltage, and temperature (PVT) variations is very challenging, especially when the parasitic effects dominate the whole chip performance in advanced process nodes. Most of the previous works presented the performance monitoring methodologies based on known/predicted can- didates of critical paths under different operating conditions. However, those methodologies may fail when the critical path is misrecognized or mispredicted. This paper proposes a novel machine-learning based chip performance monitoring methodology to accurately match the chip performance without requiring the information of critical paths under various PVT conditions. The experimental results based on measured chip performance show that the proposed methodology can achieve 98.5% accuracy in the worst case under wide-range PVT variations.
System-on-chip (SoC) performance monitoring is essential in order to achieve energy-efficient high-performance computing (HPC) when applying the adaptive voltage scaling (AVS) or dynamic voltage and frequency scaling (DVFS) techniques. However, monitoring SoC performance under process, voltage, and temperature (PVT) variations is very challenging, especially when the parasitic effects dominate the whole chip performance in advanced process nodes. Fig. 1 shows an SoC architecture with chip performance monitors (CPMs). The on-chip ring oscillators (ROs) are commonly adopted to monitor the critical path delay . With the existence of wide- range PVT variations in advanced process technologies, it becomes extremely difficult to accurately evaluate circuit performance because the critical path varies from one PVT condition to another –. Most of the previous works – proposed to extract representative critical paths under different PVT conditions which is cost-ineffective when there are many representative critical paths under wide-range PVT variations. In addition, some of the extracted representative critical paths might be incorrect, or different from the real chip behavior after fabrication. Consequently, the monitored critical path delay may be inaccurate for some PVT conditions.
In this paper, we propose a novel machine-learning based SoC performance monitoring methodology to incorporate physical para- sitic characteristics and PVT variations with unknown critical paths. Our contributions can be summarized as follows:
- We design the parasitic-aware CPM with different types ROs in order to holistically consider the parasitic effects.
- Different from the previous works, we propose a novel chip performance monitoring methodology to accurately match the chip performance without requiring the extraction of critical paths under various PVT conditions.
- Our methodology is based on a machine learning method which trains the model through real chip measurement data under different PVT conditions resulting from the CPMs at the performance boundary of Shmoo testing.
- Our experimental results show that the proposed machine learn- ing method can accurately monitor the chip performance even with a smaller number of required ROs in CPMs.
Fig. 1. An SoC architecture with chip performance monitors (CPMs).
II. THE PROPOSED CPM DESIGN
The concept of our CPM design is based on the delay model with the existence of parasitic resistance and capacitance.
Delay Model: According to the 1st order approximation, the path delay, Tpath, of a circuit can be modeled by Equation (1), where Rd and Cd denotes the intrinsic resistance and capacitance of logic devices, respectively, and Rm and Cm denotes the parasitic resistance and capacitance due to interconnects, respectively.
In advanced process nodes, especially for 7 nm and below, Rm and Cm will dominate circuit performance, and hence the interconnect delay resulting from parasitic effect must be carefully considered. To incorporate the parasitic effect in the CPM design, we have the following observation.
Observation: By applying an expansion to Equation (1), Tpath can be further expressed as the summation of four terms, ΣRdCd, ΣRmCd, ΣRdCm, ΣRmCm, as seen in Equation (2), where the last three terms are all related to interconnect parasitic. However, we observed that the delay resulting from generic ROs, such as those in , is majorly contributed by only the first term, ΣRdCd, which is the intrinsic delay of logic devices, because the generic ROs are usually quite small with extremely short interconnects compare with the critical paths in an SoC. Consequently, the generic ROs cannot accurately evaluate chip performance when there exist significant parasitic effects in advanced nodes.
CPM Design: Base on the above observation, an effective CPM for accurate performance monitoring and evaluation must cover all four terms. Therefore, in addition to the generic ROs, which are similar to most of the previous works, our CPMs further contain the other three types of ROs, as shown in Fig. 2, corresponding to the four terms, respectively. Each type of ROs may further includes different variants of device size and layout configurations.
Fig. 2. The proposed CPM consisting of four types of ROs. (a) Type RdCd: RO without parasitic resistance and capacitance; (b) Type RmCd: RO with significant parasitic resistance; (c) Type RdCm: RO with significant parasitic capacitance; (d) Type RmCm: RO with significant parasitic resistance and capacitance.
Based on our CPM design, the critical path delay of an SoC under any PVT condition can be calculated by the weighted sum of the delays of all types of ROs, as shown in Equation (3), where D1, D2, D3, and D4 are the monitored delay vectors of each type of the ROs, and K1, K2, K3, and K4 are coefficient vectors of the weights for each type of the ROs.
III. CHIP PERFORMANCE MATCHING METHODOLOGY
Based on Equation (3), we can match the critical path delays of an SoC with the CPM outputs under various PVT conditions by training a machine-learning model. It should be noted that our methodology does not need to know where the exact critical path is for each PVT condition.
A. Training Data Preparation
First of all, we shall obtain the the training data for model training indicating the ground truths, including Tpath and Di’s in Equation (3) under all PVT conditions. The trained model will output the best coefficients of Ki′s for different PVT conditions, which best match Tpath. The proposed methodology based on machine learning can achieve accurate performance monitoring at runtime when Tpath is unknown. To obtain training data, we fabricated the chips based on the architecture in Fig. 1, and perform Shmoo testing for all manufactured chips. During Shmoo testing, we identified all performance boundary conditions under different PVT variations. For each performance boundary condition, we shall capture the performance bound, a.k.a. critical path delay, as well as all the RO delays in each CPM. Consequently, a complete data set for model training can be obtained. It should be noted that we only obtain the performance bound for each PVT condition without knowing the corresponding critical path.
B. Optimizing RO Weights for Chip Performance Matching
Once the complete training data set is obtained from chip measure- ment, the next step is to find the best fitting function and RO weights for the best performance matching under different PVT conditions. We apply the feedforward neural network (FNN) model to train the coefficients, Ki’s, in Equation (3). Such formulation maintains the interpretability of the model, and is very efficient in the meantime. The designed FNN model is shown in Fig. 3, where the input layer is followed by a fully connected layer of the weight coefficients, and the predicted critical path delays is the summation of weighted RO outputs. The Adam optimizer  is applied for our model training while minimizing the errors between the predicted delays and ground truths. With the interpretable model, we can further reduce design costs by eliminating those ROs which are less significant during model training.
Fig. 3. The proposed performance matching method based on the FNN model.
IV. EXPERIMENTAL RESULTS
We conducted our experiments with the trained model by monitor- ing the performance, or critical path delay, of several designs based on the architecture in Fig. 1 and TSMC 7nm technology, consisting of different path delays and PVT conditions. The results in Fig. 4 show that our methodology can achieve 98.5% accuracy in the worse case even with the reduced number of ROs resulting from our optimizer.
Fig. 4. Performance monitoring accuracy for the CPMs with only generic ROs  and ours under different PVT conditions.
In this paper, we have shown our breaking results for chip performance monitoring based on a novel machine-learning based methodology. Our methodology can achieve cost-effective chip de- sign with high performance monitoring accuracy under wide-range PVT variations.
 X.Wang,M.Tehranipoor,andR.Datta,“Path-RO:Anovelon-chipcriticalpath delay measurement under process variations,” in Proc. ICCAD, 2008.
 K. J. Kuhn, “Reducing variation in advanced logic technologies: Approaches to process and design for manufacturability of nanoscale CMOS,” in Proc. IEDM, 2007.
 S.Seo,R.G.Dreslinski,M.Woh,Y.Park,C.Charkrabari,S.Mahlke,D.Blaauw, and T. Mudge, “Process variation in near-threshold wide SIMD architectures,” in Proc. DAC, 2012.
 J. Kim, G. Lee, K. Choi, Y. Kim, W. Kim, K. Do, and J. Choi, “Adaptive delay monitoring for wide voltage-range operation,” in Proc. DATE, 2016.
 J.Kim,K.Choi,Y.Kim,W.Kim,K.Do,andJ.Choi,“Delaymonitoringsystem with multiple generic monitors for wide voltage range operation,” IEEE TVLSI, vol. 26, no. 1, pp. 37–49, 2017.
 T.-B. Chan, P. Gupta, A. B. Kahng, and L. Lai, “DDRO: A novel performance monitoring methodology based on design-dependent ring oscillators,” in Proc. ISQED, 2012.
 ——, “Synthesis and analysis of design-dependent ring oscillator (DDRO) performance monitors,” IEEE TVLSI, vol. 22, no. 10, pp. 2117–2130, 2013.
 F. Firouzi, F. Ye, K. Chakrabarty, and M. B. Tahoori, “Representative critical-path selection for aging-induced delay monitoring,” in Proc. ITC, 2013.
 ——, “Aging-and variation-aware delay monitoring using representative critical path selection,” ACM TODAES, vol. 20, no. 3, pp. 1–23, 2015.
 D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 GUC Analog IP https://www.guc-asic.com/en-global/ip/index/Analog_IP