proteanTecs On-Chip Monitoring & Deep Data Analytics System

White Paper proteanTecs On-Chip Monitoring and Deep Data Analytics System The ultimate solution for reliability, yield, performance and power co-optimization Authors: Georgios Konstadinidis, Evelyn Landman Table of Contents 1. Introduction ...............................................................................................................................2 2. Design Profiling and Material Classification during NPI ..........................................................4 2.1 Process Classification Agent (PCA) ..................................................................................4 2.2 Design Profiling Agent (DPA) .............................................................................................5 3. System Tuning during Productization .....................................................................................7 3.1 Timing Margin Agents (MA) ...............................................................................................7 3.2 Noise Modulation Agent (NMA) .........................................................................................8 3.3 Clock Integrity Agent (CIA) ..............................................................................................9 4. Margin Agent (MA) Usage during ATE Testing ......................................................................9 5. Margin Agent (MA) Usage at System Level ...........................................................................10 6. Power and Reliability Management .......................................................................................10 6.1 Voltage Droop Sensor (VDS) ......................................................................................................10 6.2 Local Voltage and Thermal Sensor (LVTS) .............................................................................11 6.3 Local Voltage and Thermal Sensor (LVTS) .............................................................................12 7. IO Channel Health Monitors ..................................................................................................13 7.1 HBM Agent (HBMA .........................................................................................................13 7.2 Tile Connectivity Agent (TCA) ..........................................................................................13 8. In-Field Monitoring during Operational Lifetime ...................................................................14 9. HW IP System .......................................................................................................................15 10. proteanTecs SW Analytics Platform ......................................................................................16 11. Conclusions ...........................................................................................................................17 1 proteanTecs On-Chip Monitoring and Deep Data Analytics System 1 | Introduction State-of-the-art silicon processes offer mainly logic density improvements at limited speedup. The power wall was hit several years ago, and these constraints set a limit on the chip frequency and performance. The worst-case design analysis is not cost-effective anymore. Reducing design margins while maintaining high-reliability becomes imperative to maximize performance and reduce power and cost. The high reliability aspect is even more important in service-critical applications, such as Automotive and Datacenter, given the trends in cloud, full automation and autonomous driving. Since devices degrade over time and sometimes abnormally, it is imperative the design margin analysis is performed in an intelligent way and be based on actual in-situ margin measurements. The proteanTecs On-Chip Monitoring and Analytics Platform (Figure 1) offers the ultimate tool to accomplish this reliability, yield, performance, and power co-optimization as already proven on multiple customer examples as summarized in Figure 2. proteanTecs offers a HW IP system, the monitoring IP, the CAD tools to facilitate the IP integration in the chip, as well as the Analytics Platform, ATE edge SW, on-board embedded SW, and FW reference code, to analyze the measured data at all phases of the product cycle; from wafer testing, packaged device testing, New Product Introduction (NPI), system test and in field monitoring, until the product retirement (Figure 3). This offers invaluable insights of the system margin and reliability, performance, power, and cost optimization opportunities. This integrated solution and unique approach to deep data analytics is the key differentiator of the proteanTecs offering. Figure 1: proteanTecs on chip monitoring and analytics platform for reliability, yield, performance & power co-optimization at chip, system, and in-mission 2 proteanTecs On-Chip Monitoring and Deep Data Analytics System 18% Power reduction 2% Yield improvement due to feedback to the fab of design sensitivity to process Detection of silicon-to-simulation miscorrelation in 5nm testchip “The average voltage improvement is around 6~7 steps and this is significant since it’s such a big chip” “The proteanTecs analysis helps us to prove the correctness of external circuit modification and the chip design quality” “proteanTecs speeds chip and system bring-up, significantly reducing time-to-market” Power reduction at SLT Inferred process parameters per die detected sensitivity of yield to process FORTUNE 50 NETWORKING COMPANY LEADING FABLESS SEMICONDUCTOR COMPANY Post-to-pre parametric correlation LEADING ASIC VENDOR Figure 2: Customer success stories using proteanTecs on chip monitoring and analytics platform Figure 3: proteanTecs targeted applications for the full product cycle 3 proteanTecs On-Chip Monitoring and Deep Data Analytics System The following Table provides an overview of the available proteanTecs IP offering. proteanTecs Agents Purpose Process Classification Agent (PCA) Process key-parameter characterization Design Profiling Agent (DPA) Design style - process interaction characterization Margin Agent (MA) In-situ Timing Margin measurement of the actual design paths Noise Modulation Agent (NMA) In-situ effect of local IR drop on a reference path delay Clock Integrity Agent (CIA) Local Thermal Agent (LTA) In-situ maximum cycle-to-cycle clock jitter measurement and maximum and minimum clock cycle time Local on-die chip Temperature measurement (used for analytics) Workload Agent (WLA) Voltage - Temperature chip stress conditions proxy Voltage Droop Sensor (VDS) Ldi/dt voltage droop detection Local Voltage and Thermal Sensor (LVTS) Tile Connectivity Agent (TCA) HBM Agent (HBMA) Full Chip Controller Accurate thermal and DC voltage measurements, with embedded over-temperature alerts, works on VDD core only Monitors signal integrity at the receiving end of die-to-die (D2D) interfaces Monitors signal Integrity at the SOC side of SOC-HBM interfaces On chip Monitor & System Interface Controller 2 | Design Profiling and Material Classification during NPI 2.1 Process Classification Agent (PCA) Knowing the exact process characteristics during the New Product Introduction (NPI) phase and up to the final qualification is extremely important to help with process tuning to maximize performance and yield at the minimum power. Foundries offer limited wafer data (Wafer Acceptance Test - WAT) from a small number of sites across the wafer. This is not adequate to capture the process variation, across-the-wafer, and across-the-chip, sufficiently. Multiple instances of the proteanTecs Process Classification Agents (PCAs) are distributed, within each chip, to cover for cross-chip process variations and provide key process parameter measurements. These can be correlated to simulations and using proteanTecs’ SW platform’s built-in algorithms, they can be translated into inferred process parameters such as Vtsat, Idlin and Ioff. This data is analyzed by the analytics platform to facilitate the yield and frequency binning predictions. This establishes a better product binning strategy based on data and intelligent analysis. 4 proteanTecs On-Chip Monitoring and Deep Data Analytics System The agent information is also used by built-in algorithms in the SW platform to find defects that are not detectable with the state-of-the-art methods, achieving substantial DDPM reduction and even preventing Silent Data Corruption. Figure 4: PCA-based outlier detection using proteanTecs analytics for significant DPPM reduction 2.2 Design Profiling Agent (DPA) Ring oscillators are typically used to quickly identify process corners (fast, typical, or slow) and provide a first estimate of the reliability (NBTI and HCI degradation). The delay response to voltage and process variation depends on the gate type, size, and layout. General-purpose ring oscillator monitors do not typically provide the gate type variety or might not even be constructed from the design library cells. The proteanTecs DPA is built using the standard cell library and specifically the cells used in the particular design and reflects the mix and statistics of gates observed at the critical paths. It uses the same composition and Place and Route (P&R) tools as in the design. This results in a much better correlation between the DPAs, and the actual chip response compared to the generic ring oscillator case. The proteanTecs analytics platform performs the statistics and compares against targets, to provide the fundamental gate response for this specific design using a powerful and userfriendly graphics interface, as shown in Figure 5 below. 5 proteanTecs On-Chip Monitoring and Deep Data Analytics System Figure 5: DPA data offer invaluable insights to the in-silicon response of the specific gates and design style compared to simulations on built-in analytics dashboards Figure 6: Wafer maps of PCA and DPA agent data for wafer/chip level location dependencies on the proteanTecs analytics platform 6 proteanTecs On-Chip Monitoring and Deep Data Analytics System 3 | System Tuning during Productization 3.1 Timing Margin Agents (MA) While DPAs offer a much better chip speed proxy than generic ring oscillators and are very useful for first evaluation, they do not accurately reflect the critical paths’ response under normal workload operation. The critical path delay response depends on the local voltage variation, the gate type and size combination in the timing path, the signal activity that affects NBTI degradation, the process variation, and layout context. This makes it nearly impossible to capture actual timing margins with simple ring oscillators, DPAs, or even critical path replicas that are not exercised under the exact same workload conditions. Consequently, measuring the actual margin accurately in critical paths in-situ during normal operation becomes a must. The choice of the critical path to monitor though is key, as not all top critical paths can be monitored (for practical reasons), and the path order under real applications will not be the same as seen in the models. proteanTecs offers an intelligent algorithm to select the right timing path to monitor, to achieve high coverage in terms of number of nodes and critical paths covered, as well as representative groups of paths. The timing monitors are connected to the input of the monitored Flip Flop (FFs) during synthesis and P&R. Their placement and connection need very careful consideration so that it does not affect the quality of results. This flow is fully automated by proteanTecs. Furthermore, since the timing paths change over time as the design matures, it is imperative to have a very good ECO mode of injecting the right timing monitors at the right places without upsetting the design closure. Again, the proteanTecs ECO flow is fully automated and proven on numerous designs. The design overhead is typically less than 1% in gate count on the embedding blocks, and power is less than 0.05% of the core power. An additional benefit of using the MAs is to check the coverage of the critical paths per workload. The proteanTecs platform will report not only the margin per MA, but also if it toggled during the workload or not, giving insight for improvements in the V-F shmoo patterns for better coverage. 7 proteanTecs On-Chip Monitoring and Deep Data Analytics System Figure 7: The margin agents (MA) provide accurate in-situ measurements, with high coverage of the design paths, including the timing-critical ones 3.2 Noise Modulation Agent (NMA) While the MAs are inserted directly in each of the critical paths, the Noise Modulation Agents (NMAs) are inserted at block level and provide a proxy for the “effective cycle time measurement” as impacted by cycleto-cycle jitter, voltage fluctuation and local temperature. Although less accurate than the MA as it provides the actual margin at block level instead of per critical timing path, the NMAs show reference path delay fluctuations created by local IR drop, cycle-to-cycle jitter, and temperature. 8 Figure 8: The Noise Modulation Agent (NMA) measures the maximum increase/decrease of the “effective cycle time” as impacted by the cycleto-cycle jitter, local IR drop, and temperature proteanTecs On-Chip Monitoring and Deep Data Analytics System 3.3 Clock Integrity Agent (CIA) The clock jitter depends on the cycle-to-cycle jitter of the clock source (typically a PLL) and the cycle-to-cycle clock arrival to the flip-flop destination variation that depends on overall clock network latency, process variation, the supply voltage noise per cycle and the temperature difference at the various points on the chip. The proteanTecs Clock Integrity Agents (CIAs) provide an accurate maximum cycle-to-cycle clock jitter measurement under various workloads and in-situ, considering all the actual voltage variation sources during normal chip operation. Knowing the exact clock noise level and correlating back to the design estimates helps close the loop with the clock design methodology. In case of design issues, it provides a key insight in the actual clock behavior, accelerating the silicon debug. Figure 9: The Clock Integrity Agents (CIAs) are strategically placed across the die to provide accurate clock jitter measurements at the end of the clock network distribution and close to the receiving flip-flops 4 | Margin Agent (MA) Usage during ATE Testing Leveraging the MAs and edge on-tester models, advanced analytics lead to new learnings. Knowing the actual timing critical paths margins on the ATE helps establish the first search point in the Vmin-Frequency search, reducing test time and cost. The MA data can also be used during HTOL checkpoints to show the actual degradation of the timing paths, thus reducing or even eliminating the number of additional V-F searches and providing much more accurate results than simple ring oscillators. This helps establish the exact end of life guard-band required. The proteanTecs platform analyzes all the V-F characterization data and process monitor data, from both the DPAs and PCAs, and provides a model for a more efficient chip classification and V-F binning. The sensitivity of V-F to process parameters provides insights on what process parameters should be tuned by the foundry to maximize performance and performance yield, while minimizing power. It can also help identify potential process issues for outlier chips or provide useful insights on the process status during the silicon debug phase. 9 proteanTecs On-Chip Monitoring and Deep Data Analytics System 5 | Margin Agent (MA) Usage at System Level In addition to optimization performed at the ATE level as described above, the proteanTecs MAs help reduce the voltage margins at system level, achieving lower power with no compromise to reliability. During normal system operation, the workload variation causes current fluctuation that excites the package inductance-chip cap LC network that is part of the system Power Delivery Network (PDN). This leads to voltage droops that may result in timing failures, as the voltage becomes temporarily lower than the minimum required for correct functionality. The V-F binning established on the Tester (ATE) does not consider the Ldi/dt guard-band required. This can only be established accurately with actual system measurements and with the appropriate system PDN excitation, voltage regulator required guard-bands, and the final EOL guard-bands. Exact knowledge of the actual in-situ timing margins safely reduces the voltage guard-bands required, leading therefore to power reduction at maximum reliability. 6 | Power and Reliability Management 6.1 Voltage Droop Sensor (VDS) A voltage droop due to the PDN excitation, as described in the previous section, can lead to timing failures and requires upfront high-voltage guard-bands, resulting in more than 10-20% extra power. Using oscilloscope probes to measure this droop is cumbersome and is restricted to the lab environment. Generic on-chip Analog to Digital Converters based supply monitors are not able to capture the high-speed cycle-to-cycle voltage variation that can lead to timing failures. The proteanTecs Voltage Droop Sensor (VDS) provides an accurate cycle-to-cycle noise measurement and high-low watermarks during the workload run. As shown in Figure 10, this can be used to throttle the system temporarily and avoid further Ldi/dt droops, reclaiming the Ldi/dt voltage margin required and reducing overall chip power by 10-20%. 10 proteanTecs On-Chip Monitoring and Deep Data Analytics System Figure 10: The Voltage Droop Sensor (VDS) measures the voltage fluctuation per cycle during the Ldi/dt event and can be used to set thresholds for throttling and reduction of the Ldi/dt droop, leading to power reduction at max performance Further power improvements (typically in the order of 4%) can be achieved by not using the full EOL guard-band up front, but rather adjusting it over time using Adaptive Voltage Scaling (AVS), based on the remaining in-situ measured timing margin. 6.2 Local Voltage and Thermal Sensor (LVTS) Reliability degradation has an exponential dependence on temperature. Therefore, thermal sensors need to be close to the hot spots and provide very accurate temperature measurements. This information can be used to throttle the system to prevent excessive reliability degradation, functionality failure or even complete chip damage. The traditional thermal diodes are large, require extra circuitry on the board to operate, and require extra bumps and low resistance package traces that affect the power grid integrity. This can lead to an additional IR drop at the hotspots that consume the maximum power, leading to performance loss. Similar problems arise with the use of bandgap-based thermal sensors that require separate analog voltage which interferes with the normal digital domain power grid. The proteanTecs Local Voltage and Thermal Sensors (LVTS) are self-contained and use the same digital voltage supply. Special circuitry removes the impact of the digital supply noise and provides +/-1°C temperature measurement across the full operation range. The small size and APB compatible interface allows the placement of these thermal sensors exactly where they are needed (see Figure 11), close to the hot spots and with negligible physical design impact. 11 proteanTecs On-Chip Monitoring and Deep Data Analytics System Figure 11: The Local Voltage and Thermal Sensors (LVTS) are strategically placed across the chip to provide accurate on-chip temperature measurements for power management and reliability assessment 6.3 Workload Agents (WLA) The Workload Agents (WLAs) serve as a proxy for how much voltage and temperature stress the chip has been exposed to, during normal operation. This is important to determine the remaining useful life of a product and for efficient power and performance management, using Dynamic Voltage Frequency Scaling (DVFS) power management (e.g., to judge the allowed time to enter “turbo” mode to improve performance on critical workloads). Figure 12: The Workload Agent (WA) measurement is a proxy for the chip voltage, temperature and activityrelated stress under a specific workload, and can be used to estimate the remaining useful life of a product 12 proteanTecs On-Chip Monitoring and Deep Data Analytics System 7 | IO Channel Health Monitors Process scaling is getting more difficult and expensive, and the need to integrate heterogeneous processes has led to the development of 2.5D and 3D integration as an effective way of extending the life of Moore’s Law. While introducing major benefits, these advancements present new challenges, including thermomechanical and integration issues. In 2.5D and 3D applications, in addition to the circuit degradation, there is extra degradation due to thermomechanical fatigue (e.g., partially open bump or TSV connections), and can lead to communication failures. The proteanTecs Interconnect Monitoring Agents monitor this interface continuously, per lane and in functional mode, to provide an accurate measure of the degradation over time. This is especially useful in silicon and system debugging as well as in-field reliability monitoring. 7.1 HBM Agent (HBMA) The HBM Agent (Figure 13a) resides on the SOC PHY side and monitors both the nearend and far-end high-speed interface signal integrity. 7.2 Tile Connectivity Agent (TCA) 2.5D heterogeneous integration requires high-speed communication between the various tiles (chiplets) on a silicon interposer or organic substrate. The Tile Connectivity Agent (TCA) monitors the signal integrity quality at the receiving end of each Die-to-Die (D2D) interface and provides accurate measurement of the eye width, the available margin, and margin degradation over time (Figure 13b). Figure 13: The Interconnect Monitoring Agents can alert of any abnormal channel degradation due to, for example, thermomechanical issues in HBM or D2D interfaces 13 proteanTecs On-Chip Monitoring and Deep Data Analytics System 8 | In-Field Monitoring during Operational Lifetime The predominant semiconductor process degradation mechanisms like Negative Bias Instability (NBTI), Hot Electron Injection (HCI), Electromigration (EM), and others depend exponentially on voltage, nodal activity, and silicon temperature. State-of-the-art power management calls for continuous changes in voltage and frequency (DVFS), per workload, to maximize performance/energy. The nodal activity and temperature will depend on the workload, the ambient temperature, the cooling solution, the throttling mechanisms, etc. Accounting for worst-case degradation upfront is very pessimistic and scale prohibitive, leading to performance loss, high power, and cost. Therefore, in-situ monitoring of all these key parameters during normal operation and over the lifetime of the product is necessary to understand the remaining lifetime of the device. Furthermore, many latent process defects may get activated over time and cause single-node transient or stuck-at failures that had escaped the Burn-In, Dynamic Voltage Stress (DVS), and other stress test techniques at time zero. These defects will cause abnormal degradation that can only be captured before the actual failure, if the affected timing paths are monitored continuously over time. Obviously a ring oscillator or critical path replica will not exhibit the same degradation as the actual timing path with defect, as it will not have the same rare defect, nodal activity and DVFS limits. The proteanTecs Margin Agents cover most of the critical paths and many more, and when the SW platform detects an abnormal timing margin degradation (Figure 14) the system administrator is alerted to remove this machine from service prior to the actual failure. Such a failure can be a “hard” and easily detectable failure, or “soft and intermittent'', depending on activity and environmental conditions. The latter can cause undetected Silent Data Corruption errors that can create havoc in a datacenter, making the proteanTecs solution invaluable. Figure 14: In-situ timing margin measurements can be used to estimate the time-to-failure (TTF), to alert for abnormal margin degradation and early safe system maintenance 14 proteanTecs On-Chip Monitoring and Deep Data Analytics System 9 | HW IP System Given the large number of monitors provided by proteanTecs and the need of seamless integration and data communication, proteanTecs offers a Full Chip Controller (FCC) that manages all interfaces to the monitors, collects the monitor data, and provides them to the FW, edge SW and SW cloud platform via standard interfaces (JTAG, APB, I2C). As shown in Figure 15, all the monitors are connected in the chain managed by the FCC. The monitors are grouped in small blocks called Protean Units for easy integration within the design blocks. The Protean Units contain local control and HW processing of the monitors data. Usually, one FCC is required per chip and can control up to 1024 Protean Units. Some of the proteanTecs IP are hard macros, containing all-required clock manipulations internally, minimizing the design effort on the customer side. The rest of the IPs, including the high path coverage Margin Agents (MAs), are synthesizable using the customer flows and using a fully automated procedure. Figure 15: All proteanTecs agents are connected in a chain and the interfaces, data collection and control are managed through the Full Chip Controller (embedded in Unit A in this example) 15 proteanTecs On-Chip Monitoring and Deep Data Analytics System 10 | proteanTecs SW Analytics Platform The proteanTecs SW analytics platform, as depicted in Figure 16, operates in the cloud, and provides the interfaces to collect data from the wafer probe, ATE vendor, and the system during productization and normal operation. The platform provides Agent fusion and analyzes the Agent measurements to provide a holistic view of the chip and system: Identifying design sensitivity to process parameters Chip classification for optimal voltage-frequency binning at early stages of production Identifying degradations during normal system operation Providing workload analysis and Alerting the system administrator on upcoming failures While traditional COT (Customer Owned Tooling) chip design companies typically design some basic process control sensors themselves, they lack the full spectrum of monitors provided by proteanTecs. They also lack the full integration flow and the pre-built algorithms that enable the collection and analysis of a vast amount of data, as well as the use of sophisticated machine learning (ML) that helps with optimal chip and system characterization, production, and operation. This integrated solution and unique approach to data analytics saves design effort and provides powerful tools to help with process, performance, and power tuning as well as silicon and system debugging. Cloud based platform proteanTecs Cloud platform Big data analytics Many data points, high compute and memory resources Advanced analytics and visualization capabilities System Runtime environment Insights across fleets of systems, on top of specific system/chip Open architecture proteanTecs Edge analytics Chip Runtime environment On-device / “Near-device” During test, running at the “Test floor”, e.g., ATE Embedded / Firmware Agent Embedded Chip Local operation and data collection Low latency alerts and interrupts Real-time applications for chip performance/health Figure 16: The proteanTecs software stack serves the early data gathering, the data analytics, yield and performance optimization as well as in field monitoring and reliability enhancement 16 proteanTecs On-Chip Monitoring and Deep Data Analytics System 11 | Conclusions Today’s applications demand high performance, low power, and high reliability. This is especially true for Automotive and autonomous driving applications, as well as in data centers with such deployment at scale and the requirement of thousands of machines to work in tandem, error-free. The proteanTecs on-chip monitoring and analytics platform offers the ultimate tool to accomplish this reliability, yield, performance, and power co-optimization as already proven on multiple customer examples. proteanTecs offers the Monitor IP, the CAD flows, and tools to facilitate the IP integration in the chip and make sure the implementation will provide the expected value. In addition, it provides the machine learning algorithms and analytics SW stack to analyze the measured data at all phases of the product cycle; from wafer testing, packaged device testing, NPI, system ramp, system test and in field monitoring, until the product retirement. This offers invaluable insights of the system performance margins and reliability, performance, power, and cost optimization opportunities. This integrated solution and unique approach to data analytics is the key differentiator of the proteanTecs offering. This deep data, end-to-end health monitoring at scale leads to optimal reliability, performance, power and cost across a wide range of applications from automotive to datacenters and beyond. 17 proteanTecs Ltd www.proteanTecs.com © proteanTecs 2023. All rights reserved.

proteanTecs On-Chip Monitoring & Deep Data Analytics System

Related documents

Products

Support

proteanTecs On-Chip Monitoring & Deep Data Analytics System

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib