Uploaded by 1823368028

White paper proteanTecs On-Chip Monitoring and Deep Data Analytics System

advertisement
White Paper
proteanTecs On-Chip Monitoring
and Deep Data Analytics System
The ultimate solution for reliability, yield, performance
and power co-optimization
Authors: Georgios Konstadinidis, Evelyn Landman
Table of Contents
1. Introduction ...............................................................................................................................2
2. Design Profiling and Material Classification during NPI ..........................................................4
2.1 Process Classification Agent (PCA) ..................................................................................4
2.2 Design Profiling Agent (DPA) .............................................................................................5
3. System Tuning during Productization .....................................................................................7
3.1 Timing Margin Agents (MA) ...............................................................................................7
3.2 Noise Modulation Agent (NMA) .........................................................................................8
3.3 Clock Integrity Agent (CIA) ..............................................................................................9
4. Margin Agent (MA) Usage during ATE Testing
......................................................................9
5. Margin Agent (MA) Usage at System Level ...........................................................................10
6. Power and Reliability Management .......................................................................................10
6.1 Voltage Droop Sensor (VDS) ......................................................................................................10
6.2 Local Voltage and Thermal Sensor (LVTS) .............................................................................11
6.3 Local Voltage and Thermal Sensor (LVTS) .............................................................................12
7. IO Channel Health Monitors ..................................................................................................13
7.1 HBM Agent (HBMA .........................................................................................................13
7.2 Tile Connectivity Agent (TCA) ..........................................................................................13
8. In-Field Monitoring during Operational Lifetime ...................................................................14
9. HW IP System .......................................................................................................................15
10. proteanTecs SW Analytics Platform ......................................................................................16
11. Conclusions ...........................................................................................................................17
1
proteanTecs On-Chip Monitoring and Deep Data Analytics System
1 | Introduction
State-of-the-art silicon processes offer mainly logic density improvements at limited speedup.
The power wall was hit several years ago, and these constraints set a limit on the chip frequency
and performance. The worst-case design analysis is not cost-effective anymore. Reducing
design margins while maintaining high-reliability becomes imperative to maximize performance
and reduce power and cost. The high reliability aspect is even more important in service-critical
applications, such as Automotive and Datacenter, given the trends in cloud, full automation
and autonomous driving. Since devices degrade over time and sometimes abnormally, it is
imperative the design margin analysis is performed in an intelligent way and be based on actual
in-situ margin measurements. The proteanTecs On-Chip Monitoring and Analytics Platform
(Figure 1) offers the ultimate tool to accomplish this reliability, yield, performance, and power
co-optimization as already proven on multiple customer examples as summarized in Figure 2.
proteanTecs offers a HW IP system, the monitoring IP, the CAD tools to facilitate the IP integration
in the chip, as well as the Analytics Platform, ATE edge SW, on-board embedded SW, and
FW reference code, to analyze the measured data at all phases of the product cycle; from
wafer testing, packaged device testing, New Product Introduction (NPI), system test and in field
monitoring, until the product retirement (Figure 3). This offers invaluable insights of the system
margin and reliability, performance, power, and cost optimization opportunities. This integrated
solution and unique approach to deep data analytics is the key differentiator of the proteanTecs
offering.
Figure 1: proteanTecs on chip monitoring and analytics platform for reliability, yield, performance
& power co-optimization at chip, system, and in-mission
2
proteanTecs On-Chip Monitoring and Deep Data Analytics System
18% Power
reduction
2% Yield improvement
due to feedback to the
fab of design sensitivity to
process
Detection of
silicon-to-simulation
miscorrelation in 5nm
testchip
“The average voltage
improvement is around
6~7 steps and this is significant
since it’s such a big chip”
“The proteanTecs analysis helps us
to prove the correctness of external
circuit modification and the chip
design quality”
“proteanTecs speeds chip and
system bring-up, significantly
reducing time-to-market”
Power reduction at SLT
Inferred process parameters per die
detected sensitivity of yield to process
FORTUNE 50
NETWORKING COMPANY
LEADING FABLESS
SEMICONDUCTOR COMPANY
Post-to-pre parametric correlation
LEADING ASIC VENDOR
Figure 2: Customer success stories using proteanTecs on chip monitoring and analytics platform
Figure 3: proteanTecs targeted applications for the full product cycle
3
proteanTecs On-Chip Monitoring and Deep Data Analytics System
The following Table provides an overview of the available proteanTecs IP offering.
proteanTecs Agents
Purpose
Process Classification Agent (PCA)
Process key-parameter characterization
Design Profiling Agent (DPA)
Design style - process interaction characterization
Margin Agent (MA)
In-situ Timing Margin measurement of the actual design
paths
Noise Modulation Agent (NMA)
In-situ effect of local IR drop on a reference path delay
Clock Integrity Agent (CIA)
Local Thermal Agent (LTA)
In-situ maximum cycle-to-cycle clock jitter measurement
and maximum and minimum clock cycle time
Local on-die chip Temperature measurement (used for
analytics)
Workload Agent (WLA)
Voltage - Temperature chip stress conditions proxy
Voltage Droop Sensor (VDS)
Ldi/dt voltage droop detection
Local Voltage and Thermal Sensor (LVTS)
Tile Connectivity Agent (TCA)
HBM Agent (HBMA)
Full Chip Controller
Accurate thermal and DC voltage measurements, with
embedded over-temperature alerts, works on VDD core only
Monitors signal integrity at the receiving end of die-to-die
(D2D) interfaces
Monitors signal Integrity at the SOC side of SOC-HBM
interfaces
On chip Monitor & System Interface Controller
2 | Design Profiling and Material Classification during NPI
2.1 Process Classification Agent (PCA)
Knowing the exact process characteristics during the New Product Introduction (NPI)
phase and up to the final qualification is extremely important to help with process tuning
to maximize performance and yield at the minimum power. Foundries offer limited wafer
data (Wafer Acceptance Test - WAT) from a small number of sites across the wafer. This
is not adequate to capture the process variation, across-the-wafer, and across-the-chip,
sufficiently. Multiple instances of the proteanTecs Process Classification Agents (PCAs)
are distributed, within each chip, to cover for cross-chip process variations and provide
key process parameter measurements. These can be correlated to simulations and using
proteanTecs’ SW platform’s built-in algorithms, they can be translated into inferred process
parameters such as Vtsat, Idlin and Ioff. This data is analyzed by the analytics platform
to facilitate the yield and frequency binning predictions. This establishes a better product
binning strategy based on data and intelligent analysis.
4
proteanTecs On-Chip Monitoring and Deep Data Analytics System
The agent information is also used by built-in algorithms in the SW platform to find defects
that are not detectable with the state-of-the-art methods, achieving substantial DDPM
reduction and even preventing Silent Data Corruption.
Figure 4: PCA-based outlier detection using proteanTecs analytics for significant DPPM reduction
2.2 Design Profiling Agent (DPA)
Ring oscillators are typically used to quickly identify process corners (fast, typical, or slow)
and provide a first estimate of the reliability (NBTI and HCI degradation). The delay response
to voltage and process variation depends on the gate type, size, and layout. General-purpose
ring oscillator monitors do not typically provide the gate type variety or might not even be
constructed from the design library cells. The proteanTecs DPA is built using the standard
cell library and specifically the cells used in the particular design and reflects the mix and
statistics of gates observed at the critical paths. It uses the same composition and Place
and Route (P&R) tools as in the design. This results in a much better correlation between
the DPAs, and the actual chip response compared to the generic ring oscillator case. The
proteanTecs analytics platform performs the statistics and compares against targets, to
provide the fundamental gate response for this specific design using a powerful and userfriendly graphics interface, as shown in Figure 5 below.
5
proteanTecs On-Chip Monitoring and Deep Data Analytics System
Figure 5: DPA data offer invaluable insights to the in-silicon response of the specific gates and
design style compared to simulations on built-in analytics dashboards
Figure 6: Wafer maps of PCA and DPA agent data for wafer/chip level location dependencies on
the proteanTecs analytics platform
6
proteanTecs On-Chip Monitoring and Deep Data Analytics System
3 | System Tuning during Productization
3.1 Timing Margin Agents (MA)
While DPAs offer a much better chip speed proxy than generic ring oscillators and are very
useful for first evaluation, they do not accurately reflect the critical paths’ response under
normal workload operation. The critical path delay response depends on the local voltage
variation, the gate type and size combination in the timing path, the signal activity that
affects NBTI degradation, the process variation, and layout context. This makes it nearly
impossible to capture actual timing margins with simple ring oscillators, DPAs, or even
critical path replicas that are not exercised under the exact same workload conditions.
Consequently, measuring the actual margin accurately in critical paths in-situ during normal
operation becomes a must.
The choice of the critical path to monitor though is key, as not all top critical paths can be
monitored (for practical reasons), and the path order under real applications will not be the
same as seen in the models. proteanTecs offers an intelligent algorithm to select the right
timing path to monitor, to achieve high coverage in terms of number of nodes and critical
paths covered, as well as representative groups of paths.
The timing monitors are connected to the input of the monitored Flip Flop (FFs) during
synthesis and P&R. Their placement and connection need very careful consideration so
that it does not affect the quality of results. This flow is fully automated by proteanTecs.
Furthermore, since the timing paths change over time as the design matures, it is imperative
to have a very good ECO mode of injecting the right timing monitors at the right places
without upsetting the design closure. Again, the proteanTecs ECO flow is fully automated
and proven on numerous designs. The design overhead is typically less than 1% in gate
count on the embedding blocks, and power is less than 0.05% of the core power.
An additional benefit of using the MAs is to check the coverage of the critical paths per
workload. The proteanTecs platform will report not only the margin per MA, but also if
it toggled during the workload or not, giving insight for improvements in the V-F shmoo
patterns for better coverage.
7
proteanTecs On-Chip Monitoring and Deep Data Analytics System
Figure 7: The margin agents (MA) provide accurate in-situ measurements, with high coverage of
the design paths, including the timing-critical ones
3.2 Noise Modulation Agent (NMA)
While the MAs are inserted directly in each
of the critical paths, the Noise Modulation
Agents (NMAs) are inserted at block level
and provide a proxy for the “effective cycle
time measurement” as impacted by cycleto-cycle jitter, voltage fluctuation and local
temperature. Although less accurate than the
MA as it provides the actual margin at block
level instead of per critical timing path, the
NMAs show reference path delay fluctuations
created by local IR drop, cycle-to-cycle jitter,
and temperature.
8
Figure 8: The Noise Modulation Agent (NMA)
measures the maximum increase/decrease of the
“effective cycle time” as impacted by the cycleto-cycle jitter, local IR drop, and temperature
proteanTecs On-Chip Monitoring and Deep Data Analytics System
3.3 Clock Integrity Agent (CIA)
The clock jitter depends on the cycle-to-cycle
jitter of the clock source (typically a PLL) and
the cycle-to-cycle clock arrival to the flip-flop
destination variation that depends on overall
clock network latency, process variation,
the supply voltage noise per cycle and the
temperature difference at the various points
on the chip. The proteanTecs Clock Integrity
Agents (CIAs) provide an accurate maximum
cycle-to-cycle clock jitter measurement under
various workloads and in-situ, considering all
the actual voltage variation sources during
normal chip operation. Knowing the exact clock
noise level and correlating back to the design
estimates helps close the loop with the clock
design methodology. In case of design issues,
it provides a key insight in the actual clock
behavior, accelerating the silicon debug.
Figure 9: The Clock Integrity Agents (CIAs) are
strategically placed across the die to provide
accurate clock jitter measurements at the end
of the clock network distribution and close to
the receiving flip-flops
4 | Margin Agent (MA) Usage during ATE Testing
Leveraging the MAs and edge on-tester models, advanced analytics lead to new learnings.
Knowing the actual timing critical paths margins on the ATE helps establish the first search point
in the Vmin-Frequency search, reducing test time and cost.
The MA data can also be used during HTOL checkpoints to show the actual degradation of
the timing paths, thus reducing or even eliminating the number of additional V-F searches and
providing much more accurate results than simple ring oscillators. This helps establish the exact
end of life guard-band required.
The proteanTecs platform analyzes all the V-F characterization data and process monitor data,
from both the DPAs and PCAs, and provides a model for a more efficient chip classification and
V-F binning. The sensitivity of V-F to process parameters provides insights on what process
parameters should be tuned by the foundry to maximize performance and performance yield,
while minimizing power. It can also help identify potential process issues for outlier chips or
provide useful insights on the process status during the silicon debug phase.
9
proteanTecs On-Chip Monitoring and Deep Data Analytics System
5 | Margin Agent (MA) Usage at System Level
In addition to optimization performed at the ATE level as described above, the proteanTecs
MAs help reduce the voltage margins at system level, achieving lower power with no
compromise to reliability. During normal system operation, the workload variation causes
current fluctuation that excites the package inductance-chip cap LC network that is part
of the system Power Delivery Network (PDN). This leads to voltage droops that may result
in timing failures, as the voltage becomes temporarily lower than the minimum required for
correct functionality.
The V-F binning established on the Tester (ATE) does not consider the Ldi/dt guard-band
required. This can only be established accurately with actual system measurements and with
the appropriate system PDN excitation, voltage regulator required guard-bands, and the final
EOL guard-bands. Exact knowledge of the actual in-situ timing margins safely reduces the
voltage guard-bands required, leading therefore to power reduction at maximum reliability.
6 | Power and Reliability Management
6.1 Voltage Droop Sensor (VDS)
A voltage droop due to the PDN excitation, as described in the previous section, can lead
to timing failures and requires upfront high-voltage guard-bands, resulting in more than
10-20% extra power. Using oscilloscope probes to measure this droop is cumbersome
and is restricted to the lab environment. Generic on-chip Analog to Digital Converters
based supply monitors are not able to capture the high-speed cycle-to-cycle voltage
variation that can lead to timing failures. The proteanTecs Voltage Droop Sensor (VDS)
provides an accurate cycle-to-cycle noise measurement and high-low watermarks
during the workload run. As shown in Figure 10, this can be used to throttle the system
temporarily and avoid further Ldi/dt droops, reclaiming the Ldi/dt voltage margin required
and reducing overall chip power by 10-20%.
10
proteanTecs On-Chip Monitoring and Deep Data Analytics System
Figure 10: The Voltage Droop Sensor (VDS) measures the voltage fluctuation per cycle during the Ldi/dt event
and can be used to set thresholds for throttling and reduction of the Ldi/dt droop, leading to power reduction at
max performance
Further power improvements (typically in the order of 4%) can be achieved by not using
the full EOL guard-band up front, but rather adjusting it over time using Adaptive Voltage
Scaling (AVS), based on the remaining in-situ measured timing margin.
6.2 Local Voltage and Thermal Sensor (LVTS)
Reliability degradation has an exponential dependence on temperature. Therefore,
thermal sensors need to be close to the hot spots and provide very accurate temperature
measurements. This information can be used to throttle the system to prevent excessive
reliability degradation, functionality failure or even complete chip damage. The traditional
thermal diodes are large, require extra circuitry on the board to operate, and require extra
bumps and low resistance package traces that affect the power grid integrity. This can
lead to an additional IR drop at the hotspots that consume the maximum power, leading to
performance loss. Similar problems arise with the use of bandgap-based thermal sensors
that require separate analog voltage which interferes with the normal digital domain
power grid.
The proteanTecs Local Voltage and Thermal Sensors (LVTS) are self-contained and use
the same digital voltage supply.
Special circuitry removes the impact of the digital supply noise and provides +/-1°C
temperature measurement across the full operation range. The small size and APB
compatible interface allows the placement of these thermal sensors exactly where
they are needed (see Figure 11), close to the hot spots and with negligible physical
design impact.
11
proteanTecs On-Chip Monitoring and Deep Data Analytics System
Figure 11: The Local Voltage and Thermal Sensors (LVTS) are strategically placed across the chip to provide
accurate on-chip temperature measurements for power management and reliability assessment
6.3 Workload Agents (WLA)
The Workload Agents (WLAs) serve
as a proxy for how much voltage and
temperature stress the chip has been
exposed to, during normal operation. This is
important to determine the remaining useful
life of a product and for efficient power and
performance management, using Dynamic
Voltage Frequency Scaling (DVFS) power
management (e.g., to judge the allowed
time to enter “turbo” mode to improve
performance on critical workloads).
Figure 12: The Workload Agent (WA) measurement is
a proxy for the chip voltage, temperature and activityrelated stress under a specific workload, and can be
used to estimate the remaining useful life of a product
12
proteanTecs On-Chip Monitoring and Deep Data Analytics System
7 | IO Channel Health Monitors
Process scaling is getting more difficult and expensive, and the need to integrate heterogeneous
processes has led to the development of 2.5D and 3D integration as an effective way of extending
the life of Moore’s Law. While introducing major benefits, these advancements present new
challenges, including thermomechanical and integration issues.
In 2.5D and 3D applications, in addition to the circuit degradation, there is extra degradation
due to thermomechanical fatigue (e.g., partially open bump or TSV connections), and can
lead to communication failures. The proteanTecs Interconnect Monitoring Agents monitor this
interface continuously, per lane and in functional mode, to provide an accurate measure of the
degradation over time. This is especially useful in silicon and system debugging as well as in-field
reliability monitoring.
7.1 HBM Agent (HBMA)
The HBM Agent (Figure 13a) resides on the SOC PHY side and monitors both the nearend and far-end high-speed interface signal integrity.
7.2 Tile Connectivity Agent (TCA)
2.5D heterogeneous integration requires high-speed communication between the various
tiles (chiplets) on a silicon interposer or organic substrate. The Tile Connectivity Agent
(TCA) monitors the signal integrity quality at the receiving end of each Die-to-Die (D2D)
interface and provides accurate measurement of the eye width, the available margin, and
margin degradation over time (Figure 13b).
Figure 13: The Interconnect Monitoring Agents can alert of any abnormal channel degradation due to, for
example, thermomechanical issues in HBM or D2D interfaces
13
proteanTecs On-Chip Monitoring and Deep Data Analytics System
8 | In-Field Monitoring during Operational Lifetime
The predominant semiconductor process degradation mechanisms like Negative Bias
Instability (NBTI), Hot Electron Injection (HCI), Electromigration (EM), and others depend
exponentially on voltage, nodal activity, and silicon temperature. State-of-the-art power
management calls for continuous changes in voltage and frequency (DVFS), per workload,
to maximize performance/energy. The nodal activity and temperature will depend on the
workload, the ambient temperature, the cooling solution, the throttling mechanisms, etc.
Accounting for worst-case degradation upfront is very pessimistic and scale prohibitive,
leading to performance loss, high power, and cost.
Therefore, in-situ monitoring of all these key parameters during normal operation and over the
lifetime of the product is necessary to understand the remaining lifetime of the device.
Furthermore, many latent process defects may get activated over time and cause single-node
transient or stuck-at failures that had escaped the Burn-In, Dynamic Voltage Stress (DVS),
and other stress test techniques at time zero. These defects will cause abnormal degradation
that can only be captured before the actual failure, if the affected timing paths are monitored
continuously over time. Obviously a ring oscillator or critical path replica will not exhibit the
same degradation as the actual timing path with defect, as it will not have the same rare
defect, nodal activity and DVFS limits. The proteanTecs Margin Agents cover most of the
critical paths and many more, and when the SW platform detects an abnormal timing margin
degradation (Figure 14) the system administrator is alerted to remove this machine from
service prior to the actual failure. Such a failure can be a “hard” and easily detectable failure,
or “soft and intermittent'', depending on activity and environmental conditions. The latter can
cause undetected Silent Data Corruption errors that can create havoc in a datacenter, making
the proteanTecs solution invaluable.
Figure 14: In-situ timing margin measurements can be used to estimate the time-to-failure (TTF), to alert for
abnormal margin degradation and early safe system maintenance
14
proteanTecs On-Chip Monitoring and Deep Data Analytics System
9 | HW IP System
Given the large number of monitors provided by proteanTecs and the need of seamless
integration and data communication, proteanTecs offers a Full Chip Controller (FCC) that
manages all interfaces to the monitors, collects the monitor data, and provides them to the
FW, edge SW and SW cloud platform via standard interfaces (JTAG, APB, I2C). As shown in
Figure 15, all the monitors are connected in the chain managed by the FCC. The monitors are
grouped in small blocks called Protean Units for easy integration within the design blocks.
The Protean Units contain local control and HW processing of the monitors data. Usually, one
FCC is required per chip and can control up to 1024 Protean Units.
Some of the proteanTecs IP are hard macros, containing all-required clock manipulations
internally, minimizing the design effort on the customer side. The rest of the IPs, including
the high path coverage Margin Agents (MAs), are synthesizable using the customer flows and
using a fully automated procedure.
Figure 15: All proteanTecs agents are connected in a chain and the interfaces, data collection and control are
managed through the Full Chip Controller (embedded in Unit A in this example)
15
proteanTecs On-Chip Monitoring and Deep Data Analytics System
10 | proteanTecs SW Analytics Platform
The proteanTecs SW analytics platform, as depicted in Figure 16, operates in the cloud, and
provides the interfaces to collect data from the wafer probe, ATE vendor, and the system
during productization and normal operation. The platform provides Agent fusion and analyzes
the Agent measurements to provide a holistic view of the chip and system:
Identifying design sensitivity to process parameters
Chip classification for optimal voltage-frequency binning at early stages of production
Identifying degradations during normal system operation
Providing workload analysis and
Alerting the system administrator on upcoming failures
While traditional COT (Customer Owned Tooling) chip design companies typically design some
basic process control sensors themselves, they lack the full spectrum of monitors provided by
proteanTecs. They also lack the full integration flow and the pre-built algorithms that enable
the collection and analysis of a vast amount of data, as well as the use of sophisticated
machine learning (ML) that helps with optimal chip and system characterization, production,
and operation. This integrated solution and unique approach to data analytics saves design
effort and provides powerful tools to help with process, performance, and power tuning as
well as silicon and system debugging.
Cloud based
platform
proteanTecs Cloud platform
Big data analytics
Many data points, high compute
and memory resources
Advanced analytics and
visualization capabilities
System Runtime
environment
Insights across fleets of systems,
on top of specific system/chip
Open architecture
proteanTecs Edge analytics
Chip Runtime
environment
On-device / “Near-device”
During test, running at the “Test
floor”, e.g., ATE
Embedded / Firmware
Agent Embedded
Chip
Local operation and data
collection
Low latency alerts and interrupts
Real-time applications for chip
performance/health
Figure 16: The proteanTecs software stack serves the early data gathering, the data analytics, yield and
performance optimization as well as in field monitoring and reliability enhancement
16
proteanTecs On-Chip Monitoring and Deep Data Analytics System
11 | Conclusions
Today’s applications demand high performance, low power, and high reliability. This is
especially true for Automotive and autonomous driving applications, as well as in data centers
with such deployment at scale and the requirement of thousands of machines to work in
tandem, error-free.
The proteanTecs on-chip monitoring and analytics platform offers the ultimate tool to
accomplish this reliability, yield, performance, and power co-optimization as already proven
on multiple customer examples.
proteanTecs offers the Monitor IP, the CAD flows, and tools to facilitate the IP integration in
the chip and make sure the implementation will provide the expected value. In addition, it
provides the machine learning algorithms and analytics SW stack to analyze the measured
data at all phases of the product cycle; from wafer testing, packaged device testing, NPI,
system ramp, system test and in field monitoring, until the product retirement. This offers
invaluable insights of the system performance margins and reliability, performance, power,
and cost optimization opportunities. This integrated solution and unique approach to data
analytics is the key differentiator of the proteanTecs offering. This deep data, end-to-end
health monitoring at scale leads to optimal reliability, performance, power and cost across a
wide range of applications from automotive to datacenters and beyond.
17
proteanTecs Ltd
www.proteanTecs.com
© proteanTecs 2023. All rights reserved.
Download