SEMATECH and the SEMATECH logo are registered trademarks of SEMATECH, Inc. WorkStream is a registered trademark of Consilium, Inc. FAILURE REPORTING, ANALYSIS, AND CORRECTIVE ACTION SYSTEM ii CONTENTS Page Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v EXECUTIVE SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 FAILURE REPORTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Collecting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Reporting Equipment Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Reporting Software Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Logging Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Using FRACAS Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Failure Analysis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Failure Review Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Root Cause Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Failed Parts Procurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 CORRECTIVE ACTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 FRACAS DATABASE MANAGEMENT SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Configuration Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Events Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Problems Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Report Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 FRACAS Functions and Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . inside back cover iii iv ACKNOWLEDGMENTS In putting together this document, there were many obstacles that kept getting in the way of completing it. Without the support of my family, Sonia, Jason, and Christopher, this would never have been accomplished. Mario Villacourt External Total Quality and Reliability SEMATECH I would like to thank my wife, Monique I. Govil, for her support and encouragement in completing this work. Also, I want to thank SVGL for supporting my participation with this effort. Pradeep Govil Reliability Engineering Silicon Valley Group Lithography Systems v vi EXECUTIVE SUMMARY Failure Reporting, Analysis, and Corrective Action System (FRACAS) is a closed-loop feedback path in which the user and the supplier work together to collect, record, and analyze failures of both hardware and software data sets. The user captures predetermined types of data about all problems with a particular tool or software and submits the data to that supplier. A Failure Review Board (FRB) at the supplier site analyzes the failures, considering such factors as time, money, and engineering personnel. The resulting analysis identifies corrective actions that should be implemented and verified to prevent failures from recurring. A simplified version of this process is depicted in Figure 1. 2 Reliable Equipment Test & Engineering Change Control In-house Test & Inspections Design & Production Field Operations Corrective Action Failure Reporting Analysis Figure 1 FRACAS Process Page 1 Failure Reporting, Analysis, and Corrective Action System Since factors vary among installation sites, equipment users must work closely with each of their suppliers to ensure that proper data is being collected, that the data is being provided to the correct supplier, and that the resulting solutions are feasible. Unlike other reliability activities, FRACAS promotes reliability improvement throughout the life cycle of the equipment. The method can be used during in-house (laboratory) tests, field (alpha or beta site) tests, and production/operations to determine where problems are concentrated within the design of the equipment. According to the military, Corrective action options and flexibility are greatest during design evolution, when even major design changes can be considered to eliminate or significantly reduce susceptibility to known failure causes. These options and flexibility become more limited and expensive to implement as a design becomes firm. The earlier the failure cause is identified and positive corrective action implemented, the sooner both the producer and user realize the benefits of reduced failure occurrences in the factory and in the field. Early implementation of corrective action also has the advantage of providing visibility of the adequacy of the corrective action in the event more effort is required. 1 In addition to its military use, FRACAS has been applied extensively in the aerospace, automotive, and telecommunications industries. Elements of FRACAS can be found throughout all industries, but this limited use usually means that the only data received by the manufacturer are complaints logged with the field service or customer service organizations. FRACAS can provide control and management visibility for improving the reliability and maintainability of semiconductor manufacturing equipment hardware and software. Timely disciplined use of failure and maintenance data can help generate and implement effective corrective actions to prevent failures from recurring and to simplify or reduce maintenance tasks. 1. Quoted from MIL-STD-2155(AS), Failure Reporting, Analysis and Corrective Action System. Page 2 Failure Reporting, Analysis, and Corrective Action System FRACAS objectives include the following: Providing engineering data for corrective action Assessing historical reliability performance, such as mean time between failures (MTBF), mean time to repair (MTTR), availability, preventive maintenance, etc. Developing patterns for deficiencies Providing data for statistical analysis Also, FRACAS information can help measure contractual performance to better determine warranty information. FRACAS provides a complete reporting, analysis, and corrective action process that fulfills ISO 9000 requirements. More and more companies are requiring their suppliers to meet ISO 9000, for example, the SEMATECH Standardized Supplier Quality Assessment (SSQA) and the Motorola Quality Systems Review (QSR). In-house Test & Inspections FAILURE REPORTING All events (failures) that occur during inspections and tests should be reported through an established procedure that includes Field Operations collecting and recording corrective maintenance information and times. The data included in these reports should be verified and then the data should be submitted on simple, easy-to-use forms that Failure Reporting are tailored to the respective equipment or software. Page 3 Failure Reporting, Analysis, and Corrective Action System Collecting the Data Many problems go unnoticed because insufficient information was provided. The FRB must know if, for example, someone was able to duplicate the problem being reported. There are three common causes for missing essential data: Inspection or testing began before a procedure was in place to report problems. The reporting form was difficult to use. The person who filled out the form had not been trained. Operators and maintenance personnel are usually the first to identify problems and, therefore, they should be trained to properly capture all of the information needed for an event report. The contents of this report are more fully described in the next section. However data is collected, one constant should be emphasized: consolidate all the data into a central data logging system. When field failures occur, two types of data collection are usually present: (1) the user’s system that includes equipment utilization and process-related information, and (2) field service reports (FSRs) that include parts replacement information. The supplier has great difficulty addressing problems when FSRs are the only source of data. It is helpful if the user provides the supplier with detailed logs for problem verification. The supplier should correlate both types to help determine priorities for problems to be resolved. The supplier should be included in the formation of the user’s data collection system while the product is being installed. Ideally, the equipment should log all information automatically and then download the data to the supplier’s data collection system. This would eliminate the need for paper forms and also the confusion caused by duplicate data sets. Page 4 Failure Reporting, Analysis, and Corrective Action System Reporting Equipment Failures Collecting and sharing appropriate data through event reports are essential components of an effective FRACAS process, both for the supplier and for the user. There are common elements in every report (when the event occurred, what item failed, etc.) that the user and the supplier both use to analyze failures. Other crucial information includes the duration of the failure, the time it took to repair, and the type of metric used (time or wafer cycles). To simplify the effort of collecting data and to minimize any duplication of effort, use the following guidelines: Capture data at the time of the failure. Capture utilization data through the user’s factory tracking system (for example, WorkStream). Avoid duplicating data collection between the supplier’s and the user’s systems. Make failure and utilization data available to both the supplier and the user in a standardized format. Exchange failure and utilization data between the supplier and user frequently. Link event reports to the corresponding utilization data in the factory tracking system through an identifying event number that is captured on the equipment. Automate both the user’s and the supplier’s systems to maximize efficiency, minimize paper tracking, and avoid dual reporting. In addition, make full use of reporting tools that already exist (to which little or no modification may be necessary), such as In-house test reports Field service reports Page 5 Failure Reporting, Analysis, and Corrective Action System Customer activity logs Customer equipment history Any other pertinent information you may need may be determined by asking the design engineers what information they need for root cause analysis. Data log files (history kept by the software) should accompany the failure report to provide software engineers with information about events that preceded the malfunction. As a result of the data collected, minimum reliability performance metrics (such as MTBF, MTTR, and uptime percentage) could be determined for each system. Every piece of data helps. In addition to these common elements, there are specific types of data needed only by the user or only by the supplier. The supplier should acquire detailed failure information from the event report that should become an integral part of the equipment log file. The details help determine the state of operation before and after the failure, which is necessary to address the root cause of equipment failures efficiently and effectively. Table 1 shows the types of detail that the supplier needs to see in the event report (in-house or field). Figure 2 is an example of an event report form. The format of the form is important only to simplify the task of the data recorder. You may want to computerize your data entry forms to expedite the process and also minimize failure description errors. Page 6 Failure Reporting, Analysis, and Corrective Action System TABLE 1 EVENT REPORT FIELDS Event Report Fields Example Description Serial Number 001 System-unique identifier that indicates who the customer is and where the system is located Date and Time of Event March 12, 1994 at 4:00 PM Duration of the Event (Downtime) 01:21 Usually in hours and minutes Failed Item Description Handler-Stocker-RobotMotor Describes function and location of failed item from top to bottom; that is, field to lowest replaceable unit Reliability/ Fault Code of Known Problems AH-STK-ROT-MOT-021 Known problem categorized by reliability code and failure mode number Life Cycle Phase of the System Field Operations See equipment life cycle in the Glossary Downtime Category (Maintenance Action) unscheduled Usually shown as either unscheduled or scheduled (preventive maintenance) Part Number 244-PC Part number of replaced item (if any) Operator’s Name M. Lolita Person reporting problem Service Engineer’s Name S. Spade Person performing repair action Event Report Identification Number FSR-002 Supplier’s report number Event (Problem) Description What happened? Description of all conditions prior to failure and how observer thinks it failed Repair Action Description What did you do to repair item? Description of any repair or maintenance attempts Page 7 Failure Reporting, Analysis, and Corrective Action System EVENT RECORD Reliability Code: DMD-XS-HRN-M3S-002 Serial Number: 9003 Date: 09/01/93 Down Time Category: Unscheduled Time: 09:00: Life Cycle Phase: OPERATION Duration: 04:30: Relevant: Y PM: N System: DMD Part Number: 650-000029 Subsystem: XS Operator: FRANK BELL Assembly: HRN FSE Name: Subassembly: M3S FSE Number: Sub-subassembly: Problem: PREVIOUS ATTEMPT TO RESOLVE TRANSPORT LOAD FAILURE DID NOT WORK. REQUIRES SWITCH REPLACEMENT. Repair Action: REPLACED SWITCH, CURED LENS 2, ALIGNED COLUMN, CHECKED JET ALIGN, MADE TEST MILLS AND DEPOS. Figure 2 Example Event Report The user analyzes this data to evaluate management of equipment and resources and to determine its effectiveness in meeting internal commitments. Internally, user data may be handled differently than described in SEMI E10-92, but as a minimum for sharing data with the supplier, the major states in Figure 3 should be followed. NON-SCHEDULED equipment downtime UNSCHEDULED DOWNTIME ENGINEERING equipment uptime total time (168 hrs/wk) SCHEDULED DOWNTIME operations time STANDBY PRODUCTIVE Figure 3 Equipment States Stack Chart Page 8 Failure Reporting, Analysis, and Corrective Action System Reporting Software Problems When reporting software problems, too much detail can be counterproductive. An error in software does not necessarily mean that there will be a problem at the system (equipment) level. To quote Sam Keene, 1992 IEEE Reliability Society president, If I’m driving on a two-lane road (same direction) and switch lanes without using the signal indicator and continue to drive, I’ve committed an error. But if I do the same when another car is on the other side and I happen to crash into it, then I now have a problem. Your existing software error reporting process should become an integral part of your overall FRACAS. (Rather than reiterating the software reliability data collection process, a practical approach can be found in A.M. Neufelder’s work.2) FRACAS helps you focus on those errors that the customer may experience as problems. One focusing mechanism is to track the reason for a corrective action. This data can be collected by reviewing the repair action notes for each problem in the problem report. Engineering and management can categorize these reasons to determine the types of errors that occur most often and address them by improving procedures that most directly cause a particular type of error. The process of analyzing this data is continuous. Among the reasons for corrective action are the following: Unclear requirements Misinterpreted requirements Changing requirements Documented design not coded Bad nesting Missing code 2. Refer to Chapter 7, Software Reliability Data Collection, of Ensuring Software Reliability, by A. M. Neufelder (Marcel Dekker, Inc., New York, 1993). Page 9 Failure Reporting, Analysis, and Corrective Action System Excess code Previous maintenance Bad error handling Misuse of variables Conflicting parameters Software or hardware environment change Specification error Problem with third-party software Documented requirements not designed Software environment error (for example, an error in the compiler) Reasons should also be ranked in terms of the criticality or severity of the error. This information helps management predict those errors that will cause downtime (as opposed to all errors, including those that may not cause downtime). Severity rankings are: 1. Catastrophic 2. Severe 3. Moderate (has work-around) 4. Negligible 5. All others (for example, caused by a misunderstood new feature or unread documentation) Also, tracking the modules or procedures modified for each corrective action helps schedule pre-release regression testing on these changes, which results in more efficient test procedures and more effective test results. Page 10 Failure Reporting, Analysis, and Corrective Action System Logging Data Whether software or equipment errors are being reported, once the supplier receives the reports, all data should be consolidated into one file. The supplier’s reliability/quality organization should oversee the data logging. A FRACAS database is recommended for this part of the process. This is not a product-specific database but rather a database for tracking all products offered by a company. Figure 4 shows how the status of problems is kept updated in the database. PROBLEMS RECORD Title/Description LIMIT FLA SENSORS FAIL Reliability Code ATF-STK-PRI-RT-XAX-001 Initial Status: I System: ATF Fault Category: Facilities Elec Subsystem: STK Error Codes: errcode 2 Assembly: PRI Assignee: Eric Christie Subassembly: RT Sub-subassembly: XAX DATES ROOT CAUSE SOLUTION (RCS) Insufficient Info: 11/04/93 Sufficient Info: / / Defined: / / Contained: / / Retired: / / Original Plan: / / Current Plan: / / Complete: / / RCS Documentation: Comments:Not enough information at this time. However, supplier is investigating vendor part #po09347-90. STATISTICS Current Status: I First Reported: 11/04/93 Number of Fails: 2 Last Reported: 11/11/93 Number of Assists: 0 Number of Others: 0 Total Events: 0 Total Downtime Hours: 2.72 Figure 4 Problems Record Page 11 Failure Reporting, Analysis, and Corrective Action System Using FRACAS Reports The FRACAS database management system (DBMS) can display data in different ways. This section describes some of the report types you may find useful. A complete failure summary and other associated reports that this DBMS can provide are described in more detail beginning on page 27. The graphical reports provide a quick snapshot of how the equipment or software is performing at any given point. Figure 5 shows the percentage rejection rate on a monthly basis as actual-versustarget rates. Figure 6 depicts the number of failure modes or problems on a weekly basis. Figure 7 points out the number of events being reported weekly by life cycle phase. Figure 8 provides a one-page status outline for tracking by the FRB. Whichever report you choose, it should be tailored to provide summaries and special reports for both management and engineering personnel. Factory Test Tracking Initial Test Reject Rate D + Monthly Rate 3 Month Rate Reject % Target Rate (8%) Volume Feb Mar 566 97 Apr May 679 598 Jun 22 Jul 123 Aug 126 Sep 333 Oct 98 Nov 98 Dec Jan ’86 145 88 Production – HP3060 Test Figure 5 Example—Percentage Rejection Rate, Actual Versus Target Page 12 Failure Reporting, Analysis, and Corrective Action System By Reliability Code From 08/30/93 to 10/03/93 8 Events 6 4 2 0 1 2 3 4 5 Week DMD-OP-IB–IGA-SRC-002 DMD-OP-IB–IGA-SRC-003 DMD-VS-HC-001 Figure 6 Example—Number of Events Weekly Events by Life Cycle Phase by Week From 09/27/93 to 10/17/93 4 3 2 1 0 1 2 Phase II 3 Phase III Figure 7 Example—Number of Events Weekly by Life Cycle Phase Page 13 Failure Reporting, Analysis, and Corrective Action System FAILURE/REPAIR REPORT V H/W V S/W 1. PROJECT V SE V TE 2. FAILURE DATE 6. SUBSYSTEM NO. V OTHER 3. FAILURE GMT DAY ___ HR ___ MIN ___ 7. 1ST TIER H/W, S/W 4. LOG NO. V OTHER 5. REPORTING LOCATION V SUPPLIER 8. 2ND TIER H/W, S/W 9. 3RD TIER H/W, S/W 10. 4TH TIER H/W, S/W 11. DETECTION AREA I ORIGINATOR A REFERENCE DESIGNATIONS B NOMENCLATURE V BENCH V ACCEPT V FAB/ASSY V QUAL V INTEG V SYSTEM V FIELD V OTHER C SERIAL NUMBER D OPERATING TIME CYCLES 12. FAILURE DESCRIPTION V ACOUSTIC V EMC/EMI 13. ENVIRONMENT V AMB V TEMP V VIE V SHOCK V OTHER V HUMID 14. ORIGINATOR 15. DATE 16. COGNIZANT ENGINEER II VERIFICATION 17. VERIFICATION AND ANALYSIS 18. ASSOCIATED DOCUMENTATION 18. ITEM DATA A. ITEM NAME 70. FAILURE CAUSE V DESIGN B. NUMBER V WORKMANSHIP V DAMAGE (MISHANDLING) V MFG PROCESS C. SERIAL NO. D. CIRCUIT DESIGNATION E. MANUFACTURER V PART FAILURE V ADJUSTMENT V S/W V LRU FAILURE V OPERATING TIME V TEST ERROR V EQMT FAILURE 21. SIGNATURE G. DEFECT V SE FAILURE V DOCUMENTATION 22. DATE V OTHER 23. ACCUMULATED MAINTENANCE A. ON LOCATION B. DEPOT SHOP C. MANUFACTURING SHOP III REPAIR a NAME b DATE c HRS/MINS d LOCATION e WORK PERFORMED IV CORRECTIVE ACTION 24. CORRECTIVE ACTION TAKEN V NO ACTION 25. DISPOSITION OF SUBSYSTEM OR ASSEMBLY V REDESIGNED V REWORKED V READJUSTED V SCRAPPED V THIS UNIT V RETESTED V OTHER V FAR NO. 26. EFFECTIVITY V ALL UNITS V OTHER 27. SAFETY V RFC NO. 28. SIGNATURE COG ENGR 29. SEC 30. DATE 31. SIGNATURE COG SEC ENGR 34. SYSTEM ENGR 35. DATE 36. PROJECT RELIABILITY ASSURANCE 32. DATE 33. SUBSYSTEM RATING 37. DATE 38. RISK ASSESSMENT COPY DISTRIBUTION WHITE – PFA CENTER, CANARY – PROJECT OFFICE, PINK – ACTION COPY GOLDENROD – ACCOMPANY FAILED ITEM Figure 8 Example—Failure/Repair Report Page 14 Failure Reporting, Analysis, and Corrective Action System Failure Reporting ANALYSIS Failure analysis is the process that determines the root cause of the failure. Each failure should be verified and then analyzed to the Analysis extent that you can identify the cause of the failure and any contributing factors. The methods used can range from a simple investigation of the circumstances surrounding the failure to a sophisticated laboratory analysis of all failed parts. Failure Analysis Process Failure analysis begins once an event report is written and sent to the FRB. It ends when you sufficiently understand the root cause so you can develop logically derived corrective actions. It is important to clearly communicate the intent and structure of the failure analysis process to all appropriate organizations. These organizations should review and approve the process to confirm their cooperation. During and after the analysis, the problem owner, associated FRB member, and reliability coordinator must ensure that the database is maintained with the current applicable information. The analysis process is: 1. Review, in detail, the field service reports. 2. Capture historical data from the database of any related or similar failures. 3. Assign owners for action items. 4. Do a root cause analysis (RCA). 5. Develop corrective actions. 6. Obtain the failed items for RCA (as needed). 7. Write a problem analysis report (PAR) and, if needed, a material disposition report (MDR). Page 15 Failure Reporting, Analysis, and Corrective Action System Developing proper forms for tracking the failed parts and reporting the problem analysis results is essential. Figures 9 and 10 are examples of these types of forms. MATERIAL DISPOSITION REPORT S.NO. FAILURE DATE REWORK ORDER NO J37511 z RETURNED PART DESCRIPTION PART NUMBER QTY RETURNED PART DESCRIPTION FSR NUMBER z REASON FOR RETURN V DEFECTIVE V ENG CHANGE V OVERISSUE V OVERDRAWN V OTHER z USED ON EQUIPMENT S.NO. ASSEMBLY NO. SITE NUMBER DATE z MATERIAL DISPOSITION INSTRUCTIONS V RETURN TO STOCK V SEND TO FRB FOR FAILURE ANALYSIS V SEND FOR VENDOR FAILURE ANALYSIS V SCRAP V RETURN TO VENDOR V SEND FOR REPAIR z RELIABILITY INFORMATION TASK NUMBER FRB MEMBER ORIGINATOR DATE DATE DEFECT DESCRIPTION (FILLED BY THE ORIGINATOR) z DISTRIBUTION V V V V Figure 9 MDR Forms Page 16 Failure Reporting, Analysis, and Corrective Action System RELIABILITY MDC FS QUALITY PROBLEM ANALYSIS REPORT PROBLEM NUMBER: ___ ___ — ___ ___ — ___ ___ ___ ERROR CODE: PROBLEM TITLE: FRB MEMBER: ASSIGNEE: TYPE OF PROBLEM: F/A j COMMON: j O j UNIQUE: j CURRENT FIX DATE/SW REV. CATEGORY: WORK: j PROC: j DESIGN: j MODULE CODE: ____ ____ INSUFFICIENT INFO: j DATE: SUFFICIENT INFO: j DATE: COMMENTS: j DEFINITION OF PROBLEM: DEFINED: j DATE j CONTAINMENT: CONTAINED: j DATE j ROOT CAUSE: j CORRECTIVE ACTION: RETIRED/CLOSED: j DATE: VERIFICATION: DOCUMENTATION: j ECN: j FRI REVISION: A: j B: j j OTHER: C: j D: j DATE: ENTERED IN DB: *** IF MORE SPACE IS NEEDED, PLEASE ATTACH SEPARATE PIECE OF PAPER *** Figure 10 PAR Form Page 17 Failure Reporting, Analysis, and Corrective Action System Failure Review Board The FRB reviews failure trends, facilitates and manages the failure analysis, and participates in developing and implementing the resulting corrective actions. To do these jobs properly, the FRB must be empowered with the authority to require investigations, analyses, and corrective actions by other organizations. The FRB has much in common with the techniques of quality circles; they are self-managed teams directed to improve methods under their control (see Figure 11). FRB meets regularly Review in-house & field reports FRB establishes preliminary assignments Reliability coordinator brings reports Yes Does FRB require additional data? No FRB confirms problem assignments within 24 hours A Figure 11 FRB Process Page 18 Failure Reporting, Analysis, and Corrective Action System A Yes Sufficient information? Enter into DBMS & write PARs No FRB assigns problem owner After suitable period if additional events have not recurred, retire problem Design problems Operator/ field services Update DBMS Distribute updates to FRB Mfg. issues Software problems Others Establish containment, RCA plan/schedule FRB reviews & ensures reliable solution RCA complete Begin ECN Process Figure 11, continued FRB Process Page 19 Failure Reporting, Analysis, and Corrective Action System The makeup of the FRB and the scope of authority for each member should be identified in the FRACAS procedures. The FRB is typically most effective when it is staffed corporate-wide, with all functional and operational disciplines within the supplier organization participating. The user may be represented also. Members should be chosen by function or activity to let the composition remain dynamic enough to accommodate personnel changes within specific activities. The FRB composition may be product specific, if desired. For example, a supplier’s different products may be complex and therefore require expert knowledge. In this scenario, though, overall priorities must be managed carefully, since the FRB may require the same expert resources for root cause analysis. Generally, the following organizations are represented on the FRB: Reliability Quality Assurance Field Service Manufacturing Statistical Methods Marketing Systems Engineering Test Design Engineering* The FRB should be headed by the reliability manager. One of the main functions of this reliability manager is to establish an effective FRACAS organization. The reliability manager must establish procedures, facilitate periodic reviews, and coach the FRB members. Other responsibilities include: Assign action items with ownership of problem/solution (who/what/when) Allocate problems to appropriate functional department for corrective action * Hardware, software, process, and/or materials design, depending on the type of system being analyzed. Page 20 Failure Reporting, Analysis, and Corrective Action System Conduct trend analysis and inform management on the types and frequency of observed failures Keep track of failure modes and criticality Issue periodic reports regarding product performance (for example, MTBF, MTTR, productivity analysis) and areas needing improvement Ensure problem resolution Continually review the FRB process and customize it to fit specific product applications The FRB operates most efficiently with at least two recommended support persons. A reliability coordinator can procure failed items for root cause analysis, ensure that the event reports submitted to the FRB contain adequate information, stay in contact with internal and external service organizations to ensure clarity on service reports, and track and facilitate RCA for corrective action implementation. A database support person should be responsible for providing periodic reports and for maintaining the database (importing/exporting data, backing up files, archiving old records). This will ensure that data available from RCA and corrective action is kept current in the database. Responsibilities of the FRB members include the following: Prepare a plan of action with schedule Review and analyze failure reports from in-house tests and field service reports Identify failure modes to the root cause Page 21 Failure Reporting, Analysis, and Corrective Action System Furthermore, each FRB member should Have the authority to speak for and accept corrective action responsibility for an organization Have thorough knowledge of front-end systems Have the authority to ensure proper problem resolution Have the authority to implement RCA Actively participate The FRB should meet on a regular basis to review the reported problems. The frequency of these meetings will depend on the number of problems being addressed or on the volume of daily event reports. Problem-solving skills and effective meeting techniques can ensure that meetings are productive. If you need assistance in these areas, SEMATECH offers short courses3 that are affordable and can be customized for your application. The top problems, based on pareto analyses (statistical methods) derived from the database, should be used to assign priorities and to allocate resources. The reporting relationship of the FRB with other FRACAS functions is shown in Figure 12. 3. Contact the SEMATECH Organization Learning and Performance Technology Department, 2706 Montopolis Drive, Austin, TX 78741, (512) 356–7500 Page 22 Failure Reporting, Analysis, and Corrective Action System Reliability Manager Failure Review Board Reliability Coordinator Problem Owner Database Support Problem Owner In-house Test Inspections Problem Owner Field Operations Figure 12 FRACAS Functions Responsibilities Root Cause Analysis In failure analysis, reported failures are evaluated and analyzed to determine the root cause. In RCA, the causes themselves are analyzed and the results and conclusions are documented. Any investigative or analytical method can be used, including the following: Brainstorming Histogram Flow chart Force field analysis Page 23 Failure Reporting, Analysis, and Corrective Action System Pareto analysis Nominal group technique FMEA4 Trend analysis Fault tree analysis Cause and effect diagram (fishbone) These tools are directly associated with the analysis and problemsolving process.5 Advanced methods, such as statistical design of experiments, can also be used to assist the FRB. The Canada and Webster divisions of Xerox Corporation extensively use design of experiments as their roadmap for problem solving;6 they find this method helps their FRBs make unbiased decisions. Any resulting corrective action should be monitored to ensure that causes are eliminated without introducing new problems. Failed Parts Procurement A failed parts procurement process should be established to assist the FRB in failure analysis. Failed parts may originate from field, factory, or company stock, or they may become defective during shipment. Each of these scenarios requires different action. The procedure should clearly define the process, including the timeline and responsibilities for at least the following tasks: Shipment of failed parts from field to factory Procurement of failed parts from various factory locations 4. Failure Mode and Effects Analysis. More information is available through the Failure Mode and Effects Analysis (FMEA): A Guide for Continuous Improvement for the Semiconductor Industry, which is available from SEMATECH as technology transfer #92020963A-ENG. 5. Analysis and problems solving tools are described in the Partnering for Total Quality: A Total Quality Toolkit, Vol. 6, which is available from SEMATECH as technology transfer #90060279A-GEN. 6. A fuller description of the Xerox process can be found in Proceedings of Workshop on Accelerated Testing of Semiconductor Manufacturing Equipment, which is available from SEMATECH as technology transfer #91050549A–WS. Page 24 Failure Reporting, Analysis, and Corrective Action System Corrective Action Failed part reports from each location Communications between material disposition control and reliability engineering Sub-tier supplier failure analysis Procurement of sub-tier supplier PAR and implementation of corrective action Disposition of failed part after RCA is completed CORRECTIVE ACTION When the cause of a failure has been determined, a corrective action plan should be developed, documented, and implemented to Design & Production Test & Engineering Change Control eliminate or reduce recurrences of the failure. The implementation plan should be approved by a user representative, and appropriate change control procedures should be followed. The quality assurance organization should create and perform incoming test procedures and inspection of all redesigned hardware and software. SEMATECH has developed an Equipment Change Control System7 that you may find useful (see Figure 13). To minimize the possibility of an unmanageable backlog of open failures, all open reports, analyses, and corrective action suspension dates should be reviewed to ensure closure. A failure report is closed out when the corrective action is implemented and verified or when the rationale is documented (and approved by the FRB) for any instances that are being closed without corrective action. 7. Refer to Equipment Change Control: A Guide for Customer Satisfaction, which is available from SEMATECH as technology transfer #93011448A–GEN. Page 25 Failure Reporting, Analysis, and Corrective Action System An effective change control system incorporates the following characteristics. ❑ An established equipment baseline exists with current and accurate V V V ❑ ❑ All changes are identified by using V The equipment baseline V Calibration systems V Raw materials supplier change control V Traceability systems V Procurement methods and procedures V Materials review process V Incoming quality control All changes are recorded, forecasted, and tracked in an equipment change database incorporating V V V V V V ❑ Name of change Change description Reason for change All equipment affected Retrofit required/optional Concerns/risks V V V V V Qualification data Implementation plan Start date Completion date Status of change Proper investigation of all changes is performed by a Change Evaluation Committee made of experts from V V V V V ❑ Design specs and drawings Manufacturing process instructions Equipment purchase specifications Equipment manufacturing Equipment design Software design Quality and reliability Service and maintenance V V V V Environment, health, and safety Procurement Process engineering Marketing A Change Evaluation Committee is chartered to V V V V V V Validate change benefits using statistical rigor as required Specify change qualification and implementation plans Manage and drive change implementation schedules Update equipment specifications and drawings Maintain the equipment change database Communicate with the customer In summary, a change control system is functioning well when it systematically provides ❑ Identification of all changes relative to an equipment baseline ❑ Recording, forecasting, and tracking of all changes ❑ Proper investigation of all changes ❑ Communication to the customer Figure 13 Equipment Change Control Characteristics Page 26 Failure Reporting, Analysis, and Corrective Action System FRACAS DATABASE MANAGEMENT SYSTEM The FRACAS Database Management System (DBMS) facilitates failure reporting to establish a historical database of events, causes, failure analyses, and corrective actions. This pool of knowledge should minimize the recurrence of particular failure causes. This DBMS is not product specific. It is available from SEMATECH8, or you can implement the same concept in whatever software you choose. However, the database should have the same software requirements and characteristics at both the supplier’s and user’s sites. This section describes characteristics of the SEMATECH product. If you are creating your own implementation, you should develop the following aspects. Sections follow that describe each table more fully. Configuration Table Events Table Problems Table Reports 8. Available from SEMATECH Total Quality division. Software transfer and accompanying documentation in press. Page 27 Failure Reporting, Analysis, and Corrective Action System Configuration Table The configuration table accepts information about each machine or software instance. There are six standard fields, as shown in the following table. In the SEMATECH FRACAS database, there are seven optional fields that can be customized for your purposes, such as customer identification number, software revision, process type, engineering change number, site location, etc. TABLE 2 STANDARD FIELDS OF THE CONFIGURATION TABLE Field Name Description Serial Number User input. Site Enter the location of the machine. System User input. Phase User input. Enter the current Life Cycle Phase of the machine. Required field. Phase Date User input. Enter the date the machine went into the current Life Cycle Phase. Weekly Operational Time User input. Enter the number of hours the machine is scheduled to be in operation. Page 28 Failure Reporting, Analysis, and Corrective Action System Events Table The primary purpose of the FRACAS application is to record downtime events for a particular system. The fields of the Events Table are described in Table 3. To understand these fields fully, you need to understand the relationship between the problem fields (System, Subsystem, Assembly, Subassembly, and/or Sub-subassembly) and the reliability code. The reliability code field contains one or more of the problem field names. Data (if any) from the problem fields automatically helps you select or create the reliability code. Once you select a code, associated fields are updated automatically. TABLE 3 EVENTS TABLE FIELD DESCRIPTIONS Field Name Description Serial Number User input. The serial number of machine having event. Date User input. The date the event occurred. Enter in a MM/DD/YY format. Time User input. The starting time of the event. Entered in 24-hour time (HH:MM:SS). Seconds are not required. Duration User input. The duration of the event. Entered in a HH:MM:SS format. Seconds are not required. System Automatic input based upon the serial number. Subsystem User input. The optional subsystem code (if the part is at a subsystem level) involved in the event. Related to the System codes. Assembly User input. The optional assembly code (if the part is at an assembly level) involved in the event. Related to the selected System and Subsystem codes. Operator User input. Name of the operator when the event occurred. continued on next page Page 29 Failure Reporting, Analysis, and Corrective Action System TABLE 3 EVENTS TABLE FIELD DESCRIPTIONS, CONTINUED Field Name Description FSE Name User input. Name of field service engineer who responded to the event. FSR# User input. The field service report number. Subassembly User input. The optional subassembly code (if the part is at a subassembly level) involved in the event. Related to the selected System, Subsystem, and Assembly codes. Sub-subassembly User input. The optional sub-subassembly code (if the part is at a sub-subassembly level) involved in the event. Related to the selected System, Subsystem, Assembly, and Subassembly codes. Reliability Code User input. The problem reliability code. This is validated against current reliability codes. A new one may be added by selecting “New” from the pop-up list box. Life Cycle Phase The current phase of the system. Defaults to current phase from Configuration file. Down Time Category User input. Downtime code. Either scheduled or unscheduled. Scheduled downtime should be used for holidays, weekends, or any other time when the machine is not scheduled to be in operation. Part Number User input. The machine part number. Relevant Failure User input. Whether the failure was relevant or not. Y for yes; N for no. PM User input. Whether the downtime was due to preventive maintenance. Y for yes; N for no. Problem User input. The description of the problem. This can be any number of characters. Repair Action User input. The description of the repair action. This can be any number of characters. Page 30 Failure Reporting, Analysis, and Corrective Action System Problems Table In the SEMATECH implementation, the Problem Record allows you to view, modify, or add records for problems. The fields of the Problems Table are described in Table 4. TABLE 4 PROBLEMS FIELD DESCRIPTIONS Field Name Description System User input from Events screen. If adding from Events screen, automatic input based upon the serial number. Subsystem User input. Optional subsystem code (if part is subsystem level) involved in event. Related to System code. See also Reliability Code. Assembly User input. Optional assembly code (if part is at assembly level) involved in event. Related to selected System and Subsystem codes. See also Reliability Code. Subassembly User input. Optional subassembly code (if part is at subassembly level) involved in event. Related to selected System, Subsystem, and Assembly codes. See also Reliability Code. Sub-subassembly User input. Optional sub-subassembly code (if part is at sub-subassembly level) involved in event. Related to selected System, Subsystem, Assembly, and Subassembly codes. See also Reliability Code. Reliability Code Automatically generated from the above fields. A sequence number will be added to the code. Title/Description User input. Simple title or description of event. Initial Status User input. Required. Status of problem when first reported. If problem is being added from Event screen, date of event is entered automatically as initial status date. Fault Category User input. Event category. Error Codes User input. Error code for selected tool. Assignee User input. Staff member who assigned this problem. Insufficient Information Date when insufficient information status was declared. continued on next page Page 31 Failure Reporting, Analysis, and Corrective Action System TABLE 4 PROBLEMS FIELD DESCRIPTIONS, CONTINUED Field Name Description Sufficient Information Date when sufficient information status was declared. Defined Date when defined information status was declared. Contained Date when contained information status was declared. Retired Date when retired information status was declared. Root Cause Solution – Original Original (planned) date for fixing problem. Root Cause Solution – Current Current date for fixing problem. Complete Completion date for fixing problem. RCS Documentation User input. Points to documentation of fix; for example, Engineering Change Numbers (ECNs). Comments User input. Memo area for problem comments. Can be any number of characters. Report Types Your reporting mechanism for this database should be flexible enough to accommodate various customer types and their needs. For example, you need to be able to pull reports for an FRB analysis as well as for the users, sub-tier suppliers, and those who are addressing corrective actions. At a minimum, the following reports should be available: Trend charts to show a number of events during a time period. Event history to show a list of events for during a time period. Problem history to show a list of problems during a time period. Reliability statistics to demonstrate the system performance in terms of MTTR, MTBF, mean time to failure (MTTF), availability, etc. Page 32 Failure Reporting, Analysis, and Corrective Action System GLOSSARY Corrective Action A documented change to design, process, procedure, or material that is implemented and proven to correct the root cause of a failure or design deficiency. Simple part replacement with a like item is not considered corrective action. Equipment Life Cycle The sequence of phases or events constituting total product existence. The equipment life cycle is divided into the following six phases: Error Concept and feasibility Design Prototype (alpha-site) Pilot production (beta-site) Production and operation Phase-out Human action that results in a fault (term is usually reserved for software). Error Code Numbers and/or letters reported by the equipment’s software that represent an error type; code helps determine where the fault may have originated. Failure An event in which an item does not perform its required function within the specified limits under specified conditions. Failure Analysis A determination of failure cause made by use of logical reasoning from examination of data, symptoms, available physical evidence, and laboratory results. Page 33 Failure Reporting, Analysis, and Corrective Action System Failure Cause The circumstance that induces or activates a failure. Examples of a failure cause are defective soldering, design weakness, assembly techniques, and software error. Failure Mode The consequence of the failure mechanism (the physical, chemical, electrical, thermal, or process event that results in failure) through which the failure occurs; for example, short, open, fracture, excessive wear. Failure Review Board (FRB) A group consisting of representatives from appropriate organizations with the level of responsibility and authority to assure that failure causes are identified and corrective actions are effected. Fault Immediate cause of failure. A manifestation of an error in software (bug), if encountered, may cause a failure. Fault Code Type of failure mechanism categorized to assist engineering in determining where the fault may have originated. For example, a fault may be categorized as electrical, software, mechanical, facilities, human, etc. FMEA Failure Modes and Effects Analysis. An analytically derived identification of the conceivable equipment failure modes and the potential adverse effects of those modes on the system and mission. FRACAS Failure Reporting, Analysis, and Corrective Action System. A closed-loop feedback path by which failures of both hardware and software data are collected, recorded, analyzed, and corrected. Page 34 Failure Reporting, Analysis, and Corrective Action System Laboratory Analysis The determination of a failure mechanism using destructive and nondestructive laboratory techniques such as X-ray, dissection, spectrographic analysis, or microphotography. Mean Time to Repair (MTTR) Measure of maintainability. The sum of corrective maintenance times at any specific level of repair, divided by the total number of failures within an item repaired at that level, during a particular interval under stated conditions. Mean Time Between Failures (MTBF) Measure of reliability for repairable items. The mean number of life units during which all parts of the item perform within their specified limits, during a particular measurement interval under stated conditions. Mean Time to Failure (MTTF) Measure of reliability for nonrep- airable items. The total number of life units of an item divided by the total number of failures within that population, during a particular measurement interval under stated conditions. Related Event A failure that did not cause downtime. A related event or failure is discovered while investigating a failure that caused downtime. A related event could also be a repair or replacement during downtime caused by the main event. Relevant Failure Equipment failure that can be expected to occur in field service. Reliability The duration or probability of failure-free performance under stated conditions. Reliability Code A functional traceable description of the relationships that exists in the equipment. Codes are Page 35 Failure Reporting, Analysis, and Corrective Action System derived in a tree fashion from top to bottom as System, Subsystem, Assembly, Subassembly, and Sub-subassembly (typically lowest replaceable component). Codes vary in size; however, a maximum three alphanumeric descriptor per level is recommended. For example, AH-STK-ROT-XAXMOT for Automated Handler-Stocker-Robot-X_AxisMotor. Repair or Corrective Maintenance All actions performed as a result of failure to restore an item to a specified condition. These can include localization, isolation, disassembly, interchange, reassembly, alignment, and checkout. Root Cause The reason for the primary and most fundamental failures, faults, or errors that have induced other failures and for which effective permanent corrective action can be implemented. Uptime Time when equipment is in a condition to perform its intended function. It does not include any portion of scheduled downtime and nonscheduled time. Glossary References: MIL-STD-721C Definitions of Terms for Reliability and Maintainability (1981). Omdahl, T. P. Reliability, Availability, and Maintainability (RAM) Dictionary, ASQC Quality Press (1988). SEMI E10-92. Guideline for Definition and Measurement of Equipment Reliability, Availability, and Maintainability (RAM), SEMI (1993). Page 36 Failure Reporting, Analysis, and Corrective Action System FRACAS Functions and Responsibilities Responsibilities Function Failure Operators Identify problems. Call for maintenance. Annotate incident. Maintenance Correct problem. Log failure. Failure Report Maintenance Generate report with supporting data (time, place, equipment, etc.). Data Logging Reliability Log all failure reports. Validate failures and forms. Classify failures (inherent, induced, false alarm, etc.). Failure Review Failure Review Board Determine failure trends. Prepare plans for action. Identify failures to the root cause. Failure Analysis Reliability and/or Problem Owners Review operating procedures for error. Procure failed parts. Decide which parts will be destructively analyzed. Perform failure analysis to determine root cause. Quality Inspect incoming test data for item. Design Redesign hardware/software, if necessary. Sub-tier Supplier Prepare & provide new part or test procedure, if necessary. Quality Evaluate incoming test procedures. Inspect redesigned hardware/software. Reliability Close loop by collecting and evaluating posttest data for recurrence of failure. Corrective Action Post-data Review Description Technology Transfer #94042332A-GEN 2706 Montopolis Drive Austin, Texas 78741 (512) 356–3500 Printed on Recycled Paper Copyright 1993–1994, SEMATECH, Inc.