Enhancement of tape recorded voices to facilitate transcription

Enhancement of tape recorded voices to facilitate transcription & aural identification: Selected Topics in Forensic Voice Identification Bruce E. Koenig, October 1993 Federal Bureau of Investigation Ongoing law enforcement operations throughout the world are continually capturing the voices of suspects with miniature transmitter/receiver systems, analog and digital on-thebody recorders, telephone intercept devices, and concealed room microphones. Since these recordings are normally utilized for investigative leads and/or legal proceedings, specific speakers must be accurately identified. Voice identifications that occur through self-recognition of one's voice, eye-witness information, surveillance logs, and the use of a person's name in the conversation are usually readily accepted. However; voice identifications that involve listening only and/or laboratory tests are often more difficult to evaluate accurately. To provide a better understanding of these voice comparison topics, two types of aural-only comparisons will be discussed, and an update on the spectrographic technique is included. Aural Identification of Familiar Voices Recognition of familiar voices is a daily occurrence for most people, as they identify spouses, children, coworkers, friends, and business associates after only a few words spoken over the telephone or by hearing them from an adjacent room. This process involves long-term memory, where recognition occurs through a prior knowledge of speech characteristics, including such attributes as accent, speech rate, pronunciation, pitching, vocabulary, and vocal variance (intraspeaker variability). Some of the relevant scientific research, and opinions that address the accuracy of identifying familiar voices include the following: 1. Researchers used 7 listeners who were familiar with the 16 chosen speakers through daily contact. The speakers had no pronounced speech defects or accents. Groups of two to eight speech samples of varying lengths were played back to the listeners, which resulted in an identification accuracy of better than 95% for samples lasting from about 1 to 2 seconds. Voice samples were also frequency restricted, but the results reflected only a limited loss of accuracy under conditions normally encountered in law enforcement investigations. In tests involving whispered speech, the duration had to be somewhat greater than three times longer than normal speech samples to obtain equivalent levels of identification (Pollack et al. 1954). 2. Sixteen listeners with no hearing losses, who had known the recorded 10 male coworkers for at least 2 years, were chosen. None of the 10 recorded individuals had either pronounced regional accents or speech abnormalities. When the listeners heard sentences of less than 3 seconds duration from the 10 coworkers, their median accuracy rate of identification was 98% (range of 92% to 100%). When only a disyllable 1 (e.g., mama) was spoken, the median accuracy rate dropped to 88% (range of 73% to 98%) (Bricker and Pruzansky 1966). 3. In a study of coworkers, recordings were made on different telephone lines of four women and seven men, each talking for 30 seconds to 1 minute on a neutral topic such as the weather. An additional recording was prepared of another male; who was relatively unfamiliar to most of the listeners. The recordings were arranged in a random order and played to 10 of the other coworkers, who were asked to identify the speakers. "All the listeners except one correctly identified all the 11 [coworkers]... The one listener who made an error.. confused two speakers who were not well known to him. Three of the 10 listeners knew [the eighth male, who was not a coworker], and correctly identified him. Of the remaining seven listeners, only two said that they could not recognize this speaker. Five listeners wrongly identified this speaker as..." another one of their coworkers. "It is worth noting that four of the five listeners who made the wrong identification were highly skilled, experienced phoneticians..." with doctoral degrees in the field (Ladefoged 1978). This experiment reflects a 100% identification rate for the coworkers' voices that were well-known to them and an overall average accuracy rate of 96% when the relatively unfamiliar voice was added. 4. Twenty-four individuals were asked to listen to speech samples of 24 coworkers (15 males and 9 females) whom they had known for several years and 4 speakers unknown to the listeners. The speech samples averaged about 30 seconds in length and contained at least 12 utterances of 2 to 4 words each. Listeners rated each coworker on a scale of very familiar to totally unfamiliar prior to the testing. They listened to the samples for as long as they wished and then rated their decisions as follows: (1) guessing, (2) fairly sure, or (3) very sure. Deleting the results of any voice rated totally unfamiliar to the listener, the results showed a 90.4% correct identification rate and 4.3% incorrect identification rate, with 5.3% who said they did not know the speaker. If the 5.3% are deleted, the correct identification rate is 95.4%. "This rate is probably fairly representative of situations where a limited vocabulary is required and can be expected to be even higher in informal conversations where more of the individual speaker's speech habits are present as cues for identification" (Schmidt-Nielson and Stern 1985). This research reflects that the identification accuracy rate for familiar voice samples lasting 1 second or longer ranged from 92% to 100% and averaged 95% to 100%. Samples recorded through the telephone or other limited bandwidth systems had little effect on accuracy. The effects of noise and loss of high frequency information were studied in another experiment (Clarke et al. 1966) which found that aural speaker identification was only slightly degraded when progressing from high-quality voice samples to typical investigative recordings. It is obvious from everyday experience and the cited research that identifying familiar voices can be an accurate method for 2 identifying voices recorded in forensic applications, even with the limiting factors of noise and attenuated high frequencies. Aural comparisons of unfamiliar voice samples rely on short-term memory. For example, a woman receives a number of different telephone inquiries regarding a classified advertisement. She then receives an obscene telephone call, and she tries to remember if any of the voices match. In a judicial proceeding, a judge and/or a jury may have to decide if a particular crucial comment on an investigative recording was spoken by the defendant, who readily admits to saying the other statements attributed to him on the transcript, or to someone else involved in the conversation. Examiners using the spectrographic technique, described later, play back the separate voice samples concurrently on separate devices or computer files with an electronic patching arrangement to allow rapid aural switching between them or by recording short phrases or sentences from each sample on the same recording (Voice Comparison Standards 1991). The de facto study of unfamiliar voice comparisons (Clarke et al. 1966) determined the following: 1. Sentence length over the range of 5 to 11 syllables is not important variable in identification accuracy. 2. Correct identifications decreased from approximately 90% to 80% when the signal-to-noise ratio (SNR) was reduced from 30 decibels (dB) to 0 dB. 3. Correct identifications decreased from approximately 88% to 78% when the frequency response was reduced from 4,500 hertz (Hz) to 1000 Hz. Since most investigative recordings have a SNR of 10 dB to 40 dB and a frequency response of 2,500 Hz to 5,000 Hz, the range of expected correct identifications of unfamiliar voices would be 78% to 90%, with most identifications in the 78% to 83% range. The use of expert testimony for aural identifications of unfamiliar voices provides no assistance to the court and/or to the jury. The notes of the advisory committee on Rule 901 of the Federal Rules of Evidence appropriately reflect this fact as follows: "Since aural voice identification is not a subject of expert testimony, the requisite familiarity may be acquired either before or after the particular speaking which is the subject of the identification..." (Federal Criminal Code and Rules 1991). Additionally, the voice comparison standards of the International Associationfor Identification (IAI) specifically state that it "... does not support or approve the use of... aural only expert decisions..." for voice comparisons (1991). Spectrographic Comparisons The spectrographic laboratory technique is the most well-known and possibly the most accurate of the laboratory testing procedures presently available for comparing verbatim voice samples under forensic conditions. However, some scientists believe that aural identifications of very familiar voices are more accurate (Hecker 1971). The spectrographic technique has been described in numerous forensic and scientific publications, including an overview article published in the Crime Laboratory Digest (Koenig 1986). Therefore, a detailed explanation will not be rendered here; the following 3 paragraphs provide a brief summary of the examination, a review of the new comprehensive standards passed by the IAI, and its status in government and private laboratories. When properly conducted, spectrographic voice identification is a relatively accurate but not conclusive examination for comparing a recorded unknown voice sample with a suspect repeating the identical contextual information over the same type of transmission system (e.g., a local telephone line). The examiner uses both the short-term memory process previously detailed and a spectral pattern comparison between identically spoken sounds on spectrograms. Figures 1A and 1B are sound spectrograms of different male speakers saying "salt and pepper." The horizontal axis represents time, divided into 0.1-second intervals by the short vertical bars near the top, and the vertical axis is frequency, ranging linearly from 80 Hz to 4000 Hz, with horizontal lines every 1000 Hz. The speech energy is reflected in the gray scale from black (highest level) to white (lowest level). The frequency range of the voice is analogous to the range of a musical instrument, where the lowest notes are at the lowest frequency and the highest notes at the highest frequency. The mostly horizontal bands of darkness reflect the vocal resonances and are called formants. The closely spaced vertical striations represent fundamental frequency (voice pitch) or the actual vibrations of the vocal cords. The spectrographic technique requires comparison of identical phrases between the voice samples, with a decision made at one of a number of confidence levels. The scientific support of this examination is limited, and the actual error rate under most investigative conditions is unknown. The research to date indicates that the technique has a certain error rate that is independent of examiner-induced errors, with errors of false elimination (the voice samples were actually from the same person, but the examination 4 found that they did not match) appreciably higher than false identification (the voice samples were actually from different persons, but the examination found that the samples matched). In July 1991, the Voice Identification and Acoustic Analysis Subcommittee of the IAI passed and published its first set of comprehensive spectrographic voice identification standards. These requirements, which became effective January 1, 1992, for all certified IAI members, include examiner qualifications, evidence handling, preparation of exemplars, preparation of copies, preliminary-examination, preparation of spectrograms, spectrographic/aural analysis, work notes, testimony, certification, and miscellaneous subjects. Table 1 lists the minimum qualifications for spectrographic examiners of the IAI and the FBI and updates a similar table published in an earlier issue of the Crime Laboratory Digest (Koenig 1986). Table 2 is another updated and expanded table from the same article concerning minimum criteria for spectrographic comparisons. Tables 1 and 2 and the previously published tables reflect that the upgraded IAI standards are now appreciably closer to the FBI's criteria. The FBI's standards require higher educational levels, more words for lower confidence decisions, enhancement procedures when needed, and a higher frequency voice range. The most important legal difference is the FBI's policy not to provide testimony on spectrographic comparisons due to the inconclusive nature of the examination and the unknown error rate under specific investigative conditions. Table 1. Minimum Qualifications for Spectrographic Examiners of the AIA and FBI Qualification IAI FBI High School Diploma BS Degree Yes Yes Usually 2 Years 2 Years Number of Comparisons Conducted 100 100 Attendance at a Spectrographic School Yes Yes Formal Certification Yes Yes Education Periodic Hearing Test Length of Apprenticeship Table 2. Minimum Criteria for Spectrographic Comparison for the IAI and the FBI Criteria IAI FBI Words Needed for Highest Confidence Level 20 20 Words Needed for Lowest Confidence Level 10 20 Affirming Independent Secong Decision Yes Yes Original Recording Required Yes Yes Allows Testimony Yes No Usually Usually Above 2 KHz Above 2.5 KHz Yes Yes Optional Yes Speed Correction of All Recordings Yes Yes Track Determiniation of All Recordings Yes Yes Azimuth Alignment Correction Yes Yes Completely Verbatim Knon Samples Speech Frequency Rate Accuracy Statement om Report Enhancement Proceedures When Needed 5 The use of the spectrographic technique since the mid1980s continues to show a steady decline by both government laboratories and private examiners. As of mid-1993, the New York City Police Department and the FBI were the only government laboratories in this country regularly conducting these examinations. The private sector efforts were limited to less than a dozen part-time examiners. Professional meetings in the field have been sparsely attended, and no major spectrographic research is known to be under way. Problems still persist in the spectrographic voice identification field. Examples of these problems include the following: (1) separate sets of certified examiners making highconfidence decisions for both identification and elimination in the same case;1 (2) individuals with no experience, training, or education in the voice identification discipline making conclusive decisions under oath in court; and (3) examiners testifying that an unknown voice is not the defendant's, although admitting their decisions are really inconclusive based upon accepted standards. Note 1. Los Angeles Board of Civil Service Commisioners. Threat case decided March 25,1992, in which three IAI examiners made an identification at a high-confidence level, while two IM examiners eliminated the suspect. Summary and Conclusion Under investigative conditions, individuals can reliably identify voices that are well known to them, but the accuracy rate drops to approximately 78% to 83°/o when unfamiliar voices are compared to known voice samples. The use of expert witnesses does not improve the accuracy rate of aural only voice comparisons. The use of the spectrographic technique continues to decline, even with the establishment of new standards in 1992. References Bricker, P. D. and Pruzansky, S. Effects of stimulus content and duration on talker identification, Journal of the Acoustical Society of America (1966) 40:6:1441-1449. Clarke, F. R., Becker, R. W., and Nixon, J. C. Characteristics that Determine Speaker Recognition. Technical Report ESD-TR-66-636, Electronic Systems Division, US Air Force, 1966. Compton, A. J. Effects of filtering and vocal duration upon the identification of speakers, aurally, Journal of the AcousticaI Society of America (1963) 35:11:1748-1752. Federal Criminal Code and Rules. est, St. Paul, MN, 1991, p. 289. Hecker, M. H. L. Speaker Recognition: An Interpretive Survey of the Literature. American Speech and Hearing Association, Washington, DC, 1971. Koenig, B. E. Spectrographic voice identification, Crime Laboratory Digest (1986)13:4:105-118. Ladefoged, P. Expectation affects identification by listening, Language and Speech (1978) 21:4:373-374. Pollack, I., Pickett, J. M., and Sumby, W. H. On the identification of speakers by voice, Journal of the Acoustical Society of America (1954) 26:3:403-406. Schmidt-Nielson, A. and Stern, K. R. Identification of known voices as a function of familiarity and narrowband coding, Journal of the Acoustical Society of America (1985) 77:2:658-663. Voice comparison standards, Journal of Forensic Identification (1991) 41:5:373-392. 6 Voiceprint Identification Money Laundering and Narcotics Update, Department of Justice, 1988, and The Legal Investigator 1990 by Steve Cain, Lonnie Smrkovski and Mindy Wilson Voiceprint identification can be defined as a combination of both aural (listening) and spectrographic (instrumental) comparison of one or more known voices with an unknVoiceprint identification can be defined as a combination of both aural (listening) and spectrographic (instrumental) comparison of one or more known voices with an unknown voice for the purpose of identification or elimination. Developed by Bell Laboratories in the late 1940s for military intelligence purposes, the modern-day forensic utilization of the technique did not start until the late 1960s following its adoption by the Michigan State Police. From 1967 until the present, more than 5,000 law enforcement related voice identification cases have been processed by certified voiceprint examiners. Voice identification has been used in a variety of criminal cases, including murder, rape, extortion, drug smuggling, wagering-gambling investigations, political corruption, money-laundering, tax evasion, burglary, bomb threats, terrorist activities and organized crime activities. It is part of a larger forensic role known as acoustic analyses, which involves tape filtering and enhancement, tape authentication, gunshot acoustics, reconstruction of conversations and the analysis of any other questioned acoustic event. Theory The fundamental theory for voice identification rests on the premise that every voice is individually characteristic enough to distinguish it from others through voiceprint analysis. There are two general factors involved in the process of human speech. The first factor in determining voice uniqueness lies in the sizes of the vocal cavities, such as the throat, nasal and oral cavities, and the shape, length and tension of the individual's vocal cords located in the larynx. The vocal cavities are resonators, much like organ pipes, which reinforce some of the overtones produced by the vocal cords, which produce formats or voiceprint bars. The likelihood that two people would have all their vocal cavities the same size and configuration and coupled identically appears very remote. The second factor in determining voice uniqueness lies in the manner in which the articulators or muscles of speech are manipulated during speech. The articulators include the lips, teeth, tongue, soft palate and jaw muscles whose controlled interplay produces intelligible speech. Intelligible speech is developed by the random learning process of imitating others who are communicating. The likelihood that two people could develop identical use patterns of their articulators also appears very remote. 1 Therefore, the chance that two speakers would have identical vocal cavity dimensions and configurations coupled with identical articulator use patterns appears extremely remote. While there have been claims that sever al voices have been found to be indistinguishable, no evidence to support such allegations has been published, offered for examination or demonstrated to the authors. Several studies have been published evidencing the ability to reliably identify voices under certain conditions, and a Federal Bureau of Investigation survey of its own performance in the examination of 2,000 forensic cases revealed an error rate of 0.31 percent for false identifications, and 0.53 per cent for false eliminations. (See Koenig, B.E., 1986, Spectrographic Voice Identification: a forensic survey, Journal of the Acoustical Society of America, 79:2088-2090.) While there is disagreement in the so-called "scientific community" on the degree of accuracy with which examiners can identify speakers under all conditions, there is agreement that voices can, in fact, be identified. To facilitate the visual comparisons of voices, a sound spectrograph is used to analyze the complex speech wave form into a pictorial display on what is referred to as a spectrogram. The spectrogram displays the speech signal with the time along the horizontal axis, frequency on the vertical axis, and relative amplitude indicated by the degree of gray shading on the display. The resonance of the speaker's voice is displayed in the form of vertical signal impressions or markings for consonant sounds, and horizontal bars or formants for vowel sounds. The visible configurations displayed are characteristic of the articulation involved for the speaker producing the words and phrases. The spectrograms serve as a permanent record of the words spoken and facilitate the visual comparison of similar words spoken between and unknown and known speaker's voice. Procedural Guidelines The acoustic environment in many cases can be controlled at the receiving end of speech signal. Shutting off the radio, television or other signal- noise generating devices will reduce or eliminate unwanted background speech or noise. While not always possible, the investigator should at tempt to select a reasonably quiet environment for controlled activities such as drug buys or other illegal operations being investigated. Many times these types of activities are carried out in bars, restaurants, car washes, billiard rooms and the like, and the investigator cannot always dictate the location. It may require the recording of telephone conversations or face-to-face encounters under a variety of acoustic conditions in which someone is wearing a body recorder or transmitting the conversation via radio frequency to a remote location. Unfortunately, in many cases the investigators cannot control the acoustic environment. In situations involving an adverse environment, investigators should use high technology stereo equipment to optimize recording capability. 2 The attempt to produce samples as parallel to the unknown as possible actually assists the examiner in his task because speaker variables are reduced to a minimum. Numerous studies have been conducted that indicate very reliable decisions can be made by trained professional examiners when samples are obtained in the manner described. The notion proposed by some opponents that duplicating the unknown as closely as possible may cause error is not supported by any available evidence. Research studies have produced strong evidence that even very good mimics cannot duplicate another's speech patterns. In an attempt to obtain proper speech samples, investigators should not hesitate to ask suspects for the samples they need. Surprisingly, many suspects will voluntarily give a sample of their voice for comparison purposes. In the event you are dealing with some type of vocal' disguise, attempt to obtain a similarly produced known exemplar in addition to the suspect's normal voice. It should be noted that vocal disguises can be very difficult for the examiner to deal with and the probability of determination is less than with normal voice samples. If a suspect refuses to cooperate with the investigator, a court order may be acquired compelling the suspect to produce voice recordings for the purpose of comparison. Courts have repeatedly held that requiring the accused to submit voice exemplars for the purpose of comparison for identification or elimination does not violate the suspect's Fifth Amendment rights. In Wade, 388 U.S. 218 (1967), the Court held that the privilege against self-incrimination offers no protection from compulsion to submit to speaking for purpose of voice identification, or to writing, photographing, fingerprinting and measurements. Several problems have been en countered in obtaining known voice exemplars even with the use of a court order. If the court order is vague, the suspect may utter a few words of the text involved, speak too softly, too fast, or too slowly, or otherwise disguise the sample and claim compliance with the order. To prevent such problems, the investigator is wise to request that the court order specify in detail, that the suspect give a sample of his or her voice, repeating the phrases of the questioned call in a natural conversational voice (or in a similar disguise, if that is the case) and that such sample shall be given at least three times and to the reasonable satisfaction of the investigator. Voice exemplars obtained with such specific instructions are usually very satisfactory for comparison purposes. Before terminating the recording session, check the recording to deter mine whether or not a satisfactory exemplar was obtained.' Remember that once a suspect is released, a second known sample may be very difficult to obtain. Whatever the recording circum stance, background noise and the distance between the talker and the receiving device should be minimized for optimal recording. Good quality tape recording equipment should be used, as well as 3 magnetic recording tape. As a rule of thumb, recording tape with standard 120 equalization, normal bias and no more than a 5 dB drop at 6 KHz should be used. After the development of a suspect, the next task is to properly obtain known voice samples for comparison purposes. Do not hesitate to ask a suspect for a speech sample. If the suspect refuses, a court order may be obtained requiring compliance with the request. See Schmerber v. California, 384 US. 757(1966). and Gilbert v. California, 388 US. 263 (1967). Both are landmark cases. There are also many additional decisions at both state and federal court levels that may be cited to support such a request. Court orders should clearly spell out the minimum number of samples to be obtained, the manner of speech, and the method to be employed. The next task for the investigator is to obtain proper speech samples for comparison purposes. Probably the best guide here is attempting to duplicate the recording of the questioned call. Known samples should be obtained via the telephone and recorded in the same manner as the questioned call. If possible, the same recorder and telephone pickup should be used. In some cases, even the same telephone has been employed. If there is room on the questioned tape, the known sample may be placed on it. If there is not, another tape of the same type and brand should be used if at all possible. Speech samples obtained should contain exactly the same words and phrases as those in the questioned sample because only like speech sounds are used for comparison. Be cause the voice, like handwriting, is dynamic and variant, several samples of each spoken phrase are desired for analysis. Unless the questioned call sounds like a read statement, the suspect should not be allowed to read the phrases from a transcript but should repeat each phrase after it is spoken by someone else. To avoid an unnatural verbal response, the suspect should repeat the first phrase and proceed in the same manner with each successive phrase. When all phrases have been record ed, the same procedure should be repeated at least two more times beginning with the first word or phrase. The suspect may be asked to read the phrases if a very poor job of repeating is done. Some people do a better job of reading than repeating the phrases. It is important that the known sample be spoken in the same manner as the questioned sample; therefore, the investigator should be familiar with the voice, manner of speech and the text. If the caller's voice was disguised, the suspect should give a normal sample and a disguised one as in the questioned call. Recorded evidence should be wrapped in tinfoil to protect it from possible contact with a magnetic field if it is submitted by mail. The evidence should be shipped in a secure container that will prevent the evidence from tearing through the packaging material. Do not submit a copy of your investigative report with the evidence. The examiner does not want to know the details of the case. It is important, however, to provide the examiner with information regarding the 4 recording method, the number of calls and suspects involved, and any other information that may assist the examiner in the examination of the evidence. Upon receipt of the evidence by the laboratory, it is properly marked and a case number is assigned. The analysis and comparison of known and questioned voice samples may take several hours or days to complete, depending on the number of samples involved and the complexity of the examination. Both an aural (listening) and visual (spectrographic) examination and comparison is conducted. Aural and spectrographic cues examined should compliment one another in the event the voices are in fact the same. As with the identification of finger prints, there is presently no universal standard for the number of words required for identification. It does, how ever, vary from a minimum of 10 for some agencies and 20 for others. The Internal Revenue Service has chose to use 20 or more like speech sounds between an unknown and known sample with the degree of certainty based on quality and excellence of the evidence examined. Obtaining a second, independent decision is standard practice in this field as in other forensic sciences. Visual comparison of spectrograms involves, in general, the examination of spectrograph features of like sounds as portrayed in spectrograms in terms of time, frequency and amplitude. Specific features, the result of producing consonants, vowels and semi-vowels in isolation or in combination (coarticulation), include the following but certainiy not all-inclusive clues: pitch, bandwidth, mean frequency, trajectory of vowel formants, distribution of formant energy, nasal resonance, stops, plosives, fricatives, pauses, inter formant features and other idiosyncratic and pathological features. Special aural comparison tapes are prepared facilitating comparison of psycholinguistic features via short-term memory. Aural cues compared include resonance quality, pitch, temporal factors, inflection, dialect, articulation, syllable grouping, breath pattern, disguise, pathologies and other peculiar speech characteristics. Some agencies offer court testimony, others do not. The IRS laboratory is the only federal agency that presently offers testimony. All other certified examiners, whether in state agencies or in private practice, also offer court testimony. Court Admissibility Court testimony involving aural- spectrographic voice comparison essentially started having an impact on the courts after the Tosi Study in December 1970. Since then there have been between 150 and 200 trials in local, state or federal courts. Because of a difference based on evidentiary philosophical reasons, some courts have admitted aural-spectrographic voice evidence and others have not. 5 There are two general "rules" or "standards" by which scientific evidence is accepted in courts of law in the United States. The first, commonly referred to as the Frye "rule" or "test," is based on a 1923 District of Columbia case and basically requires "general acceptance in the particular field in which it belongs." See Frye v. United States, 54 App. D.C. 46, 293 F. 1013 (1923). The second is based on the argument of McCormick (See "McCormick on Evidence," 3rd Ed., 203 at 608.) McCormick states: "General scientific acceptance is a proper condition for taking judicial notice of scientific facts, but it is not a suitable criterion for the admissibility of scientific evidence. Any relevant conclusion supported by a qualified expert witness should be received unless there are distinct reasons for exclusion." See Rule 702 of the Federal Rules of Evidence. Many state and federal courts have abandoned Frye and adopted the argument of McCormick. The supreme courts of Minnesota, Maine, Ohio and Rhode Island have admitted aural-spectrographic voice evidence following McCormick. Intermediate appellate courts in California, Mary land and Michigan admitted such evidence following Frye but were reversed by their respective supreme courts, which held that the Frye test had not been met. The Massachusetts Supreme Court held aural-spectrographic voice evidence admissible applying the Frye test, while those of Arizona, Indiana and Pensylvania did not. In the federal court system, we are aware of 30 trials in which the question of aural-spectrographic voice evidence was addressed. All but three admitted the evidence based on Frye or McCormick. On appeal, the Second, Fourth and Sixth Circuits held the evidence admissible, applying McCormick, while the District of Columbia did not, applying Frye. See United States v. Williams, 583 F.2d 1194 (2d Cir.), cert. denied 439 US. 1117 (1978); United States v. Bailer, 519 F.2d 463 (4th Cir.), cert. denied 423 US. 1019 (1975); United States v. Franks, 511 F.2d 25 (6th Cir.) cert. denie4 422 US. 1042 (1975), and United States v. McDaniel, 538 F.2d 408 (D.C. Cir. 1976). In United States v. Williams, supra at 1198, the court said: "The 'Frye' test is usually construed as necessitating a survey and categorization of the subjective views of a number of scientists, assuring thereby a reserve of experts available to testify. Difficulty in applying the 'Frye' test has led a number of courts to its implicit modification." Also see United States v. Bailer, supra at n.6. Since 1970, the forensic application of aural-spectrographic voice identification has been reliably applied in the investigation of several thousand cases. While there is disagreement on the reliability of the method under all conditions, there is agreement that voices can be identified and eliminated when the proper conditions exist and the analysis is carefully conducted by qualified examiners. Several state appellate and supreme courts have admitted the evidence, as have three of four federal appellate courts. The United States Supreme Court has 6 refused to review and decide the three cases brought before it. While the admission of aural-spectrographic voice evidence continues to be decided in various courts, the method continues to be a very important tool m the arsenal against crime. Other areas of acoustic analysis include, in part, gun shot analysis, tape enhancement and tape authentication. While not discussed in this article, it should be noted that laboratory analysis related to these problems is avail able in some laboratories. NDAA Bulletin December 1993 7 VOICE IDENTIFICATION: The Aural/Spectrographic Method by: Michael C. McDermott (mike@mcdltd.com), Tom Owen (owlmax@aol.com), Frank M. McDermott, Ltd. Owl Investigations, Inc. Table of Contents: I. INTRODUCTION II. The Sound Spectrograph III. The Method of Voice Identification IV. History V. Standards of Admissibility VI. Research Studies VII. Conclusion VIII. Table of Cases IX. Appendix 1 © 1996 Owl Investigations, Inc. INTRODUCTION The forensic science of voice identification has come a long way from when it was first introduced in the American courts back in the mid 1960's. In the early days of this identification technique there was little research to support the theory that human voices are unique and could be used as a means for identification. There was also no standardization of how an identification was reached, or even training or qualifications necessary to perform the analysis. Voice comparisons were made solely on the pattern analysis of a few commonly used words. Due to the newness of the technique there were only a few people in the world who performed voice identification analysis and were capable of explaining it to a 1 court. Gradually the process became known to other scientists who voiced concerns, not as to the validity of the analysis, but as to the lack of substantial research demonstrating the reliability of the technique. They felt that the technique should not be used in the courtroom without more documentation. Thus the battle lines were drawn over the admissibility of voice identification evidence with proponents claiming a valid, reliable identification process and opponents claiming more research must be completed before the process should be used in courtrooms. Today voice identification analysis has matured into a sophisticated identification technique, using the latest technology science has to offer. The research, which is still continuing today, demonstrates the validity and reliability of the process when performed by a trained and certified examiner using established, standardized procedures. Voice identification experts are found all over the world. No longer limited to the visual comparison of a few words, the comparison of human voices now focuses on every aspect of the words spoken; the words themselves, the way the words flow together, and the pauses between them. Both aural and spectrographic analysis are combined to form the conclusion about the identity of the voices in question. The road to admissibility of voice identification evidence in the courts of the United States has not been without its potholes. Many courts have had to rule on this issue without having access to all the facts. Trial strategies and budgets have resulted in incomplete pictures for the courts. To compound the problem, courts have utilized different standards of admission resulting in different opinions as to the admissibility of voice identification evidence. Even those courts which have claimed to use the same standard of admissibility have interpreted it in a variety of ways resulting in a lack of consistency. Although many courts have denied admission to voice identification evidence, none of the courts excluding the spectrographic evidence have found the technique unreliable. Exclusion has always been based on the fact that the evidence presented did not present a clear picture of the technique's acceptance in the scientific community and as such, the court was reluctant to rely on that evidence. The majority of courts hearing the issue have admitted spectrographic voice identification evidence. THE SOUND SPECTROGRAPH The sound spectrograph, an automatic sound wave analyzer, is a basic research instrument used in many laboratories for research studies of sound, music and speech. It has been widely used for the analysis and classification of human speech sounds and in the analysis and treatment of speech and hearing disorders. The instrument produces a visual representation of a given set of sounds in the parameters of time, frequency and amplitude. The analog spectrograph is composed of four basic parts; (1) a magnetic tape recorder/playback unit, (2) a tape scanning device with a drum which carries the paper to be marked, (3) an electronic variable filter, and (4) an electronic stylus which transfers the analyzed 2 information to the paper. The analog sound spectrograph samples energy levels in a small frequency range from a magnetic tape recording and marks those energy levels on electrically sensitive paper. This instrument then analyses the next small frequency range and samples and marks the energy levels at that point. This process is repeated until the entire desired frequency range is analyzed for that portion of the recording. The finished product is called a spectrogram and is a graphic depiction of the patterns, in the form of bars or formants, of the acoustical events during the time frame analyzed. The machine will produce a spectrogram in approximately eighty seconds. The spectrogram is in the form of an X,Y graph with the X axis the time dimension, approximately 2.4 seconds in length, and the Y axis the frequency range, usually 0 to 4000 or 8000 Hz. The degree of darkness of the markings indicates the approximate relative amplitude of the energy present for a given frequency and time. Recent developments in sound spectrography have produced computerized digital sound spectrographs ranging from dedicated digital signal analysis workstations to PC-based systems for acquisition, analysis editing, and playback. These sophisticated computer-based systems provide high fidelity signal acquisition, high- speed digital processing circuitry for quick and flexible analysis, and CD-quality playback. The computerize-based systems accomplish all the same tasks of the analog systems, but with the computer-based systems the examiner gains a host of comparison and measurement tools not available with the analog equipment. The computer-based systems are capable of displaying multiple sound spectrogram, adjusting the time alignment and frequency ranges and taking detailed numeric measurements of the displayed sounds. With these advances in technology, the examiner widens the scope of the analysis to create a more detailed picture of the voice or sound being analyzed. The accuracy and reliability of the sound spectrograph, either analog or digital, has never been in question in any of the courts and never considered an issue in the admissibility of voice identification evidence. This may be due in part to the wide use of the instrument in the field of speech and hearing for non-voice identification analysis of the human voice and, in part to the fact that given the same recording of speech sounds the sound spectrograph will consistently produce the same spectrogram of that speech. The contest comes in the interpretation of the spectrograms. Proponents of the aural and spectrographic technique of voice identification base their decisions on the theory that all human voices are different due to the physical uniqueness of the vocal track, the distinctive environmental influences in the learning process of speech development, and the unique development of neurological faculties which are responsible for the production of speech. Opponents claim that not enough research has been completed to validate the theory that intraspeaker variability is less than interspeaker variability. THE METHOD OF VOICE IDENTIFICATION 3 The method by which a voice is identified is a multifaceted process requiring the use of both aural and visual senses. In the typical voice identification case the examiner is given several recordings; one or more recordings of the voice to be identified and one or more recorded voice samples of one or more suspects. It is from these recordings the examiner must make the determination about the identity of the unknown voice. The first step is to evaluate the recording of the unknown voice, checking to make sure the recording has a sufficient amount of speech with which to work and that the quality of the recording is of sufficient clarity in the frequency range required for analysis.1 The volume of the recorded voice signal must be significantly higher than that of the environmental noise. The greater the number of obscuring events, such as noise, music, and other speakers, the longer the sample of speech must be. Some examiners report that they reject as many as sixty percent of the cases submitted to them with one of the main reasons for rejection being the poor quality of the recording of the unknown voice. Once the unknown voice sample has been determined to be suitable for analysis, the examiner then turns his attention to the voice samples of the suspects. Here also, the recordings must be of sufficient clarity to allow comparison, although at this stage, the recording process is usually so closely controlled that the quality of recording is not a problem. The examiner can only work with speech samples which are the same as the text of the unknown recording. Under the best of circumstances the suspects will repeat, several times, the text of the recording of the unknown speaker and these words will be recorded in a similar manner to the recording of the unknown speaker. For example, if the recording of the unknown speaker was a bomb threat made to a recorded telephone line then each of the suspects would repeat the threat, word for word, to a recorded telephone line. This will provide the examiner with not only the same speech sounds for comparison but also with valuable information about the way each speech sound completes the transition to the next sound. There are those times when a voice sample must be obtained without the knowledge of the suspect. It is possible to make an identification from a surreptitious recording but the amount of speech necessary to do the comparison is usually much greater. If the suspect is being engaged in conversation for the purpose of obtaining a voice sample, the conversation must be manipulated in such a way so as to have the suspect repeat as many of the words and phrases found in the text of the unknown recording as possible. The worst exemplar recordings with which an examiner must work are those of random speech. It is necessary to obtain a large sample of speech to improve the chances of obtaining a sufficient amount of comparable speech. As in any other form of identification analysis, as the quality of the evidence with which the examiner has to work declines, the greater the amount of evidence and 4 time necessary to complete the analysis, and the less likely the chance for a positive conclusion. Once the evidence has been determined to be sufficient to perform the analysis, the examiner then begins the two step process of voice sample comparison; one aural (listening) and the other spectrographic (visual). These are two different but interwoven and equally important analytical methods which the examiner combines to reach the final conclusion. The first step is an aural comparison of the voice samples.2 Here the examiner compares both single speech sounds and series of speech sounds of the known and unknown samples. At this stage the examiner is conducting a number of tasks; comparing for similarities and differences, screening out less useful portions of the samples, and indexing the samples for further analysis. An example of the initial aural comparison is the screening of the samples for pronunciation similarities or discrepancies such as the word "the" may be said with a short "a" sound or a long "e" sound. If the word is not pronounced in the same manner it loses comparison value. Once the examiner has located those portions to be used for the analysis, a more detailed aural comparison is undertaken. This comparison can be accomplished in many different ways. One of the most commonly used methods of aural comparison is re-recording a speech sound sample of the unknown followed immediately by a re-recording of the same speech sounds of the suspect. This is repeated several times so that the final product is a recording of specific speech sounds, in alternating order, by the unknown speaker followed by the suspect. Such comparisons have been greatly facilitated by the use of audio digital recording equipment which allows for the digital recording, storage, and repeated playback of only the desired speech sounds to be examined. During the aural comparison the examiner studies the psycholinguistic features of the speakers voice. There are a large number of qualities and traits which are examined from such general traits as accent and dialect to inflection, syllable grouping and breath patterns. The examiner also scrutinizes the samples for signs of speech pathologies and peculiar speech habits. The second step in the voice identification process is the spectrographic analysis of the recorded samples. The sound spectrograph is an automatic sound wave analyzer with a high quality, fully functional tape recorder. The speech samples to be analyzed are recorded on the sound spectrograph. The recording is then analyzed in two and one half second segments. The product is a spectrogram, a graphic display of the recorded signal on the basis of time and frequency with a general indication of amplitude. The spectrograms of the unknown speaker are then visually compared to the spectrograms of the suspects. Only those speech sounds which are the same are compared.3 The comparisons of the spectrograms are based on the displayed patterns representing the psychoacoustical features of the captured speech. The examiner studies the bandwidths, mean frequencies, and trajectory of vowel formants; vertical striations, distribution of formant energy and nasal 5 resonances; stops, plosives and fricatives; interformant features, the relation of all features present as affected during articulatory changes and any peculiar acoustic patterning.4 The examiner looks not only for similarities but also for differences. The differences are closely examined to determine if they are due to pronunciation differences or if they are indicative of different speakers. When the analysis is complete the examiner integrates his findings from both the aural and spectrographic analyses into one of five standard conclusions; a positive identification, a probable identification, a positive elimination, a probable elimination, or no decision. In order to arrive at a positive identification the examiner must find a minimum of twenty speech sounds which possess sufficient aural and spectrographic similarities. There can be no differences either aural or spectrographic for which there can be no accounting. The probable identification conclusion is reached when there are less then twenty similarities and no unexplained differences. This conclusion is usually reached when working with small samples, random speech samples or recordings of lower quality. The result of positive elimination is rendered when twenty differences between the samples are found that can not be based on any fact other than different voices having produced the samples. A probable elimination decision is usually reached when working with limited text or a recording of lower quality. The no decision conclusion is used when the quality of the recording is so poor that there is insufficient information with which to work or when there are too few common speech sounds suitable for comparison. HISTORY A good place to start examining the history of speech sound analysis goes back a little more than one hundred years to Alexander Melville Bell who developed a visual representation of the spoken word. This visual display of the spoken word conveyed much more information about the pronunciation of that word than the dictionary spelling could ever suggest. His depiction of speech sounds demonstrated the subtle differences with which different people pronounced the same words. This system of speech sound analysis developed by Bell is the phonetic alphabet which he called "visible speech".5 His method of encoding the great variety of speech sounds was by handwritten symbols and was language independent. This code produced a visual representation of speech which could convey to the eye the subtle differences in which words were spoken. This system was used by both Bell and his son, Alexander Graham Bell, in helping deaf people learn to speak.6 It was in the early 1940's that a new method of speech sound analysis was developed. Potter, Kopp & Green, working for Bell Laboratories in Murray Hill, New Jersey, began work on a project to develop a visual representation of speech using a sound spectrograph. This machine, an automatic sound wave analyzer, produced a visual record of speech portraying three parameters; frequency, intensity and time. This research was intensified during World War II when acoustic scientists suggested that enemy radio voices could be identified 6 by the spectrograms produced by the sound spectrograph. The war ended before the technique could be perfected. In 1947, Potter, Kopp and Green published their work in a book, the title of which was borrowed from Alexander Melville Bell, Visible Speech. Their work is a comprehensive study of speech spectrograms designed to linguistically interpret visible speech sound patterns. This work was similar to that of Bell's in that speech sounds were encoded into a visual form. The difference is, instead of a pen, Potter, Kopp and Green used a sound spectrograph to produce the visual patterns. Research in the area of speaker identification slowed dramatically with the end of World War II. It was not until the late 1950's and early 1960's that the research began again. It was at this time the New York City Police Department was receiving a large number of telephone bomb threats to the airlines.7 At that time Bell Laboratories was asked by law enforcement officers to provide assistance in the apprehension of the individuals making the telephone calls. The task of developing a reliable method of identification of a speaker's voice was given to Lawrence G. Kersta, a physicist at Bell Laboratories who had worked on the early experiments using the sound spectrograph. In two years Kersta had developed a method of identification in which he reported results yielding a correct identification 99.65% of all attempts.8 It was in 1966 that the Michigan State Police began the practical application of the voice identification method in attempting to solve criminal cases. A Voice Identification unit was established and the unit personnel received training from Kersta and other speech scientists. During the first few years the voice identification method was used only as an investigative aid. The first court of published opinion to rule on the admissibility of voice identification analysis was in the case of United States v. Wright, 17 USCMA 183, 37 CMR 447 (1967). This was a court martial proceeding in which the appellate court affirmed the admission of spectrographic voice identification evidence by the board of review. The lengthy dissent by Judge Ferguson based on the requirements for acceptance of scientific evidence spelled out in Frye v. United States, 293 Fed. 1013 (CA DC Cir) (1923), was the beginning of a controversy which continues today. The first non-military case to review the admissibility of voice identification evidence was the New Jersey Supreme Court in State v. Cary.9 In this case the court stated that "the physical properties of a person's voice are identifying characteristics".10 The court also noted that trial courts in the states of New York and California have admitted voice identification evidence but that these admissions have not been subject of appellate review.11 The court declined to rule on the admissibility issue and remanded the case to determine if the equipment and technique were sufficiently accurate to provide results admissible as evidence. The Superior Court of New Jersey, on appeal from a denial of 7 admission after remand, held that the majority of evidence "indicates, not that the technique is not accurate and reliable, but rather that it is just too early to tell and at this time lacks the required scientific acceptance".12 The New Jersey Supreme Court reviewed this decision and once again remanded for additional fact finding "in light of the far-reaching implications of admission of voiceprint evidence".13 The State of New Jersey was unable "to furnish any new and significant evidence" by the third time the New Jersey Supreme Court reviewed this issue and as such affirmed the trial court's opinion excluding voice identification evidence.14 California came to a similar holding when the issue first reached the appellate level in People v. King.15 The State brought in Lawrence Kersta as the voice identification expert to testify as to the reliability of the technique. The defense brought in seven speech scientists and engineers to rebut Kersta's claims. The court held that "Kersta's claims for the accuracy of the `voiceprint' process are founded on theories and conclusions which are not yet substantiated by accepted methods of scientific verification".16 The court cited the Frye test as the proper standard for admissibility.17 The court also left the door open for future admission by saying when voice identification evidence has achieved the necessary degree of acceptance they will welcome its use.18 In State ex rel. Trimble v. Heldman 19, the Supreme Court of Minnesota held that "spectrograms ought to be admissible at least for the purpose of corroborating opinions as to identification by means of ear alone".20 The court was impressed by the testimony of Dr. Oscar Tosi who had previously testified against the use of spectrographic voice identification evidence in courtrooms, but after extensive research and experimentation now described the technique as "extremely reliable".21 The court made reference to the Frye test and to the scientific community's acceptance of Dr. Tosi's study, but did not specifically apply the Frye test as the standard for the admissibility of the voice identification evidence.22 In discussing the issue of admissibility the court held that it was the job of the factfinder to weight the credibility of the evidence. "The opinion of an expert is admissible, if at all, for the purpose of aiding the jury or the factfinder in a field where he has no particular knowledge or training. The weight and credibility to be given to the opinion of an expert lies with the factfinder. It is no different in this field than in any other".23 In 1972 the third and fourth District Courts of Florida, in separate opinions, held admissible the use of spectrographic voice identification evidence.24 The court in Worley held that the voice identification evidence was admissible to corroborate the defendant's identification by other means. The court stated that the technique had attained the necessary level of scientific reliability required for admission, but since it was only offered as corroborative evidence, the court refused to comment as to whether such evidence alone would be sufficient to sustain the identification and conviction.25 8 The third District Court of Appeals of Florida did not limit the admission of spectrograph evidence to corroborative status. In the Alea opinion the court does not mention the Frye test as the standard to be used for admission, but rather states that "such testimony is admissible to establish the identity of a suspect as direct and positive proof, although its probative value is a question for the jury".26 In the case of State v. Andretta 27, the New Jersey Supreme Court stated that there was much more support for the admission of spectrographic voice identification evidence than at the time they decided Cary, but refused to address the issue further since the only issue before them was whether the defendant should be compelled to speak for a spectrographic voice analysis.28 In California the Court of Appeal affirmed the trial court's admission of voice identification evidence in the case of Hodo v. Superior Court.29 Here the court found the requirements of Frye had been met in that there was now general acceptance of spectrographic voice identification by recognized experts in the field. The court cited Dr. Tosi's testimony that "those who really are familiar with spectrography, they are accepting the technique".30 Tosi also pointed out that the general population of speech scientists are not familiar with this technique and thus can not form an opinion on it.31 The court in United States v. Samples 32 held that the Frye test of general acceptance precludes too much relevant evidence for purposes of the fact determining process at a revocation of probation hearing and the court allowed the use of spectrographic voice identification evidence to corroborate other identification evidence.33 In 1974 the case of United States v. Addison 34 rejected the admission of voice identification evidence saying that such evidence "is not now sufficiently accepted" and as such the requirements of the Frye test were not met.35 At the trial the court heard from two experts endorsing the technique, Dr. Tosi and a recent convert to the reliability of the technique, Dr. Ladefoged. Only one expert, Dr. Stuart, testified that he was still skeptical of the technique and thought that most of the scientific community was also.36 Although the admission of spectrographic voice identification evidence was held to be error by the trial court, the appellate court refused to overturn the conviction due to overwhelming amount of other evidence supporting the conviction.37 Attempted disguise or mimic were the grounds the California Court of Appeal used to reverse a conviction based in part on spectrographic voice identification in the case of People v. Law.38 The court found that "with respect to disguised and mimicked voices in particular, the prosecution did not carry out its burden of proof to demonstrate that the scientific principles pertaining to spectrographic identification were beyond the experimental and into the demonstrable stage or that the procedure was sufficiently established to have gained general acceptance in the particular field in which it belongs".39 The main concern of the court was that no experimentation had been completed studying the effects of attempts to disguise or mimic on the accuracy of the identification process. 9 Without mentioning the Frye test this court used the standards set in Frye as the test of admissibility although the court seemed to be limiting the scope of the opinion to cases involving disguise or mimic. In United States v. Franks 40, the Sixth Circuit Court of Appeals held spectrographic voice identification evidence to be admissible. The court said it was "mindful of a considerable area of discretion on the part of the trial judge in admitting or refusing to admit evidence based on scientific processes".41 Quoting from United States v. Stifel 42, the court pointed out that "neither newness nor lack of absolute certainty in a test suffices to render it inadmissible in court. Every useful new development must have its first day in court. And court records are full of the conflicting opinions of doctors, engineers and accountants...".43 The court in Franks found that extensive review was given to the qualifications of the experts and opportunity to cross-examine the experts to determine the proper weight to be given such evidence. The Massachusetts Supreme Court, in Commonwealth v. Lykus 44, allowed the admission of spectrographic voice identification evidence saying that the opinions of a qualified expert should be received and the considerations similar to those expressed in Frye should be for the fact finder as to the weight and value of the opinions. The court gave greater weight to those experts who had had direct and empirical experience in the field as opposed to those who had only performed a theoretical review of that work.45 The court also stated that "neither infallibility nor unanimous acceptance of the principle need be proved to justify its admission into evidence".46 The Massachusetts Supreme Court again, that same year, found no error in the use of spectrographic voice identification evidence in the case of Commonwealth v. Vitello.47 The Fourth Circuit Court of Appeals, in the case of United States v. Baller 48, allowed the admission of spectrographic voice identification evidence saying unless it is prejudicial or misleading to the jury, it is better to admit relevant scientific evidence in the same manner as other expert testimony and allow its weight to be attacked by cross-examination and refutation.49 The court listed six reasons supporting admission; the expert was a qualified practitioner, evidence in voir dire demonstrated probative value, competent witnesses were available to expose limitations, the defense demonstrated competent cross-examination, the tape recordings were played for the jury, and the jury was told they could disregard the opinion of the voice identification expert.50 Voice identification evidence was admitted by the Sixth Circuit Court of Appeals in United States v. Jenkins 51 using the same logic as in Baller. Here the court said that the issue of admissibility was within the discretion of the trial judge and that once a proper foundation had been laid the trier of fact was able to assign proper weight to the evidence.52 In 1976 the New York Supreme Court pointed out, in the case of People v. Rogers 53, that fifty different trial courts had admitted spectrographic voice identification evidence, as had fourteen out of fifteen U. S. District Court judges, 10 and only two out of thirty- seven states considering the issue had rejected admission.54 The Rogers court stated that this technique, when accompanied by aural examination and conducted by a qualified examiner, had now reached the level of general scientific acceptance by those who would be expected to be familiar with its use, and as such, has reached the level of scientific acceptance and reliability necessary for admission.55 The court also pointed out that other scientific evidence processes are regularly admitted which as, or less, reliable than spectrographic voice identification; hair and fiber analysis, ballistics, forensic chemistry and serology, and blood alcohol tests.56 The Supreme Court of California finally put an end to the see-saw ride of admissibility in that state in People v. Kelly 57 by rejecting admission because of insufficient showing of support. "Although voiceprint analysis may indeed constitute a reliable and valuable tool in either identifying or eliminating suspects in criminal cases, that fact was not satisfactorily demonstrated in this case".58 In this case the court seemed to have the most trouble with the fact the only expert provided to lay the foundation for admission was the technician who performed the analysis, saying that a single witness can not attest to the views of the scientific community on this new technique and that this witness, who may not be capable of a fair and impartial evaluation of the technique since he has built a career on it, lacked the academic credentials to express an opinion as to the acceptance of the technique by the scientific community.59 In United States v. McDaniel 60, it appears that District of Columbia Circuit Court of Appeals would have liked to admit the spectrographic voice identification evidence but had to reject it because the shadow of the Addison decision of two years past "looms over our consideration of this issue".61 The court held the admission of the voice identification evidence to be harmless error in that the rest of the evidence was overwhelming. The court did recognize the trend toward admissibility and contemplated that it may be time to reexamine the holding of Addison "in light of the apparently increased reliability and general acceptance in the scientific community".62 The Supreme Court of Pennsylvania rejected admission in Commonwealth v. Topa 63 holding that the technician's opinion alone will not suffice to permit the introduction of scientific evidence into a court of law.64 This was the same situation, in fact the same single expert, which confronted the Kelly court. In People v. Tobey 65 the Michigan Supreme Court found, by applying the Frye test, that the trial court erred in admitting spectrographic voice identification evidence. The court found that neither of the two experts testifying in favor of the technique could be called disinterested and impartial experts in that both had built their reputations and careers on this type of work.66 The court pointed out that not all courts require independent and impartial proof of general scientific acceptability and was quick to add that this decision was not intended in anyway to foreclose the introduction of such evidence in future cases where there is demonstrated solid scientific approval and support of this new method of identification.67 11 In admitting voice identification evidence, the United States District Court for the Southern District of New York, in United States v. Willaims 68, found that the requirements of the Frye test were met when the technique was performed "by aural comparison and spectrographic analysis".69 The court stated that the concerns of the defendant that this technique had a mystique of scientific precision which may mask the ultimate subjectivity of spectrographic analysis, although they were valid concerns, could be alleviated by action other than suppression of the evidence, such as opposing expert opinion and jury instructions allowing the jury to determine the weight, if any, of the evidence.70 In People v. Collins 71, the Supreme Court of New York rejected admission of spectrographic voice identification evidence saying that the Frye test alone was insufficient to determine admissibility and must be used in conjunction with a test of reliability.72 The court found that the proponents of the technique were in the minority and that the remainder of the relevant scientific community either expressed opposition or expressed no opinion.73 In Brown v. United States 74, the District of Columbia Court of Appeals rejected the use of voice identification evidence, but held the error to be harmless and affirmed the conviction in light of overwhelming non-spectrographic identification of the defendant as perpetrator of the crime. One of the main problems in this case was the fact that the exemplar of the defendant's voice was recorded in a defective manner but used anyway after the tape speed malfunction had been corrected in a laboratory. Dr. Tosi, testifying as a proponent of the technique, stated that the technician should not have used the defective recording as a basis of comparison.75 The court held the technique was not shown to be sufficiently reliable and accepted within the scientific community to permit its use in this criminal case, but that this decision did not foreclose a future decision as to admissibility of the technique.76 In the civil case of D'Arc v. D'Arc 77, the court found that the requirements of the Frye test had not been met and thus the evidence could not be admitted. The court believed that even with proper instructions to the contrary, this type of evidence "has the potentiality to be assumed by many jurors as being conclusive and dispositive" and thus should be subject to strict standards of admission.78 The court in State v. Williams 79 refused to apply the Frye standard citing instead the Maine Rules of Evidence, Rule 401, which states "all relevant evidence is admissible", with relevant being described as evidence having any tendency to make the existence of any fact that is of consequence to the determination of the action more probable or less probable than it would be without the evidence.80 In Reed v. State 81 the court applied the Frye standard to determine admissibility with a rather wide definition of the scientific community which included "those whose scientific background and training are sufficient to allow them to comprehend and understand the process and form a judgment about it".82 The court said the trial court erred in using the more restricted definition of scientific community, "those who are knowledgeable, directly knowledgeable through work, 12 utilization of the techniques, experimentation and so forth" and did not mean the broad general scientific community of speech and hearing science.83 In a fifty-one page dissent to the Reed decision 84, Judge Smith points out that the Frye standard is much criticized and has never been adopted in the state of Maryland, that this decision is out of step with other courts on related issues of fingerprints, ballistics, x-rays and the like, that this decision is out of step with prior Maryland holdings on expert testimony, that the majority of reported opinions have accepted such evidence, and that even if Frye were applicable it is satisfied. In United States v. Williams 85 the court did not apply the Frye standard but did note that acceptance of the technique appeared strong among scientists who had worked with spectrograms and weak among those who had not.86 The court then focused on the reliability of the technique and the tendency to mislead. As to the reliability of the technique, the court noted the small error rate, 2.4% false identification, the existence and maintenance of standards of analysis, and the conservative manner in which the technique was applied.87 As to the tendency to mislead, the court felt that adequate precautions were taken in that the jury could view the spectrograms and listen to the recording and the expert's qualifications, the reliability of the equipment and the technique were subject to scrutiny by the defense, and the jury was instructed that they were free to disregard the testimony of the experts.88 In the case of People v. Bein 89 the court based admissibility on a two pronged test; general acceptance by the relevant scientific community, and competent expert testimony establishing reliability of the process. The court found that both tests had been met and allow the admission of the evidence.90 The court described the relevant scientific community "to be that group of scientists who are concerned with the problems of voice identification for forensic and other purposes".91 The court also suggested that "it is no different in this field of expertise than in other fields, that where experts disagree, it is for the finder of fact to determine which testimony is the more credible and therefore more acceptable".92 The Ohio Supreme Court, in State v. Williams 93, relied on their own state rules of evidence, as did the Maine court in Williams, and rejected the use of the Frye standard. The court refused "to engage in scientific nose counting for the purpose of whether evidence based on newly ascertained or applied scientific principles is admissible".94 The court noted, with approval, the playing of the recordings to the jury and, that the jury was free to reject the testimony of the expert.95 In that same year, right across the border in Indiana, the court in Cornett v. State96 rejected admission of voice identification evidence saying the conditions set out in Frye had not been met. Here the court used a wide definition of the scientific community which included linguists, psychologists and engineers who use voice spectrography for identification purposes.97 Although the court held 13 that the trial court erred in admitting the evidence, the error was found to be harmless and the conviction affirmed.98 Likewise the court in State v. Gortarez 99 rejected the admission of voice identification evidence but affirmed the conviction holding such admission to be harmless error. The court also used a wide definition of the scientific community in applying the Frye standard including experts in the fields of acoustical engineering, acoustics, communication electronics, linguists, phonetics, physics and speech communications and found that there was not general acceptance among these scientists.100 In the case of United States v. Love101, the admissibility of spectrographic voice identification was not at issue. The fourth circuit Court of Appeals was reviewing whether the trial judge's comments about a voice identification expert were considered error. The trial judge told the jury that they, the jury, were to assign whatever weight they wanted to the testimony of the expert and even disregard his testimony if they "should conclude that his opinion was not based on adequate education, training or experience, or that his professed science of voice print identification was not sufficiently reliable, accurate, and dependable."102 The Court of Appeals found no error in the judge's instruction to the jury. In admitting spectrographic voice identification evidence, the Supreme Court of Rhode Island, in State v. Wheeler 103, declined to apply the Frye standard holding instead "the law and practice of this state on the use of expert testimony has historically been based on the principle that helpfulness to the trier of fact is the most critical consideration".104 The court reviewed the cases around the country, both state and federal, and noted that the majority of circuit courts that have considered admission of spectrographic evidence have decided in favor of its admission.105 The court pointed out that the defendant had all the proper safeguards such as cross-examination, rebuttal experts, and the jury had the right to reject the evidence for any one of a number of reasons.106 In State v. Free107 the Court of Appeals of the State of Louisiana did not rely on the Frye test for guidance in determining the admissibility of spectrographic voice identification evidence but instead applied a balancing test set forth in State v. Catanese108). One individual, accepted as an expert in voice identification, testified as to the theoretical and technical aspects of the spectrographic voice analysis method. No other witnesses were called to either support of show fault with the admission of the voice identification testimony. The Court of Appeals found that voice identification evidence, when offered by a competent expert and obtained through proper procedures, "is as reliable as other kinds of scientific evidence accepted routinely by courts" and "can be highly probative"109. Using the Catanese balancing test the Court of Appeals found that trier of fact was likely to give almost conclusive weight to the voice identification expert's opinion, consequently, misleading the jurors. The Court of Appeals was also concerned that there were not enough experts available who could critically examine the validity of a voice identification determination in a particular case. Nine rules were suggested as a basis for which voice identification evidence could be 14 accepted110). The Court of Appeals held that Catanese prohibits admission of the voice identification evidence at this time111 and found the admission of that evidence to be harmless error. In 1987 the Supreme Court of New Jersey again addressed the issue of admissibility of spectrographic evidence in the civil case of Windmere v. International Insurance Company.112 In affirming the judgment of the Appellate Division, the Supreme Court of New Jersey ruled that the Appellate court's affirmation of the admission of the spectrographic evidence by the trial court was improper. The court stated the admissibility of the spectrographic voice analysis is based on the scientific technique having sufficient scientific basis to produce uniform and reasonably reliable results and contribute materially to the ascertainment of the truth 113, a standard the court admits bears "a close resemblance to the familiar Frye test".114 The court relies upon the "general acceptance within the professional community" to establish the scientific reliability of the voice identification process. In reaching a determination of general acceptance, the court on a three prong test which includes; (1) the testimony of knowledgeable experts, (2) authoritative scientific literature, and (3) persuasive judicial decisions which acknowledge such general acceptance of expert testimony.115 The court found that none of the three prongs indicated that there was a general acceptance of spectrographic voice identification in the professional community. The court criticized the proponent experts as being too closely tied to the development of this identification analysis to represent the opinions of the community.116 The court found that the trial court did not undertake to resolve the issue of conflicting scientific literature and they would make no effort to resolve the conflict.117 The court also reviewed the judicial decisions regarding admissibility and found a split among the jurisdictions as to the reliability of the identification process.118 The New Jersey Supreme Court specifically limited its decision in Windmere excluding spectrographic voice identification evidence to the present case. The court stated that the future use of voice identification evidence "as a reasonably reliable scientific method may not be precluded forever if more thorough proofs as to reliability are introduced" 119 and they will "continue to await the more conclusive evidence of scientific reliability".120 The Court of Appeals of Texas in the case of Pope v. Texas121 refused to address the issue of admissibility of voice identification evidence stating that "the overwhelming evidence against appellant renders this error, if any, harmless"122). Justice McClung in his dissenting opinion states that the trial court did err in admitting the voice identification evidence and that the error was not harmless123. He suggests that the Frye test is the proper standard for assessing the admissibility issue and that the "relevant scientific community" should be defined broadly124. When this aspect of the test is so defined the "general acceptability" criterion is not met. In February of 1989, the United States Court of Appeals for the Seventh Circuit affirmed the decision of the United States District Court for the Northern District 15 of Illinois admitting spectrographic voice identification evidence in the criminal case of United States of America v. Tamara Jo Smith.125 The Seventh circuit now joins the Second, Fourth and Sixth Circuits in affirming the use of spectrographic voice identification evidence.126 The Appellate court used the Frye standard to hold expert testimony concerning spectrographic voice analysis admissible in cases where the proponent of the testimony has established a proper foundation.127 The court noted that this technique was not one-hundred percent infallible and that the entire scientific community does not support it, however, neither infallibility nor unanimity is a precondition for general acceptance of scientific evidence.128 The Seventh circuit found that a proper foundation had been established in that the expert testified to the theory and the technique, the accuracy of the analysis and the limitations of the process.129 The court noted that variations from the norm result in an increase of false eliminations.130 The jury was not likely to be misled in that they had the opportunity to hear the recordings, see the spectrograms, hear the limitations of the process, witnessed a rigorous cross-examination of the expert and could reject the testimony of the expert.131 In United States v. Maivia,132 the United States District Court admitted spectrographic evidence after a four day hearing on the issue. The court examined the various sub- tests of the Frye test and found that spectrographic voice identification evidence met these tests. The court also noted that "inasmuch as the admissibility of spectrographic evidence to identify voices has received judicial recognition, it is no longer considered novel within the Frye test and consequently the test is inapplicable" 133. The court also looked to the Federal Rules of Evidence, specifically rule 403, in deciding the admissibility of spectrographic voice identification evidence. In affirming the order of the Appellate Division, the New York Supreme Court, in the case of People v. Jeter134, concluded that the trial court was not able to properly determine that voice identification evidence is generally accepted as reliable based on case law and existing literature. The Court stated that the trial court should have held a preliminary inquiry into the reliability of voice spectrographic evidence. In the light of the other evidence, the admission of the voice identification evidence was held to be harmless error in this case. STANDARDS OF ADMISSIBILITY Prior to 1993 there were two main standards of admissibility which had been applied to voice identification evidence; the Frye test and the Federal Rules of Evidence (and the rules of evidence of the various states). The Frye test originated from Court of Appeals of the District of Columbia135 in a decision rejecting admissibility of a systolic blood pressure deception test (a forerunner of the polygraph test). The court stated that admission of this novel technique was dependent on its acceptance by the scientific community. "Just when a scientific principle or discovery crosses the line between the experimental and demonstrable stages is difficult to define. Somewhere in this 16 twilight zone the evidential force of the principle must be recognized, and while courts will go a long way in admitting expert testimony deduced from a wellrecognized scientific principle or discovery, the thing from which the deduction is made must be sufficiently established to have gained general acceptance in the particular field in which it belongs".136 Out of forty published opinions prior to 1993 deciding the admissibility of voice identification evidence, twenty-three courts applied the Frye standard or a standard very similar to Frye. Sixteen of the twenty-three courts rejected the admission of such evidence. Six of these courts held the admission of voice identification evidence by the trial court was harmless error and affirmed the conviction or judgment. Eight of the sixteen stated that although voice identification evidence had not yet met the required standard of scientific acceptability, their decision was not intended to foreclose future admission when such standards were met. Two of these courts denied admission because they felt a single witness could not speak for the entire scientific community regarding the acceptance issue. Seven courts applied the test and found the requirements of Frye had been met. Of the thirteen courts applying a standard of admissibility different from Frye, only one, the Free court137, rejected voice identification evidence. There are three problems with the Frye standard; at what point is the principle of "sufficiently established" determined, at what point is "general acceptance" reached, and what is the proper definition of "the particular field in which it belongs". These three areas have been major stumbling blocks for the courts in deciding the issue of the admissibility of voice identification evidence due to the small number of voice scientists who have performed research in this field. The trial court in People v. Siervonti 138 noted the lack of research in this area saying "one only wishes that the last twelve years had been spent in research and not in attempting to get the method into the courts".139 The Frye test has been criticized as not being the appropriate test to use for the admission of voice identification evidence. This standard was established and applied to the admission of a type of evidence which is very different from voice identification. In Frye the court was concerned with the admission of a test designed to determine if a person was telling the truth or not. This type of evidence invades the province of the finder of fact. Voice identification evidence belongs in the general classification of identification evidence which does not impinge on the role of the finder of fact. As such it shares common traits with the other identification sciences of fingerprinting, ballistics, handwriting, and fiber, serum and substance identification. Another criticism of the application of the Frye test as the standard for admission of voice identification evidence is that general acceptance by the scientific community is the proper condition for taking of judicial notice of scientific facts. 17 McCormick states that general scientific acceptance is a proper condition for taking judicial notice of scientific facts, but not a criterion for the admissibility of scientific evidence.140 The court in Reed v. State 141 seemed to note this difference between the standard for the taking of judicial notice and that for admission of evidence such as voice identification. The court said that validity and reliability may be so broadly accepted in the scientific community that the court may take judicial notice of it. If it can not be judicially noticed then the reliability must be demonstrated before it can be admitted.142 The court then applied the Frye test, general acceptance by the scientific community, to determine reliability and thus, admissibility. Scientific evidence has long been admitted before it was judicially noticed, as with the case of fingerprints. The admission of fingerprint identification evidence was first challenged in the case of People v. Jennings143 in 1911. The court in Jennings allowed the admission of fingerprint evidence saying "whatever tends to prove any material fact is relevant and competent".144 It was not until thirty-three years later that fingerprint evidence was first judicially noticed.145 The majority of courts which have decided the issue of admissibility in favor of allowing voice identification into the courtroom have used similar standards which permit the finder of fact to hear the evidence and determine the proper weight to be assigned to it. Their logic runs parallel to the Federal Rules of Evidence which state that all relevant evidence is admissible with the word "relevant" being defined as evidence tending to make the existence of any fact that is of consequence to the determination of the action more probable or less probable than it would be without the evidence.146 A qualified expert may testify to his opinion if such opinion will assist the trier of fact in better understanding the evidence.147 Many of the courts which have upheld the admission of voice identification evidence have done so because the trial court had set up a number of precautions to insure the evidence was viewed in its proper light. These precautions include allowing the jury to see the spectrograms of the voices in question, allowing the jury to hear the recordings from which the spectrograms were produced, the expert's qualifications and opinions as well as the reliability of the equipment and technique are subject to scrutiny by the other side, the availability of competent witnesses to expose limitations in the process, and instructions to the jury that they were free to assign whatever weight, if any, to the evidence they felt it deserved. The United States Supreme Court in 1993 changed the long-standing law of admissibility of scientific expert evidence by rejecting the Frye test as inconsistent with the Federal Rules of Evidence in the case of Daubert v. Merrell Dow Pharmaceuticals148. The Court held that the Federal Rules of Evidence and not Frye were the standard for determining admissibility of expert scientific testimony. Frye's "general acceptance" test was superseded by the Federal 18 Rules' adoption. Rule 702 is the appropriate standard to assess the admissibility of scientific evidence. The Court derived a reliability test from Rule 702. In order to qualify a scientific knowledge, an inference or assertion must be derived by the scientific method. Proposed testimony must be supported by appropriate validation - i.e., good grounds, based on what is known. In short, the requirement that an expert's testimony pertain to scientific knowledge establishes a standard of evidentiary reliability149 The Daubert decision concerns statutory law and not constitutional law. The Court held that the Federal Rules, not Frye, govern admissibility.. The only Federal Circuit to reject spectrographic voice analysis has been the District of Columbia. Daubert may cause the District of Columbia to change its stance the next time such evidence is introduced. Since Daubert is not binding on the states, it will be difficult to determine just how much impact Daubert will have on the admissibility standards of the states. Many states have adopted evidence rules based on the Federal Rules of Evidence and may not be effected by this holding. Other states which have adopted the Frye test will have to decide to either continue following Frye or change their standard to Daubert. The Arizona Supreme Court declined to follow Daubert saying that it was "not bound by the United States Supreme Court's non-constitutional construction of the Federal Rules of Evidence when we construe the Arizona Rules of Evidence."150 RESEARCH STUDIES The studies that have been produced over the years have run the gambit in type, parameter, and result. A quick review of the available published data would leave one with the impression that the spectrographic method of voice identification was only somewhat more accurate than flipping a coin. The diversity of the relatively low number of studies and the range of results has only added to the confusion as to the reliability and validity of this method of identification. When one takes the time and expends the effort to analyze the studies in this field, a very different conclusion becomes evident. When the individual parameters of the studies are taken into account, who was being evaluated, what information was given to the examiner to assess, and what limitations were placed on the examiner's conclusions, a much clearer picture of the accuracy of the spectrographic voice identification method develops. The picture is not one of a marginally accurate technique but rather a picture that clearly shows that a properly trained and experienced examiner, adhering to internationally accepted standards will produce a highly accurate result. The studies also show that as the level of training diminishes and/or the conclusions an examiner may reach are artificially limited, the error rate goes up dramatically. The training for accurately performing the spectrographic voice identification method has been established as requiring completion of (1) a formal course of study, usually 2 to 4 weeks duration, in the basics of spectrographic analysis, (2) 19 two years of study completing 100 voice comparison cases, usually in a one-toone relationship with a recognized expert, (3) examination by a board of experts in the field of spectrographic voice identification analysis. For the most accurate results from the spectrographic voice identification method, a professional examiner (1) will require the original recordings or the best quality re-recordings if the original is not available; (2) will perform a critical aural review of the suspect and known recordings; (3) will produce sound spectrograms of the comparable words and phrases; (4) will produce a comparison recording juxtaposing the known and unknown speech samples; (5) will evaluate the evidence and classify the results into one of five standard categories [ 1 - positive identification, 2 - probable identification, 3. - positive elimination, 4 - probable elimination, and 5 - no decision]. The final decision is reached through a combined process of aural and visual examination. It is important to remember that the spectrographic method of voice identification is a process that interweaves the visual analysis of the sound spectrograms with the critical aural examination of the sounds being viewed. Taking the results from all of the studies produced shows that if the examiner's ability to analyze both the graphic representations of the voice and the aural cues found in the recordings is limited or restricted, accuracy suffers. Likewise, the amount of training has a direct bearing on the level of accuracy of the results. In a survey of 18 studies151 of the accuracy of the spectrographic voice identification method, the results fall into two categories; those with proper training, using standard procedures produce very accurate results, whereas those with inadequate training, using limited analysis methods, produce inaccurate results. In a study152 in 1975 authored by Lt. L. Smrkovski of the Voice Identification Unit of the Michigan State police, error rates in voice identification analysis comparisons, based on three levels of training and experience, were evaluated. The following table summarizes the results of that study. Error type Novice Trainee Professional False Ident. 5.0% 0.0% 0.0% False Elim. 25.0% 0.0% 0.0% No Decision 2.5% 2.5% 7.5% Lt. Smrkovski's results show that proper training is essential. The fact that his results show a higher no decision rate among the professional examiners than the trainee examiners may indicate that the professional is a bit more cautious in his analysis than the trainee. Mark Greenwald, in his 1979 thesis153 for his M.A. degree at Michigan State University, studied the performance of three professional examiners (each with 20 eight years experience) and five trainees (each with less than two years experience) using standard spectrographic voice identification methods (visual and aural) and result classifications. Greenwald found that the professional examiners produced no errors when using full frequency bandwidth recordings. When the frequency band width was restricted, the professional examiners still produced no errors, but did increase their percentage of no decision classifications. Greenwald also found that the training level was an important factor and that the trainees in this study had an error rate of 6.1% for false identifications in the restricted frequency bandwidth trials. In 1986, the Federal Bureau of Investigation published a survey of two thousand voice identification comparisons made by FBI examiners154. This survey was based on 2000 forensic comparisons completed over a period of fifteen years, under actual law enforcement conditions, by FBI examiners.155 The examiners had a minimum of two years experience, completed over 100 actual cases, completed a basic two week training course and received formal approval by other trained examiners.156 The results of the survey are depicted in the chart 157 below. DECISIONS NUMBER PERCENT(%) No or low confidence 1304 65.2 Eliminations 378 18.9 Identifications 318 15.9 ERRORS False eliminations 2 0.53 False identification 1 0.31 The FBI results are consistent with the Smrkovski study in that properly trained examiners, utilizing the full range of procedures, produce quite accurate results. By way of contrast, the 1976 study158 by Alan Reich used four speech science graduate students with previous experience with speech spectrograms (but untrained in spectrographic voice identification analysis) to examine, using visual comparison only, nine excerpted words. This study produced an accuracy rate in the undisguised trials of 56.67%. When disguise was introduced into this study paradigm the accuracy rate decreased significantly. Taken as a whole the 18 studies support the conclusion that accurate results will be obtained only through the combined use of the aural and visual components of the spectrographic voice identification method as performed by a properly trained examiner adhering to the established standards. Those studies with poor accuracy results are important in that they demonstrate the weaknesses of 21 improperly performed examinations that do not adhere to the internationally accepted professional standards. A large part of the debate over the admissibility of spectrographic voice identification analysis in the courts appears due to the fact that the parameters of these studies have not adequately been demonstrated to the courts in the necessary detail which would allow the courts to examine the overall meaning of these studies. Many of these studies look at only one or two aspects of the spectrographic voice identification method. Frequently the results of these restricted scope studies have been misapplied to the entire spectrographic voice identification method resulting in inaccurate information being used as the basis for deciding the admissibility of spectrographic voice identification analysis. It is important to provide an accurate picture of all the studies so the courts will have the foundational information necessary to make an informed decision regarding the admissibility of spectrographic voice identification analysis. CONCLUSION The technique of voice identification by means of aural and spectrographic comparison is still an unsettled topic in law. Although the spectrographic voice identification method has progressed greatly since it was first introduced to a court of law back in the mid 1960's, it still faces stiff resistance on the issue of admissibility in the courts today. One of the reasons for such opposition regarding admissibility is that the method has evolved greatly since its initial application. Court decisions based on early methods of voice identification analysis are not applicable to the methods used today. No longer are voices compared on the basis of a limited group of key words. Today's aural/spectrographic voice identification method takes advantage of the latest in technological advancements and interweaves several analyses into one procedure to produce an accurate opinion as to the identity of a voice. This modern technique combines the experience of a trained examiner performing the visual analysis of the spectrograms and aural analysis of the recordings with the use of the latest instruments modern technology has to offer, all in a standardized methodology to assure reliability. Court decisions reviewing the early voice identification cases may not be relevant to present day cases because the older decisions were based on less sophisticated procedures. Most of the courts which have rejected admission have been aware of continuing work in this field and have specifically left the door open as to future admissibility. Proper presentation and explanation of the research pertaining to spectrographic voice identification analysis will allow the courts to better understand the accuracy and reliability of the spectrographic voice identification method. When the research is properly presented, the studies show that properly trained individuals, using standard methodology, produce accurate results. The current trends in the admissibility issue of voice identification evidence indicate that courts are more willing to allow the evidence into the courtroom 22 when a proper foundation has been established which then allows the trier of fact to determine the weight to be assigned to the evidence. 23 Spectrographic voice identification: A forensic survey Bruce E. Koenig Federal Bureau of Investigation, Engineering Section, Technical Services Division, 8199 Backlick Road, Lorton, Virginia 22079 (Received 25 October 1985; accepted for publication 18 February 1986) - J. Acoust. Soc. Am 79(6) June 1986 A survey of 2000 voice identification comparisons made by Federal Bureau of Investigation (FBI) examiners was used to determine the observed error rate of the spectrographic voice identification technique under actual forensic conditions. The qualifications of the examiners and the comparison procedures are set forth. The survey revealed that decisions were made in 34.8% of the comparisons with a 0.31% false identification error rate and a 0.5 3% false elimination error rate. These error rates are expected to represent the minimum error rates under actual forensic conditions. PACS numbers: 43.70.Jt INTRODUCTION The sound spectrograph is a device which produces a visual graph (spectrogram) of speech as a function of time (horizontal axis), frequency (vertical axis), and voice energy (gray scale or color differences).1,2 It is a well-accepted research tool that is used to study individual vowel characteristics, physiological speech anomalies, etc. However, in the field of forensic voice identification, it has yet to find approval among most scientists in phonetics. linguistics, engineering, and related disciplines as a positive test in comparing voice samples.3-6 Historically, forensic applications were not seriously considered until 1962 when Lawrence Kersta published the results of experiments which reflected error rates of 0% to 3% for one-word spectral comparisons in closed sets (examiner always knows a match exists) of 12 or less speakers.7 In 1972, the findings of a largescale study at Michigan State University were published in which attempts were made to more closely imitate law enforcement conditions, but only spectral comparisons were made (no aural). The “forensic model” included open set trials (examiner did not know if a match existed), noncontemporary samples (1 month apart), trained examiners, and high-confidence decisions. This resulted in an approximate error rate of 2% for false identification (no match existed but the examiner selected one, or a match existed but the examiner chose the wrong one) and 5% for false elimination (a match existed but the examiner failed to recognize it). The authors of the study attempted to extend the experimental results to actual law enforcement conditions, which they thought would lower the 1 error rates. They theorized that examiners could aurally compare the voice samples, the number of known suspects would be limited by police investigation, there would be no time limits placed on the examiner, only very high confidence decisions would be used, and additional known voice samples could be obtained.8 Other scientists disagreed on the study’s extensions, and stated that in actual forensic conditions the error rate would increase, not decrease.’ In 1979, a committee of the National Research Council released its findings and recommendations in a Federal Bureau of Investigation (FBI) -funded study on the reliability of spectrographic voice identification under forensic conditions, which found, in part, that: (1) Error rates vary from case to case due to the properties of the voices compared, the recording conditions used to obtain voice samples, the skill of the examiner, and the examiner’s knowledge about the case. Estimates of error rates are available only for a few situations, and they “do not constitute a generally adequate basis for a judicial or legislative body to use in making judgments concerning the reliability and acceptability of aural-visual voice identification in forensic applications.”10 (2) Examiners should fully use all available knowledge and techniques that could improve the voice identification method.’0 (3) Spectrographic voice identification assumes that intraspeaker variability (differences in the same utterance repeated by the same speaker) is discernable from interspeaker variability (differences in the same utterance by different speakers); however, that “assumption is not adequately supported by scientific theory and data.” Viewpoints on actual error rates are presently based only on “various professional judgments and fragmentary experimental results rather than from objective data representative of results in forensic applications.”’’ FBI examiners have used the spectrographic technique since the 1950s for investigative support, but have not provided expert court testimony on comparison results.’2 This paper presents the results of 2000 forensic comparisons, under actual law enforcement conditions, by FBI examiners. . SURVEY PROCEDURES The FBI conducts forensic voice identification examinations using the spectrographic or voiceprint technique for the FBI, other Federal agencies, state and local law enforcement authorities, and many foreign governments. After each examination is conducted, a written report of findings is mailed to the contributor with the name of the examiner and the disposition of the submitted voice 2 samples. If an identification or elimination is made, the contributor is contacted by telephone and asked if the results are consistent with interviews and other evidence in the investigation. If other information strongly supports the voice comparison result, then the contributor is told to contact the FBI if later developed evidence contradicts the finding. If the voice comparison results contradict other evidence, the matter is closely followed until legally adjudicated or investigatively closed. In the few occurrences where no final determination was possible, the voice comparison result was considered a “no decision” in the survey. The results of the last 2000 requested comparisons, spanning 15 years, were compiled and organized into total identification and elimination decisions, known errors, and no or low confidence decisions. II. QUALIFICATIONS OF EXAMINERS All of the individuals conducting the voice comparison examinations were FBI employees with the following qualifications: (I) at least two years of full-time experience in voice identification and analysis of tape recorded voice signals using sophisticated digital and analog analysis and filtering equipment; (2) completion of over 100 voice comparisons in actual cases; (3) completion of a basic two week course in spectrographic analysis, or equivalent; (4) passing a yearly hearing test; (5) formal approval by other trained examiners; and (6) a minimum of a Bachelor of Science Degree in a basic scientific field. III. COMPARISON PROCEDURES The following procedures were used, if at all possible, on every attempted voice comparison in the survey. (1) Only original recordings of voice samples were accepted for examination, unless the original recording had been erased and a high-quality copy was still available. (2) The recordings were played back on appropriate professional tape recorders and recorded on a professional full-track tape recorder at 7 1/2 ips. When possible, playback speed was adjusted to correct for original recording speed errors by analyzing the recorded telephone and AC line tones on spectrum analysis equipment. When necessary, special recorders were used to allow proper playback of original recordings that had incorrect track placement or azimuth misalignment. (3) Spectrograms were produced on Voice Identification, Inc., Sound Spectrographs, model 700. in the linear expand frequency range (0-4000 Hz), wideband filter (300 Hz) and bar display mode. All spectrograms for each separate comparison were prepared on the same spectrograph. The spectrograms were phonetically marked below each voice sound. 3 (4) When necessary, enhanced tape copies were also prepared from the original recordings using equalizers, notch filters, and digital adaptive predictive deconvolution programs13,14 to reduce extraneous noise and correct telephone and recording channel effects. A second set of spectrograms was then prepared from the enhanced copies and was used together with the unprocessed spectrograms for comparison. (5) Similarly pronounced words were compared between two voice samples, with most known voice samples being verbatim with the unknown voice recording. Normally, 20 or more different words were needed for a meaningful comparison. Less than 20 words usually resulted in a less conclusive opinion, such as possibly instead of probably. (6) The examiners made a spectral pattern comparison between the two voice samples by comparing beginning, mean and end formant frequency, formant shaping, pitch, timing, etc., of each individual word. When available, similarly pronounced words within each sample were compared to insure voice sample consistency. Words with spectral patterns that were distorted, masked ‘by extraneous sounds, too faint, or lacked adequate identifying characteristics were not used (7) An aural examination was made of each voice sample to determine if pattern similarities or dissimilarities noted were the product of pronunciation differences, voice disguise, obvious drug or alcohol use, altered psychological state, electronic manipulation, etc. (8) An aural comparison was then made by repeatedly playing two voice samples simultaneously on separate tape recorders, and electronically switching back and forth between the samples while listening on high-quality headphones. When one sample had a wider frequency response than the other, bandpass filters were used to compensate during at least some of the aural listening tests. (9) The examiner then had to resolve any differences found between the aural and spectral results, usually by repeating all or some of the comparison steps. (10) If the examiner found the samples to be very similar (identification) or very dissimilar (elimination), an independent evaluation was always conducted by at least one, but usually two other examiners to confirm the results. If differences of opinions occurred between the examiners, they were then resolved through additional comparisons and discussions by all the examiners involved. No or low confidence decisions were usually not reviewed by another examiner. IV. SURVEY RESULTS The survey found that in 2000 voice comparisons, the following decisions and errors were observed: 4 Decisions Number Percent (%) No or low confidence 1304 65.2 Eliminations 378 18.9 Identifications 318 15.9 Errors False eliminations 2 0.53 False identification 1 0.31 Most of the no or low confidence decisions were due to poor recording quality and/or an insufficient number of comparable words. Decisions were also affected by high-pitched voices (usually female) and some forms of voice disguise. V. CONCLUSIONS (1) The observed identification and elimination errors probably represent the minimum error rates expected under actual forensic conditions, since investigators are not always correct in their evaluation of a suspect’s involvement, due to limited physical evidence, faulty eyewitness statements, etc. (2) The stated results should only be considered valid when compared with examiners having the same qualifications and using the same comparison procedures. (3) The FBI has emphasized signal analysis and pattern recognition skills for conducting voice identification examinations, more than formal training in speech physiology, linguistics, phonetics, etc., though a basic knowledge of these fields is considered important. ACKNOWLEDGMENTS 5 Thanks are due to the following colleagues who were involved in conducting the comparisons used in this survey: Steven A. Killion, Barbara Ann Kohus, Dale Gene Linden, Gregory J. Major, Artese Savoy Kelly, Keith W. Sponholtz, Ernest Terrazas, Richard L. Todd, and Charles Wilmore, Jr. 1 w. Koenig, H. K. Dunn, and L. Y. Lacey, J. Acoust. Soc. Am. 18, 244(1946) 2G. M. Kuhn, J. Acoust. Soc. Am. 76, 682—685 (1984). 3R. H. Bolt, F. S. Cooper, E. E. David, Jr., P. B. Denes, J. M. Pickett, and K. N. Stevens, J. Acoust. Soc. Am. 47, 591—612 (1970). 4R. H. Bolt, F. S. Cooper, E. E. David, Jr., P. B. Denes, J. M. Pickett, and K. N. Stevens, J. Acoust. Soc. Am. 54, 531—534 (1974). 5K. N. Stevens, C. E. Williams, J. R. Carbonell, and B. Woods, J. Acoust Soc. Am. 44, 1596—1607 (1968). 6R. H. Bolt, F. S. Cooper, D. M. Green, S. L. Hamlet, J. G. McKnight, J. M. Pickett, 0.1. Tosi, and B. D. Underwood, “On the Theory and Practice of Voice Identification,” N.A.S.N.R.C. Publ. (1979). 7L. G. Kersta, Nature 196, 1253—1257 (1962). 80. Tosi, H. Oyer, w. Lashbrook, C. Pedrey, J. Nicol, and E. Nash, J. Acoust. Soc. Am. 51, 2030—2043 (1972). 9R. H. Bolt, F. S. Cooper, E. E. David, Jr., P. B. Deres, J. M. Pickett, and K. N. Stevens, J. Acoust. Soc. Am. 54, 53 1—534 (1974). 10R. H. Bolt, F. S. Cooper. D. M. Green, S. L. Hamlet, J. G. McKnight, J. M. Pickett, 0. I. Tosi, and B. D. Underwood, N.A.S.N.R.C. Publ., 60 (1979). 11R. H. Bolt, F. 5: Cooper, D. M. Green, S. L. Hamlet, J. G. McKnight, J. M. Pickett, 0. 1. Tosi, and B. D. Underwood, N.A.S.N.R.C. Publ., 2 (1979). 12B. E. Koenig, FBI Law Enforcement Bulletin (January and February, 1980). 13J. E. Paul, IEEE Circuits and Systems Magazine 1, 2—7 (1979). 4J. E. Paul, paper presented at Voice Interactive Systems Subtag, Orlando, FL (Oct. 1984); hosted by U. S. Army Avionics Research and Development Activity, Ft. Monmouth, NJ. 6 Voice Comparison Approved by ABRE Voice ID Board - April 1999 AMERICAN BOARD of RECORDED EVIDENCE -- VOICE COMPARISON STANDARDS Abstract This document specifies the requirements of the American Board of Recorded Evidence for the comparison of recorded voice samples. These standards have been established for all practitioners of the aural/spectrographic method of voice identification and are intended to guide the examiner toward the highest degree of accuracy in the conduct of voice comparisons. These criteria supersede any previous written, oral, or implied standards, and became effective in 1998. Foreword This document was developed by members of the American Board of Recorded Evidence, a board of the American College of Forensic Examiners, following their meeting in San Diego, CA in December, 1996. The document draws upon previously published material from the International Association for Identification, the International Association for Voice Identification, The Journal of the Acoustical Society of America, The Audio Engineering Society and The Federal Bureau of Investigation for much of its content. The contents of this document are for non-commercial, educational use. It is the intent of the Board to publish this document in the official journal of the American College of Forensic Examiners. VOICE COMPARISON STANDARDS Table of Contents 1. Scope 2. Evidence Handling 3. Preparation of Exemplars 4. Preparation of Copies 5. Preliminary Examination 6. Preparation of Spectrograms 7. Spectrographic/Aural Analysis 8. Work Notes 9. Reporting 10. Testimony 1 1. SCOPE This standard specifies recommended practices for the handling, preparation and analysis of recorded evidence to be followed by practitioners of the aural/spectrographic method of speaker identification. The document covers specific instructions for the preparation of exemplar recordings, voice spectrograms and aural comparison samples. It defines criteria to be applied when arriving at conclusions that are based upon the oral evidence. It also includes requirements for reports and testimony that are offered by the expert witness regarding his findings in voice analyses. This standard is intended as a guide based upon good laboratory practices for handling recordings that may be used in evidence. Persons handling evidence recordings should first obtain and follow the rules of the legal jurisdiction or jurisdictions involved. When a jurisdiction provides instructions, those should be followed. Only in the absence of such instructions should the recommendations of this standard be followed with the approval of the jurisdiction. 2. EVIDENCE HANDLING. Since evidence involved in criminal or civil proceedings must meet the appropriate jurisdiction's Rules of Evidence, it is important to properly identify and safeguard it from the time of receipt until returned to the contributor or court. The ABRE has adopted as its standard for handling evidence the AES Standard "AES27-1996 - AES recommended practice for forensic purposes-Managing recorded audio materials intended for examination". The complete document is available at: Audio Engineering Society, Inc. 60 East 42nd Street New York, NY 10165 3 PREPARATION OF EXEMPLARS. The quality of the exemplars is critical in allowing an accurate comparison with unknown voice samples. 3.1 Production. The exemplars can be prepared by either the investigator, attorney, examiner, or other appropriate person. Whenever possible, an impartial individual knowledgeable of the known speaker's voice should be present to minimize attempts at disguise, changes in speech rate, adding or deleting accents, and other alterations. The known speaker should state his or her name at the beginning of the recording and repeat the unknown speaker's statement(s) from three (3) to six (6) times, depending upon the length of the unknown samples. Normally, the person preparing the exemplar should record his or her name and that of any other witnesses present. 3.2 Duplication of Recording Conditions. 2 3.2.1 Microphone. Whenever possible, the same type of microphone system should be utilized when recording exemplars as was used for the original unknown recording. Therefore, if the unknown caller used a telephone, the exemplar should be prepared by having the suspect talk into one telephone instrument and be recorded at a second telephone set, located an appropriate distance away. 3.2.2 Acoustic environment. The exemplar recordings should be prepared in a quiet environment with relatively short reverberation times. Do not imitate noises present at the location of the unknown call or obvious reverberant effects. 3.2.3 Transmission line. Whenever possible, the same general type of transmission line, such as a telephone call, should be utilized when recording exemplars as was used for the original unknown recording. 3.2.4 Recording system. A good quality recording system should always be used in preparing exemplars; it is usually not necessary to imitate the system utilized in recording the unknown sample, but if the system is available and functional, it may be used. A standard cassette set at 1 7/8 inches per second or open reel tape recorder at 3 3/4 or 7 1/2 inches per second or a digital recorder should otherwise be used. Micro cassette and other miniature formats, speeds below 1 7/8 inches per second, and poor quality/inexpensive units are not recommended. Before the known speaker is allowed to leave the exemplar-taking session, the recordings should be played back to insure that the samples are of high quality and properly prepared. 3.2.5 Recording media. Good quality tape or other appropriate recording media should always be used in preparing exemplars; it is not necessary to duplicate the type of tape utilized in recording the unknown sample. The tape should either be new (preferred) or properly bulk erased. 3.3 Duplication of Speech Delivery. 3.3.1 Reading v. recitation. The suspect should be allowed to review the written text or transcription before actually making the recorded exemplars. This familiarity will usually improve the reading of the text and response to oral prompts and increase the likelihood of obtaining a normal speech sample. When a suspect cannot or will not read normally, it is advisable to have someone recite the phrases in the same manner as the unknown speaker and have the suspect repeat them in a similar fashion. Ideally, the exemplar should be spoken in a manner that replicates the unknown speaker, to include speech rate, accent (whether real or feigned), hoarseness, or any abnormal vocal effect. The individual taking the sample should feel free to try both reading and recitation, until a satisfactory exemplar is obtained. 3.3.2 Repetition. Multiple repetitions of the text are necessary to provide information about the suspect's intraspeaker variability. All material to be used for 3 comparison should normally be read or recited from three (3) to six (6) times, unless very lengthy. 3.3.3 Speech rate. Exemplars should be produced at a speech rate similar to the unknown voice sample. In general, the suspect is instructed not to talk at his or her natural speaking rate if this is markedly different from the unknown sample. An effort should be made through repetition to appropriately adjust the speech rate and cadence in the exemplar to that in the questioned recording. 3.3.4 Stress/Accents. Stress includes the emphasis and melody pattern in syllables, words, phrases, and sentences. If prominent or peculiar stress is present in the questioned recording, exemplars should be obtained in a similar manner, if possible. Spoken accents or dialects, both real and feigned, should be emulated by the known speaker. The recitation mode is the better technique for accomplishing this. 3.3.5 Effects of alcohol or other drugs. Since the degree and type of effects from alcohol and other drugs varies from person to person, an attempt to duplicate these vocal changes is not recommended when obtaining the exemplar. If the suspect appears to be under the effects of alcohol or other drugs at the time of the exemplar recording the session should be rescheduled. 3.3.6 Other. If any other unique aural or spectrally displayable speech characteristics are present in the questioned voice, attempts should be made to include them in the exemplars. 3.4 Marking. Same as Sect. 2 4 PREPARATION OF COPIES. 4.1 Playback of Evidential Recordings. The proper playback of the unknown and known voice sample is critical, since it provides the optimum output for the aural and spectral analyses. 4.1.1 Track determination. In situations where the questioned recording was made on equipment of unknown origin or configuration, it may be necessary to analyze oxide on the recording before playing it back. The recorded track position and configuration may be determined by applying an appropriate ferrofluid to the oxide side of analog tapes in a high amplitude portion of the recording. The treated area is then viewed under low magnification to determine the track configuration and offsets. 4.1.2 Azimuth alignment. Where there is evidence of an audio level or clarity problem during playback, azimuth alignment should be checked and adjusted if necessary by either an inspection of the developed magnetic striations (see track 4 determination above), frequency analysis of the recorded material, or adjustment of the reproducer head azimuth for maximum high frequency output. All audio miniature cassettes, standard cassettes, and open reels (other than loggers) recorded at 15/16 inches per second (2.4 centimeters per second), or less, should be carefully examined for loss of higher frequency information, which often occurs in these formats. 4.1.3 Speed accuracy. Errors in playback speed will cause corresponding variations in the voice frequency, both aurally and spectrally. The playback speed error should be determined for all recordings containing known discrete tones, and then corrected on a reproducer with speed-adjustment circuitry. A Real-Time (RT) Analyzer or Fast Fourier Transform (FFT) analyzer system should be used that allows a resolution of 1% (+0.60 hertz) or better at 60 hertz. Where a known signal is present on the recording, a frequency counter may be employed to correct tape speed. Ideally, there should be less that a 3% error between questioned and known samples that are being compared. 4.1.4 Reproducer. Using the information gleaned from the examinations of the track, azimuth alignment, and speed, a high-quality playback device is configured to allow optimum output. 4.2 Direct Copies. The following information is provided for the analog reel copies that are needed for processing on the Voice Identification, Inc., Series 700 sound spectrograph. If the spectrograph being utilized has a digital memory, the requirements for cabling and retention are still applicable. Even with digital memory systems, a high quality digital or analog tape copy should still be prepared and maintained. 4.2.1 Format. All copies are prepared in a full track, 7 1/2 inches per second format on 1.0 mil or thicker audio tape from a reputable manufacturer. Normally, new, unused reels of tape should be utilized; however, previously recorded tape can be used if either bulk erased or over-recorded on a full track recorder with no input. 4.2.2 Cabling. All copies must be prepared with good quality cables from the playback device to the line input of the recording unit. No loudspeaker-tomicrophone copying procedures are permitted. 4.2.3 Recording unit. A separate professional reel recorder, or the one incorporated in the Series 700 Series Spectrograph, is required. At least once a year, the recorder must be checked by a technically competent individual to determine the unit's playback speed accuracy, distortion level, flutter, record/playback frequency response, and record level. The recorder must meet the following criteria: playback speed within 0.15% distortion of less than 3% at 200 nWb/m, wow and flutter below 0.15% (NAB unweighted), record/playback frequency response of 100 to 10,000 hertz + 3 decibels at 200 nWb/m, and a 0 VU level no greater than 250 nWb/m. If the recorder does not meet all of these standards, it must be repaired and/or adjusted. If a digital system is utilized by 5 the examiner, the system should be checked at least once a year by a technically competent individual according to the manufacturer's written instructions. Digital systems should have almost unmeasurable speed errors, wow and flutter, distortion, and frequency deviations. 4.2.4 Retention. The direct copies must be retained at normal room temperatures and humidity for at least three (3) years, unless the case has been completely adjudicated or the contributor requires the return of all materials used by the examiner. 4.3 Enhanced Copies. When the original recording contains interfering noise and/or limited frequency response, enhanced copies may provide improved audibility and more usable spectrograms. At times, separate enhanced copies will have to be prepared for the aural and spectral examinations to provide optimum results for each. The following information is specifically provided for the analog reel copies that are needed for processing on the Voice Identification, Inc., Series 700 sound spectrograph. If the spectrograph being utilized has a digital memory, the requirements for cabling and retention are still applicable. Even with digital memory systems, a high quality digital or analog tape copy should still be prepared an maintained. A written record of the settings on the devices used should be maintained. 4.3.1 Equalizers. Parametric or graphic equalizers can boost and attenuate selected frequency bands to normalize the recorded speech spectrum. Though an FFT or RT analyzer is of considerable assistance in adjusting the spectrum, a final decision on the equalizer settings should be made by either listening and/or preparing spectrograms, depending upon the enhanced copy's use. 4.3.2 Notch filters. These devices allow the selected attenuation of discrete tones present in the recordings. An FFT or RT analyzer is of considerable assistance in identifying the frequency of the tones and optimally centering the filter's notch. 4.3.3 Deconvolutional filters. These digital devices both automatically attenuate sounds correlated longer than a specified time and flatten the sound spectrum. The filter can, at times, provide improved spectrographic and aural samples for examination. Care should be taken to insure that the adaptation rate is not set at a value that starts to delete speech information. 4.3.4 Other filters. Band pass, shelving, comb, user-characterized digital, and other filters are helpful in a small number of voice identification cases. 4.3.5 Format. Same as 4.2.1. 4.3.6 Cabling. Same as 4.2.2. 4.3.7 Recording unit. Same as Section 4.2.3. 4.3.8 Retention. Same as Section 4.2.4. 6 5 PRELIMINARY EXAMINATION. A preliminary examination is conducted to determine whether the unknown and known voice samples meet specific guidelines to allow continuation of the examination. 5.1 Original/Duplicate Recordings. The unknown and known voice samples must be original recordings unless listed as a specific exception below. Copies not meeting these guidelines cannot be used for examination. Short time restraints imposed by the contributor are not considered an exception. When access to the original recording is denied due to legal restraints, copies may be used under the allowed exceptions. The exceptions for not examining the original recordings are: a. If the original recording has been erased or destroyed, the examiner should then use the best first-generation copy available; b. The copies were prepared by a qualified voice identification examiner or other technically competent individual following Section 4 guidelines; c. If the original recording is in a relatively unique format or part of a digital storage system, the examiner or other technically competent individual should prepare the copies from the original material following Section 4 guidelines. If that is not possible, then detailed telephonic and/or written instructions should be given to the individual preparing the copies. Copies produced by non-technical individuals should be closely analyzed in the laboratory to insure that the duplication process was properly done. 5.2 Verbatim/Non-verbatim. The known, or another unknown voice sample, must be either wholly verbatim (preferred), or partially verbatim to allow meaningful comparisons with unknown voice samples. A partially verbatim sample should include phrases and sentences containing at least three (3) similar, consecutive matching words. An example of the use of partial verbatim samples would be two (2) unknown recorded false fire alarms containing, at times, nearly identical phraseology. If no verbatim recordings are submitted by the contributor, the examiner may analyze the unknown samples to determine whether they would meet the guidelines if appropriate known voice samples are submitted at a later time. 5.3 Number of Comparable words. There must be at least (10) comparable word between two (2) voice samples to reach a minimal decision criteria. Similarly spoken words within each sample can only be counted once. It is noted that in most voice samples at least some of the words identified at this point will not be useful in the final examinations. 5.4 Quality of Voice Samples. This preliminary aural and spectral review is to determine if the voice samples are of sufficient quality to allow meaningful comparisons between them. 7 5.4.1 Disguise. Samples, or portions of samples, that contain falsetto, true whispering (in contrast to low amplitude speech), or other disguises that obviously change or obscure the vocal formants or other speech characteristics, may need to be eliminated from comparison consideration. Other types of disguise may or may not be usable, depending upon the nature of the disguise. Sometimes a known voice sample with the same type of disguise can be compared, but the examiner should exercise caution in such examinations. 5.4.2 Distortion. Samples, or portions of samples, that include high-level linear and/or nonlinear distortion should be eliminated from comparison consideration. Such distortion can result from saturation of magnetic tape or overdriven electronic circuits, and can produce artifacts, including formants that did not exist in the original speech information. 5.4.3 Frequency range. Samples, or portions of samples, that are restricted in upper frequency range and produce less than two complete speech formants are of limited value to the examiner. Samples producing three or more speech formants provide the examiner better information with which to make a comparison. Sometimes the use of enhanced copies can allow the frequency range to be extended but note the limitations in Section 7.1.3. 5.4.4 Interfering speech and other sounds. Samples, or portions of samples, that contain any extraneous speech information or sounds which interfere with aural identification or spectral clarity should be eliminated from comparison consideration unless the sounds can be sufficiently attenuated through enhancement procedures. 5.4.5 Signal-to-noise ratio. Samples, or portions of samples, containing recording system or environmental noise that impedes aural identification or spectral clarity should be eliminated from comparison consideration unless the noise can be sufficiently attenuated through enhancement procedures. 5.4.6 Variations between samples. Though the following variations can quickly end a voice comparison, the problem can often be remedied by obtaining additional known samples: a. Transmission systems. Normally, samples being compared should be produced through the same type of transmission system, for example, the telephone, a microphone in a room, or a RF transmitter/receiver. If aurally or spectrally the samples are noticeably different due to the dissimilarities in the transmission systems and filtering does not rectify these differences, no further comparisons should be made. b. Recording systems. Normally, samples being compared should be produced on either good quality, or compatible, recording systems. However, if the recordings contain uncorrectable system differences that affect aural and spectral characteristics, no further comparisons should be made. Examples of 8 recording differences that can affect the results include high-level flutter, gross speed fluctuations, and voice-activated stop/starts. c. Speech delivery. Normally, samples being compared should have the speakers talking in the same general manner, including speech rate, accent, similar pronunciation, and so on. However, in cases where this has not been done, as in poorly produced known exemplars, no further comparisons should be made. d. Other. Any other differences between the voice samples that noticeably effect aural and spectral characteristics should be closely reviewed before proceeding with the examination. 6 PREPARATION OF SPECTROGRAMS. 6.1 Sound Spectrograph. The examiner must use a sound spectrograph, or a digital system, that allows the identification and marking of each speech sound on the spectrogram by either manual manipulation of the drum while listening to the recorded material or the separate identification of the individual sounds on a computer monitor. Spectrographs used must be of professional manufacture, such as the Voice Identification 700 Series or professional computerized systems, such as the Kay Elemetrics Model 5500. The spectrograph should be calibrated at least every six (6) months according to the manufacturer's instructions. 6.1.2 Print Quality. Spectrographic prints must be produced either in an analogue format or, if from a computerized system, must be printed with a minimum of 600 dots per inch resolution. 6.2 Format. 6.2.1 Filter bandwidth. A 250 to 300 hertz bandwidth filter is recommended for the production of most spectrograms. A 450 to 600 hertz bandwidth filter may sometimes improve the formant appearance for high-pitched voices. Narrower filters should only be used for non-voiced sounds and calibration purposes. 6.2.2 Mode. The bar display mode must be used for all spectrograms with the high-shaping equalizer engaged (except when an enhanced copy is being used that has already properly shaped the spectrum). 6.2.3 Frequency range. An appropriate frequency range should be chosen that fully displays all speech sounds in the unknown voice sample. The known voice spectrograms are then prepared using the same frequency range. 6.2.4 Direct v. enhanced. When enhanced copies are used for the examination, at least some spectrograms must be prepared from the direct copies. 9 6.3 Marking. Each spectrogram must be marked below each speech sound, either phonetically, orthographically, or a combination of both. Great care should be taken to insure that the speech sounds are accurately designated as to how they were spoken, which may not be their correct pronunciation. The spectrograms should be appropriately labeled with identifying information such as specimen, case, and laboratory identifiers. The spectrograms may be marked consecutively for each unknown and known sample. Known and unknown sounds may be marked in different colored ink to facilitate comparisons. 6.4 Retention. All spectrograms should be retained for at least three (3) years after completion of the examination, unless the case has been completely adjudicated or the contributor requires the return of all materials used by the examiner. 7 SPECTROGRAPHIC/ AURAL ANALYSIS. 7.1 Pattern Comparison. 7.1.1 Intraspeaker consistency. The examiner must visually compare similarly spoken words within each voice sample to determine the range of intraspeaker variability. If there is considerable variability, the word must not be used for comparison. If there is considerable variability in a number of words in a sample, the sample should not be used for comparison. This is often encountered with disguised voices and known exemplars from uncooperative individuals. 7.1.2 Similar speech sounds. Only speech sounds of similarly spoken words should be compared between voice samples. Comparison of the same speech sound but in different words, should be avoided. 7.1.3 Direct v. enhanced. When using spectrograms from direct and enhanced copies, both should be visually compared to words from the known or questioned voice sample. The examiner should be cognizant that the enhancement process may distort the spectral energy distribution, thus increasing the likelihood of a false elimination. 7.1.4 Number of comparable words. This is determined by the total number of different words present in both samples that meet the standards set forth in Section 5.4.1 - 6. A similar or nearly similar word appearing more than once in one or both samples should be counted only as one comparable word. 7.1.5 Speech characteristics. a. General formant shaping and positioning. A formant is a band of acoustic energy produced by spoken vowels and resonant consonants. Formants and other vocal patterns produced on the spectrograms are visually compared by the 10 examiner. Generally, the spoken word will produce a set or sets of three (3) or more observable formants. A good pattern match exists when the majority, if not all, of the formant shaping and positioning exhibit strong similarities. A precise photographic match rarely occurs even between two (2) consecutive utterances of the same word spoken by the same individual. Conversely even very different voices can exhibit similarities in general formant shaping and positioning for some words. Examination of these patterns must be conducted between each comparable word of the voice samples. b. Pitch striations. Pitch, or fundamental frequency, can be a useful characteristic for distinguishing between speakers. Pitch information is displayed on a spectrogram in the form of closely-spaced vertical striations, with the spacing and shaping being useful parameters of the individual talker. Differences in the pitch rate and the smoothness or coarseness of the pitch quality should be examined both spectrally and aurally; but most talkers are characterized by fairly wide pitch ranges. c. Energy distribution. Energy distribution of certain vocal sounds can assist the examiner in analyzing similarities and differences between voice samples. Certain phonemes are displayed primarily by their energy distribution diffused across a certain frequency range. Plosive and fricative consonants are displayed along the frequency axis as concentrated dark energy distribution patterns. Although the characteristics of energy distributions, especially bursts, are more dependent upon the type of sounds produced than the speakers, some talkerdependent characteristics can be observed. d. Word length. The time length of a particular spoken word can be readily compared between voice samples. When a person speaks more slowly or faster than normal, the time between words is usually more affected than the length of the individual words. It is noted that a word appearing at the end of a sentence or phrase is usually longer than the same word appearing in the middle. e. Coupling. The effects of inappropriate coupling can often be observed in spectrograms as either diminished or enhanced energy in the frequency range between the first and second formants. Coupling is related to the open/close condition of the oral and nasal cavities. In normal speaking the nasal cavity is coupled to the oral cavity for nasal sounds, such as "n", "m", and "ng". However, some talkers are hyper nasal, producing nasal-like characteristics in inappropriate vocal sounds; other speakers are hypo nasal producing limited nasal qualities even when appropriate. f. Other. Plosives, fricatives, and inter-formant features should be spectrally compared between samples by the examiner. Other sounds such as inhalation noise, repetitious throat clearing, or utterances like "um" and "uh" can sometimes be compared to the known exemplar if they have been successfully replicated. 7.2 Aural Comparison. 11 7.2.1 Short-term memory. An aural short-term memory comparison must be conducted either by playing the two (2) samples on separate playback systems with a patching arrangement to allow rapid switching between them or by recording short phrases or sentences from each sample on the same recording. The short-term memory playback tape should contain all words used in the spectrographic comparison. The two (2) samples should be reviewed at approximately the same speech amplitude and with the same general frequency range. The frequency range may be normalized between the samples by using band pass filtering on the sample with the widest frequency range to duplicate the range found on the other sample. 7.2.2 Direct v. enhanced. When direct and enhanced copies have been produced, both should be aurally compared to the known or questioned sample. The examiner should recognize that though enhancement procedures often improve intelligibility, they can also produce changes, at times, that can make samples of the same talker sound somewhat different. 7.2.3 Pronunciation. Only similarly pronounced words should be compared between samples. 7.2.4 Intraspeaker consistency. The examiner must aurally compare similar words within each sample to determine if they are spoken in a generally consistent manner. If intraspeaker variability is present for a particular word, that word should not be compared to the other voice sample. If considerable intraspeaker variability is present in the entire sample, that sample should not be used for comparison. This is often the problem with disguised speech and known exemplars from uncooperative individuals. 7.2.5 Speech characteristics. a. Pitch. See sect. 7.1.5.b. b. Intonation. Intonation is the perception of the variation of pitch, commonly known as a melody pattern. Spontaneous conversation will normally exhibit this characteristic to a greater extent than a passage that is read by the speaker. c. Stress/Emphasis. The stress or emphasis within the words of the sample should be similar for different recordings of the same talker when no disguise is present. d. Rate. The rate of speaking under the same conditions is relatively constant for a particular talker. However, rates of reading, recitation, and conversation will normally vary for the same talker. e. Disguise. Obvious vocal disguises can disqualify a sample for comparison purposes. The examiner should carefully analyze the characteristics of the disguise in a sample and then determine if it is possible to make a meaningful comparison with another sample, whether it also contains a disguised voice or not. 12 f. Mode. Certain speaker-dependent characteristics can be discerned from the mode in which a speaker initiates sounds. Speakers range from gradually to abruptly initiating voicing, which can reveal useful similarities and differences between two samples. g. Psychological state. Listening usually reveals many of the effects of an altered psychological state upon the voice. Alterations may be characterized as nervousness, over-excitement, excessive monotone, crying, and so on. The examiner should be cautious in comparing samples with major changes due to an altered psychological state. h. Speech defects. Speech defects are abnormalities in the voicing of sounds, and can include lisps, pitch and loudness problems, and poor temporal sequencing. Except for extreme cases, there are no criteria to assess whether a voice is considered normal or defective. Obvious, or even subtle, defects in the questioned or known voice samples can often provide vital information in the comparison decision. i. Vocal quality. Vocal quality is the perception of the complex, dynamic interplay of the laryngeal voicing (pitch, intonation, and stress), articulator movement, and oral cavity resonances. Since each individual’s voice is relatively unique in its vocal quality, comparisons can provide important information regarding similarities and differences between the voice samples. j. Other. Examples of other useful speech characteristics that are occasionally heard include long-term fluctuations of pitch (vibrato), vocal fry (extremely low pitching), pitch breaks, and stuttering. 7.3 Conclusions. Every aural/spectrographic examination conducted can only produce one of seven (7) decisions; Identification, Probable Identification, Possible Identification, Inconclusive, Possible Elimination, Probable Elimination, or Elimination. The following descriptions for each decision are the minimal decision criteria, and must be adhered to by the examiner, except that lower confidence level can always be chosen, even though the criteria would allow a higher degree of confidence. Within the range of probable decisions, the examiner may wish to clarify his findings, i.e. low probability, high probability, depending upon the quantity and quality of the comparable material available to the examiner. Comparable words must meet the previously listed criteria. The following are the seven (7) possible decisions. 7.3.1 Identification. At least 90% of all the comparable words must be very similar aurally and spectrally, producing not less than twenty (20) matching words. Each word must have three (3) or more usable formants. This confidence level is not allowed when there is obvious voice or electronic disguise in either sample, or the samples are more than six (6) years apart. 13 7.3.2 Probable Identification. At least 80% of the comparable words must be very similar aurally and spectrally, producing not less than fifteen (15) matching words. Each word must have two (2) or more usable formants. 7.3.3 Possible Identification. At least 80% of the comparable words must be very similar aurally and spectrally, producing not less than (10) matching words. Each word must have two (2) or more usable formants. 7.3.4 Inconclusive. Falls below either the Possible Identification or Possible Elimination confidence levels and/or the examiner does not believe a meaningful decision is obtainable due to various limiting factors. Comparisons that reveal aural similarities and spectral differences, or vice versa, must produce an Inconclusive decision. 7.3.5 Possible Elimination. At least 80% of the comparable words must be very dissimilar aurally and spectrally, producing not less than (10) that do not match. Each word must have two (2) or more usable formants. 7.3.6 Probable Elimination. At least 80% of the comparable words must be very dissimilar aurally and spectrally, producing not less than fifteen (15) words that do not match. Each word must have two (2) or more usable formants. 7.3.7 Elimination. At least 90% of all the comparable words must be very dissimilar aurally and spectrally, producing not less than twenty (20) words that do not match. Each word must have three (3) or more usable formants. This confidence level is not allowed when there is obvious voice or electronic disguise in either sample, or the samples are more than six (6) years apart. 7.4 Second Opinion. A second opinion is not required, but may be obtained from another certified examiner when desired by either the examiner or the party submitting the evidence. 7.4.1 Independence. A second opinion must be completely independent of the first examiner's decision, and no oral or written information shall be provided regarding that first opinion. 7.4.2 Material provided. The second examiner should only be provided the originals, or direct and enhanced copies, any work notes under Sections 2, 3, and 4 and the spectrograms. The second examiner must not be provided any materials that reflect even partially, the first examiner's opinions regarding the examination. 7.4.3 Examination. A thorough analysis should be conducted by the second certified examiner, using the guidelines in Sections 5, 6 and 7 (except for 7.4). It is left to the discretion of the second examiner whether to prepare additional spectrograms or copies. 7.4.4 Resolving differences. If different decisions are reached by the two (2) examiners, a detailed discussion between them of the analysis will often lead to a 14 resolution. If not, the lower confidence level must be reported and testified to when both decisions are an identification or elimination. If split between and identification and elimination, no matter what the confidence level, the decision must be inconclusive. A third independent decision can be obtained but the result will be the lowest confidence level, or an inconclusive of all the examiners involved. 7.4.5 Reporting. Whenever possible, the second examiner should prepare a short report listing the results of the second opinion. This is not necessary if both examiners are in the same organization. The name and results of the second opinion can then be included in the first examiner's work notes. 8 WORK NOTES. 8.1 Required Information. The examiner's work notes should be in accordance with Rule 26 of the Federal Rules of Evidence - Expert Witness Statement categories, and should contain, as a minimum, the following information: a. Laboratory, case, and specimen identifiers; b. Description of submitted evidence; c. Chain-of-custody documentation; d. Track determination, azimuth alignment, and speed accuracy information, where required, for each submitted sample; e. Information on the duplication processes, including the type of equipment and format copies; f. Information of the enhancement processes, if any, including the type of equipment, filter settings, and format copies; g. List of the exact words used for comparison and whether they matched or not; h. Name of any second opinion examiner and the results of that examination; i. Final decision. 8.2 Retention. The work notes should be retained for at least three (3) years after completion of the examination unless the contributor has requested that all material relating to the case be returned. 9 REPORTING. 9.1 Format. The report should be typed, dated, and in a standard laboratory or business letter style. The content of the report should be in conformity with Rule 26 of the Federal Rules of Evidence. The following information must be included: a short description of the evidence being examined, a summary of the 15 examination performed, the final decision, and a statement of accuracy. Exhibits, handouts and supporting documentation should be separate from the report. Business matters, such as payment of fees, should be set forth in separate communications and not included within the report. 9.2 Decision Statement. The report must clearly state which of the seven (7) decision options listed in Section 7.3 was the final result of the examination. 10 TESTIMONY. The American Board of Recorded Evidence does not take a position as to whether or not a certified examiner should provide testimony regarding examination results. However, an examiner must follow the standards set forth in this document, including the appropriate criteria set forth in this section, whether they provide testimony, or not. 10.1 Testimony v. Investigative Guidance. Each specific organization or individual examiner must decide before conducting spectrographic voice identification examinations whether testimony will be provided. If not, the contributor must be advised of the investigative guidance policy and all oral and written reports should set forth this information. 10.2 Qualification List. The presentation of the qualifications of the examiner should be in conformity with Rule 26 of the Federal Rules of Evidence - Expert Witness Statement categories, regarding expert witnesses. 10.3 Pre testimony Conference. Discussion of the examination with the attorney before judicial proceedings is an important aspect of providing meaningful testimony and educating the attorney on the strengths and limitations of the technique. The conference should include a candid discussion, the inherent problems, identification of scientific literature that is either critical or supportive, and other information important to the testimony. 10.4 Appearance and Demeanor. Whenever possible, examiners must dress in proper business attire or appropriate law enforcement or military uniform for all judicial proceedings, maintain a professional demeanor even under adversarial conditions, and direct explanations to the jury, when present. 10.5 Presentation. The examiner should provide to the judge and/or jury, as a minimum, his/her qualifications, an overview of the spectrographic technique, its scientific basis, the details of the analysis procedures followed in the specific case, and the results of the analysis. The information should be presented in a form understandable to non-experts, but with no loss of accuracy. 16 ANOMALIES ASSOCIATED WITH COMPUTER EDITING OF RECORDED TELEPHONE CONVERSATIONS Second international chemical congress forensic symposium fall 1995 San Juan Puerto Rico by Steve Cain During a two to three year period, a Midwestern entrepreneur had been interested in filing a patent on an innovative new product. As a home-based business man, much of his product development and marketing strategies were accomplished through contact with several dozen product development attorneys and other business advisors over his home and office telephones. When requested to provide the original telephone tape recordings, he claimed they had been inadvertently misplaced but that he had made copies of the relevant conversations which he later surrendered for forensic analysis. Although unsuccessful in ever examining the original tapes, I did have two copies of each of the original tapes. The original recorders were described as Radio Shack type portable machines together with a telephone interface device and two consumer brand high speed dubbing cassette recorders which purportedly were used in the selective dubbing of individual telephone conversations from the original tapes. During review of ten composite copy tape conversations, it became apparent through both aural and spectrographic/waveform analysis, that there existed a number of suspicious record events (i.e. “anomalies”) which deserved further instrumental attention. A KAY Digital Spectrograph Model 5500 was used for the bulk of the analysis. As the original tapes were not available, magnetic development was not deemed appropriate and therefore traditional digital waveform/spectrographic techniques were utilized in the examination process. Before displaying examples of the computer-related edited phenomena, it may prove beneficial to review the traditional analog anomalies often associated with falsification of recordings. These include: 1. Deletion: the elimination of words or sounds by stopping the tape and overrecording unwanted areas. 2. Obscuration: the mixing in of sounds of amplitude sufficient to mask waveform patterns which originally would show stop/starts in inappropriate places. 3. Transformation: the rearranging of words to change content or context. 4. Synthesis: the adding of words or sounds by artificial means or impersonation. Anomalies often times include the following phenomena: 1. Gaps: segments in a recording which represent unexplained changes in content or context. 2. Transients: short, abrupt sounds exemplified by clicks, pops, etc. 3. Fades: gradual loss of volume. 1 4. Equipment Sounds: context inconsistencies caused by the recording equipment (such as hum, static, and varying pitches). 5. Extraneous voices: background voices which at times appear to be as near as the primary voice or can even mask the primary voice. (1) Modern day technology and the development of the DSP chip have greatly complicated the issue of tape tampering detection and further increases the likelihood that altered tapes can escape detection. The Federal Bureau of Investigation Signal Analysis Branch has already acknowledged, “it is difficult to detect some alterations when a recording is digitized onto a computer system, physically or electronically edited and recopied onto another tape. (2) Recently there have been at least 20 different manufacturers of desktop computer editing workstations or digital recorders which can be used as “turn key” editing systems. Software related computer cards can transform a personal computer into a sophisticated digital audio editing machine. Some of the systems do require that the initial conversion of the analog format be accomplished by a digital audio recorder before accessing the computer hardware.(3) Digitization of speech can sometimes leave discernable artifacts, especially “aliasing” effects. This phenomena of digitizing the speech signal involves two distinctive processes known as Sampling and Quantizing, which are the true core of the digital recording process. Speech digitization requires filtering by an appropriate low pass filter which should remove any high frequencies that are beyond the sampling rate of the equipment. The sampling process refers to the transforming of the low-filtered electronic waveform into many thousands of small units of time. Each of these time units are later quantized with respect to its respective amplitude. The Nyquist Theorem, however, requires that the sampling frequency be twice as high as the highest frequency converted into digital format. If this theorem is not followed, an undesirable effect known as Aliasing occurs.(4) High frequency changes in amplitude are not properly encoded, leaving some information lost and occasionally new erroneous signals are generated. “If the throughput frequency is greater than one-half the sampling frequency, aliasing inevitably occurs.”(5) For example if S is the sampling rate and F is a higher frequency than one-half the sampling rate and N is an integer, a new sample frequency, Fa is also created at Fa = ± NS ± F. Therefore, if S equals 44 KHZ and we sampled at 36 KHZ, another sample frequency would occur at 8 KHZ. If we sample at 40 KHZ, a 4 KHZ aliasing signal would occur. (6) Other aliasing effects involve Image Aliasing which occurs in multiple images produced by the sampling process. If a 44 KHZ sampler is utilized and a 36 KHZ input signal is analyzed, some of the resulted output frequencies would 8 KHZ, 52 KHZ, 80 KI-IZ, etc. In addition, Harmonic Aliasing can exaggerate the problem. Complex tones, for example, could result in aliasing frequencies generated 2 separately for each harmonic. The practical result of this would be additional harmonics would be added to the digitized signal which normally would be multiples of the harmonic of the fundamental frequency.(7) As DSP technology and their respective chips become more sophisticated and available to the consumer, the ability to edit, alter or fabricate audio recordings will be enhanced. Computer-based digital editing now permits the generation of lengthy, fabricated audio segments, sometimes devoid of the traditional transients in other editing artifacts associated with analog tape tampering. The results of an aural/waveform/spectrographic analysis on the evidence tape copies disclosed a number of computer related editing anomalies associated with significant portions of the recorded telephone conversations, namely: 1. Uncharacteristic tones in the recordings sometimes occurring at even numbered multiples of each other (i.e. 4, 8, 1 6, 20 KHZ). 2. Omission or deletion of material. 3. Abrupt beginning and ending of ongoing speech. 4. Aliasing effects. The more subtle effects of the digital editing process involving “aliasing” artifacts can sometimes be heard but are more readily apparent in the spectrographic/waveform analysis of the altered speech signals. Examples of the digital editing process associated with this case are displayed in the accompanying sets of overhead transparencies. A short term aural composite tape was produced and should further corroborate the nature and extent of the digital editing anomalies associated with the computer edits found in this examination process. BIBLIOGRAPHY 1. Steve Cain, “Sound Recordings as Evidence in Court Proceedings,” article accepted for publication by National District Attorneys Association, The Prosecutor, to be published late 1995. 2. Bruce E. Koenig, “Authentication of Forensic Audio Recordings,” Journal of Audio Engineering Society, 38, 1/2, 1990, January/February, page 4. 3. Steve Cain, “Verifying the Integrity of Audio and Video Tapes,” paper published in The Champion Magazine, July, 1993. 4. Jordan S. Gruber, Fausto Poza and Anthony Pellicano, Audio Tape Recordings: Evidence. Experts and Technology, Volume 48, American Jurisprudence Series, Lawyers Cooperative Publishing, Rochester, New York, 1993, pp. 108-109. 5. Ken C. Pohlmann, Principles of Digital Audio, Howard W. Sams and Company, 1992, pp. 46-48. 6. Ibid 5, p. 45. 7. Ibid 5, p. 48. 3 VERIFYING THE INTEGRITY OF AUDIO AND VIDEOTAPES By Steve Cain Champion - July 1993 by Steve Cain An ever increasing reliance on tape evidence in criminal prosecutions, especially in organized crime and drug cases, underscores the importance of tape integrity and the methods used to qualify or disqualify tape evidence. This article will discuss some of the procedures utilized in analog and digital editing of tapes and assess their potential threat vis-a-vis tape tampering issues; the "legal admissibility" issue surrounding tape recorded evidence to include defining strategies for the defense to require the government to release the 'best evidence' for analysis purposes; and an overview of the accepted techniques for the scientific analysis of recorded tape evidence. Tape Editing Technology, The forensic examination of "tampered tapes" should include an inspection of the original tape(s) and the recorder(s) used to produce the tape(s). In the simple case, the existence of an electronic edit and/or evidence of physical splicing will produce acoustic irregularities which can be viewed with instruments and documented. Modern day technology was apparently used in the electronic editing performed on the disputed Gennifer Flowers/Gov. Bill Clinton tape recordings. The Cable News Network (CNN) asked that I provide an expert opinion on Mr. Clinton's voice and also asked that I examine the tape submitted by the STAR News Magazine for any evidence of possible tampering. The later examination disclosed a number of suspicious acoustic events (anomalies) including: a total loss of signal (dropouts) ;a change in the speakers' frequency response during different telephone conversations; and "spikes" (audible sounds of short duration which are often attributable to normal stop/start and pause functions of the recorder). In order to provide any definitive conclusion, I requested the original recorder and tape to determine if these electronic edits were intentional edits or possible malfunction/anomalies of the recorder/microphone equipment. CNN has never received the requested tape or recorder from the Star News Magazine. Digital editing of both audio and video tapes, however, greatly complicates the issue and increases the likelihood that altered tapes can escape detection. The Federal Bureau of Investigation (FBI) Signal Analysis Branch has already acknowledged, "It is difficult to detect some alterations when a recording is digitized into a computer system, physically or electronically edited and recopied on to another tape." *1* The days of utilizing a razor blade and splicing tape to effectively alter or "doctor" a recorded conversation are all but gone. Right now there are at least twenty manufacturers of desktop computer editing work stations or digital recorders which can be used as "turn key" editing systems. Software and add on computer cards can transform an IBM personal computer or a Macintosh computer into a sophisticated digital audio editing machine. Some of the systems require a digital audio recorder for initial conversion of the analog format before accessing the 1 computer hardware. These editing work stations were developed to save the motion picture and recording industries money by precluding the necessity of recording sessions or to correct subtle errors in multi track releases. Some computer boards and software cost less than a $1,000,and provide both recording and editing of sound in an IBM compatible or Mac personal computer format. Editing options are practically inexhaustible thus giving the operator the ability to alter the tape in a word processor type of mode (i.e. cut and paste, copy, delete, etc.) while selected playback files utilize subdue cross fading effects that can "shape" the sound. The typical telltale signs of traditional analog recorder editing including "clicks, pops" and other short duration sounds, can now all be effectively removed without any detectable, audible clue. Traditional Editing Techniques Present tape editing practices include either physical splices or electronic editing on one or more analog tape recordings whenever portions of selected conversations are over recorded (i.e. erased) or the original recorder was stopped and restarted inappropriately. While listening to the tape, the attorney may first suspect an alteration by noting either unexplained transients, equipment sounds, extraneous voices, or inconsistencies with provided written information. The major categories of tape alterations include; (1) Deletion; (2) Obscuration; (3) Transformation; and (4) Synthesis *2* Deletion of unwanted material can readily be done through splicing or by using one or more recorders to erase, rerecord, or stop/pause the recorder at strategic points within the conversation. Obscuration involves the distortion of a recorded signal with the purpose of rendering selective portions unintelligible. This method, for example, was used during the editing of the infamous 18 minute gap in the Watergate tapes. This technique is also used to .mask splices, clicks, or suspicious transients and is more difficult to detect than deletion methods. By judicious use of two tape recorders, one may add "noise" to the copy and thereby mask the original recording and render it less intelligible. One can also reduce the volume of the slave recorder and thus weaken the amplitude of target conversations on the original tape. Transformation involves the alteration of portions of a recording so as to change the meaning of what is said. The technique is similar to deletion practices but greater skill and care must be applied as a knowledge of acoustic phonetics is required to avert a suspicious edit. Lastly, synthesis is the generation of artificial text by adding background sounds or conversation to the tape copy which were not present on the original recording. The addition of selective phrases can be accomplished if a sufficient data base library of recorded conversations is available. It must be emphasized that all of the traditional analog methods of altering audiotapes can be more efficiently and surreptitiously accomplished through the use of digital editing work stations. Tape Authentication And Detection Of Edits With the threat of digital editing looming larger, it is more inoperative than ever that both the official tapes and recorders be made available for inspection. The FBI's Signal Analysis Branch has developed a set of well defined procedures for the acceptance of authentication requests which provides an excellent 2 overview of what the government considers to be essential for a scientifically valid tape analysis: 1. Sworn testimony or written allegations by defense, plaintiff, or government witnesses of tampering or other illegal acts. The description of the problem should be as complete as possible, including exact location in recording, type of alleged alteration, scientific test performed, and so on; 2. The original tape must be provided. Copies of a duplicate tape cannot be authenticated and are normally not accepted for examination by the FBI; 3. The tape recorders and related components used to produce the recording must be provided; and, 4. Written records of any damage or maintenance done to the recorders, accessories, and other submitted equipment must be provided. In addition, there must be a detailed statement from the person or persons who made the recording describing exactly how it was produced and the conditions that existed at the time, including: A. Power source, such as alternating current, dry cell batteries, automobile electrical system, portable generator. B. Input, such a telephone, radio, frequencies (Rf) transmitter/receiver, miniature microphone, etc. C. Environment, such as telephone transmission line, small apartment, etc. D. Background noise, such as television, radio, unrelated conversations, computer games, etc. E. Foreground information, such as number of individuals involved in the conversation, general topics of discussion, closeness to microphone, etc. F. Magnetic tape, such as brand, format, when purchased and whether previously used. G. Recorder operation, such as number of times turned on and off in the record mode, type of keyboard or remote operation for all known record events, use of voice activated features, etc. Also recommended is a typed transcript of the recording, to include both English and foreign language versions *3* It is essential in all tape authentication exams to obtain the original recorder and tape, as copies cannot normally be authenticated. If the defense is encountering difficulties in obtaining the necessary "originals" they may wish to cite Koenig's article'*4*as an authoritative resource which specifies the reasons why the original evidence is essential in any tape tampering request. If the original tape and recorder are not available for inspection, the forensic expert can still conduct a preliminary examination of the submitted "copy" for any evidence of discontinuous recorder operation, although all conclusions must necessarily be qualified regarding possible editing effects. The examination process normally includes both an aural, physical, and instrumental analysis of the evidential tape. Phase continuity, speed determination, azimuth determination, waveform analysis, spectrographic and narrow band spectrographic analysis are among the techniques employed to evaluate the tape. The techniques and tests are usually adequate in the detection of altered analog recordings. Fortunately, the vast majority of altered tapes today are still analog 3 tapes. Defense counsel should have a working knowledge of how tapes are analyzed. First, there is a physical inspection of the submitted tape, the tape housing, the tape recorder and all ancillary equipment used to make the original recording: microphones, telephone couplers, transceivers, etc. A magnetic development test involves the application of a special fluid which under proper magnification will make visible the head track configuration, off-azimuth recordings, start/stop functions, damage to recording heads, etc. The forensic expert can subsequently determine whether the submitted tape is a copy, has been over-recorded, or was made on a different recorder than the one submitted. The original recorder can be detected by slight speed fluctuations and deformities in the rotating parts which provide a unique "wow and flutter" signature which can be measured. Also, spectrum analysis can be used to measure slightly different signals transmitted through the microphone or telephone equipment. All of the signal analysis equipment can be useful in answering questions related to bandwidth, distortion effects, or unique tones generated during the original recording process. Forensic Video Examinations The forensic video examiner is concerned with the authenticity and integrity of the signal. Questions relating and whether the tape is a copy, a compilation of other tapes or an edited version are of important consideration. Forensic examinations of videotapes usually consist of both a visual and aural examination. One of the more important pieces of equipment used in forensic video examinations is a waveform monitor which is a specialized oscilloscope. It displays the voltage versus time modes and has specialized circuits to process the signal. If any editing occurs, then its possible to display the signal aberration on the display screen of the instrument.*5* Additional tests include measurements of the chrominance, hue and burst of the color videotape by using a vector scope. The vector scope measures the chrominance information and allows for the examination of matching bursts of multiple signals. It also permits the investigation of edit points. Vertical, interval and horizontal information known as video synchronizing information can be observed on a cross pulse monitor. This "cross pulse" information can be viewed on a cross pulse monitor and with proper application, one can often determine if the videotape is a copy or an original. In cases where the helical heads are out of alignment, a set of marks could exist for each succeeding generation or copy.*6* Lastly, if one suspects videotape editing, the examination will require a frame-by-frame inspection, with the use of waveform monitors, vector scopes, and a cross pulse monitor together with other forensic equipment as deemed appropriate. It must be noted that there are sophisticated production studios that can edit videotapes in such fashion that traditional methods of detection are no longer adequate. Studios capable of producing such tapes are, for now, generally limited to larger metropolitan areas. Legal Issues/Admissibility In their article, "Attacking The Weight Of The Prosecution's Science Evidence,"*7* authors Edward J. Imwinkelried and Robert Scofield explore the 4 thesis that the accused has a constitutional right to introduce expert testimony which can generate a reasonable doubt. The authors warn, however, that this right to relevant criminal evidence is in fact very limited in scope, namely; (1) important or "crucial evidence" and; (2) the defense must show that the evidence is "trustworthy." Likewise, authors Nancy Hollander and Lauren Baldwin point out that the admissibility of an expert's testimony is often dependent on whether the expert is testifying for the defense or for the prosecution ."*8* In the field of forensic tape analysis, there exists few competently trained and certified experts available to the defense to challenge the accuracy of government tapes and/or the conclusions of the government experts. Even though I have over twenty years experience in federal law enforcement and as a Treasury Department crime laboratory supervisor, I am routinely subjected to concerted efforts by the prosecution to attack my credibility and the accuracy of my conclusions. As you would expect, as a government expert, I never received any criticisms from the prosecutor concerning my credentials or accuracy of my findings. Access To Evidence More and more courts are being forced to address the question of whether the government has the privilege to withhold technical data from a defendant challenging the integrity of electronic surveillance evidence. A few courts have recognized "qualified privilege" for the government to such data (by drawing an analogy to an "informer's privilege"), but have not been very sensitive to the unique nature of electronic surveillance evidence nor defined the showing required to overcome the government's "qualified privilege." Under the due process clause, criminal defendants should be afforded a meaningful opportunity to present a complete defense.*9* To safeguard this right the court has recognized the principal of "constitutionally guaranteed access to evidence ....*10* This access to evidence however, is not absolute as indicated in Roviaro v. United States,*11*" wherein the court recognized the government's limited privilege to withhold the identity of informers. Two circuit courts of appeal have extended the limited privilege recognized in Roviaro to the nature and location of electronic surveillance equipment."*12* In Angiulo and Cintolo, the appellants asserted that the district court had mistakenly barred questions concerning providing them the precise location of microphones hidden in an apartment. Trial motions for the information had not been made nor had the defendants offered any technical basis for the value of the information. The government successfully objected to the questions concerning the microphones location on the grounds that it would reveal sensitive surveillance techniques and jeopardize future criminal investigations. In upholding the district court, the First Circuit, citing Van Horn *13* and United States v. Harley,*14* and making an analogy to the informer’s privilege in Roviaro held that a qualified privilege against compelled government disclosure of sensitive investigative techniques exists."*15* The privilege can be overcome, however, by a sufficient showing of need. The defendant must show that, "he needs the evidence to conduct his defense and that there are no adequate 5 independent means of getting at the same point."*16* The Cintolo court stressed that the extent to which adequate alternative means could have substituted for the proper testimony is "a key to evaluating this claim of necessity.*17* As technological advances have occurred in digital editing, there likewise has been a tremendous increase in the number of body worn FM transmitters and other recording devices used by law enforcement to collect evidence against defendants. It should be emphasized, however, that some of this evidence may not be admissible in court if the agencies do not comply with several Federal Communication Commission (FCC) regulations. First, all nonfederal agencies must use only transmitters that are approved by the FCC and without this approval the transmitter is not considered a legal transmitting device and therefore cannot be legally used to gather evidence. Secondly, state and local agencies must be licensed in the FCC's Police Radio Service and thus far most departments reportedly have not met this requirement. These observations are part of the information contained in "Equipment Performance Report: Body Worn FM Transmitters," a report of the Technology Assessment Program (TAP). This program tested nine Body-Worn FM transmitters in accordance with National Institutes of Justice (NIJ) Standard 0214.01. These standards require transmitters passing the test to provide intelligible audio signals that result in acceptable quality voice recordings.*18* As noted in the Cintolo and Angiulo decisions, the defense failed to provide a sufficient showing of necessity, thus, it is imperative that defense experts vouch for the necessity of access to the government evidence as soon as possible. The Need For Original Recording Equipment And How To Get If There are a number of valid scientific reasons for accessing original tapes, recorders, and related equipment to conduct a proper analysis. In practically every creditable forensic publication dealing with forensic tape analysis procedures, the authors emphasize the necessity of examining the original evidence or a direct patch cord copy. In many cases, however, experience has shown an unwillingness of the government prosecutor and agents to provide such materials to the defense for examination purposes. The government may object that the defense never requested the original or direct copy recordings and therefore, their motions for access at the eleventh hour are basically "delay strategies." This argument can be effectively countered if the defense obtains an appropriate court order requesting the defense expert be provided access to the required "best evidence recordings." Secondly, the government may contend that it has a qualified (if not absolute) privilege of withholding technical data from the defense counsel citing "National Security" or indicating that such release may jeopardize future criminal investigations. The Anguilo and Cintolo decisions provide the defense counsel relief from such government actions. Counsel must show the need for the evidence to conduct the defense and that there "is no adequate independent means of getting at the same points." The importance of the defense obtaining the original or at least a direct patch cord copy of all evidential recordings cannot be over emphasized. In practically every case I have seen, the copy initially provided by the government was not 6 adequate for the best voice identification, tape enhancement or tape authentication examination. Subsequent motions filed by the defense citing the aforementioned requisite need for the original evidence often results in its release by the court. As reflected in the newly approved International Association for Identification standards for analysis of questioned voice recordings, the "unknown and known voice samples must be original recordings, unless listed as a specific exception ...."*19* Notes: 1. Bruce E. Koenig, Authentication of Forensic Audio Recordings, JOURNAL OF AUDIO ENGINEERING, 38 No. 1/2, 1990, Jan/Feb, page 4. 2. National Commission For The Review of Federal and State Wiretapping Laws, pp 223225,1972. 3. Steve Cain, Voiceprint Identification, NARCOTICS, FORFEITURE, AND MONEY LAUNDERING UPDATE NEWSLETTER, U.S. Department of Justice, Criminal Division, (Winter 1988). 4. Bruce E. Koenig, Authentication of Forensic Audio Recordings, JOURNAL OF AUDIO ENGINEERING SOCIETY, 38 No. 1/2, 1990, Jan/Feb. page 4. 5. Tom Owen, Forensic Audio and Video Theory And Applications, JOURNAL OF AUDIO ENGINEERING SOCIETY, Vol. 36, No. 1/2. 1988, Jan/Feb, page 39. 6. Ibid page 40. 7. Edward J. Imwinkelried, and Robert G.Scofield, Attacking The Weight Of Prosecution ~Scientific Evidence, THE CHAMPION, PDN, April 1992. 8. Nancy Hollander and Lauren M. Baldwin, Testimony In Criminal Trials: Creative Uses,Creative Attacks, THE CHAMPION, December 199 1. 9. California v. Trombetta, 467 U.S. 479, 485 (1984). 10. United States v. Valenzuela Bemal, 458 U.S. 858, 867 (1982). 11. 353 U.S. 53 (). 12. See United States v. Angiulo, 847 F.2d. 956,98182 (lst Cir. 1988); and United States v. Cinto1o, 818 F.2d. 980, 100103 (lst Cir. 1987); United States v. Van Horn, 789 F.2d. 1492, 150708 (llth Cir. 1986). 13. 798, F.2d. 1492 ( ). 14. 682 F.2d. 1018, 1020 (D.C. Cir 1982). 15. Cintolo, 818 F.2d. 1002. 16. See Harley, supra. 17. Cintolo, 818 F.2d. 1003. 18. Copies are available at no charge from the Technology Assessment Program Information Center (TAPIC), tollfree number 800-248-2742 or (301) 251-5060. 19. IAI Voice Comparison Standards, JOURNAL OF FORENSIC IDENTIFICATION, January/February, 1992 7 AUTHENTICATION OF SOUND RECORDINGS FOR EVIDENTIARY PURPOSES By: STEVE CAIN, MFS, MFSQD PRESIDENT FORENSIC TAPE ANALYSIS, INC LAKE GENEVA, WISCONSIN MICHAEL R. CHIAL Ph.D PROFESSOR AND CHAIRMAN OF COMMUNICATIONS PROGRAMS AND PROFESSOR OF COMMUNICATIVE DISORDERS UNIVERSITY OF WISCONSIN-MADISON PRESENTED AT 1994 ANNUAL MEETING OF THE AMERICAN ACADEMY OF FORENSIC SCIENCES (JURISPRUDENCE SECTION) FEBRUARY 18, 1994 AUTHENTICATION OF SOUND RECORDINGS FOR EVIDENTIARY PURPOSES An ever-increasing reliance on tape evidence in both criminal and civil hearings underscores the importance of tape integrity and the methods used to qualify or disqualify audiotape evidence. Tape recordings are subject to increasing falsification and misinterpretation, especially with the advent of computer-based digital editing equipment. The purpose of this paper is four-fold: 1) to identify the predominant methods by which audiotapes are normally intentionally altered or falsified; 2) identify the physical and instrumental techniques for detecting signs of tape falsification; 3) briefly discuss the increasing threat caused by modernday digital editing techniques and 4) provide examples of both analog and digitally falsified tapes. There are two generally accepted approaches for establishing the authenticity of a questioned tape recording. Current legal practices normally require that the burden of proof be placed on the attorney seeking to introduce the tape into evidence. This will require that the attorney demonstrate that certain accepted methods designed to protect from any form of tape tampering have been adhered to and if that is not successful to submit the tape to a qualified expert for a forensic examination. On a more practical level, an original recording is considered authentic if it starts at the beginning of the tape and does not stop until the end. Any stops or restarts should be announced by the operator. Original recordings should contain all of the audio information recorded at the moment in time that the event occurred. The recording should further not contain any break in its continuity or content nor should it contain any suspicious signs suggestive of falsification. 1 It is important for both attorney and investigator to understand that falsification or tampering with tapes involves an intentional attempt to alter the tape’s original content. Often, however, the evidential recorders and their respective tapes have been unintentionally interrupted during the recording process. This innocuous or accidental interruption of the tape does not constitute a falsification effort and may include the following operator errors; 1) accidental stop/restart of tape recorder; 2) mechanical malfunction of the tape recorder; 3) damage to the tape oxide or the use of a previously recorded tape; 4) “off-speed” recording due to low batteries or improper AC line connections; 5) microphone abnormalities; etc. The major categories of intentional tape editing or falsification include; 1) Deletion; 2) Obscuration; 3) Transformation; and 4) Synthesis. Deletion of unwanted material can be rapidly accomplished through either splicing or by using one or more recorders to erase, rerecord, or stop/pause the recorder at strategic points within the conversation. Obscuration involves the distortion of a recorded signal with the purpose of rendering selective portions unintelligible (i.e. the eighteen minute gap in the infamous Watergate tapes). This technique can also be used to mask splices, clicks, or suspicious transients. Transformation involves the alteration of portions of a recording so as to alter its original content. The technique is similar to deletion practices but requires greater knowledge of acoustic phonetics and is more difficult to accomplish. Lastly, synthesis is the generation of artificial text by adding background sounds or conversation to the taped copy which were not present on the original recording. It should be emphasized that all of the aforementioned traditional analog techniques for altering audiotapes could be more effectively and surreptitiously accomplished through the use of digital editing workstations. The principles of falsification are also similar to the general principles of disguise. Namely, the individual actually effecting the tape falsification is attempting to obscure or disrupt important features of the originally recorded event or subject of interest. This is accomplished through various masking techniques. Secondly, falsification efforts are often designed to misdirect the attention of the listener to an irrelevant aspect or feature of the signal or an event of interest. The electromechanical indications of falsified tapes should include one or more of the following phenomenon: 1) Gaps segments in a recording which represents unexplained changes in content or context. A gap can contain buzzing, hum, or silence. 2) Transients - short, abrupt sounds exemplified by clicks, pops, etc. Transients may indicate tape splicing or some other interruption of the recording process. 3) Fades - gradual loss of volume. Fades can cause inaudibility and are considered gaps when the recording becomes fully inaudible. 2 4) Equipment sounds - inconsistencies of context caused by the recording equipment itself. Common equipment sounds include hum, static, whistles, and varying pitches. 5) Extraneous voices - background voices which at times appear to be as near as the primary voices, and at times can even block the primary voices. The methods for detecting falsified (non-authentic) recordings include: Critical Listening The forensic tape specialist will normally listen with high quality head phones and professional recording equipment to the original tapes prior to conducting any instrumental examination. He notes any unusual aural and/or acoustic events such as starts, stops, speed fluctuations, and other variations requiring investigation. He examines all recorded events to include both foreground and background sounds and listens for abnormal changes, absences, or presences of differing environmental sounds. He concentrates on voices, conversation and other audible sounds. Aural Anomalies Would include sudden changes in a person’s voice, abrupt unexplained topic changes, or a sudden change in foreground/background information. Physical lnspection Magnetic Development Spectrum Analysis Employs the use of specialized computer equipment which measures the frequency spectrum of the recorded tape and provides a visual interpretation of the frequency vs. amplitude, frequency vs. amplitude vs. time displays. This allows for the expert to view the entire spectrum or to zoom in on one particular area of interest to help characterize the acoustic nature of a particular anomaly and to possibly identify its source. Waveform Analysis - A computer generated display representing time vs. amplitude of recorded signals in graphic form. Such analysis normally allows the expert to measure and identify record-mode events including the measurement of record-to-erase head distances, determination of the spacing between gaps and multiple gap erase heads, and inspection of the signature shape and spacing of various record event signals. Test Recordings on Evidential Recorders and Accessory Equipment -Various electrical, magnetic and mechanical measurements of both standard and modified recorders can be used in determining the possible origins of questionable tones or sounds occurring on the evidential recording. There exist many different methods of both analog and digital editing of tape recordings and the below examples highlight one of the more common methods utilized. 3 TRADITIONAL METHODS OF TAPE EDITING METHOD OF DETECTION 1. Whispered Speech 1. Talker identification (voice print analysis) involving the combined aural/spectrographic method 2. Vocal Disguise or Mimicking 2. Talker identification (voice print analysis) 3. Typical Analog Edits - Splicing (electronic or physical), stop/restart, overrecording, pausing of recorder, erasures, dubbing, etc. 3. Critical listening, instrumental analysis, magnetic development, and spectrum analysis. 4. Re-recording to obscure physical physical edits, etc. 4. Critical listening, instrumental analysis, magnetic development, and spectrum analysis. CONTEMPORARY/FUTURE CHALLENGES Digital editing of both audio and video tapes has greatly complicated the authentication process and increases the likelihood that altered tapes can escape detection. There are at least 30 different desktop computer editing workstations or digital recorders which can be used as “turnkey” editing systems. Software and add on computer cards can transform an IBM or Macintosh computer into a sophisticated digital audio editing machine. Some of the systems require a digital audio recorder for initial conversion of the analog format before accessing the computer hardware. These editing workstations were originally designed by the motion picture and recording industries to correct subtle errors in multi-track releases and can now be purchased at prices as low as $300 for the software. The editing options are practically inexhaustible and provide the operator the ability to alter the tape in a word-processing format (ie. cut and paste, copy, delete, etc.,) while selecting playback files which can help “shape” the sound. The typical telltale signs of traditional analog recorder editing including clicks and pops and other short duration sounds can now be effectively removed without little if any detectable audible clues. Examples of varying editing processes including related hardware and/or equipment: 1) Pitch Shift Telephones 2) Vocal Disguise through synthesized speech (Votrax or Dectalk). 3) Computer Manipulation of speech formant data (Kay Elemetrics Model 4300 and ASL programs - Re-synthesis of Human Speech) 4) Additive mixing of noise or other background and foreground signals into ongoing speech. 5) Signal Processing Filters (analog and digital) a. Phasing Anomalies 4 b. Chorusing c. Harmonic Distortion d. Reverberation e. Filtering of Selective Frequencies f. Channel Switching The threat of future digital editing is becoming of increasing concern to the courts. It is therefore more imperative than ever that both the original tapes and the recorders be made available for inspection. Both the FBI Signal Analysis Branch and other certified acoustic tape experts recognize that it is essential for the contributing attorney to provide all of the original tapes and related recording equipment before a complete authentication can be accomplished. Professor Chial and I have attempted to explain some of the more traditional and more recent methods of detecting falsified or edited audiotape recordings; identify the various physical and instrumental techniques for detecting signs of tape falsification; discuss various examples of falsified tapes, and lastly to briefly discuss the increasing threat caused by digital computer-based editing systems. It is relatively easy to change the content of a recording by deleting words or obscuring meaning with over-recorded sounds or by transforming the context through rearrangement of selected phrases or added words. Nevertheless, falsifications normally leave detectable magnetic and waveform acoustic signatures which can lead to forensic individualization of the evidential recorders and tapes. Note: For additional information see the following published articles: “Authentication of Forensic Audio Recordings,” Journal of Audio Engineering Society, 38, 1990, Bruce E. Koenig. The National Commission for the Review of Federal and State Wire Tapping Laws, 1976, Mark Weiss, et al. “Verifying the Integrity of Audio and Video Tapes,” The Champion Magazine, Summer 1993, Steve Cain. “Sound Recordings as Evidence in Court Proceedings,” The Prosecutor Magazine, Sept/Oct. 1995 5 AES standard for forensic purposes —Criteria for the authentication of analog audio tape recordings Users of this standard are encouraged to access http://www.aes.org/standards to determine if they are using the latest printing incorporating all current amendments and editorial corrections. This document has been reproduced by Global Engineering Documents with the permission of AES under a royalty agreement. AUDIO ENGINEERING SOCIETY, INC. 60 East 42nd Street, New York, New York 10165, USA AES standard for forensic purposes Criteria for the authentication of analog audio tape recordings Published by Audio Engineering Society, Inc. Copyright © 2000 by the Audio Engineering Society Abstract The purpose of this standard is to formulate a standard scientific procedure for the authentication of audio tape recordings intended to be offered as evidence or otherwise utilized in civil, criminal, or other fact finding proceedings. An AES standard implies a consensus of those directly and materially affected by its scope and provisions and is intended as a guide to aid the manufacturer, the consumer, and the general public. The existence of an ABS standard does not in any respect preclude anyone, whether or not he or she has approved the document, from manufacturing, marketing, purchasing, or using products, processes, or procedures not in agreement with the standard. Prior to approval, all parties were provided opportunities to comment or object to any provision. Approval does not assume any liability to any patent owner, nor does it assume any obligation whatever to parties adopting the standards document. This document is subject to periodic review and users are cautioned to obtain the latest editi AES43-2000 Contents Foreword 3 1 Scope 4 2 Normative references 4 3 Definitions 4 1 4 Verification of authenticity 5 4.1 Criteria 5 4.2 Equipment 5 4.3 Reporting 5 5 Examination and analysis 6 5.1Evidence management 6 5.2 Critical listening and waveform examination 7 5.3 Photo-microscopic analysis 8 5.4 The formulation of an opinion and conclusion 8 6 Testimony 9 6.1 Preparation 9 6.2 Problems 10 Annex A Informative references 11 Foreword [This foreword is not a part of AES standard for forensic purposes — criteria for the authentication of analog audio tape recordings, AES43-2000.] This document was developed by a writing group, headed by A. Pellicano, of the SC-03-12 Working Group on Forensic Audio of the SC-03 Subcommittee on the Preservation and Restoration of Audio Recordings. The writing group was formed to execute project AES-X48. It results from an international consensus and is not intended to. reflect the practice of any single nation. As an AES standard, it is an international professional society’s statement of technical good practice, but its use is entirely voluntary and it does not have the status of a governmental regulation. Nevertheless, any claim to voluntary compliance with the standard implies acceptance of its mandatory clauses. In 1991, SC-03-12 was organized as AESSC WG-12 at the request of a community of engineers from the ABS. the Acoustical Society of America, various law enforcement agencies, and groups concerned with testimony. The group concerns itself with the handling, authentication, and enhancement of audio recorded materials basing itself on methodologies such as developed from those described in Bolt, Cooper, Flanagan, McKnight, Stockham, and Weiss, Report on a Technical Investigation Conducted for the U.S. District Court for the 2 District of Columbia by the Advisory Panel on the White House Tapes. May 31, 1974. This document results from one of the projects set out at the early meetings of the working group. Tom Owen, Chair of SC-03-12 Michael McDermott, Vice-Chair of SC-03-12 1999-09-03 AES standard for forensic purposes — Criteria for the authentication of analog audio tape recordings 1 Scope This standard specifies the minimum procedure for the authentication of analog audio tape recordings intended to be offered as evidence or otherwise utilized in civil, criminal, or other fact finding proceedings. It does not specify or restrict additional testing procedures that can be used. These methodologies are suggested to any and all individuals and groups who hold themselves out to be or are recognized as forensic tape analysts or experts. This standard is a set of procedures set forth to inform attorneys, courts, and other interested parties. It also serves to aid interested parties who are attempting to determine whether or not the procedures and methodologies of potential, chosen, or opposing experts are of a scientific nature and would withstand objective scrutiny. 2 Normative references The following standard contains provisions that, through reference in this text, constitute provisions of this document. At the time of publication, the edition indicated was valid. All standards are subject to revision, and parties to agreements based on this document are encouraged to investigate the possibility of applying the most recent editions of the indicated standards. AES27- 1996, AES recommended practice for forensic purposes — Managing recorded audio materials intended for examination. 3 Definitions 3.1 authentication 3 authentic recording and authenticity analysis as defined in AES27 3.2 forensic tape analyst FTA entity performing authentication according to this standard 3.3 designated original recording DOR original recording as defined in AES27 3.4 designated originating recording device DORD original recorder as defined in AES27 3.5 employer engaging party entity engaging the services of an FTA 3.6 cassette device composed of a case containing two coplanar or superimposed hubs or reels on which a magnetic tape is wound, so that the tape can move from hub (reel) to hub (reel) during recording, reproduction, a fast forward movement, or rewinding, and can be easily and instantaneously inserted in a recordingreproducing equipment or in a reproducer designed for this purpose, without handling the magnetic tape 3.7 memorialization legally acceptable documentation of evidence 3.8 4 test recording recording made by the FTA, using the designated originating recording device and a non-evidence blank tape, for the purpose of determining certain performance characteristics of the recording device 3.9 signature waveform or microscopic visualization (or demonstration) of record events either located on the DOR or created on a test recording, or both, utilizing the DORD or any tape recording device examined by the FTA for the purpose of identification or comparison during an examination 4 Verification of authenticity 4.1 Criteria Verification is predicated upon two sets of criteria: a) that a person, whether a law enforcement official or any individual stated, if called upon, could or would testify under penalty of perjury, that the tape recorded evidence presented as the DOR is, in fact, the tape material utilized to create the recording at the exact time that the occurrence, interview, interrogation, or recorded content actually took place; b) that by a comprehensive examination procedure and scientific means the FTA was able to determine that it is the original. 4.2 Equipment The FTA shall examine the DOR along with and utilizing the DORD. The FTA shall render findings that would scientifically evince that the DORD recorded the designated original recording, and found no conclusive evidence of tampering, unauthorized editing, or any form of intentional deletions, material or otherwise, within the recorded content. 4.3 Reporting The FTA may then render an opinion that the recording has passed the procedure or standard for authentication and that the questioned tape recording is authentic in physical state and in content. 5 Examination and analysis 5.1 Evidence management Except where otherwise specified in this standard, evidence management practices shall comply with AES27. 5 5.1.1 Physical examination 5.1.1.1 Record-prevention punch-out tabs If the audio evidence is contained in a tape cassette that features recordprevention punch-out tabs, the FTA should try to obtain permission to remove them or the FTA may remove the tabs at its discretion. If the tabs are removed, the FTA shall attach the removed record-prevention tabs to a suitable carrier such as a file card by means of a nondestructive and removable adhesive such as transparent adhesive tape. The carrier shall be placed inside a sealed envelope, with the date and time that the envelope was sealed and the signature of the FTA written across the seal. The cassettes shall be comprehensively photographed or videotaped before and after the removal of the punch-out tabs. 5.1.1.2 Operating condition When the tape recorded evidence is contained in a cassette, the cassette shall be carefully examined to determine that it is operable. The FTA shall inspect the cassette, making sure that there is no obstruction to the tape. The FTA shall also look for apparent tears or splices on the tape material itself that could possibly obstruct or deter playback. The FTA shall carefully rotate the tape hubs in both directions to detect any hidden obstruction that could hinder playback. When examining a reel of tape, the same care and caution shall be executed. NOTE Playback of a damaged tape can produce further damage to the tape. 5.1.1.2.1 Notification of damage If during the physical examination, the FTA finds evidence of physical tampering or damage to the cassette or the tape material, the FTA shall immediately inform its employer that the submitting party shall be notified. If the cassette or tape material can be repaired, then the FTA shall obtain written permission from the submitting party prior to proceeding with any repairs or modifications. Whether or not the FTA receives permission to repair the damage or remove the tape material and place it in a new cassette or otherwise prepare the DOR to be available for playback, the FTA shall photograph or videotape the evidence for reference to memorialize the discovery. If the tape or cassette is repaired, the videotape or photographs shall comprehensively depict the repair. 5.1.1.2.2 Splices If a physical splice is located, the splice shall be noted and photographed or videotaped at the time of the observation. 5.1.1.3 DORD condition The DORD and any accompanying apparatus such as separate microphones, switching devices, and similar accessories shall be inspected and examined to determine that they are operational. After the FTA concludes the visual 6 inspection, a compatible tape shall be placed in the DORD and the functions of the DORD shall be tested to ensure that it can play back the DOR without damage to the DOR or the DORD. 5.1.1.3.1 Notification If the DORD is not functional, the FTA shall inform its employer that the submitting party shall be notified. If the DORD can be repaired, then the FTA shall obtain written permission to do so. If the repair necessitates the replacement of the record-playback head, the erase head or both, the FTA shall indicate to the employer that the replacement of the head or heads negates an authentication procedure and that the FTA report of findings relates only to the examination of the DOR. All repairs shall be comprehensively memorialized including who repaired the recorder and at what facility. All replaced parts shall be maintained as evidence by the FTA. 5.1.2 Verification Compliance with 5.1 shall be verified and attested to by the FTA before proceeding further with the evidence. 5.2 Critical listening and waveform examination The critical listening and waveform examination procedures can assist an FTA in attempting to determine whether or not any anomalies are present on the questioned recording. 5.2.1 The FTA shall produce a first test recording containing known exemplars of the functions of the DORD. It should include a minimum of ten start recording signatures, ten stop recording signatures, ten stop-start recording signatures, ten pause signatures (assuming that the recording device has this feature), and if the DORD is so equipped, ten voice activation signatures. Other test recordings may be produced which should include over-recordings and other variations of the record functions of the recording device if necessary or appropriate. 5.2.2 The designated recording device shall be utilized to play back the test recording. The playback should be rehearsed to ensure that the level of playback is appropriate. That setting should be fixed by either carefully applying tape across the volume control of the recorder or exacting some form of measurement that would ensure that the playback output level can be reasonably reproduced. 5.2.3 The first test recording should be played back into a configuration of either a computerized method of storing the playback on a hard disk, or some form of memory device that would allow repetitive playback. Many programs are now available to digitize the playback and store that information on hard drives. They further allow an array of playback functions, and most have features that would enable the FTA to view the waveform. 7 5.2.4 Once the signal or audio from the test recording has been stored, playback of the digitized recording can take place to enable the FTA to listen to the recording while viewing the waveform. The FTA can then be informed as to how the record functions of the designated recording device sound (assuming that the functions generate a discernible audible sound when played back) and are visually demonstrated or appear in the waveform domain. The FTA should then study and scrutinize the signatures so that it can be reasonably acquainted with how the function signatures of the designated recording device sound and are seen or demonstrated in the waveform domain. 5.2.5 The DOR shall be played back with as close to the exact output level and through the exact configuration as the test recording. The output volume control of the DORD may be adhesive-taped to fixed position until all of the test recordings are created and subsequently stored. Once the signal or audio from the DOR is stored, then the FTA shall critically listen to the content while viewing the waveform. 5.2.6 The FTA should then produce, by the safest and best means possible, at least two first-generation copies of the DOR for reference and to evince the state of the recorded content at or about the time of receipt. If the FTA is asked for copies, then copies should be provided appropriately labeled and marked. 5.2.7 The critical listening and waveform examination should occur as often as the FTA deems it necessary in order to answer the following questions. a) Was the content consistent and uninterrupted throughout the entirety of the questioned tape recording? If not, then the location of the gaps, dropouts, overrecordings, or any other form of disruption should be delineated for further examination and analysis. If there are other apparent unrelated recordings, they should be cataloged for reference and/or possible further examination and analysis. b) Were there any identifiable record function signatures detected and located in the content? If so, are they consistent with the test recording exemplars? If not, they should be designated as possible anomalies. In either case, they should be labeled or otherwise delineated for further analysis. c) Was there any form of anomalous or otherwise perceptible aural or visible indications in the playback or waveform display? If so, their presence should be labeled or otherwise delineated for further analysis. This question would include level changes, apparent or obvious differences in background content, or any other form of aurally perceptible variances. d) Were there background conversations or content? For example, were there radio communications or other perceptible speech, or repetitive noise that would aid in determining authentication? If so, they should be labeled or otherwise delineated for reference and further analysis. 8 e) If (a) through (d) render any form of anomaly or evident difference, then further test recordings utilizing the DORD should be produced in an attempt to recreate or mimic the differences or anomalies detected and located. If the further tests can or cannot do so, that revelation should be reported. 5.2.8 These and other findings should be reported upon, verified in the waveform, and their precise location noted for future reference. Once these procedures have been accomplished, then the next step shall be to perform the photomicroscopic examination and analysis. 5.3 Photo-microscopic analysis 5.3.1 Test recordings for the specific purpose of photo-microscopic analysis should be produced. These test recordings should include all of the record function signatures of the DORD. 5.3.2 The test recordings should be examined under the microscope, in a scientific manner, which would allow the PTA to view the magnetic domain (Bitter patterns) of the record function signatures of the test recording examined. See annex A for informative references. 5.3.3 The known exemplars produced, viewed, and examined can familiarize the FTA with how the function signatures of the DORD appear. The FTA can now be enabled to make measurements, take photographs, videotape, and otherwise memorialize the procedure and the resulting findings. 5.3.4 The FTA can now perform the same examination and analysis upon the designated original recording. This procedure, when performed in a scientific manner, can enable the FTA to attempt to identify the signatures located on the questioned recording. The FTA can now make comparisons and other forms of tests resolving the issue of authenticity as it pertains to the recording that the FTA is examining. It further allows the opportunity to demonstrate the s findings by means of measurements, photographs, videotapes, or any other form of demonstrative means that could be reviewed by the employers, the courts, or juries and other experts, opposing or consulting. 5.3.5 An FTA can now draw conclusions from these findings, including whether or not the DORD actually recorded the designated original recording. An FTA’s finding could either validate this fact or disprove it. In some cases no definitive solution can be made. 5.4 The formulation of an opinion and conclusion 5.4.1 Once an FTA has performed all of the testing procedures and rendered scientific findings, it should be sure to have: 9 a) performed all of the tests and examinations in a scientific manner, that if recreated or duplicated by another expert would render the exact same findings; for example, if the PTA has found and identified a stop/start recording signature at a specific location on the questioned recording, another expert or analyst could or would find and identify that same signature at the same location; b) produced comprehensible and repeatable graphic waveform displays, printouts, or any other form of graphic rendering that would demonstrate the FTA’s findings in the waveform domain, so that another expert or any other individual could view them in an effort to determine whether the FTA’s findings exist and are valid; c) produced sufficient photographs, videotapes, or any other form of definitive renderings that would demonstrate the FTA’s findings in the magnetic domain, so that another expert or any other individual could view them in an effort to determine whether the FTA’s findings exist and are valid. 5.4.2 If asked, an FTA should render a comprehensive report that would effectively demonstrate all of the procedures and findings, in a scientific manner, that would survive objective scrutiny and lend credence to its opinion and conclusion. 5.4.3 As to what an FTA hears or perceives in the playback of the DOR that is not demonstrable, that information would be categorized as subjective and left to the courts, juries, or other parties to determine its relevance, validity, or both. It may, however, be reported thereon. 5.4.4 After an FTA has completed all of the tests and examinations, has analyzed and memorialized all of the findings, and has either rendered a comprehensive written report or rendered an oral report to the employers regarding this opinion and conclusion, based on a high degree of scientific certainty, an FTA may be permitted to testify as to its opinion and conclusion. 6 Testimony Once an FTA has finalized its examination and analysis and reached a definitive conclusion and opinion, the FTA may be available for testimony if called upon to do so. 6.1 Preparation 6.1.1 To adequately prepare for testimony, an FTA shall attend to its files so that notes, correspondence, data, and other written or otherwise demonstrable information are in a comprehensive form. This requirement includes the cataloging of all the evidence submitted, the test recordings produced, and any 10 and all demonstrative renderings that may be requested to be viewed by the opposing parties, their experts, and the engaging party. 6.1.2 Once the files .are in order, then an FTA should review all findings in a comprehensive fashion to determine that all of the calculations, demonstrative renderings, reports, and supporting information are complete and, most importantly, accurate. The PTA should thoroughly review its deposition, if one had occurred, and any and all forms of reports it may have previously rendered. 6.1.3 When an FTA is reasonably assured that it is prepared, then the FTA shall proceed to prepare its employer and first, and at the very least, demonstrate the following: a) that the FTA followed the criteria for authentication as strictly as possible; b) that the FTA had attained a high degree of scientific certainty as to its findings, opinion, and conclusion; c) that all of the FTA’s demonstrative waveform or spectral renderings are accurate and truthfully demonstrate all of the findings which it claims are located on the questioned recording; further, that if any other competent expert or party examined the FTA’s waveforms, it can locate the signatures, events, edits, or anomalies that are graphically demonstrated in the FTA’s depictions at or about the same location as did the FTA presenting the findings; d) that all of the FTA’s photographs or other forms of visual magnetic domain renderings are accurate and truthfully demonstrate its findings located on the questioned recording; further that if any other competent expert or party examined the magnetic domain, that party could and would locate the signatures, events, edits, or anomalies that are demonstrated in the PTA’s depictions at or about the same location as did the FTA when presenting the findings; e) that the FTA has performed its examination in the utmost unbiased ethical manner and that it believes the findings, opinion, and conclusion would withstand the scrutiny of peers and the legal process; f) that the FTA should submit or make available to its employer all reference materials, instrumentation manuals, literature or any other form of documentation or data that the FTA has relied upon during its examination and analysis, in rendering its opinion, or both. Further, the FTA should attempt to familiarize its employer in the syntax, nomenclatures, or terminology utilized in their field; g) that the FTA should assert that its employer can rely upon the FTA to professionally and truthfully testify as to the findings with the utmost assurance within its capabilities and competence. 6.1.4 At this point the engaging party may further interview or mock cross examine an FTA in an attempt to ascertain any issues relating to the findings, 11 opinions and conclusions rendered or any issues relating to prior testimony given by an FTA. 6.2 Problems 6.2.1 From time to time there are problems educating or relating findings to the engaging party. The FTA should avail itself in an effort to clearly address the issues caused by its findings or the engaging party’s apprehensions if any exist. 6.2.2 If an FTA senses or is otherwise led to believe that the engaging party has difficulty in comprehending the issues, or its findings, opinions, and conclusions, the FTA may suggest further preparation or offer the services of another expert to further clarify the issues or perform an independent examination and analysis of the questioned recording, in an attempt to satisfy the doubt of the engaging party or otherwise assure it of the testimony to be presented. 12 Forensic speaker identification based on spectral moments R. Rodman,* D. McAllister,* D. Bitzer,* L. Cepeda* and P. Abbitt† *Voice I/O Group: Multimedia Laboratory Department of Computer Science North Carolina State University rodman@csc.ncsu.edu †Department of Statistics North Carolina State University ABSTRACT A new method for doing text-independent speaker identification geared to forensic situations is presented. By analysing ‘isolexemic’ sequences, the method addresses the issues of very short criminal exemplars and the need for open-set identification. An algorithm is given that computes an average spectral shape of the speech to be analysed for each glottal pulse period. Each such spectrum is converted to a probability density function and the first moment (i.e. the mean) and the second moment about the mean (i.e. the variance) are computed. Sequences of moment values are used as the basis for extracting variables that discriminate among speakers. Ten variables are presented all of which have sufficiently high inter- to intraspeaker variation to be effective discriminators. A case study comprising a ten-speaker database, and ten unknown speakers, is presented. A discriminant analysis is performed and the statistical measurements that result suggest that the method is potentially effective. The report represents work in progress. KEYWORDS speaker identification, spectral moments, isolexemic sequences, glottal pulse period PREFACE Although it is unusual for a scholarly work to contain a preface, the controversial nature of our research requires two caveats, which are herein presented. First, the case study described in our article to support our methodology was performed on sanitized data, that is, data not subjected to the degrading effect of telephone transmission or a recording medium such as a tape recorder. We acknowledge, in agreement with Künzel (1997), that studies based strictly on formant frequency values are undermined by telephone transmission. Our answer to this is that our methodology is based on averages of entire spectral shapes of the vocal tract. These spectra are derived by a pitch synchronous Fourier analysis that treats the vocal tract as a filter that is driven by the glottal pulse treated as an impulse function. We believe that the averaging of such spectral shapes will mitigate the degrading effect of the transmittal medium. The purpose of this study, however, is to show that the method, being novel, is promising when used on ‘clean’ data. We also acknowledge, and discuss below in the ‘Background’ section, the fact that historically spectral parameters have not proved successful as a basis for accurate speaker identification. Our method, though certainly based on spectral parameters, considers averages of entire, pitch independent spectra as represented by spectral moments, which are then plotted in curves that appear to reflect individual speaking characteristics. The other novel part of our approach is comparing ‘like-with-like’. We base speaker 13 identification on the comparison of manually extracted ‘isolexemic’ sequences. This, we believe, permits accurate speaker identification to be made on very short exemplars. Our methods are novel and so far unproven on standardized testing databases (though we are in the process of remedying this lacuna). The purpose of this article is to publicize our new methodology to the forensic speech community both in the hopes of stimulating research in this area, and of engendering useful exchanges between ourselves and other researchers from which both parties may benefit. INTRODUCTION Speaker identification is the process of determining who spoke a recorded utterance. This process may be accomplished by humans alone, who compare a spoken exemplar with the voices of individuals. It may be accomplished by computers alone, which are programmed to identify similarities in speech patterns. It may alternatively be accomplished through a combination of humans and computers working in concert, the situation described in this article. Whatever the case, the focus of the process is on a speech exemplar – a recorded threat, an intercepted message, a conspiracy recorded surreptitiously – together with the speech of a set of suspects, among whom may or may not be the speaker of the exemplar. The speech characteristics of the exemplar are compared with the speech characteristics of the suspects in an attempt to make the identification. More technically and precisely, given a set of speakers S = {S1 … SN}, a set of collected utterances U = {U1 … UN} made by those speakers, and a single utterance uX made by an unknown speaker: closed-set speaker identification determines a value for X in [1 … N]; open-set speaker identification determines a value for X in [0, 1 … N], where X = 0 means ‘the unknown speaker SXS’. ‘Text independent’ means that uX is not necessarily contained in any of the Ui. During the process, acoustic feature sets {F1 … FN} are extracted from the utterances {U1 … UN}. In the same manner, a feature set FX is extracted from uX. A matching algorithm determines which, if any, of {F1 … FN} sufficiently resembles FX. The identification is based on the resemblance and may be given with a probability-oferror coefficient. Forensic speaker identification is aimed specifically at an application area in which criminal intent occurs. This may involve espionage, blackmail, threats and warnings, suspected terrorist communications, etc. Civil matters, too, may hinge on identifying an unknown speaker, as in cases of harassing phone calls that are recorded. Often a law enforcement agency has a recording of an utterance associated with a crime such as a bomb threat or a leaked company secret. This is uX. If there are suspects (the set S), utterances are elicited from them (the set U), and an analysis is carried out to determine the likelihood that one of the suspects was the speaker of uX, or that none of them was. Another common scenario is for agents to have a wiretap of an unknown person who is a suspect in a crime, and a set of suspects to test the recording against. Forensic speaker identification distinguishes itself in five ways. First, and of primary importance, it must be open-set identification. That is, the possibility that none of the suspects is the speaker of the criminal exemplar must be entertained. Second, it must be capable of dealing with very short utterances, possibly under five seconds in length. 14 Third, it must be able to function when the exemplar has a poor signal-to-noise ratio. This may be the result of wireless communication, of communication over low-quality phone lines, or of data from a ‘wire’ worn by an agent or informant, among others. Fourth, it must be text independent. That is, identification must be made without requiring suspects to repeat the criminal exemplar. This is because the criminal exemplar may be too short for statistically significant comparisons. As well, it is generally true that suspects will find ways of repeating the words so as to be acoustically dissimilar from the original. Moreover, it may be of questionable legality as to whether a suspect can be forced to utter particular words. Fifth, the time constraints are more relaxed. An immediate response is generally not required so there is time for extensive analysis, and most important in our case, time for human intervention. The research described below represents work in progress. BACKGROUND The history of electronically assisted speaker identification began with Kersta (1962), and can be traced through these references: Baldwin and French (1990), Bolt (1969), Falcone and de Sario (1994), French (1994), Hollien (1990), Klevans and Rodman (1997), Koenig (1986), Künzel (1994), Markel and Davis (1978), O’Shaughnessy (1986), Reynolds and Rose (1995), Stevens et al. (1968) and Tosi (1979). Speaker identification can be categorized into three major approaches. The first is to use long-term averages of acoustic features. Some features that have been used are inverse filter spectral coefficients, pitch, and cepstral coefficients (Doddington 1985). The purpose is to smooth across factors influencing acoustic features, such as choice of words, leaving behind speaker-specific information. The disadvantage of this class of methods is that the process discards useful speaker-discriminating data, and can require lengthy speech utterances for stable statistics. The second approach is the use of neural networks to discriminate speakers. Various types of neural nets have been applied (Rudasi and Zahorian 1991, Bennani and Gallinari 1991, Oglesby and Mason 1990). A major drawback to the neural net methods is the excessive amount of data needed to ‘train’ the speaker models, and the fact that when a new speaker enters the database the entire neural net must be retrained. The third approach – the segmentation method – compares speakers based on similar utterances or at least using similar phonetic sequences. Then the comparison measures differences that originate with the speakers rather than the utterances. To date, attempts to do a ‘like phonetic’ comparison have been carried out using speech recognition front-ends. As noted in Reynolds and Rose (1995), ‘It was found in both studies [Matsui and Furui 1991, Kao et al. 1992] that the front-end speech recognizer provided little or no improvement in speaker recognition performance compared to no front-end segmentation.’ The Gaussian mixture model (GMM) of speakers described in Reynolds and Rose (1995) is an implicit segmentation approach in which like sounds are (probabilistically) compared with like. The acoustic features are of the mel-cepstral variety (with some other preprocessing of the speech signal). Their best results in a closed-set test using five second exemplars was correct identification in 94.5% ±1.8 of cases using a population of 16 speakers (Reynolds and Rose 1995: 80). Open-set testing was not attempted. 15 Probabilistic models such as Hidden Markov Models (HMMs) have also been used for text-independent speaker recognition. These methods suffer in two ways. One is that they require long exemplars for effective modelling. Second, the HMMs model temporal sequencing of sounds,which ‘for text-independent tasks … contains little speaker-dependent information’ (Reynolds and Rose 1995: 73). A different kind of implicit segmentation was pursued in Klevans and Rodman (1997) using a two-level cascading segregating method. Accuracies in the high 90s were achieved in closed-set tests over populations (taken from the TIMIT database) ranging in size from 25 to 65 from similar dialect regions. However, no open-set results were attempted. In fact, we believe the third approach – comparing like utterance fragments with like – has much merit, and that the difficulties lie in the speech recognition process of explicit segmentation, and the various clustering and probabilistic techniques that underlie implicit segmentation. In forensic applications, it is entirely feasible to do a manual segmentation that guarantees that lexically similar partial utterances are compared. This is discussed in the following section. SEMI-AUTOMATIC SPEAKER IDENTIFICATION Semi-automatic speaker identification permits human intervention at one or more stages of computer processing. For example, the computer may be used to produce spectrograms (or any of a large number of similar displays) that are interpreted by human analysts who make final decisions (Hollien 1990). One of the lessons that has emerged from nearly half a century of computer science is that the best results are often achieved by a collaboration of humans and computers. Machine translation is an example. Humans translate better, but slower; machines translate poorly, but faster. Together they translate both better and faster, as witnessed by the rise in popularity of so-called CAT (Computer-aided Translation) software packages. (The EAMT – European Association for Machine Translation – is a source of copious material on this subject, for example, the Fifth EAMT Workshop held in Ljubljana, Slovenia in May of 2000.) The history of computer science also teaches us that while computers can achieve many of the same intellectual goals as humans, they do not always do so by imitating human behaviour. Rather, they have their own distinctly computational style. For example, computers play excellent chess but they choose moves in a decidedly nonhuman way. Our speaker identification method uses computers and humans to extract isolexemic sound sequences, which are then heavily analysed by computers alone to extract personal voice traits. The method is appropriate for forensic applications, where analysts may have days or even weeks to collect and process data for speaker identification. Isolexemic sequences may consist of a single phone (sound); several phones such as the rime (vowel plus closing consonant(s)) of a syllable (e.g. the ill of pill or mill); a whole syllable; a word; sounds that span syllables or words; etc. What is vital is that the sequence be ‘iso’ in the sense that it comes from the same word or words of the language as pronounced by the speakers being compared. A concrete example illustrates the concept. The two pronunciations of the vowel in the word pie, as uttered by a northern 16 American and a southern American, are isolexemic because they are drawn from the same English word. That vowel, however, will be pronounced in a distinctly different manner by the two individuals, assuming they speak a typical dialect of the area. By comparing isolexemic sequences, the bulk of the acoustic differences will be ascribable to the speakers. Speech recognizers are not effective at identifying isolexemic sequences that are phonetically wide apart, nor are any of the implicit segmentation techniques. Only humans, with deep knowledge of the language, know that pie is the same word regardless of the fact that the vowels are phonetically different, and despite the fact that the same phonetic difference, in other circumstances, may function phonemically to distinguish between different words. The same word need not be involved. We can compare the ‘enny’ of penny with the same sound in Jenny knowing that differences – some people pronounce it ‘inny’ – will be individual, not linguistic. Moreover, the human analyst, using a speech editor such as Sound ForgeTM, is able to isolate the ‘enny’ at a point in the vowel where coarticulatory effects from the j and the p are minimal. In determining what sound sequences are ‘iso’, the analyst need not be concerned with prosodics (pitch or intonation in particular) because, as we shall see, the analysis of the spectra is glottal pulse or pitch synchronous, the effect of which is to minimize the influence of the absolute pitch of the exemplars under analysis. In fact, one of the breakthroughs in the research reported here is an accurate means of determining glottal pulse length so that the pitch synchronicity can hold throughout the analysis of hundreds of spectra (Rodman et al. 2000). Isolexemic comparisons cut much more rapidly to the quick than any other way of comparing the speech of multiple speakers. Even three seconds of speech may contain a dozen syllables, and two dozen phonetic units, all of which could hypothetically be used to discriminate among speakers. The manual intervention converts a text-independent analysis to the more effective text-dependent analysis without the artifice of making suspects repeat incriminating messages, which does not work if the talker is uncooperative in any case, for he may disguise his voice (Hollien 1990: 233). (The disguise may take many forms: an alteration of the rhythm by altering vowel lengths and stress patterns, switching dialects for multidialectal persons, or faking an ‘accent’.) For example, suppose the criminal exemplar is ‘There’s a bomb in Olympic Park and it’s set to go off in ten minutes.’ Suspects are interviewed and recorded (text independent), possibly at great length over several sessions, until they have eventually uttered sufficient isolexemic parts from the original exemplar. For example, the suspect may say ‘we met to go to the ball game’ in the course of the interview, permitting the isolexemic ‘[s]et to go’ and ‘[m]et to go’ to be compared (text dependent). A clever interrogator may be able to elicit key phrases more quickly by asking pointed questions such as ‘What took place in Sydney, Australia last summer?’, which might elicit the word Olympics among others. Or indeed, the interrogator could ask for words directly, one or two at a time, by asking the suspect to say things like ‘Let’s take a break in ten minutes.’ The criminal exemplar and all of the recorded interviews are digitized (see below) and loaded into a computer. The extraction of the isolexemic sequences is accomplished by a human operator using a sound editor such as Sound ForgeTM. This activity is what makes the procedure semi-automatic. FEATURE EXTRACTION 17 All the speech to be processed is digitized at 22.050 kHz, 16 bit quantization, and stored in .wav files. This format is suitable for input to any sound editor, which is used to extract the isolexemes to be analysed. Once data are collected and the isolexemes are isolated, both from the criminal exemplar and the utterances of suspects (in effect, the training speech), the process of feature extraction can begin. Feature extraction takes place in two stages. The first is the creation of ‘tracks’, essentially an abbreviated trace of successive spectra. The second is the measurement of various properties of the tracks, which serve as the features for the identification of speakers. Creating ‘tracks’ We discuss the processing of voiced sounds, that is, those in which the vocal cords are vibrating throughout. The processing of voiceless sounds is grossly similar but differs in details not pertinent to this article. (The interested reader may consult Fu et al. 1999.) Our method requires the computation of an average spectrum for each glottal pulse (GP) – opening and closing of the vocal cords – in the speech signal of the current isolexeme. We developed an algorithm for the accurate computation of the glottal pulse period (GPP) of a succession of GPs. The method, and the mathematical proofs that underlie it, and a comparison with other methods, are published as Rodman et al. (2000). By using a precise, pitch synchronous Fourier analysis, we produce spectral shapes that reflect the shape of the vocal tract, and are essentially unaffected by pitch. In effect, we treat the vocal tract as a filter that is driven by the glottal pulse, which is treated as an impulse function. The resulting spectra are highly determined by vocal tract shapes and glottal pulse shapes (not spacing). These shapes are speaker dependent and this provides the basis for speaker identification. We use spectral moments as representative values of these spectral shapes. We use them as opposed, say, to formant frequencies, because they contain information across the entire range of frequencies up to 4 kHz for voiced sounds, and 11 kHz for voiceless sounds (not discussed in this article). The higher formants, and the distribution of higher frequencies in general, have given us better results than in our experiments with pure formant frequencies and even with moments of higher orders (Koster 1995). Knowing the GPP permits us to apply the following steps to compute a sequence of spectral moments. (Assume the current GPP contains N samples.) 1. Compute the discrete Fourier transform (DFT) using a window width of N, thus transforming the signal from the time domain to the frequency domain. 2. Take the absolute value of the result (so the result is a real number). 3. Shift over 1 sample. 4. Repeat steps 1–3 N times. 5. Average the N transforms and scale by taking the cube root to reduce the influence of the first formant, drop the DC term, and interpolate it with a cubic spline to produce a continuous spectrum. 6. Convert the spectrum to a probability density function by dividing it by its mass, then calculate the first moment m1 (mean) and the second central moment about the mean m2 (variance) of that function in the range of 0 to 4000 Hz and put them in two lists L1 and 18 L2. Let S(f) be the spectrum. The following formulae are used, appropriately modified for the discrete signal: 7. Repeat Steps 1 through 6 until less than 3N samples remain. 8. Scale each moment: m1 by 10-3 and m2 by 10-6. Several comments about the algorithm are in order. The shifting and averaging in Steps 1–3 are effective in removing noise resulting from small fluctuations in the spectra, but preserving idiosyncratic features of the vocal tract shape. Although the window spans the length of two glottal pulses as it slides across, there is one spectral shape computed per glottal pulse. The overlapping windows improve the sensitivity of the method. The process is computationally intense but it yields track points that are reliable and consistent in distinguishing talkers. The procedure also removes the pitch as a parameter affecting the shape of the transform, as noted above. In Step 5 the cube root is taken – at one time we took the logarithm – because the first formant of voiced speech contains a disproportionate amount of the spectral energy. The effect of taking the cube root ‘levels’ the peaks in the spectrum and renders the spectrum more sensitive to speaker differences. The means and variances of Step 6 are chosen as ‘figures of merit’ for the individual spectra. Although representing a single spectrum over a 4 kHz bandwidth with two numbers appears to give up information, it has the advantage of allowing us to track every spectrum in the isolexeme and to measure the changes that occur. This dynamism leads to features that we believe to be highly individuating because they capture the shape, position and movement of the speaker’s articulators, which are unique to each speaker. (This is argued in more detail in Klevans and Rodman 1997.) Also in Step 6, the division by the spectral mass removes the effect of loudness, so that two exemplars, identical except for intensity, will produce identical measurements. Finally, the scaling in Step 8 is performed so that we are looking at numbers in [0, 3] for both means and variances. This is done as a matter of convenience. It makes the resulting data more readable and presentable. The result of applying the algorithm is a sequence of points in twodimensional m1-m2 space that can be interpolated to give a track. These are the values from the lists L1 and L2. The tracks are smoothed by a threestage cascading filter: median-5, average-3, 19 median-3. That is, the first pass replaces each value (except endpoints) with the median of itself and the four surrounding values. The second pass takes that median-5 output and replaces each point by the average of itself with the two surrounding values. That output is subjected to the median-3 filter to give the final, smoothed track. The smoothing takes place because the means and variances of the spectra make small jumps when the speech under analysis is in a (more or less) steady state as in the pronunciation of vowels. This is true especially for monophthongal vowels such as the ‘e’ in bed, but even in diphthongs such as the ‘ow’ in cow, there are steady states that span several glottal pulse periods. The smoothing removes much of the irrelevant effect of the jumps. (See also Fu et al. 1999, Rodman et al. 1999.) A visual impression of intra- and interspeaker variation may be seen in Figure 1. The first two tracks in the figure are a single speaker saying owie on two different occasions. The third and fourth tracks in the figure are two different speakers saying owie. Figures 2 and 3 are similar data for the utterances ayo and eya. Our research shows that the interspeaker variation of tracks of isolexemic sequences will be measurably larger than the intraspeaker variation, and therefore that an unknown speaker can be identified through these tracks. 20 Extracting features from tracks To compare tracks, several factors must be considered: the region of moment space occupied by the track; the shape of the track; the centre of gravity of the track; and the orientation of the track. Each of these characteristics displays larger interspeaker than intraspeaker variation when reduced to statistical variables. One way to extract variables is to surround the track by a minimal enclosing rectangle (MER), which is the rectangle of minimal area containing the entire track. The MER is computed by rotating the track about an endpoint one degree at a time and computing the area of a bounding rectangle whose sides are parallel to the axes each time, and then taking the minimum. The minimum is necessarily found within 90 degrees of rotation. This is illustrated in Figure 4. 21 From the MER of the curve in its original orientation, we extract four of the ten variables to be used to characterize the tracks, viz. the x-value of the midpoint, the yvalue of the midpoint, the length of the long side (L), and the angle of orientation (). (The length of the short side was not an effective discriminator for this study.) Four more variables are the minimal x-coordinate, the minimal y-coordinate, the maximal xcoordinate, and the maximal y-coordinate of the track. They are derived by surrounding the track in its original orientation with a minimal rectangle parallel to the axes and taking the four corner points. These eight parameters measure the track’s location and orientation in moment space. The final two variables attempt to reflect the shape of the track. Note that the spacing and number of track points in an utterance depend on the fundamental pitch. The higher the frequency the fewer the number of samples in the period and hence the greater the number of track points that will be computed over a given time period. To obviate this remaining manifestation of pitch and hence, the number of track points, as a factor affecting the measurement of the shape of a curve, we reparameterize the curve based on the distance between track points. We normalize the process so that the curve always lies in the same interval thus removing track length as a factor. (Other variables take it into account.) More particularly, we parameterize the tracks in m1-m2 space into two integrable curves by plotting the m1-value of a point p (the ordinate) versus the distance in m1-m2 space to point p+1 (abscissa), providing the distance exceeds a certain threshold. If it does not, the point p+1 is thrown out and the next point taken, and so on until the threshold is exceeded. The abscissa is then normalized to [0, 1] and the points interpolated into a smooth curve by a cubic spline. This is known as a normalized arc length parameterization. A second curve is produced via the same process using the m2-value of the point p. The two quadrature-based variables are calculated by integrating each curve over the interval [0, 1]. The ten variables are most likely not completely independent. With a data set of this size, it is nearly impossible to estimate the correlations meaningfully. The first eight 22 variables were selected through exploratory analysis to characterize the MER. The last two variables are related to the shape of the track as opposed to its location and orientation in m1-m2 space and are therefore likely to have a high degree of independence from the other eight. Figures 5A–C illustrate the discriminatory power of these variables. Figures 5A and 5B represent two different utterances of ayo by speaker JT. The first plot in each figure is the track in moment space. The second and third plots are the normalized arc length parameterizations for m1 and m2. (The actual variable used will be the quadrature of these curves.) The similarity in shape of corresponding plots for the same-speaker utterances is evident. Figure 5C is the set of plots for the utterance of ayo by speaker BB. The different curve shapes in Figure 5C indicate that a different person spoke. 23 A CASE STUDY The experiment From an imagined extortion threat containing ‘Now we see about the payola’, we identified three potential isolexemes: owie, eya, and ayo, as might be isolated from the underlined parts of the exemplar. Single utterances of owie, eya, and ayo were extracted from the speech of ten unknown speakers. The set consisted of eight males of whom five were native speakers of American English, and three were near accent-less fluent English speakers whose native language was Venezuelan Spanish. The two females were both native speakers of American English. This is the testing database. We then asked the ten speakers – BB, BS, DM, DS, JT, KB, LC, NM, RR, VN – to utter owie, eya, and ayo four times to simulate the results of interrogations in which those sounds were extracted from the elicited dialogue. This is the training database. All the speech samples were processed to create tracks as described in the ‘Creating “tracks”’ subsection above. The objective of the experiment is to see if the 10 features described in the previous subsection are useful in discriminating among individuals. The approach of using several variables to distinguish between groups or classes is referred to as discriminant analysis. (See, for example, Mardia et al. 1997.) As mentioned in the 24 ‘Background’ section above, many authors have employed methods such as neural networks and hidden Markov models to discriminate between individuals. (See Klevans and Rodman (1997) for a general discussion.) A disadvantage of these methods is that they require a large amount of training data. We present a fairly simple discriminant analysis, which is easily implemented and can be used with a small amount of training data. Determining effective discriminators The set of variables described in the ‘Extracting features from tracks’ subsection above seemed to capture important features of the ayo, eya and owie tracks. We therefore used an analysis of variance (ANOVA) to confirm that these variables are effective in discriminating between individuals. ANOVA is a method of comparing means between groups (see, for example, Snedecor and Cochran 1989). In this case, a group is a set of replicates from an individual. If the mean of a feature varies across individuals, then this variable may be useful for discriminating between at least some of the individuals. In an ANOVA, the F-statistic is the ratio of the interspeaker variation to the intraspeaker variation. If this ratio is large (much larger than one), then we conclude that there is a significant difference in feature means between individuals. Table 1 contains the F-statistics for each of the ten variables described in the ‘Extracting features from tracks’ subsection. In this analysis, each variable is considered separately, so the F-statistic is a measure of a variable’s effectiveness in distinguishing individuals when used alone. For these data an F-statistic larger than 2.2 can be considered large, meaning the variable will be a good discriminator. Note that a large Fvalue does not imply that we can separate all individuals well using the single feature; however, it will be useful in separating the individuals into at least two groups. All of the variables discussed in the ‘Extracting features from tracks’ subsection have a large Fstatistic. (Indeed, we used the F-statistic to eliminate as ineffective such potential variables as the length of the short side of the MER.) 25 Measures of similarity Having determined that all ten features are useful for all three sounds, the discriminant analysis will be based on these 30 variables. Let yi be the 30- dimensional vector of sample means for speaker i. In our training database, this mean is based on four repetitions for each speaker. It will be easy to discriminate between individuals if the yi’s are ‘far apart’ in 30- dimensional space. One way to measure the distance between means is to use Euclidean distance. However, this metric is not appropriate in this situation because it does not account for differing variances and covariances. For example, a change in one unit of the angle of orientation variable is not equivalent to a change of one unit of a quadrature-based variable. Also, with a one-unit change in maximum-y, we might expect a change in minimum-y or the long side variables. Mahalanobis distance is a metric that accounts for variances and covariances between variables (see, for example, Mardia et al. 1979). Let ∑be the 30x30 dimensional covariance matrix. We will partition ∑ into nine 10x10 matrices, six of which are distinct. The matrix has the form For example, the submatrix ∑AErepresents the covariance matrix of the ten variables associated with the ayo sound. The submatrix ∑AA represents the covariance matrix of the ten ayo variables and the ten eya variables. We make two assumptions about the structure of this matrix. First, we assume that the diagonal submatrices are constant for all individuals, so that ∑AA ∑EE and ∑OOcan be estimated by pooling the corresponding sample covariance matrices across individuals. This is a fairly strong assumption, but with the size of the training data set, we cannot reliably estimate these matrices separately for each individual without making even more stringent distributional assumptions. Secondly, we assume that ∑has a block diagonal structure. That is, the matrices ∑AE, ∑AOand ∑EOare assumed to be matrices of zeros. This is also a strong assumption, but again, the size of the training data set does not allow for reliable estimation of these submatrices. Let ∑ˆ be the estimate of ∑using the zero matrices and estimated matrices described above. The squared Mahalanobis distance between individuals i and j is Table 2 contains squared Mahalanobis distances for the ten individual means in the training database. The lower triangle of the table is blank because these cells are redundant. Relatively small distances indicate that the individuals are similar with the respect to the variables used in the analysis. For example, the most similar individuals are 26 KB and NM (the two female speakers) while the most dissimilar are DM and RR (both native speakers of American English). Classifying exemplars For features extracted from a set of three utterances (ayo, eya, owie) from a speaker, we can calculate the squared Mahalanobis distance from the exemplar to each individual mean by For the closed-set problem, we identify Sx by choosing the individual mean to which yx is closest. We first tested this identification rule on each exemplar in the training set. The rule correctly identified the speaker for all training exemplars. We would expect to have a low error rate in this case, since each exemplar was also used in estimating The rule was also applied to unknowns 1–7 in the testing database. These exemplars came from speakers in the training set. (Unknowns 8–10 were ‘ringers’ introduced for the open-set test. They consisted of one male and two female native speakers of American English, replacing one female and one male native speaker of American English, and one male speaker whose native language was Venezuelan Spanish.) Table 3 contains the squared Mahalanobis distances from each yx to each individual mean. Each speaker was identified correctly. For each unknown(1–7), the minimum distance is less than 100, except for Unknown 6. The asterisk marks the minimum distance for unknowns 8–10. The minimum distances are lower, in general, than the interspeaker distances given in Table 2. This confirms that this set of variables is useful for discriminating between individuals. Also, the distances from each speaker in the test set seem to follow approximately the same trends as in Table 2. For example, in the training data, DM was the most dissimilar to BB. For Unknown 1 (BB), the largest distance is to DM. 27 In many cases, it will be desirable to report not only the individual selected by the rule, but also to provide an estimate of the reliability of the procedure. The reliability may be determined empirically. We may use the observed error rate for the closed-set classification rule when applied to the training data and test speakers 1–7. Crossvalidation can also be used to estimate error rates. However, due to the size of the study, neither method will provide a reasonable estimate of reliability of the procedure. Another method of estimating reliability would be to make distributional assumptions, e.g. multivariate normality. Any such assumptions would be difficult to verify with such a small data set. Developing a framework for estimating the reliability of such a procedure with a small data set is planned for future work. For the open-set problem, the rule must be modified to allow us to conclude that the unknown speaker is not in the training set (X=0) One way of modifying this rule would be to establish a distance threshold. If none of the distances Mxi fall below this threshold, then we conclude X=0. As in the closed-set problem, estimates of reliability are desirable. In general, error rates will depend on the choice of the threshold. We investigated empirical choices of thresholds for this experiment. For the test data set, if we choose a distance threshold, we will misclassify at least one of the ten unknowns. For example, if we choose a distance threshold of 100, Unknown 6 will be incorrectly assigned to So and Unknown 9 will be incorrectly classified as KB. In this testing situation, we can pick a distance threshold that minimizes the number of misclassification errors. However, this will not be possible in a practical situation. A framework for choosing thresholds for the open-set problem is planned for future work. SUMMARY AND FUTURE DIRECTIONS The results we obtained are encouraging because of the sparseness of data. The known speakers had about 8–12 seconds of speech data per speaker. The unknowns had onequarter of that amount, 2–3 seconds. In an actual forensic situation there is an excellent 28 likelihood of having many times the amount of data for the criminal exemplar (unknown speaker), and as much data as needed for suspects (known speakers). The identification process is cumulative in nature. As additional data become available, there is more information for individuating speakers, and the error probabilities diminish. In practice the only limitation is the amount of data in the criminal exemplar (the testing data). Often, authorities are able to collect as large an amount of training data as needed. Each new sound sequence that undergoes analysis makes its small contribution to the overall discrimination. In even as short an utterance as ‘There’s a bomb in Olympic Park and it’s set to go off in ten minutes’ there are easily a dozen or more sequences that may be extracted for analysis. Thus we are sanguine about the ability of this method to work in practice. When the case study is regarded as closed-set speaker identification, the system performed without error. While it is unreal to expect zero error rates in general, the results forecast a relatively low error rate in cases of this kind. Many practical scenarios require only closed-set identification. For example, in a corporate espionage case, where a particular phone line is tapped, there are a limited number of persons who have access to that phone line. Similar cases are described in Klevans and Rodman (1997). The more difficult and more general open-set identification yielded error rates between 10 and 20 per cent depending on how thresholds are set. Our current research is strongly concerned with reducing this error rate. Future research: short term Our research in this area is expanding in three directions. The first is to use a larger quantity of data for identification. Simplistically, this might have ten repetitions of ten vowel transitional segments similar to owie for the training database. It is expected that the F-values of the variables would rise, meaning that the ratio of interspeaker variation to intraspeaker variation will climb. At one time we used only three utterances per sound per speaker in the training base and when we went to four utterances the F-values increased significantly, which validates our expectation. (Naturally this implies lengthier interrogating sessions in a forensic application, but when a serious crime is involved, the extra effort may be justified.) The second direction is to use more phonetically varied data. The vowel transitions of this study were chosen primarily to determine if the methodology was promising. They do not span the entire moment space encompassed by the totality of speech sounds. There are speech sounds such as voiced fricatives that produce tracks that extend beyond the union of the MERs for the above utterances. Moreover, we are also able to process voiceless sounds to produce moment tracks, but using a different processing method that analyses the speech signal at frequencies up to 11 kHz (Fu et al. 1999). We are also able to process liquid [l], [r] and nasal sounds [m], [n], [nj], [N]. We hypothesize that the use of other transitions, for example, vowel-fricative-vowel as in lesson, will increase the discriminatory power of the method because it ‘views’ a different aspect of the speaker’s vocal tract. An interesting, open, minor question is whether particular types of sequences (e.g. vowel-nasal-vowel, diphthong alone, etc.) will be more effective discriminators than others. We are currently moving from producing our own data to using standardized databases such as those available from the Linguistic Data Consortium. While this makes 29 the data extraction process more difficult and time-consuming, it has the advantage of providing test data of the kind encountered in actual scenarios, particularly if one of the many telephone- based databases are used. The third direction is to find more and better discriminating variables. Eight of the ten variables are basically ‘range statistics’, a class of statistics well known for their lack of robustness and extreme sensitivity to outliers, and as noted above, are not entirely, mutually independent. Both more and varied data would obviate these shortcomings, but what is truly needed is a more precise measurement of curve shape, since the shape appears to be highly correlated to the speaker. We are experimenting with methods to characterize the shape of a curve. The visual appearance of the shape of tracks for a given speaker for a given utterance, and the differences between the shapes of the tracks among speakers for the same utterance, suggest that curve shape should be used for speaker identification. Curvature scale space (Mokhtarian and Mackworth 1986, Mokhtarian 1995, Sonka et al. 1999) is a method that has been proposed to measure the similarity of 2D curves for the purpose of retrieving curves of similar shape from a database of planar curves. The method tries to quantify shape by smoothing the curve (the scaling process) and watching where the curvature changes sign. When the scaling process produces no more curvature changes, the resulting behaviour history of the changes throughout the smoothing process is used to do curve matching (Mokhtarian 1995). We are currently exploiting this methodology to extract variables that are linked to the shape of the moment tracks in m1-m2 space. These variables should provide discriminating power highly independent of the variables currently in use, and hence would improve the effectiveness of the identification process. Other methods for exploiting shape differences are also being considered. Matching shapes, while visually somewhat straightforward, is a difficult problem to quantify algorithmically and methods for its solution have only recently begun to appear in the literature. Future research: long term Our long-term future research is also pointed in three slightly different directions. They are (1) noisy data, (2) channel impacted data, and (3) disguised voice data. All three of these data-distorting situations may compromise the integrity of a speaker identification system based on ‘clean’ data. A system for practical use in a forensic setting would need methods for accommodating to messy data. This is a vast and complex topic, and most of the work needed would necessarily follow the development of the speaker identification system as used under less unfavourable circumstances. ACKNOWLEDGEMENT The authors wish to acknowledge the editors for helpful assistance in improving the presentation of the foregoing work. REFERENCES Baldwin, J. R. and French, P. (1990) Forensic Phonetics, London: Pinter Publishers. 30 Bennani, Y. and Gallinari, P. (1991) ‘On the Use of TDNN-Extracted Features Information in Talker Identification’, ICASSP (International Conference on Acoustics, Speech and Signal Processing), 385–8. Bolt, R. H., Cooper. F. S., David, E. E., Denes, P. B., Pickett, J. M., and Stevens, K. N. (1969) ‘Identification of a speaker by speech spectrograms’, Science, 166: 338–43. Doddington, G. (1985) ‘Speaker Recognition – Identifying People by Their Voices’, in Proceedings of the IEEE (Institute of Electronics and Electronic Engineers), 73(11): 1651–63. Falcone, M. and de Sario, N. (1994) ‘A PC speaker identification system for forensic use: IDEM’, in Procedings of the ESCA (European Speech Communication Association) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, 169–72. French, P. (1994) ‘An overview of forensic phonetics with particular reference to speaker identification’, Forensic Linguistics, 1(2):169–81. Fu, H., Rodman, R., McAllister, D., Bitzer, D. and Xu, B. (1999) ‘Classification of Voiceless Fricatives through Spectral Moments’, in Proceedings of the 5th International Conference on Information Systems Analysis and Synthesis (ISAS’99),Skokie, Ill.: International Institute of Informatics and Systemics, 307–11. Hollien, H. (1990) The Acoustics of Crime: The New Science of Forensic Phonetics, New York: Plenum Press. Kao, Y., Rajasekaran, P. and Baras, J. (1992) ‘Free-text speaker identification over long distance telephone channel using hypothesized phonetic segmentation, ICASSP (International Conference on Acoustics, Speech and Signal Processing), II.177–II.180. Kersta, L. G. (1962) ‘Voiceprint identification’, Nature, 5(196): 1253–7. Klevans, R. L. and Rodman, R. D. (1997) Voice Recognition, Norwood, Mass.:Artech House Publishers. Koenig, B. E. (1986) ‘Spectrographic voice identification: a forensic survey’, Journal of the Acoustical Society of America, 79: 2088–90. Koster, B. E. (1995) Automatic Lip-Sync: Direct Translation of Speech-Sound to MouthAnimation, PhD dissertation, Department of Computer Science, North Carolina State University. Künzel, H. J. (1994) ‘Current Approaches to Forensic Speaker Recognition’, in Proceedings of the ESCA (European Speech Communication Association) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, 135–41. Künzel, H. J. (1997) ‘Some general phonetic and forensic aspects of speaking tempo’,Forensic Linguistics, 4(1): 48–83. Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis, London: Academic Press. Markel, J. D. and Davis, S. B. (1978) ‘Text-independent speaker identification from a large linguistically unconstrained time-spaced data base’, ICASSP (International Conference on Acoustics, Speech and Signal Processing), 287–9. Matsui, T. and Furui, S. (1991) ‘A text-independent speaker recognition method robust against utterance variations’, ICASSP (International Conference on Acoustics, Speech and Signal Processing), 377–80. 31 Mokhtarian, F. (1995) ‘Silhouette-based isolated object recognition through curvature scale space’, IEEE (Institute of Electronics and Electronic Engineers) Transactions on Pattern Analysis and Machine Intelligence, 17 (5): 539–44. Mokhtarian, F. and Mackworth, A. (1986) ‘Scale-based description and recognition of planar curves and two-dimensional shapes’, IEEE (Institute of Electronics and Electronic Engineers) Transactions on Pattern Analysis and Machine Intelligence, V. Pami-8 (1): 34–43. Oglesby, J. and Mason, J. S. (1990) ‘Optimization of Neural Models for Speaker Identification’, ICASSP, 393–6. O’Shaughnessy, D. (1986) ‘Speaker Recognitiion’, IEEE (Institute of Electronics and Electronic Engineers) ASSP (Acoustics, Speech and Signal processing) Magazine, October, 4–17. Reynolds, D. A. and Rose, R. C. (1995) ‘Robust text-independent speaker identification using Gaussian mixture speaker models’, IEEE (Institute of Electronics and Electronic Engineers) Transactions on Speech and Audio Processing, 3(1): 72–83. Rodman, R. D. (1998) ‘Speaker recognition of disguised voices’, in Proceedings of the COST 250 Workshop on Speaker Recognition by Man and Machine: Directions for Forensic Applications, Ankara, Turkey, 9–22. Rodman, R. D. (1999) Computer Speech Technology, Boston, Mass.: Artech House Publishers. Rodman, R., McAllister, D., Bitzer, D., Fu, H. and Xu, B. (1999) ‘A pitch tracker for identifying voiced consonants’, in Proceedings of the 10th International Conference on Signal Processing Applications and Technology (ICSPAT’99). Rodman, R., McAllister, D., Bitzer, D. and Chappell, D. (2000) ‘A High-Resolution Glottal Pulse Tracker’, in International Conference on Spoken Language Processing(ICSLP), October 16–20, Beijing, China (CD-ROM). Rudasi, L. and Zahorian, S. A. (1991) ‘Text-independent talker identification with neural networks’, ICASSP (International Conference on Acoustics, Speech and Signal Processing), 389–92. Snedecor, G. W. and Cochran, W. G. (1989) Statistical Methods (8th edn), Ames, IA: Iowa State University Press. Sonka, M., Hlavac, V. and Boyle, R. (1999) Image Processing, Analysis, and Machine Vision (2nd edn), Boston, MA, PWS Publishing, ch. 6. Stevens, K. N., Williams, C. E., Carbonelli, J. R. and Woods, B. (1968) ‘Speaker authentication and identification: a comparison of spectrographic and auditory presentations of speech material’, Journal of the Acoustical Society of America,(43): 1596–1607. Tosi, O. (1979) Voice Identification: Theory and Legal Applications, Baltimore, Md.: University Park Press. 32

Enhancement of tape recorded voices to facilitate transcription

Related documents

Products

Support

Enhancement of tape recorded voices to facilitate transcription

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib