Enhancement of tape recorded voices to facilitate transcription

advertisement
Enhancement of tape recorded voices to facilitate
transcription & aural identification:
Selected Topics in Forensic Voice Identification
Bruce E. Koenig, October 1993
Federal Bureau of Investigation
Ongoing law enforcement operations throughout the world are continually capturing the
voices of suspects with miniature transmitter/receiver systems, analog and digital on-thebody recorders, telephone intercept devices, and concealed room microphones. Since
these recordings are normally utilized for investigative leads and/or legal proceedings,
specific speakers must be accurately identified. Voice identifications that occur through
self-recognition of one's voice, eye-witness information, surveillance logs, and the use of
a person's name in the conversation are usually readily accepted. However; voice
identifications that involve listening only and/or laboratory tests are often more difficult
to evaluate accurately. To provide a better understanding of these voice comparison
topics, two types of aural-only comparisons will be discussed, and an update on the
spectrographic technique is included.
Aural Identification of Familiar Voices
Recognition of familiar voices is a daily occurrence for most people, as they identify
spouses, children, coworkers, friends, and business associates after only a few words
spoken over the telephone or by hearing them from an adjacent room. This process
involves long-term memory, where recognition occurs through a prior knowledge of
speech characteristics, including such attributes as accent, speech rate, pronunciation,
pitching, vocabulary, and vocal variance (intraspeaker variability).
Some of the relevant scientific research, and opinions that address the accuracy of
identifying familiar voices include the following:
1. Researchers used 7 listeners who were familiar with the 16 chosen
speakers through daily contact. The speakers had no pronounced speech
defects or accents. Groups of two to eight speech samples of varying
lengths were played back to the listeners, which resulted in an
identification accuracy of better than 95% for samples lasting from about
1 to 2 seconds. Voice samples were also frequency restricted, but the
results reflected only a limited loss of accuracy under conditions normally
encountered in law enforcement investigations. In tests involving
whispered speech, the duration had to be somewhat greater than three
times longer than normal speech samples to obtain equivalent levels of
identification (Pollack et al. 1954).
2. Sixteen listeners with no hearing losses, who had known the recorded 10
male coworkers for at least 2 years, were chosen. None of the 10 recorded
individuals had either pronounced regional accents or speech
abnormalities. When the listeners heard sentences of less than 3 seconds
duration from the 10 coworkers, their median accuracy rate of
identification was 98% (range of 92% to 100%). When only a disyllable
1
(e.g., mama) was spoken, the median accuracy rate dropped to 88% (range
of 73% to 98%) (Bricker and Pruzansky 1966).
3. In a study of coworkers, recordings were made on different telephone lines
of four women and seven men, each talking for 30 seconds to 1 minute on
a neutral topic such as the weather. An additional recording was prepared
of another male; who was relatively unfamiliar to most of the listeners.
The recordings were arranged in a random order and played to 10 of the
other coworkers, who were asked to identify the speakers. "All the
listeners except one correctly identified all the 11 [coworkers]... The one
listener who made an error.. confused two speakers who were not well
known to him. Three of the 10 listeners knew [the eighth male, who was
not a coworker], and correctly identified him. Of the remaining seven
listeners, only two said that they could not recognize this speaker. Five
listeners wrongly identified this speaker as..." another one of their
coworkers. "It is worth noting that four of the five listeners who made the
wrong identification were highly skilled, experienced phoneticians..." with
doctoral degrees in the field (Ladefoged 1978). This experiment reflects a
100% identification rate for the coworkers' voices that were well-known to
them and an overall average accuracy rate of 96% when the relatively
unfamiliar voice was added.
4. Twenty-four individuals were asked to listen to speech samples of 24
coworkers (15 males and 9 females) whom they had known for several
years and 4 speakers unknown to the listeners. The speech samples
averaged about 30 seconds in length and contained at least 12 utterances
of 2 to 4 words each. Listeners rated each coworker on a scale of very
familiar to totally unfamiliar prior to the testing. They listened to the
samples for as long as they wished and then rated their decisions as
follows: (1) guessing, (2) fairly sure, or (3) very sure. Deleting the results
of any voice rated totally unfamiliar to the listener, the results showed a
90.4% correct identification rate and 4.3% incorrect identification rate,
with 5.3% who said they did not know the speaker. If the 5.3% are deleted,
the correct identification rate is 95.4%. "This rate is probably fairly
representative of situations where a limited vocabulary is required and can
be expected to be even higher in informal conversations where more of the
individual speaker's speech habits are present as cues for identification"
(Schmidt-Nielson and Stern 1985).
This research reflects that the identification accuracy rate for familiar voice samples
lasting 1 second or longer ranged from 92% to 100% and averaged 95% to 100%.
Samples recorded through the telephone or other limited bandwidth systems had little
effect on accuracy. The effects of noise and loss of high frequency information were
studied in another experiment (Clarke et al. 1966) which found that aural speaker
identification was only slightly degraded when progressing from high-quality voice
samples to typical investigative recordings. It is obvious from everyday experience and
the cited research that identifying familiar voices can be an accurate method for
2
identifying voices recorded in forensic applications, even with the limiting factors of
noise and attenuated high frequencies.
Aural comparisons of unfamiliar voice samples rely on short-term memory. For example,
a woman receives a number of different telephone inquiries regarding a classified
advertisement. She then receives an obscene telephone call, and she tries to remember if
any of the voices match. In a judicial proceeding, a judge and/or a jury may have to
decide if a particular crucial comment on an investigative recording was spoken by the
defendant, who readily admits to saying the other statements attributed to him on the
transcript, or to someone else involved in the conversation. Examiners using the
spectrographic technique, described later, play back the separate voice samples
concurrently on separate devices or computer files with an electronic patching
arrangement to allow rapid aural switching between them or by recording short phrases
or sentences from each sample on the same recording (Voice Comparison Standards
1991). The de facto study of unfamiliar voice comparisons (Clarke et al. 1966)
determined the following:
1. Sentence length over the range of 5 to 11 syllables is not important
variable in identification accuracy.
2. Correct identifications decreased from approximately 90% to 80% when
the signal-to-noise ratio (SNR) was reduced from 30 decibels (dB) to 0 dB.
3. Correct identifications decreased from approximately 88% to 78% when
the frequency response was reduced from 4,500 hertz (Hz) to 1000 Hz.
Since most investigative recordings have a SNR of 10 dB to 40 dB and a frequency
response of 2,500 Hz to 5,000 Hz, the range of expected correct identifications of
unfamiliar voices would be 78% to 90%, with most identifications in the 78% to 83%
range.
The use of expert testimony for aural identifications of unfamiliar voices provides no
assistance to the court and/or to the jury. The notes of the advisory committee on Rule
901 of the Federal Rules of Evidence appropriately reflect this fact as follows: "Since
aural voice identification is not a subject of expert testimony, the requisite familiarity
may be acquired either before or after the particular speaking which is the subject of the
identification..." (Federal Criminal Code and Rules 1991). Additionally, the voice
comparison standards of the International Associationfor Identification (IAI) specifically
state that it "... does not support or approve the use of... aural only expert decisions..." for
voice comparisons (1991).
Spectrographic Comparisons
The spectrographic laboratory technique is the most well-known and possibly the most
accurate of the laboratory testing procedures presently available for comparing verbatim
voice samples under forensic conditions. However, some scientists believe that aural
identifications of very familiar voices are more accurate (Hecker 1971). The
spectrographic technique has been described in numerous forensic and scientific
publications, including an overview article published in the Crime Laboratory Digest
(Koenig 1986). Therefore, a detailed explanation will not be rendered here; the following
3
paragraphs provide a brief summary of the examination, a review of the new
comprehensive standards passed by the IAI, and its status in government and private
laboratories.
When properly conducted, spectrographic voice identification is a relatively accurate but
not conclusive examination for comparing a recorded unknown voice sample with a
suspect repeating the identical contextual information over the same type of transmission
system (e.g., a local telephone line). The examiner uses both the short-term memory
process previously detailed and a spectral pattern comparison between identically spoken
sounds on spectrograms.
Figures 1A and 1B are sound spectrograms of different male speakers saying "salt and
pepper."
The horizontal axis represents time, divided into 0.1-second intervals by the short vertical
bars near the top, and the vertical axis is frequency, ranging linearly from 80 Hz to 4000
Hz, with horizontal lines every 1000 Hz. The speech energy is reflected in the gray scale
from black (highest level) to white (lowest level). The frequency range of the voice is
analogous to the range of a musical instrument, where the lowest notes are at the lowest
frequency and the highest notes at the highest frequency. The mostly horizontal bands of
darkness reflect the vocal resonances and are called formants. The closely spaced vertical
striations represent fundamental frequency (voice pitch) or the actual vibrations of the
vocal cords. The spectrographic technique requires comparison of identical phrases
between the voice samples, with a decision made at one of a number of confidence levels.
The scientific support of this examination is limited, and the actual error rate under most
investigative conditions is unknown. The research to date indicates that the technique has
a certain error rate that is independent of examiner-induced errors, with errors of false
elimination (the voice samples were actually from the same person, but the examination
4
found that they did not match) appreciably higher than false identification (the voice
samples were actually from different persons, but the examination found that the samples
matched).
In July 1991, the Voice Identification and Acoustic Analysis Subcommittee of the IAI
passed and published its first set of comprehensive spectrographic voice identification
standards. These requirements, which became effective January 1, 1992, for all certified
IAI members, include examiner qualifications, evidence handling, preparation of
exemplars, preparation of copies, preliminary-examination, preparation of spectrograms,
spectrographic/aural analysis, work notes, testimony, certification, and miscellaneous
subjects. Table 1 lists the minimum qualifications for spectrographic examiners of the
IAI and the FBI and updates a similar table published in an earlier issue of the Crime
Laboratory Digest (Koenig 1986). Table 2 is another updated and expanded table from
the same article concerning minimum criteria for spectrographic comparisons. Tables 1
and 2 and the previously published tables reflect that the upgraded IAI standards are now
appreciably closer to the FBI's criteria. The FBI's standards require higher educational
levels, more words for lower confidence decisions, enhancement procedures when
needed, and a higher frequency voice range. The most important legal difference is the
FBI's policy not to provide testimony on spectrographic comparisons due to the
inconclusive nature of the examination and the unknown error rate under specific
investigative conditions.
Table 1. Minimum Qualifications for Spectrographic Examiners of the AIA and FBI
Qualification
IAI
FBI
High School Diploma
BS Degree
Yes
Yes
Usually 2 Years
2 Years
Number of Comparisons Conducted
100
100
Attendance at a Spectrographic School
Yes
Yes
Formal Certification
Yes
Yes
Education
Periodic Hearing Test
Length of Apprenticeship
Table 2. Minimum Criteria for Spectrographic Comparison for the IAI and the FBI
Criteria
IAI
FBI
Words Needed for Highest Confidence Level
20
20
Words Needed for Lowest Confidence Level
10
20
Affirming Independent Secong Decision
Yes
Yes
Original Recording Required
Yes
Yes
Allows Testimony
Yes
No
Usually
Usually
Above 2 KHz
Above 2.5 KHz
Yes
Yes
Optional
Yes
Speed Correction of All Recordings
Yes
Yes
Track Determiniation of All Recordings
Yes
Yes
Azimuth Alignment Correction
Yes
Yes
Completely Verbatim Knon Samples
Speech Frequency Rate
Accuracy Statement om Report
Enhancement Proceedures When Needed
5
The use of the spectrographic technique since the mid1980s continues to show a steady
decline by both government laboratories and private examiners. As of mid-1993, the New
York City Police Department and the FBI were the only government laboratories in this
country regularly conducting these examinations. The private sector efforts were limited
to less than a dozen part-time examiners. Professional meetings in the field have been
sparsely attended, and no major spectrographic research is known to be under way.
Problems still persist in the spectrographic voice identification field. Examples of these
problems include the following: (1) separate sets of certified examiners making highconfidence decisions for both identification and elimination in the same case;1 (2)
individuals with no experience, training, or education in the voice identification
discipline making conclusive decisions under oath in court; and (3) examiners testifying
that an unknown voice is not the defendant's, although admitting their decisions are really
inconclusive based upon accepted standards.
Note 1. Los Angeles Board of Civil Service Commisioners. Threat case decided March 25,1992, in which three IAI examiners made an
identification at a high-confidence level, while two IM examiners eliminated the suspect.
Summary and Conclusion
Under investigative conditions, individuals can reliably identify voices that are well
known to them, but the accuracy rate drops to approximately 78% to 83°/o when
unfamiliar voices are compared to known voice samples. The use of expert witnesses
does not improve the accuracy rate of aural only voice comparisons. The use of the
spectrographic technique continues to decline, even with the establishment of new
standards in 1992.
References
Bricker, P. D. and Pruzansky, S. Effects of stimulus content and duration on talker identification, Journal
of the Acoustical Society of America (1966) 40:6:1441-1449.
Clarke, F. R., Becker, R. W., and Nixon, J. C. Characteristics that Determine Speaker Recognition.
Technical Report ESD-TR-66-636, Electronic Systems Division, US Air Force, 1966.
Compton, A. J. Effects of filtering and vocal duration upon the identification of speakers, aurally, Journal
of the AcousticaI Society of America (1963) 35:11:1748-1752.
Federal Criminal Code and Rules. est, St. Paul, MN, 1991, p. 289.
Hecker, M. H. L. Speaker Recognition: An Interpretive Survey of the Literature. American Speech and
Hearing Association, Washington, DC, 1971.
Koenig, B. E. Spectrographic voice identification, Crime Laboratory Digest (1986)13:4:105-118.
Ladefoged, P. Expectation affects identification by listening, Language and Speech (1978) 21:4:373-374.
Pollack, I., Pickett, J. M., and Sumby, W. H. On the identification of speakers by voice, Journal of the
Acoustical Society of America (1954) 26:3:403-406.
Schmidt-Nielson, A. and Stern, K. R. Identification of known voices as a function of familiarity and
narrowband coding, Journal of the Acoustical Society of America (1985) 77:2:658-663.
Voice comparison standards, Journal of Forensic Identification (1991) 41:5:373-392.
6
Voiceprint Identification
Money Laundering and Narcotics Update, Department of Justice, 1988, and
The Legal Investigator 1990 by Steve Cain, Lonnie Smrkovski and Mindy
Wilson
Voiceprint identification can be defined as a combination of both aural (listening) and
spectrographic (instrumental) comparison of one or more known voices with an
unknVoiceprint identification can be defined as a combination of both aural
(listening) and spectrographic (instrumental) comparison of one or more known
voices with an unknown voice for the purpose of identification or elimination.
Developed by Bell Laboratories in the late 1940s for military intelligence
purposes, the modern-day forensic utilization of the technique did not start until
the late 1960s following its adoption by the Michigan State Police. From 1967
until the present, more than 5,000 law enforcement related voice identification
cases have been processed by certified voiceprint examiners.
Voice identification has been used in a variety of criminal cases, including murder,
rape, extortion, drug smuggling, wagering-gambling investigations, political
corruption, money-laundering, tax evasion, burglary, bomb threats, terrorist
activities and organized crime activities. It is part of a larger forensic role known
as acoustic analyses, which involves tape filtering and enhancement, tape
authentication, gunshot acoustics, reconstruction of conversations and the
analysis of any other questioned acoustic event.
Theory
The fundamental theory for voice identification rests on the premise that every
voice is individually characteristic enough to distinguish it from others through
voiceprint analysis. There are two general factors involved in the process of
human speech. The first factor in determining voice uniqueness lies in the sizes
of the vocal cavities, such as the throat, nasal and oral cavities, and the shape,
length and tension of the individual's vocal cords located in the larynx. The vocal
cavities are resonators, much like organ pipes, which reinforce some of the
overtones produced by the vocal cords, which produce formats or voiceprint bars.
The likelihood that two people would have all their vocal cavities the same size
and configuration and coupled identically appears very remote.
The second factor in determining voice uniqueness lies in the manner in which
the articulators or muscles of speech are manipulated during speech. The
articulators include the lips, teeth, tongue, soft palate and jaw muscles whose
controlled interplay produces intelligible speech. Intelligible speech is developed
by the random learning process of imitating others who are communicating. The
likelihood that two people could develop identical use patterns of their articulators
also appears very remote.
1
Therefore, the chance that two speakers would have identical vocal cavity
dimensions and configurations coupled with identical articulator use patterns
appears extremely remote. While there have been claims that sever al voices
have been found to be indistinguishable, no evidence to support such allegations
has been published, offered for examination or demonstrated to the authors.
Several studies have been published evidencing the ability to reliably identify
voices under certain conditions, and a Federal Bureau of Investigation survey of
its own performance in the examination of 2,000 forensic cases revealed an error
rate of 0.31 percent for false identifications, and 0.53 per cent for false
eliminations. (See Koenig, B.E., 1986, Spectrographic Voice Identification: a
forensic survey, Journal of the Acoustical Society of America, 79:2088-2090.)
While there is disagreement in the so-called "scientific community" on the degree
of accuracy with which examiners can identify speakers under all conditions,
there is agreement that voices can, in fact, be identified.
To facilitate the visual comparisons of voices, a sound spectrograph is used to
analyze the complex speech wave form into a pictorial display on what is referred
to as a spectrogram. The spectrogram displays the speech signal with the time
along the horizontal axis, frequency on the vertical axis, and relative amplitude
indicated by the degree of gray shading on the display. The resonance of the
speaker's voice is displayed in the form of vertical signal impressions or markings
for consonant sounds, and horizontal bars or formants for vowel sounds. The
visible configurations displayed are characteristic of the articulation involved for
the speaker producing the words and phrases. The spectrograms serve as a
permanent record of the words spoken and facilitate the visual comparison of
similar words spoken between and unknown and known speaker's voice.
Procedural Guidelines
The acoustic environment in many cases can be controlled at the receiving end
of speech signal. Shutting off the radio, television or other signal- noise
generating devices will reduce or eliminate unwanted background speech or
noise. While not always possible, the investigator should at tempt to select a
reasonably quiet environment for controlled activities such as drug buys or other
illegal operations being investigated. Many times these types of activities are
carried out in bars, restaurants, car washes, billiard rooms and the like, and the
investigator cannot always dictate the location.
It may require the recording of telephone conversations or face-to-face
encounters under a variety of acoustic conditions in which someone is wearing a
body recorder or transmitting the conversation via radio frequency to a remote
location. Unfortunately, in many cases the investigators cannot control the
acoustic environment. In situations involving an adverse environment,
investigators should use high technology stereo equipment to optimize recording
capability.
2
The attempt to produce samples as parallel to the unknown as possible actually
assists the examiner in his task because speaker variables are reduced to a
minimum. Numerous studies have been conducted that indicate very reliable
decisions can be made by trained professional examiners when samples are
obtained in the manner described.
The notion proposed by some opponents that duplicating the unknown as closely
as possible may cause error is not supported by any available evidence.
Research studies have produced strong evidence that even very good mimics
cannot duplicate an- other's speech patterns.
In an attempt to obtain proper speech samples, investigators should not hesitate
to ask suspects for the samples they need. Surprisingly, many suspects will
voluntarily give a sample of their voice for comparison purposes.
In the event you are dealing with some type of vocal' disguise, attempt to obtain a
similarly produced known exemplar in addition to the suspect's normal voice. It
should be noted that vocal disguises can be very difficult for the examiner to deal
with and the probability of determination is less than with normal voice samples.
If a suspect refuses to cooperate with the investigator, a court order may be
acquired compelling the suspect to produce voice recordings for the purpose of
comparison. Courts have repeatedly held that requiring the accused to submit
voice exemplars for the purpose of comparison for identification or elimination
does not violate the suspect's Fifth Amendment rights. In Wade, 388 U.S. 218
(1967), the Court held that the privilege against self-incrimination offers no
protection from compulsion to submit to speaking for purpose of voice
identification, or to writing, photographing, finger- printing and measurements.
Several problems have been en countered in obtaining known voice exemplars
even with the use of a court order. If the court order is vague, the suspect may
utter a few words of the text involved, speak too softly, too fast, or too slowly, or
otherwise disguise the sample and claim compliance with the order.
To prevent such problems, the investigator is wise to request that the court order
specify in detail, that the suspect give a sample of his or her voice, repeating the
phrases of the questioned call in a natural conversational voice (or in a similar
disguise, if that is the case) and that such sample shall be given at least three
times and to the reasonable satisfaction of the investigator. Voice exemplars
obtained with such specific instructions are usually very satisfactory for
comparison purposes.
Before terminating the recording session, check the recording to deter mine
whether or not a satisfactory exemplar was obtained.' Remember that once a
suspect is released, a second known sample may be very difficult to obtain.
Whatever the recording circum stance, background noise and the distance
between the talker and the receiving device should be minimized for optimal
recording. Good quality tape recording equipment should be used, as well as
3
magnetic recording tape. As a rule of thumb, recording tape with standard 120
equalization, normal bias and no more than a 5 dB drop at 6 KHz should be used.
After the development of a suspect, the next task is to properly obtain known
voice samples for comparison purposes. Do not hesitate to ask a suspect for a
speech sample. If the suspect refuses, a court order may be obtained requiring
compliance with the request. See Schmerber v. California, 384 US. 757(1966).
and Gilbert v. California, 388 US. 263 (1967). Both are landmark cases. There
are also many additional decisions at both state and federal court levels that may
be cited to support such a request. Court orders should clearly spell out the
minimum number of samples to be obtained, the manner of speech, and the
method to be employed.
The next task for the investigator is to obtain proper speech samples for
comparison purposes. Probably the best guide here is attempting to duplicate the
recording of the questioned call. Known samples should be obtained via the
telephone and recorded in the same manner as the questioned call. If possible,
the same recorder and telephone pickup should be used. In some cases, even
the same telephone has been employed. If there is room on the questioned tape,
the known sample may be placed on it. If there is not, another tape of the same
type and brand should be used if at all possible.
Speech samples obtained should contain exactly the same words and phrases
as those in the questioned sample because only like speech sounds are used for
comparison. Be cause the voice, like handwriting, is dynamic and variant, several
samples of each spoken phrase are desired for analysis. Unless the questioned
call sounds like a read statement, the suspect should not be allowed to read the
phrases from a transcript but should repeat each phrase after it is spoken by
someone else. To avoid an unnatural verbal response, the suspect should repeat
the first phrase and proceed in the same manner with each successive phrase.
When all phrases have been record ed, the same procedure should be repeated
at least two more times beginning with the first word or phrase. The suspect may
be asked to read the phrases if a very poor job of repeating is done. Some
people do a better job of reading than repeating the phrases.
It is important that the known sample be spoken in the same manner as the
questioned sample; therefore, the investigator should be familiar with the voice,
manner of speech and the text. If the caller's voice was disguised, the suspect
should give a normal sample and a disguised one as in the questioned call.
Recorded evidence should be wrapped in tinfoil to protect it from possible contact
with a magnetic field if it is submitted by mail. The evidence should be shipped in
a secure container that will prevent the evidence from tearing through the
packaging material. Do not submit a copy of your investigative report with the
evidence. The examiner does not want to know the details of the case. It is
important, however, to provide the examiner with information regarding the
4
recording method, the number of calls and suspects involved, and any other
information that may assist the examiner in the examination of the evidence.
Upon receipt of the evidence by the laboratory, it is properly marked and a case
number is assigned. The analysis and comparison of known and questioned
voice samples may take several hours or days to complete, depending on the
number of samples involved and the complexity of the examination. Both an
aural (listening) and visual (spectrographic) examination and comparison is
conducted. Aural and spectrographic cues examined should compliment one
another in the event the voices are in fact the same.
As with the identification of finger prints, there is presently no universal standard
for the number of words required for identification. It does, how ever, vary from a
minimum of 10 for some agencies and 20 for others. The Internal Revenue
Service has chose to use 20 or more like speech sounds between an unknown
and known sample with the degree of certainty based on quality and excellence
of the evidence examined. Obtaining a second, independent decision is standard
practice in this field as in other forensic sciences.
Visual comparison of spectrograms involves, in general, the examination of
spectrograph features of like sounds as portrayed in spectrograms in terms of
time, frequency and amplitude. Specific features, the result of producing
consonants, vowels and semi-vowels in isolation or in combination
(coarticulation), include the following but certainiy not all-inclusive clues: pitch,
bandwidth, mean frequency, trajectory of vowel formants, distribution of formant
energy, nasal resonance, stops, plosives, fricatives, pauses, inter formant
features and other idiosyncratic and pathological features.
Special aural comparison tapes are prepared facilitating comparison of
psycholinguistic features via short-term memory. Aural cues compared include
resonance quality, pitch, temporal factors, inflection, dialect, articulation, syllable
grouping, breath pattern, disguise, pathologies and other peculiar speech
characteristics.
Some agencies offer court testimony, others do not. The IRS laboratory is the
only federal agency that presently offers testimony. All other certified examiners,
whether in state agencies or in private practice, also offer court testimony.
Court Admissibility
Court testimony involving aural- spectrographic voice comparison essentially
started having an impact on the courts after the Tosi Study in December 1970.
Since then there have been between 150 and 200 trials in local, state or federal
courts. Because of a difference based on evidentiary philosophical reasons,
some courts have admitted aural-spectrographic voice evidence and others have
not.
5
There are two general "rules" or "standards" by which scientific evidence is
accepted in courts of law in the United States. The first, commonly referred to as
the Frye "rule" or "test," is based on a 1923 District of Columbia case and
basically requires "general acceptance in the particular field in which it belongs."
See Frye v. United States, 54 App. D.C. 46, 293 F. 1013 (1923). The second is
based on the argument of McCormick (See "McCormick on Evidence," 3rd Ed.,
203 at 608.) McCormick states: "General scientific acceptance is a proper
condition for taking judicial notice of scientific facts, but it is not a suitable
criterion for the admissibility of scientific evidence. Any relevant conclusion
supported by a qualified expert witness should be received unless there are
distinct reasons for exclusion." See Rule 702 of the Federal Rules of Evidence.
Many state and federal courts have abandoned Frye and adopted the argument
of McCormick. The supreme courts of Minnesota, Maine, Ohio and Rhode Island
have admitted aural-spectrographic voice evidence following McCormick.
Intermediate appellate courts in California, Mary land and Michigan admitted
such evidence following Frye but were reversed by their respective supreme
courts, which held that the Frye test had not been met. The Massachusetts
Supreme Court held aural-spectrographic voice evidence admissible applying the
Frye test, while those of Arizona, Indiana and Pensylvania did not.
In the federal court system, we are aware of 30 trials in which the question of
aural-spectrographic voice evidence was addressed. All but three admitted the
evidence based on Frye or McCormick. On appeal, the Second, Fourth and Sixth
Circuits held the evidence admissible, applying McCormick, while the District of
Columbia did not, applying Frye. See United States v. Williams, 583 F.2d 1194
(2d Cir.), cert. denied 439 US.
1117 (1978); United States v. Bailer, 519 F.2d 463 (4th Cir.), cert. denied
423 US. 1019 (1975); United States v. Franks, 511 F.2d 25 (6th Cir.) cert. denie4
422 US. 1042 (1975), and United States v. McDaniel, 538 F.2d 408 (D.C. Cir.
1976).
In United States v. Williams, supra at 1198, the court said: "The 'Frye' test is
usually construed as necessitating a survey and categorization of the subjective
views of a number of scientists, assuring thereby a reserve of experts available
to testify. Difficulty in applying the 'Frye' test has led a number of courts to its
implicit modification." Also see United States v. Bailer, supra at n.6.
Since 1970, the forensic application of aural-spectrographic voice identification
has been reliably applied in the investigation of several thousand cases. While
there is disagreement on the reliability of the method under all conditions, there is
agreement that voices can be identified and eliminated when the proper
conditions exist and the analysis is carefully conducted by qualified examiners.
Several state appellate and supreme courts have admitted the evidence, as have
three of four federal appellate courts. The United States Supreme Court has
6
refused to review and decide the three cases brought before it. While the
admission of aural-spectrographic voice evidence continues to be decided in
various courts, the method continues to be a very important tool m the arsenal
against crime.
Other areas of acoustic analysis include, in part, gun shot analysis, tape
enhancement and tape authentication. While not discussed in this article, it
should be noted that laboratory analysis related to these problems is avail able in
some laboratories.
NDAA Bulletin December 1993
7
VOICE IDENTIFICATION: The Aural/Spectrographic Method
by: Michael C. McDermott (mike@mcdltd.com), Tom Owen (owlmax@aol.com),
Frank M. McDermott, Ltd.
Owl Investigations, Inc.
Table of Contents:
I.
INTRODUCTION
II.
The Sound Spectrograph
III.
The Method of Voice Identification
IV.
History
V.
Standards of Admissibility
VI.
Research Studies
VII.
Conclusion
VIII.
Table of Cases
IX.
Appendix 1
© 1996 Owl Investigations, Inc.
INTRODUCTION
The forensic science of voice identification has come a long way from when it
was first introduced in the American courts back in the mid 1960's. In the early
days of this identification technique there was little research to support the theory
that human voices are unique and could be used as a means for identification.
There was also no standardization of how an identification was reached, or even
training or qualifications necessary to perform the analysis. Voice comparisons
were made solely on the pattern analysis of a few commonly used words. Due to
the newness of the technique there were only a few people in the world who
performed voice identification analysis and were capable of explaining it to a
1
court. Gradually the process became known to other scientists who voiced
concerns, not as to the validity of the analysis, but as to the lack of substantial
research demonstrating the reliability of the technique. They felt that the
technique should not be used in the courtroom without more documentation.
Thus the battle lines were drawn over the admissibility of voice identification
evidence with proponents claiming a valid, reliable identification process and
opponents claiming more research must be completed before the process should
be used in courtrooms.
Today voice identification analysis has matured into a sophisticated identification
technique, using the latest technology science has to offer. The research, which
is still continuing today, demonstrates the validity and reliability of the process
when performed by a trained and certified examiner using established,
standardized procedures. Voice identification experts are found all over the world.
No longer limited to the visual comparison of a few words, the comparison of
human voices now focuses on every aspect of the words spoken; the words
themselves, the way the words flow together, and the pauses between them.
Both aural and spectrographic analysis are combined to form the conclusion
about the identity of the voices in question.
The road to admissibility of voice identification evidence in the courts of the
United States has not been without its potholes. Many courts have had to rule on
this issue without having access to all the facts. Trial strategies and budgets
have resulted in incomplete pictures for the courts. To compound the problem,
courts have utilized different standards of admission resulting in different
opinions as to the admissibility of voice identification evidence. Even those courts
which have claimed to use the same standard of admissibility have interpreted it
in a variety of ways resulting in a lack of consistency. Although many courts have
denied admission to voice identification evidence, none of the courts excluding
the spectrographic evidence have found the technique unreliable. Exclusion has
always been based on the fact that the evidence presented did not present a
clear picture of the technique's acceptance in the scientific community and as
such, the court was reluctant to rely on that evidence. The majority of courts
hearing the issue have admitted spectrographic voice identification evidence.
THE SOUND SPECTROGRAPH
The sound spectrograph, an automatic sound wave analyzer, is a basic research
instrument used in many laboratories for research studies of sound, music and
speech. It has been widely used for the analysis and classification of human
speech sounds and in the analysis and treatment of speech and hearing
disorders.
The instrument produces a visual representation of a given set of sounds in the
parameters of time, frequency and amplitude. The analog spectrograph is
composed of four basic parts; (1) a magnetic tape recorder/playback unit, (2) a
tape scanning device with a drum which carries the paper to be marked, (3) an
electronic variable filter, and (4) an electronic stylus which transfers the analyzed
2
information to the paper. The analog sound spectrograph samples energy levels
in a small frequency range from a magnetic tape recording and marks those
energy levels on electrically sensitive paper. This instrument then analyses the
next small frequency range and samples and marks the energy levels at that
point. This process is repeated until the entire desired frequency range is
analyzed for that portion of the recording. The finished product is called a
spectrogram and is a graphic depiction of the patterns, in the form of bars or
formants, of the acoustical events during the time frame analyzed. The machine
will produce a spectrogram in approximately eighty seconds. The spectrogram is
in the form of an X,Y graph with the X axis the time dimension, approximately 2.4
seconds in length, and the Y axis the frequency range, usually 0 to 4000 or 8000
Hz. The degree of darkness of the markings indicates the approximate relative
amplitude of the energy present for a given frequency and time.
Recent developments in sound spectrography have produced computerized
digital sound spectrographs ranging from dedicated digital signal analysis
workstations to PC-based systems for acquisition, analysis editing, and playback.
These sophisticated computer-based systems provide high fidelity signal
acquisition, high- speed digital processing circuitry for quick and flexible analysis,
and CD-quality playback. The computerize-based systems accomplish all the
same tasks of the analog systems, but with the computer-based systems the
examiner gains a host of comparison and measurement tools not available with
the analog equipment. The computer-based systems are capable of displaying
multiple sound spectrogram, adjusting the time alignment and frequency ranges
and taking detailed numeric measurements of the displayed sounds. With these
advances in technology, the examiner widens the scope of the analysis to create
a more detailed picture of the voice or sound being analyzed.
The accuracy and reliability of the sound spectrograph, either analog or digital,
has never been in question in any of the courts and never considered an issue in
the admissibility of voice identification evidence. This may be due in part to the
wide use of the instrument in the field of speech and hearing for non-voice
identification analysis of the human voice and, in part to the fact that given the
same recording of speech sounds the sound spectrograph will consistently
produce the same spectrogram of that speech.
The contest comes in the interpretation of the spectrograms. Proponents of the
aural and spectrographic technique of voice identification base their decisions on
the theory that all human voices are different due to the physical uniqueness of
the vocal track, the distinctive environmental influences in the learning process of
speech development, and the unique development of neurological faculties which
are responsible for the production of speech. Opponents claim that not enough
research has been completed to validate the theory that intraspeaker variability is
less than interspeaker variability.
THE METHOD OF VOICE IDENTIFICATION
3
The method by which a voice is identified is a multifaceted process requiring the
use of both aural and visual senses. In the typical voice identification case the
examiner is given several recordings; one or more recordings of the voice to be
identified and one or more recorded voice samples of one or more suspects. It is
from these recordings the examiner must make the determination about the
identity of the unknown voice.
The first step is to evaluate the recording of the unknown voice, checking to
make sure the recording has a sufficient amount of speech with which to work
and that the quality of the recording is of sufficient clarity in the frequency range
required for analysis.1 The volume of the recorded voice signal must be
significantly higher than that of the environmental noise. The greater the number
of obscuring events, such as noise, music, and other speakers, the longer the
sample of speech must be. Some examiners report that they reject as many as
sixty percent of the cases submitted to them with one of the main reasons for
rejection being the poor quality of the recording of the unknown voice.
Once the unknown voice sample has been determined to be suitable for analysis,
the examiner then turns his attention to the voice samples of the suspects. Here
also, the recordings must be of sufficient clarity to allow comparison, although at
this stage, the recording process is usually so closely controlled that the quality
of recording is not a problem.
The examiner can only work with speech samples which are the same as the text
of the unknown recording. Under the best of circumstances the suspects will
repeat, several times, the text of the recording of the unknown speaker and these
words will be recorded in a similar manner to the recording of the unknown
speaker. For example, if the recording of the unknown speaker was a bomb
threat made to a recorded telephone line then each of the suspects would repeat
the threat, word for word, to a recorded telephone line. This will provide the
examiner with not only the same speech sounds for comparison but also with
valuable information about the way each speech sound completes the transition
to the next sound.
There are those times when a voice sample must be obtained without the
knowledge of the suspect. It is possible to make an identification from a
surreptitious recording but the amount of speech necessary to do the comparison
is usually much greater. If the suspect is being engaged in conversation for the
purpose of obtaining a voice sample, the conversation must be manipulated in
such a way so as to have the suspect repeat as many of the words and phrases
found in the text of the unknown recording as possible.
The worst exemplar recordings with which an examiner must work are those of
random speech. It is necessary to obtain a large sample of speech to improve
the chances of obtaining a sufficient amount of comparable speech.
As in any other form of identification analysis, as the quality of the evidence with
which the examiner has to work declines, the greater the amount of evidence and
4
time necessary to complete the analysis, and the less likely the chance for a
positive conclusion.
Once the evidence has been determined to be sufficient to perform the analysis,
the examiner then begins the two step process of voice sample comparison; one
aural (listening) and the other spectrographic (visual). These are two different but
interwoven and equally important analytical methods which the examiner
combines to reach the final conclusion. The first step is an aural comparison of
the voice samples.2 Here the examiner compares both single speech sounds
and series of speech sounds of the known and unknown samples. At this stage
the examiner is conducting a number of tasks; comparing for similarities and
differences, screening out less useful portions of the samples, and indexing the
samples for further analysis. An example of the initial aural comparison is the
screening of the samples for pronunciation similarities or discrepancies such as
the word "the" may be said with a short "a" sound or a long "e" sound. If the word
is not pronounced in the same manner it loses comparison value.
Once the examiner has located those portions to be used for the analysis, a
more detailed aural comparison is undertaken. This comparison can be
accomplished in many different ways. One of the most commonly used methods
of aural comparison is re-recording a speech sound sample of the unknown
followed immediately by a re-recording of the same speech sounds of the
suspect. This is repeated several times so that the final product is a recording of
specific speech sounds, in alternating order, by the unknown speaker followed by
the suspect. Such comparisons have been greatly facilitated by the use of audio
digital recording equipment which allows for the digital recording, storage, and
repeated playback of only the desired speech sounds to be examined.
During the aural comparison the examiner studies the psycholinguistic features
of the speakers voice. There are a large number of qualities and traits which are
examined from such general traits as accent and dialect to inflection, syllable
grouping and breath patterns. The examiner also scrutinizes the samples for
signs of speech pathologies and peculiar speech habits.
The second step in the voice identification process is the spectrographic analysis
of the recorded samples. The sound spectrograph is an automatic sound wave
analyzer with a high quality, fully functional tape recorder. The speech samples
to be analyzed are recorded on the sound spectrograph. The recording is then
analyzed in two and one half second segments. The product is a spectrogram, a
graphic display of the recorded signal on the basis of time and frequency with a
general indication of amplitude.
The spectrograms of the unknown speaker are then visually compared to the
spectrograms of the suspects. Only those speech sounds which are the same
are compared.3 The comparisons of the spectrograms are based on the
displayed patterns representing the psychoacoustical features of the captured
speech. The examiner studies the bandwidths, mean frequencies, and trajectory
of vowel formants; vertical striations, distribution of formant energy and nasal
5
resonances; stops, plosives and fricatives; interformant features, the relation of
all features present as affected during articulatory changes and any peculiar
acoustic patterning.4 The examiner looks not only for similarities but also for
differences. The differences are closely examined to determine if they are due to
pronunciation differences or if they are indicative of different speakers.
When the analysis is complete the examiner integrates his findings from both the
aural and spectrographic analyses into one of five standard conclusions; a
positive identification, a probable identification, a positive elimination, a probable
elimination, or no decision. In order to arrive at a positive identification the
examiner must find a minimum of twenty speech sounds which possess sufficient
aural and spectrographic similarities. There can be no differences either aural or
spectrographic for which there can be no accounting.
The probable identification conclusion is reached when there are less then
twenty similarities and no unexplained differences. This conclusion is usually
reached when working with small samples, random speech samples or
recordings of lower quality. The result of positive elimination is rendered when
twenty differences between the samples are found that can not be based on any
fact other than different voices having produced the samples. A probable
elimination decision is usually reached when working with limited text or a
recording of lower quality. The no decision conclusion is used when the quality of
the recording is so poor that there is insufficient information with which to work or
when there are too few common speech sounds suitable for comparison.
HISTORY
A good place to start examining the history of speech sound analysis goes back
a little more than one hundred years to Alexander Melville Bell who developed a
visual representation of the spoken word. This visual display of the spoken word
conveyed much more information about the pronunciation of that word than the
dictionary spelling could ever suggest. His depiction of speech sounds
demonstrated the subtle differences with which different people pronounced the
same words. This system of speech sound analysis developed by Bell is the
phonetic alphabet which he called "visible speech".5 His method of encoding the
great variety of speech sounds was by handwritten symbols and was language
independent. This code produced a visual representation of speech which could
convey to the eye the subtle differences in which words were spoken. This
system was used by both Bell and his son, Alexander Graham Bell, in helping
deaf people learn to speak.6
It was in the early 1940's that a new method of speech sound analysis was
developed. Potter, Kopp & Green, working for Bell Laboratories in Murray Hill,
New Jersey, began work on a project to develop a visual representation of
speech using a sound spectrograph. This machine, an automatic sound wave
analyzer, produced a visual record of speech portraying three parameters;
frequency, intensity and time. This research was intensified during World War II
when acoustic scientists suggested that enemy radio voices could be identified
6
by the spectrograms produced by the sound spectrograph. The war ended before
the technique could be perfected.
In 1947, Potter, Kopp and Green published their work in a book, the title of which
was borrowed from Alexander Melville Bell, Visible Speech. Their work is a
comprehensive study of speech spectrograms designed to linguistically interpret
visible speech sound patterns. This work was similar to that of Bell's in that
speech sounds were encoded into a visual form. The difference is, instead of a
pen, Potter, Kopp and Green used a sound spectrograph to produce the visual
patterns.
Research in the area of speaker identification slowed dramatically with the end of
World War II. It was not until the late 1950's and early 1960's that the research
began again. It was at this time the New York City Police Department was
receiving a large number of telephone bomb threats to the airlines.7 At that time
Bell Laboratories was asked by law enforcement officers to provide assistance in
the apprehension of the individuals making the telephone calls. The task of
developing a reliable method of identification of a speaker's voice was given to
Lawrence G. Kersta, a physicist at Bell Laboratories who had worked on the
early experiments using the sound spectrograph. In two years Kersta had
developed a method of identification in which he reported results yielding a
correct identification 99.65% of all attempts.8
It was in 1966 that the Michigan State Police began the practical application of
the voice identification method in attempting to solve criminal cases. A Voice
Identification unit was established and the unit personnel received training from
Kersta and other speech scientists. During the first few years the voice
identification method was used only as an investigative aid.
The first court of published opinion to rule on the admissibility of voice
identification analysis was in the case of United States v. Wright, 17 USCMA 183,
37 CMR 447 (1967). This was a court martial proceeding in which the appellate
court affirmed the admission of spectrographic voice identification evidence by
the board of review. The lengthy dissent by Judge Ferguson based on the
requirements for acceptance of scientific evidence spelled out in Frye v. United
States, 293 Fed. 1013 (CA DC Cir) (1923), was the beginning of a controversy
which continues today.
The first non-military case to review the admissibility of voice identification
evidence was the New Jersey Supreme Court in State v. Cary.9 In this case the
court stated that "the physical properties of a person's voice are identifying
characteristics".10 The court also noted that trial courts in the states of New York
and California have admitted voice identification evidence but that these
admissions have not been subject of appellate review.11 The court declined to
rule on the admissibility issue and remanded the case to determine if the
equipment and technique were sufficiently accurate to provide results admissible
as evidence. The Superior Court of New Jersey, on appeal from a denial of
7
admission after remand, held that the majority of evidence "indicates, not that the
technique is not accurate and reliable, but rather that it is just too early to tell and
at this time lacks the required scientific acceptance".12 The New Jersey
Supreme Court reviewed this decision and once again remanded for additional
fact finding "in light of the far-reaching implications of admission of voiceprint
evidence".13 The State of New Jersey was unable "to furnish any new and
significant evidence" by the third time the New Jersey Supreme Court reviewed
this issue and as such affirmed the trial court's opinion excluding voice
identification evidence.14
California came to a similar holding when the issue first reached the appellate
level in People v. King.15 The State brought in Lawrence Kersta as the voice
identification expert to testify as to the reliability of the technique. The defense
brought in seven speech scientists and engineers to rebut Kersta's claims. The
court held that "Kersta's claims for the accuracy of the `voiceprint' process are
founded on theories and conclusions which are not yet substantiated by
accepted methods of scientific verification".16 The court cited the Frye test as the
proper standard for admissibility.17 The court also left the door open for future
admission by saying when voice identification evidence has achieved the
necessary degree of acceptance they will welcome its use.18
In State ex rel. Trimble v. Heldman 19, the Supreme Court of Minnesota held that
"spectrograms ought to be admissible at least for the purpose of corroborating
opinions as to identification by means of ear alone".20 The court was impressed
by the testimony of Dr. Oscar Tosi who had previously testified against the use of
spectrographic voice identification evidence in courtrooms, but after extensive
research and experimentation now described the technique as "extremely
reliable".21 The court made reference to the Frye test and to the scientific
community's acceptance of Dr. Tosi's study, but did not specifically apply the
Frye test as the standard for the admissibility of the voice identification
evidence.22 In discussing the issue of admissibility the court held that it was the
job of the factfinder to weight the credibility of the evidence.
"The opinion of an expert is admissible, if at all, for the purpose of aiding the jury
or the factfinder in a field where he has no particular knowledge or training. The
weight and credibility to be given to the opinion of an expert lies with the
factfinder. It is no different in this field than in any other".23
In 1972 the third and fourth District Courts of Florida, in separate opinions, held
admissible the use of spectrographic voice identification evidence.24 The court in
Worley held that the voice identification evidence was admissible to corroborate
the defendant's identification by other means. The court stated that the technique
had attained the necessary level of scientific reliability required for admission, but
since it was only offered as corroborative evidence, the court refused to comment
as to whether such evidence alone would be sufficient to sustain the identification
and conviction.25
8
The third District Court of Appeals of Florida did not limit the admission of
spectrograph evidence to corroborative status. In the Alea opinion the court does
not mention the Frye test as the standard to be used for admission, but rather
states that "such testimony is admissible to establish the identity of a suspect as
direct and positive proof, although its probative value is a question for the jury".26
In the case of State v. Andretta 27, the New Jersey Supreme Court stated that
there was much more support for the admission of spectrographic voice
identification evidence than at the time they decided Cary, but refused to address
the issue further since the only issue before them was whether the defendant
should be compelled to speak for a spectrographic voice analysis.28
In California the Court of Appeal affirmed the trial court's admission of voice
identification evidence in the case of Hodo v. Superior Court.29 Here the court
found the requirements of Frye had been met in that there was now general
acceptance of spectrographic voice identification by recognized experts in the
field. The court cited Dr. Tosi's testimony that "those who really are familiar with
spectrography, they are accepting the technique".30 Tosi also pointed out that
the general population of speech scientists are not familiar with this technique
and thus can not form an opinion on it.31
The court in United States v. Samples 32 held that the Frye test of general
acceptance precludes too much relevant evidence for purposes of the fact
determining process at a revocation of probation hearing and the court allowed
the use of spectrographic voice identification evidence to corroborate other
identification evidence.33
In 1974 the case of United States v. Addison 34 rejected the admission of voice
identification evidence saying that such evidence "is not now sufficiently
accepted" and as such the requirements of the Frye test were not met.35 At the
trial the court heard from two experts endorsing the technique, Dr. Tosi and a
recent convert to the reliability of the technique, Dr. Ladefoged. Only one expert,
Dr. Stuart, testified that he was still skeptical of the technique and thought that
most of the scientific community was also.36 Although the admission of
spectrographic voice identification evidence was held to be error by the trial court,
the appellate court refused to overturn the conviction due to overwhelming
amount of other evidence supporting the conviction.37
Attempted disguise or mimic were the grounds the California Court of Appeal
used to reverse a conviction based in part on spectrographic voice identification
in the case of People v. Law.38 The court found that "with respect to disguised
and mimicked voices in particular, the prosecution did not carry out its burden of
proof to demonstrate that the scientific principles pertaining to spectrographic
identification were beyond the experimental and into the demonstrable stage or
that the procedure was sufficiently established to have gained general
acceptance in the particular field in which it belongs".39 The main concern of the
court was that no experimentation had been completed studying the effects of
attempts to disguise or mimic on the accuracy of the identification process.
9
Without mentioning the Frye test this court used the standards set in Frye as the
test of admissibility although the court seemed to be limiting the scope of the
opinion to cases involving disguise or mimic.
In United States v. Franks 40, the Sixth Circuit Court of Appeals held
spectrographic voice identification evidence to be admissible. The court said it
was "mindful of a considerable area of discretion on the part of the trial judge in
admitting or refusing to admit evidence based on scientific processes".41
Quoting from United States v. Stifel 42, the court pointed out that "neither
newness nor lack of absolute certainty in a test suffices to render it inadmissible
in court. Every useful new development must have its first day in court. And court
records are full of the conflicting opinions of doctors, engineers and
accountants...".43 The court in Franks found that extensive review was given to
the qualifications of the experts and opportunity to cross-examine the experts to
determine the proper weight to be given such evidence.
The Massachusetts Supreme Court, in Commonwealth v. Lykus 44, allowed the
admission of spectrographic voice identification evidence saying that the opinions
of a qualified expert should be received and the considerations similar to those
expressed in Frye should be for the fact finder as to the weight and value of the
opinions. The court gave greater weight to those experts who had had direct and
empirical experience in the field as opposed to those who had only performed a
theoretical review of that work.45 The court also stated that "neither infallibility
nor unanimous acceptance of the principle need be proved to justify its
admission into evidence".46 The Massachusetts Supreme Court again, that
same year, found no error in the use of spectrographic voice identification
evidence in the case of Commonwealth v. Vitello.47
The Fourth Circuit Court of Appeals, in the case of United States v. Baller 48,
allowed the admission of spectrographic voice identification evidence saying
unless it is prejudicial or misleading to the jury, it is better to admit relevant
scientific evidence in the same manner as other expert testimony and allow its
weight to be attacked by cross-examination and refutation.49 The court listed six
reasons supporting admission; the expert was a qualified practitioner, evidence
in voir dire demonstrated probative value, competent witnesses were available to
expose limitations, the defense demonstrated competent cross-examination, the
tape recordings were played for the jury, and the jury was told they could
disregard the opinion of the voice identification expert.50
Voice identification evidence was admitted by the Sixth Circuit Court of Appeals
in United States v. Jenkins 51 using the same logic as in Baller. Here the court
said that the issue of admissibility was within the discretion of the trial judge and
that once a proper foundation had been laid the trier of fact was able to assign
proper weight to the evidence.52
In 1976 the New York Supreme Court pointed out, in the case of People v.
Rogers 53, that fifty different trial courts had admitted spectrographic voice
identification evidence, as had fourteen out of fifteen U. S. District Court judges,
10
and only two out of thirty- seven states considering the issue had rejected
admission.54 The Rogers court stated that this technique, when accompanied by
aural examination and conducted by a qualified examiner, had now reached the
level of general scientific acceptance by those who would be expected to be
familiar with its use, and as such, has reached the level of scientific acceptance
and reliability necessary for admission.55 The court also pointed out that other
scientific evidence processes are regularly admitted which as, or less, reliable
than spectrographic voice identification; hair and fiber analysis, ballistics, forensic
chemistry and serology, and blood alcohol tests.56
The Supreme Court of California finally put an end to the see-saw ride of
admissibility in that state in People v. Kelly 57 by rejecting admission because of
insufficient showing of support. "Although voiceprint analysis may indeed
constitute a reliable and valuable tool in either identifying or eliminating suspects
in criminal cases, that fact was not satisfactorily demonstrated in this case".58 In
this case the court seemed to have the most trouble with the fact the only expert
provided to lay the foundation for admission was the technician who performed
the analysis, saying that a single witness can not attest to the views of the
scientific community on this new technique and that this witness, who may not be
capable of a fair and impartial evaluation of the technique since he has built a
career on it, lacked the academic credentials to express an opinion as to the
acceptance of the technique by the scientific community.59
In United States v. McDaniel 60, it appears that District of Columbia Circuit Court
of Appeals would have liked to admit the spectrographic voice identification
evidence but had to reject it because the shadow of the Addison decision of two
years past "looms over our consideration of this issue".61 The court held the
admission of the voice identification evidence to be harmless error in that the rest
of the evidence was overwhelming. The court did recognize the trend toward
admissibility and contemplated that it may be time to reexamine the holding of
Addison "in light of the apparently increased reliability and general acceptance in
the scientific community".62
The Supreme Court of Pennsylvania rejected admission in Commonwealth v.
Topa 63 holding that the technician's opinion alone will not suffice to permit the
introduction of scientific evidence into a court of law.64 This was the same
situation, in fact the same single expert, which confronted the Kelly court.
In People v. Tobey 65 the Michigan Supreme Court found, by applying the Frye
test, that the trial court erred in admitting spectrographic voice identification
evidence. The court found that neither of the two experts testifying in favor of the
technique could be called disinterested and impartial experts in that both had
built their reputations and careers on this type of work.66 The court pointed out
that not all courts require independent and impartial proof of general scientific
acceptability and was quick to add that this decision was not intended in anyway
to foreclose the introduction of such evidence in future cases where there is
demonstrated solid scientific approval and support of this new method of
identification.67
11
In admitting voice identification evidence, the United States District Court for the
Southern District of New York, in United States v. Willaims 68, found that the
requirements of the Frye test were met when the technique was performed "by
aural comparison and spectrographic analysis".69 The court stated that the
concerns of the defendant that this technique had a mystique of scientific
precision which may mask the ultimate subjectivity of spectrographic analysis,
although they were valid concerns, could be alleviated by action other than
suppression of the evidence, such as opposing expert opinion and jury
instructions allowing the jury to determine the weight, if any, of the evidence.70
In People v. Collins 71, the Supreme Court of New York rejected admission of
spectrographic voice identification evidence saying that the Frye test alone was
insufficient to determine admissibility and must be used in conjunction with a test
of reliability.72 The court found that the proponents of the technique were in the
minority and that the remainder of the relevant scientific community either
expressed opposition or expressed no opinion.73
In Brown v. United States 74, the District of Columbia Court of Appeals rejected
the use of voice identification evidence, but held the error to be harmless and
affirmed the conviction in light of overwhelming non-spectrographic identification
of the defendant as perpetrator of the crime. One of the main problems in this
case was the fact that the exemplar of the defendant's voice was recorded in a
defective manner but used anyway after the tape speed malfunction had been
corrected in a laboratory. Dr. Tosi, testifying as a proponent of the technique,
stated that the technician should not have used the defective recording as a
basis of comparison.75 The court held the technique was not shown to be
sufficiently reliable and accepted within the scientific community to permit its use
in this criminal case, but that this decision did not foreclose a future decision as
to admissibility of the technique.76
In the civil case of D'Arc v. D'Arc 77, the court found that the requirements of the
Frye test had not been met and thus the evidence could not be admitted. The
court believed that even with proper instructions to the contrary, this type of
evidence "has the potentiality to be assumed by many jurors as being conclusive
and dispositive" and thus should be subject to strict standards of admission.78
The court in State v. Williams 79 refused to apply the Frye standard citing instead
the Maine Rules of Evidence, Rule 401, which states "all relevant evidence is
admissible", with relevant being described as evidence having any tendency to
make the existence of any fact that is of consequence to the determination of the
action more probable or less probable than it would be without the evidence.80
In Reed v. State 81 the court applied the Frye standard to determine admissibility
with a rather wide definition of the scientific community which included "those
whose scientific background and training are sufficient to allow them to
comprehend and understand the process and form a judgment about it".82 The
court said the trial court erred in using the more restricted definition of scientific
community, "those who are knowledgeable, directly knowledgeable through work,
12
utilization of the techniques, experimentation and so forth" and did not mean the
broad general scientific community of speech and hearing science.83
In a fifty-one page dissent to the Reed decision 84, Judge Smith points out that
the Frye standard is much criticized and has never been adopted in the state of
Maryland, that this decision is out of step with other courts on related issues of
fingerprints, ballistics, x-rays and the like, that this decision is out of step with
prior Maryland holdings on expert testimony, that the majority of reported
opinions have accepted such evidence, and that even if Frye were applicable it is
satisfied.
In United States v. Williams 85 the court did not apply the Frye standard but did
note that acceptance of the technique appeared strong among scientists who
had worked with spectrograms and weak among those who had not.86 The court
then focused on the reliability of the technique and the tendency to mislead. As to
the reliability of the technique, the court noted the small error rate, 2.4% false
identification, the existence and maintenance of standards of analysis, and the
conservative manner in which the technique was applied.87 As to the tendency
to mislead, the court felt that adequate precautions were taken in that the jury
could view the spectrograms and listen to the recording and the expert's
qualifications, the reliability of the equipment and the technique were subject to
scrutiny by the defense, and the jury was instructed that they were free to
disregard the testimony of the experts.88
In the case of People v. Bein 89 the court based admissibility on a two pronged
test; general acceptance by the relevant scientific community, and competent
expert testimony establishing reliability of the process. The court found that both
tests had been met and allow the admission of the evidence.90 The court
described the relevant scientific community "to be that group of scientists who
are concerned with the problems of voice identification for forensic and other
purposes".91 The court also suggested that "it is no different in this field of
expertise than in other fields, that where experts disagree, it is for the finder of
fact to determine which testimony is the more credible and therefore more
acceptable".92
The Ohio Supreme Court, in State v. Williams 93, relied on their own state rules
of evidence, as did the Maine court in Williams, and rejected the use of the Frye
standard. The court refused "to engage in scientific nose counting for the
purpose of whether evidence based on newly ascertained or applied scientific
principles is admissible".94 The court noted, with approval, the playing of the
recordings to the jury and, that the jury was free to reject the testimony of the
expert.95
In that same year, right across the border in Indiana, the court in Cornett v.
State96 rejected admission of voice identification evidence saying the conditions
set out in Frye had not been met. Here the court used a wide definition of the
scientific community which included linguists, psychologists and engineers who
use voice spectrography for identification purposes.97 Although the court held
13
that the trial court erred in admitting the evidence, the error was found to be
harmless and the conviction affirmed.98
Likewise the court in State v. Gortarez 99 rejected the admission of voice
identification evidence but affirmed the conviction holding such admission to be
harmless error. The court also used a wide definition of the scientific community
in applying the Frye standard including experts in the fields of acoustical
engineering, acoustics, communication electronics, linguists, phonetics, physics
and speech communications and found that there was not general acceptance
among these scientists.100
In the case of United States v. Love101, the admissibility of spectrographic voice
identification was not at issue. The fourth circuit Court of Appeals was reviewing
whether the trial judge's comments about a voice identification expert were
considered error. The trial judge told the jury that they, the jury, were to assign
whatever weight they wanted to the testimony of the expert and even disregard
his testimony if they "should conclude that his opinion was not based on
adequate education, training or experience, or that his professed science of voice
print identification was not sufficiently reliable, accurate, and dependable."102
The Court of Appeals found no error in the judge's instruction to the jury.
In admitting spectrographic voice identification evidence, the Supreme Court of
Rhode Island, in State v. Wheeler 103, declined to apply the Frye standard
holding instead "the law and practice of this state on the use of expert testimony
has historically been based on the principle that helpfulness to the trier of fact is
the most critical consideration".104 The court reviewed the cases around the
country, both state and federal, and noted that the majority of circuit courts that
have considered admission of spectrographic evidence have decided in favor of
its admission.105 The court pointed out that the defendant had all the proper
safeguards such as cross-examination, rebuttal experts, and the jury had the
right to reject the evidence for any one of a number of reasons.106
In State v. Free107 the Court of Appeals of the State of Louisiana did not rely on
the Frye test for guidance in determining the admissibility of spectrographic voice
identification evidence but instead applied a balancing test set forth in State v.
Catanese108). One individual, accepted as an expert in voice identification,
testified as to the theoretical and technical aspects of the spectrographic voice
analysis method. No other witnesses were called to either support of show fault
with the admission of the voice identification testimony. The Court of Appeals
found that voice identification evidence, when offered by a competent expert and
obtained through proper procedures, "is as reliable as other kinds of scientific
evidence accepted routinely by courts" and "can be highly probative"109. Using
the Catanese balancing test the Court of Appeals found that trier of fact was
likely to give almost conclusive weight to the voice identification expert's opinion,
consequently, misleading the jurors. The Court of Appeals was also concerned
that there were not enough experts available who could critically examine the
validity of a voice identification determination in a particular case. Nine rules were
suggested as a basis for which voice identification evidence could be
14
accepted110). The Court of Appeals held that Catanese prohibits admission of
the voice identification evidence at this time111 and found the admission of that
evidence to be harmless error.
In 1987 the Supreme Court of New Jersey again addressed the issue of
admissibility of spectrographic evidence in the civil case of Windmere v.
International Insurance Company.112 In affirming the judgment of the Appellate
Division, the Supreme Court of New Jersey ruled that the Appellate court's
affirmation of the admission of the spectrographic evidence by the trial court was
improper. The court stated the admissibility of the spectrographic voice analysis
is based on the scientific technique having sufficient scientific basis to produce
uniform and reasonably reliable results and contribute materially to the
ascertainment of the truth 113, a standard the court admits bears "a close
resemblance to the familiar Frye test".114 The court relies upon the "general
acceptance within the professional community" to establish the scientific
reliability of the voice identification process. In reaching a determination of
general acceptance, the court on a three prong test which includes; (1) the
testimony of knowledgeable experts, (2) authoritative scientific literature, and (3)
persuasive judicial decisions which acknowledge such general acceptance of
expert testimony.115 The court found that none of the three prongs indicated that
there was a general acceptance of spectrographic voice identification in the
professional community. The court criticized the proponent experts as being too
closely tied to the development of this identification analysis to represent the
opinions of the community.116 The court found that the trial court did not
undertake to resolve the issue of conflicting scientific literature and they would
make no effort to resolve the conflict.117 The court also reviewed the judicial
decisions regarding admissibility and found a split among the jurisdictions as to
the reliability of the identification process.118
The New Jersey Supreme Court specifically limited its decision in Windmere
excluding spectrographic voice identification evidence to the present case. The
court stated that the future use of voice identification evidence "as a reasonably
reliable scientific method may not be precluded forever if more thorough proofs
as to reliability are introduced" 119 and they will "continue to await the more
conclusive evidence of scientific reliability".120
The Court of Appeals of Texas in the case of Pope v. Texas121 refused to
address the issue of admissibility of voice identification evidence stating that "the
overwhelming evidence against appellant renders this error, if any,
harmless"122). Justice McClung in his dissenting opinion states that the trial
court did err in admitting the voice identification evidence and that the error was
not harmless123. He suggests that the Frye test is the proper standard for
assessing the admissibility issue and that the "relevant scientific community"
should be defined broadly124. When this aspect of the test is so defined the
"general acceptability" criterion is not met.
In February of 1989, the United States Court of Appeals for the Seventh Circuit
affirmed the decision of the United States District Court for the Northern District
15
of Illinois admitting spectrographic voice identification evidence in the criminal
case of United States of America v. Tamara Jo Smith.125 The Seventh circuit
now joins the Second, Fourth and Sixth Circuits in affirming the use of
spectrographic voice identification evidence.126 The Appellate court used the
Frye standard to hold expert testimony concerning spectrographic voice analysis
admissible in cases where the proponent of the testimony has established a
proper foundation.127 The court noted that this technique was not one-hundred
percent infallible and that the entire scientific community does not support it,
however, neither infallibility nor unanimity is a precondition for general
acceptance of scientific evidence.128 The Seventh circuit found that a proper
foundation had been established in that the expert testified to the theory and the
technique, the accuracy of the analysis and the limitations of the process.129
The court noted that variations from the norm result in an increase of false
eliminations.130 The jury was not likely to be misled in that they had the
opportunity to hear the recordings, see the spectrograms, hear the limitations of
the process, witnessed a rigorous cross-examination of the expert and could
reject the testimony of the expert.131
In United States v. Maivia,132 the United States District Court admitted
spectrographic evidence after a four day hearing on the issue. The court
examined the various sub- tests of the Frye test and found that spectrographic
voice identification evidence met these tests. The court also noted that
"inasmuch as the admissibility of spectrographic evidence to identify voices has
received judicial recognition, it is no longer considered novel within the Frye test
and consequently the test is inapplicable" 133. The court also looked to the
Federal Rules of Evidence, specifically rule 403, in deciding the admissibility of
spectrographic voice identification evidence.
In affirming the order of the Appellate Division, the New York Supreme Court, in
the case of People v. Jeter134, concluded that the trial court was not able to
properly determine that voice identification evidence is generally accepted as
reliable based on case law and existing literature. The Court stated that the trial
court should have held a preliminary inquiry into the reliability of voice
spectrographic evidence. In the light of the other evidence, the admission of the
voice identification evidence was held to be harmless error in this case.
STANDARDS OF ADMISSIBILITY
Prior to 1993 there were two main standards of admissibility which had been
applied to voice identification evidence; the Frye test and the Federal Rules of
Evidence (and the rules of evidence of the various states). The Frye test
originated from Court of Appeals of the District of Columbia135 in a decision
rejecting admissibility of a systolic blood pressure deception test (a forerunner of
the polygraph test). The court stated that admission of this novel technique was
dependent on its acceptance by the scientific community.
"Just when a scientific principle or discovery crosses the line between the
experimental and demonstrable stages is difficult to define. Somewhere in this
16
twilight zone the evidential force of the principle must be recognized, and while
courts will go a long way in admitting expert testimony deduced from a wellrecognized scientific principle or discovery, the thing from which the deduction is
made must be sufficiently established to have gained general acceptance in the
particular field in which it belongs".136
Out of forty published opinions prior to 1993 deciding the admissibility of voice
identification evidence, twenty-three courts applied the Frye standard or a
standard very similar to Frye. Sixteen of the twenty-three courts rejected the
admission of such evidence. Six of these courts held the admission of voice
identification evidence by the trial court was harmless error and affirmed the
conviction or judgment. Eight of the sixteen stated that although voice
identification evidence had not yet met the required standard of scientific
acceptability, their decision was not intended to foreclose future admission when
such standards were met. Two of these courts denied admission because they
felt a single witness could not speak for the entire scientific community regarding
the acceptance issue.
Seven courts applied the test and found the requirements of Frye had been met.
Of the thirteen courts applying a standard of admissibility different from Frye, only
one, the Free court137, rejected voice identification evidence.
There are three problems with the Frye standard; at what point is the principle of
"sufficiently established" determined, at what point is "general acceptance"
reached, and what is the proper definition of "the particular field in which it
belongs".
These three areas have been major stumbling blocks for the courts in deciding
the issue of the admissibility of voice identification evidence due to the small
number of voice scientists who have performed research in this field. The trial
court in People v. Siervonti 138 noted the lack of research in this area saying
"one only wishes that the last twelve years had been spent in research and not in
attempting to get the method into the courts".139
The Frye test has been criticized as not being the appropriate test to use for the
admission of voice identification evidence. This standard was established and
applied to the admission of a type of evidence which is very different from voice
identification. In Frye the court was concerned with the admission of a test
designed to determine if a person was telling the truth or not. This type of
evidence invades the province of the finder of fact. Voice identification evidence
belongs in the general classification of identification evidence which does not
impinge on the role of the finder of fact. As such it shares common traits with the
other identification sciences of fingerprinting, ballistics, handwriting, and fiber,
serum and substance identification.
Another criticism of the application of the Frye test as the standard for admission
of voice identification evidence is that general acceptance by the scientific
community is the proper condition for taking of judicial notice of scientific facts.
17
McCormick states that general scientific acceptance is a proper condition for
taking judicial notice of scientific facts, but not a criterion for the admissibility of
scientific evidence.140
The court in Reed v. State 141 seemed to note this difference between the
standard for the taking of judicial notice and that for admission of evidence such
as voice identification. The court said that validity and reliability may be so
broadly accepted in the scientific community that the court may take judicial
notice of it. If it can not be judicially noticed then the reliability must be
demonstrated before it can be admitted.142 The court then applied the Frye test,
general acceptance by the scientific community, to determine reliability and thus,
admissibility.
Scientific evidence has long been admitted before it was judicially noticed, as
with the case of fingerprints. The admission of fingerprint identification evidence
was first challenged in the case of People v. Jennings143 in 1911. The court in
Jennings allowed the admission of fingerprint evidence saying "whatever tends to
prove any material fact is relevant and competent".144 It was not until thirty-three
years later that fingerprint evidence was first judicially noticed.145
The majority of courts which have decided the issue of admissibility in favor of
allowing voice identification into the courtroom have used similar standards which
permit the finder of fact to hear the evidence and determine the proper weight to
be assigned to it. Their logic runs parallel to the Federal Rules of Evidence which
state that all relevant evidence is admissible with the word "relevant" being
defined as evidence tending to make the existence of any fact that is of
consequence to the determination of the action more probable or less probable
than it would be without the evidence.146 A qualified expert may testify to his
opinion if such opinion will assist the trier of fact in better understanding the
evidence.147
Many of the courts which have upheld the admission of voice identification
evidence have done so because the trial court had set up a number of
precautions to insure the evidence was viewed in its proper light. These
precautions include allowing the jury to see the spectrograms of the voices in
question, allowing the jury to hear the recordings from which the spectrograms
were produced, the expert's qualifications and opinions as well as the reliability of
the equipment and technique are subject to scrutiny by the other side, the
availability of competent witnesses to expose limitations in the process, and
instructions to the jury that they were free to assign whatever weight, if any, to
the evidence they felt it deserved.
The United States Supreme Court in 1993 changed the long-standing law of
admissibility of scientific expert evidence by rejecting the Frye test as
inconsistent with the Federal Rules of Evidence in the case of Daubert v. Merrell
Dow Pharmaceuticals148. The Court held that the Federal Rules of Evidence
and not Frye were the standard for determining admissibility of expert scientific
testimony. Frye's "general acceptance" test was superseded by the Federal
18
Rules' adoption. Rule 702 is the appropriate standard to assess the admissibility
of scientific evidence. The Court derived a reliability test from Rule 702.
In order to qualify a scientific knowledge, an inference or assertion must be
derived by the scientific method. Proposed testimony must be supported by
appropriate validation - i.e., good grounds, based on what is known. In short, the
requirement that an expert's testimony pertain to scientific knowledge establishes
a standard of evidentiary reliability149
The Daubert decision concerns statutory law and not constitutional law. The
Court held that the Federal Rules, not Frye, govern admissibility.. The only
Federal Circuit to reject spectrographic voice analysis has been the District of
Columbia. Daubert may cause the District of Columbia to change its stance the
next time such evidence is introduced.
Since Daubert is not binding on the states, it will be difficult to determine just how
much impact Daubert will have on the admissibility standards of the states. Many
states have adopted evidence rules based on the Federal Rules of Evidence and
may not be effected by this holding. Other states which have adopted the Frye
test will have to decide to either continue following Frye or change their standard
to Daubert. The Arizona Supreme Court declined to follow Daubert saying that it
was "not bound by the United States Supreme Court's non-constitutional
construction of the Federal Rules of Evidence when we construe the Arizona
Rules of Evidence."150
RESEARCH STUDIES
The studies that have been produced over the years have run the gambit in type,
parameter, and result. A quick review of the available published data would leave
one with the impression that the spectrographic method of voice identification
was only somewhat more accurate than flipping a coin. The diversity of the
relatively low number of studies and the range of results has only added to the
confusion as to the reliability and validity of this method of identification. When
one takes the time and expends the effort to analyze the studies in this field, a
very different conclusion becomes evident. When the individual parameters of
the studies are taken into account, who was being evaluated, what information
was given to the examiner to assess, and what limitations were placed on the
examiner's conclusions, a much clearer picture of the accuracy of the
spectrographic voice identification method develops. The picture is not one of a
marginally accurate technique but rather a picture that clearly shows that a
properly trained and experienced examiner, adhering to internationally accepted
standards will produce a highly accurate result. The studies also show that as the
level of training diminishes and/or the conclusions an examiner may reach are
artificially limited, the error rate goes up dramatically.
The training for accurately performing the spectrographic voice identification
method has been established as requiring completion of (1) a formal course of
study, usually 2 to 4 weeks duration, in the basics of spectrographic analysis, (2)
19
two years of study completing 100 voice comparison cases, usually in a one-toone relationship with a recognized expert, (3) examination by a board of experts
in the field of spectrographic voice identification analysis.
For the most accurate results from the spectrographic voice identification method,
a professional examiner (1) will require the original recordings or the best quality
re-recordings if the original is not available; (2) will perform a critical aural review
of the suspect and known recordings; (3) will produce sound spectrograms of the
comparable words and phrases; (4) will produce a comparison recording
juxtaposing the known and unknown speech samples; (5) will evaluate the
evidence and classify the results into one of five standard categories [ 1 - positive
identification, 2 - probable identification, 3. - positive elimination, 4 - probable
elimination, and 5 - no decision]. The final decision is reached through a
combined process of aural and visual examination.
It is important to remember that the spectrographic method of voice identification
is a process that interweaves the visual analysis of the sound spectrograms with
the critical aural examination of the sounds being viewed. Taking the results from
all of the studies produced shows that if the examiner's ability to analyze both the
graphic representations of the voice and the aural cues found in the recordings is
limited or restricted, accuracy suffers. Likewise, the amount of training has a
direct bearing on the level of accuracy of the results.
In a survey of 18 studies151 of the accuracy of the spectrographic voice
identification method, the results fall into two categories; those with proper
training, using standard procedures produce very accurate results, whereas
those with inadequate training, using limited analysis methods, produce
inaccurate results.
In a study152 in 1975 authored by Lt. L. Smrkovski of the Voice Identification
Unit of the Michigan State police, error rates in voice identification analysis
comparisons, based on three levels of training and experience, were evaluated.
The following table summarizes the results of that study.
Error type Novice Trainee Professional
False Ident. 5.0% 0.0% 0.0%
False Elim. 25.0% 0.0% 0.0%
No Decision 2.5% 2.5% 7.5%
Lt. Smrkovski's results show that proper training is essential. The fact that his
results show a higher no decision rate among the professional examiners than
the trainee examiners may indicate that the professional is a bit more cautious in
his analysis than the trainee.
Mark Greenwald, in his 1979 thesis153 for his M.A. degree at Michigan State
University, studied the performance of three professional examiners (each with
20
eight years experience) and five trainees (each with less than two years
experience) using standard spectrographic voice identification methods (visual
and aural) and result classifications. Greenwald found that the professional
examiners produced no errors when using full frequency bandwidth recordings.
When the frequency band width was restricted, the professional examiners still
produced no errors, but did increase their percentage of no decision
classifications. Greenwald also found that the training level was an important
factor and that the trainees in this study had an error rate of 6.1% for false
identifications in the restricted frequency bandwidth trials.
In 1986, the Federal Bureau of Investigation published a survey of two thousand
voice identification comparisons made by FBI examiners154. This survey was
based on 2000 forensic comparisons completed over a period of fifteen years,
under actual law enforcement conditions, by FBI examiners.155
The examiners had a minimum of two years experience, completed over 100
actual cases, completed a basic two week training course and received formal
approval by other trained examiners.156
The results of the survey are depicted in the chart 157 below.
DECISIONS NUMBER PERCENT(%)
No or low confidence 1304 65.2
Eliminations 378 18.9
Identifications 318 15.9
ERRORS
False eliminations 2 0.53
False identification 1 0.31
The FBI results are consistent with the Smrkovski study in that properly trained
examiners, utilizing the full range of procedures, produce quite accurate results.
By way of contrast, the 1976 study158 by Alan Reich used four speech science
graduate students with previous experience with speech spectrograms (but
untrained in spectrographic voice identification analysis) to examine, using visual
comparison only, nine excerpted words. This study produced an accuracy rate in
the undisguised trials of 56.67%. When disguise was introduced into this study
paradigm the accuracy rate decreased significantly.
Taken as a whole the 18 studies support the conclusion that accurate results will
be obtained only through the combined use of the aural and visual components
of the spectrographic voice identification method as performed by a properly
trained examiner adhering to the established standards. Those studies with poor
accuracy results are important in that they demonstrate the weaknesses of
21
improperly performed examinations that do not adhere to the internationally
accepted professional standards.
A large part of the debate over the admissibility of spectrographic voice
identification analysis in the courts appears due to the fact that the parameters of
these studies have not adequately been demonstrated to the courts in the
necessary detail which would allow the courts to examine the overall meaning of
these studies. Many of these studies look at only one or two aspects of the
spectrographic voice identification method. Frequently the results of these
restricted scope studies have been misapplied to the entire spectrographic voice
identification method resulting in inaccurate information being used as the basis
for deciding the admissibility of spectrographic voice identification analysis. It is
important to provide an accurate picture of all the studies so the courts will have
the foundational information necessary to make an informed decision regarding
the admissibility of spectrographic voice identification analysis.
CONCLUSION
The technique of voice identification by means of aural and spectrographic
comparison is still an unsettled topic in law. Although the spectrographic voice
identification method has progressed greatly since it was first introduced to a
court of law back in the mid 1960's, it still faces stiff resistance on the issue of
admissibility in the courts today. One of the reasons for such opposition
regarding admissibility is that the method has evolved greatly since its initial
application. Court decisions based on early methods of voice identification
analysis are not applicable to the methods used today. No longer are voices
compared on the basis of a limited group of key words. Today's
aural/spectrographic voice identification method takes advantage of the latest in
technological advancements and interweaves several analyses into one
procedure to produce an accurate opinion as to the identity of a voice. This
modern technique combines the experience of a trained examiner performing the
visual analysis of the spectrograms and aural analysis of the recordings with the
use of the latest instruments modern technology has to offer, all in a
standardized methodology to assure reliability. Court decisions reviewing the
early voice identification cases may not be relevant to present day cases
because the older decisions were based on less sophisticated procedures. Most
of the courts which have rejected admission have been aware of continuing work
in this field and have specifically left the door open as to future admissibility.
Proper presentation and explanation of the research pertaining to spectrographic
voice identification analysis will allow the courts to better understand the
accuracy and reliability of the spectrographic voice identification method. When
the research is properly presented, the studies show that properly trained
individuals, using standard methodology, produce accurate results.
The current trends in the admissibility issue of voice identification evidence
indicate that courts are more willing to allow the evidence into the courtroom
22
when a proper foundation has been established which then allows the trier of fact
to determine the weight to be assigned to the evidence.
23
Spectrographic voice identification: A forensic
survey
Bruce E. Koenig
Federal Bureau of Investigation, Engineering Section, Technical Services
Division, 8199 Backlick Road, Lorton, Virginia 22079
(Received 25 October 1985; accepted for publication 18 February 1986) - J.
Acoust. Soc. Am 79(6) June 1986
A survey of 2000 voice identification comparisons made by Federal Bureau of
Investigation (FBI) examiners was used to determine the observed error rate of
the spectrographic voice identification technique under actual forensic conditions.
The qualifications of the examiners and the comparison procedures are set forth.
The survey revealed that decisions were made in 34.8% of the comparisons with
a 0.31% false identification error rate and a 0.5 3% false elimination error rate.
These error rates are expected to represent the minimum error rates under
actual forensic conditions.
PACS numbers: 43.70.Jt
INTRODUCTION
The sound spectrograph is a device which produces a visual graph (spectrogram)
of speech as a function of time (horizontal axis), frequency (vertical axis), and
voice energy (gray scale or color differences).1,2 It is a well-accepted research
tool that is used to study individual vowel characteristics, physiological speech
anomalies, etc. However, in the field of forensic voice identification, it has yet to
find approval among most scientists in phonetics. linguistics, engineering, and
related disciplines as a positive test in comparing voice samples.3-6
Historically, forensic applications were not seriously considered until 1962 when
Lawrence Kersta published the results of experiments which reflected error rates
of 0% to 3% for one-word spectral comparisons in closed sets (examiner always
knows a match exists) of 12 or less speakers.7 In 1972, the findings of a largescale study at Michigan State University were published in which attempts were
made to more closely imitate law enforcement conditions, but only spectral
comparisons were made (no aural). The “forensic model” included open set trials
(examiner did not know if a match existed), noncontemporary samples (1 month
apart), trained examiners, and high-confidence decisions. This resulted in an
approximate error rate of 2% for false identification (no match existed but the
examiner selected one, or a match existed but the examiner chose the wrong
one) and 5% for false elimination (a match existed but the examiner failed to
recognize it). The authors of the study attempted to extend the experimental
results to actual law enforcement conditions, which they thought would lower the
1
error rates. They theorized that examiners could aurally compare the voice
samples, the number of known suspects would be limited by police investigation,
there would be no time limits placed on the examiner, only very high confidence
decisions would be used, and additional known voice samples could be
obtained.8 Other scientists disagreed on the study’s extensions, and stated that
in actual forensic conditions the error rate would increase, not decrease.’ In 1979,
a committee of the National Research Council released its findings and
recommendations in a Federal Bureau of Investigation (FBI) -funded study on the
reliability of spectrographic voice identification under forensic conditions, which
found, in part, that:
(1) Error rates vary from case to case due to the properties of the voices
compared, the recording conditions used to obtain voice samples, the skill of the
examiner, and the examiner’s knowledge about the case. Estimates of error rates
are available only for a few situations, and they “do not constitute a generally
adequate basis for a judicial or legislative body to use in making judgments
concerning the reliability and acceptability of aural-visual voice identification in forensic applications.”10
(2) Examiners should fully use all available knowledge and techniques that could
improve the voice identification method.’0
(3) Spectrographic voice identification assumes that intraspeaker variability
(differences in the same utterance repeated by the same speaker) is discernable
from interspeaker variability (differences in the same utterance by different
speakers); however, that “assumption is not adequately supported by scientific
theory and data.” Viewpoints on actual error rates are presently based only on
“various professional judgments and fragmentary experimental results rather
than from objective data representative of results in forensic applications.”’’
FBI examiners have used the spectrographic technique since the 1950s for
investigative support, but have not provided expert court testimony on
comparison results.’2
This paper presents the results of 2000 forensic comparisons, under actual law
enforcement conditions, by FBI examiners.
. SURVEY PROCEDURES
The FBI conducts forensic voice identification examinations using the
spectrographic or voiceprint technique for the FBI, other Federal agencies, state
and local law enforcement authorities, and many foreign governments. After each
examination is conducted, a written report of findings is mailed to the contributor
with the name of the examiner and the disposition of the submitted voice
2
samples. If an identification or elimination is made, the contributor is contacted by
telephone and asked if the results are consistent with interviews and other
evidence in the investigation. If other information strongly supports the voice
comparison result, then the contributor is told to contact the FBI if later developed
evidence contradicts the finding. If the voice comparison results contradict other
evidence, the matter is closely followed until legally adjudicated or investigatively
closed. In the few occurrences where no final determination was possible, the
voice comparison result was considered a “no decision” in the survey.
The results of the last 2000 requested comparisons, spanning 15 years, were
compiled and organized into total identification and elimination decisions, known
errors, and no or low confidence decisions.
II. QUALIFICATIONS OF EXAMINERS
All of the individuals conducting the voice comparison examinations were FBI
employees with the following qualifications: (I) at least two years of full-time
experience in voice identification and analysis of tape recorded voice signals
using sophisticated digital and analog analysis and filtering equipment; (2)
completion of over 100 voice comparisons in actual cases; (3) completion of a
basic two week course in spectrographic analysis, or equivalent; (4) passing a
yearly hearing test; (5) formal approval by other trained examiners; and (6) a
minimum of a Bachelor of Science Degree in a basic scientific field.
III. COMPARISON PROCEDURES
The following procedures were used, if at all possible, on every attempted voice
comparison in the survey.
(1) Only original recordings of voice samples were accepted for examination,
unless the original recording had been erased and a high-quality copy was still
available.
(2) The recordings were played back on appropriate professional tape recorders
and recorded on a professional full-track tape recorder at 7 1/2 ips. When
possible, playback speed was adjusted to correct for original recording speed
errors by analyzing the recorded telephone and AC line tones on spectrum
analysis equipment. When necessary, special recorders were used to allow
proper playback of original recordings that had incorrect track
placement or azimuth misalignment.
(3) Spectrograms were produced on Voice Identification, Inc., Sound
Spectrographs, model 700. in the linear expand frequency range (0-4000 Hz),
wideband filter (300 Hz) and bar display mode. All spectrograms for each separate comparison were prepared on the same spectrograph. The spectrograms
were phonetically marked below each voice sound.
3
(4) When necessary, enhanced tape copies were also prepared from the original
recordings using equalizers, notch filters, and digital adaptive predictive
deconvolution programs13,14 to reduce extraneous noise and correct telephone
and recording channel effects. A second set of spectrograms was then prepared
from the enhanced copies and was used together with the unprocessed
spectrograms for comparison.
(5) Similarly pronounced words were compared between two voice samples, with
most known voice samples being verbatim with the unknown voice recording.
Normally, 20 or more different words were needed for a meaningful comparison.
Less than 20 words usually resulted in a less conclusive opinion, such as
possibly instead of probably.
(6) The examiners made a spectral pattern comparison between the two voice
samples by comparing beginning, mean and end formant frequency, formant
shaping, pitch, timing, etc., of each individual word. When available, similarly
pronounced words within each sample were compared to insure voice sample
consistency. Words with spectral patterns that were distorted, masked ‘by
extraneous sounds, too faint, or lacked adequate identifying characteristics were
not used
(7) An aural examination was made of each voice sample to determine if pattern
similarities or dissimilarities noted were the product of pronunciation differences,
voice disguise, obvious drug or alcohol use, altered psychological state,
electronic manipulation, etc.
(8) An aural comparison was then made by repeatedly playing two voice samples
simultaneously on separate tape recorders, and electronically switching back and
forth between the samples while listening on high-quality headphones. When one
sample had a wider frequency response than the other, bandpass filters were
used to compensate during at least some of the aural listening tests.
(9) The examiner then had to resolve any differences found between the aural
and spectral results, usually by repeating all or some of the comparison steps.
(10) If the examiner found the samples to be very similar (identification) or very
dissimilar (elimination), an independent evaluation was always conducted by at
least one, but usually two other examiners to confirm the results. If differences of
opinions occurred between the examiners, they were then resolved through
additional comparisons and
discussions by all the examiners involved. No or low confidence decisions were
usually not reviewed by another examiner.
IV. SURVEY RESULTS
The survey found that in 2000 voice comparisons, the following decisions and
errors were observed:
4
Decisions
Number
Percent (%)
No or low confidence
1304
65.2
Eliminations
378
18.9
Identifications
318
15.9
Errors
False eliminations
2
0.53
False identification
1
0.31
Most of the no or low confidence decisions were due to poor recording quality
and/or an insufficient number of comparable words. Decisions were also affected
by high-pitched voices (usually female) and some forms of voice disguise.
V. CONCLUSIONS
(1) The observed identification and elimination errors probably represent the
minimum error rates expected under actual forensic conditions, since
investigators are not always correct in their evaluation of a suspect’s involvement,
due to limited physical evidence, faulty eyewitness statements, etc.
(2) The stated results should only be considered valid when compared with
examiners having the same qualifications and using the same comparison
procedures.
(3) The FBI has emphasized signal analysis and pattern recognition skills for
conducting voice identification examinations, more than formal training in speech
physiology, linguistics, phonetics, etc., though a basic knowledge of these fields
is considered important. ACKNOWLEDGMENTS
5
Thanks are due to the following colleagues who were involved in conducting the
comparisons used in this survey:
Steven A. Killion, Barbara Ann Kohus, Dale Gene Linden, Gregory J. Major,
Artese Savoy Kelly, Keith W. Sponholtz, Ernest Terrazas, Richard L. Todd, and
Charles Wilmore, Jr.
1 w. Koenig, H. K. Dunn, and L. Y. Lacey, J. Acoust. Soc. Am. 18, 244(1946)
2G. M. Kuhn, J. Acoust. Soc. Am. 76, 682—685 (1984).
3R. H. Bolt, F. S. Cooper, E. E. David, Jr., P. B. Denes, J. M. Pickett, and K. N.
Stevens, J. Acoust. Soc. Am. 47, 591—612 (1970).
4R. H. Bolt, F. S. Cooper, E. E. David, Jr., P. B. Denes, J. M. Pickett, and K. N.
Stevens, J. Acoust. Soc. Am. 54, 531—534 (1974).
5K. N. Stevens, C. E. Williams, J. R. Carbonell, and B. Woods, J. Acoust Soc.
Am. 44, 1596—1607 (1968).
6R. H. Bolt, F. S. Cooper, D. M. Green, S. L. Hamlet, J. G. McKnight, J. M.
Pickett, 0.1. Tosi, and B. D. Underwood, “On the Theory and Practice of Voice
Identification,” N.A.S.N.R.C. Publ. (1979).
7L. G. Kersta, Nature 196, 1253—1257 (1962).
80. Tosi, H. Oyer, w. Lashbrook, C. Pedrey, J. Nicol, and E. Nash, J. Acoust. Soc.
Am. 51, 2030—2043 (1972).
9R. H. Bolt, F. S. Cooper, E. E. David, Jr., P. B. Deres, J. M. Pickett, and K. N.
Stevens, J. Acoust. Soc. Am. 54, 53 1—534 (1974).
10R. H. Bolt, F. S. Cooper. D. M. Green, S. L. Hamlet, J. G. McKnight, J. M.
Pickett, 0. I. Tosi, and B. D. Underwood, N.A.S.N.R.C. Publ., 60 (1979).
11R. H. Bolt, F. 5: Cooper, D. M. Green, S. L. Hamlet, J. G. McKnight, J. M.
Pickett, 0. 1. Tosi, and B. D. Underwood, N.A.S.N.R.C. Publ., 2 (1979).
12B. E. Koenig, FBI Law Enforcement Bulletin (January and February,
1980).
13J. E. Paul, IEEE Circuits and Systems Magazine 1, 2—7 (1979).
4J. E. Paul, paper presented at Voice Interactive Systems Subtag, Orlando, FL
(Oct. 1984); hosted by U. S. Army Avionics Research and Development Activity,
Ft. Monmouth, NJ.
6
Voice Comparison
Approved by ABRE Voice ID Board - April 1999
AMERICAN BOARD of RECORDED EVIDENCE -- VOICE
COMPARISON STANDARDS
Abstract
This document specifies the requirements of the American Board of Recorded
Evidence for the comparison of recorded voice samples. These standards have
been established for all practitioners of the aural/spectrographic method of voice
identification and are intended to guide the examiner toward the highest degree
of accuracy in the conduct of voice comparisons. These criteria supersede any
previous written, oral, or implied standards, and became effective in 1998.
Foreword
This document was developed by members of the American Board of Recorded
Evidence, a board of the American College of Forensic Examiners, following their
meeting in San Diego, CA in December, 1996. The document draws upon
previously published material from the International Association for Identification,
the International Association for Voice Identification, The Journal of the
Acoustical Society of America, The Audio Engineering Society and The Federal
Bureau of Investigation for much of its content. The contents of this document
are for non-commercial, educational use. It is the intent of the Board to publish
this document in the official journal of the American College of Forensic
Examiners.
VOICE COMPARISON STANDARDS
Table of Contents
1. Scope
2. Evidence Handling
3. Preparation of Exemplars
4. Preparation of Copies
5. Preliminary Examination
6. Preparation of Spectrograms
7. Spectrographic/Aural Analysis
8. Work Notes
9. Reporting
10. Testimony
1
1. SCOPE
This standard specifies recommended practices for the handling, preparation and
analysis of recorded evidence to be followed by practitioners of the
aural/spectrographic method of speaker identification. The document covers
specific instructions for the preparation of exemplar recordings, voice
spectrograms and aural comparison samples. It defines criteria to be applied
when arriving at conclusions that are based upon the oral evidence. It also
includes requirements for reports and testimony that are offered by the expert
witness regarding his findings in voice analyses.
This standard is intended as a guide based upon good laboratory practices for
handling recordings that may be used in evidence. Persons handling evidence
recordings should first obtain and follow the rules of the legal jurisdiction or
jurisdictions involved. When a jurisdiction provides instructions, those should be
followed. Only in the absence of such instructions should the recommendations
of this standard be followed with the approval of the jurisdiction.
2. EVIDENCE HANDLING.
Since evidence involved in criminal or civil proceedings must meet the
appropriate jurisdiction's Rules of Evidence, it is important to properly identify and
safeguard it from the time of receipt until returned to the contributor or court. The
ABRE has adopted as its standard for handling evidence the AES Standard
"AES27-1996 - AES recommended practice for forensic purposes-Managing
recorded audio materials intended for examination". The complete document is
available at:
Audio Engineering Society, Inc.
60 East 42nd Street
New York, NY 10165
3 PREPARATION OF EXEMPLARS. The quality of the exemplars is critical in
allowing an accurate comparison with unknown voice samples.
3.1 Production. The exemplars can be prepared by either the investigator,
attorney, examiner, or other appropriate person. Whenever possible, an impartial
individual knowledgeable of the known speaker's voice should be present to
minimize attempts at disguise, changes in speech rate, adding or deleting
accents, and other alterations. The known speaker should state his or her name
at the beginning of the recording and repeat the unknown speaker's statement(s)
from three (3) to six (6) times, depending upon the length of the unknown
samples. Normally, the person preparing the exemplar should record his or her
name and that of any other witnesses present.
3.2 Duplication of Recording Conditions.
2
3.2.1 Microphone. Whenever possible, the same type of microphone system
should be utilized when recording exemplars as was used for the original
unknown recording. Therefore, if the unknown caller used a telephone, the
exemplar should be prepared by having the suspect talk into one telephone
instrument and be recorded at a second telephone set, located an appropriate
distance away.
3.2.2 Acoustic environment. The exemplar recordings should be prepared in a
quiet environment with relatively short reverberation times. Do not imitate noises
present at the location of the unknown call or obvious reverberant effects.
3.2.3 Transmission line. Whenever possible, the same general type of
transmission line, such as a telephone call, should be utilized when recording
exemplars as was used for the original unknown recording.
3.2.4 Recording system. A good quality recording system should always be used
in preparing exemplars; it is usually not necessary to imitate the system utilized
in recording the unknown sample, but if the system is available and functional, it
may be used. A standard cassette set at 1 7/8 inches per second or open reel
tape recorder at 3 3/4 or 7 1/2 inches per second or a digital recorder should
otherwise be used. Micro cassette and other miniature formats, speeds below 1
7/8 inches per second, and poor quality/inexpensive units are not recommended.
Before the known speaker is allowed to leave the exemplar-taking session, the
recordings should be played back to insure that the samples are of high quality
and properly prepared.
3.2.5 Recording media. Good quality tape or other appropriate recording media
should always be used in preparing exemplars; it is not necessary to duplicate
the type of tape utilized in recording the unknown sample. The tape should either
be new (preferred) or properly bulk erased.
3.3 Duplication of Speech Delivery.
3.3.1 Reading v. recitation. The suspect should be allowed to review the written
text or transcription before actually making the recorded exemplars. This
familiarity will usually improve the reading of the text and response to oral
prompts and increase the likelihood of obtaining a normal speech sample. When
a suspect cannot or will not read normally, it is advisable to have someone recite
the phrases in the same manner as the unknown speaker and have the suspect
repeat them in a similar fashion. Ideally, the exemplar should be spoken in a
manner that replicates the unknown speaker, to include speech rate, accent
(whether real or feigned), hoarseness, or any abnormal vocal effect. The
individual taking the sample should feel free to try both reading and recitation,
until a satisfactory exemplar is obtained.
3.3.2 Repetition. Multiple repetitions of the text are necessary to provide
information about the suspect's intraspeaker variability. All material to be used for
3
comparison should normally be read or recited from three (3) to six (6) times,
unless very lengthy.
3.3.3 Speech rate. Exemplars should be produced at a speech rate similar to the
unknown voice sample. In general, the suspect is instructed not to talk at his or
her natural speaking rate if this is markedly different from the unknown sample.
An effort should be made through repetition to appropriately adjust the speech
rate and cadence in the exemplar to that in the questioned recording.
3.3.4 Stress/Accents. Stress includes the emphasis and melody pattern in
syllables, words, phrases, and sentences. If prominent or peculiar stress is
present in the questioned recording, exemplars should be obtained in a similar
manner, if possible. Spoken accents or dialects, both real and feigned, should be
emulated by the known speaker. The recitation mode is the better technique for
accomplishing this.
3.3.5 Effects of alcohol or other drugs. Since the degree and type of effects from
alcohol and other drugs varies from person to person, an attempt to duplicate
these vocal changes is not recommended when obtaining the exemplar. If the
suspect appears to be under the effects of alcohol or other drugs at the time of
the exemplar recording the session should be rescheduled.
3.3.6 Other. If any other unique aural or spectrally displayable speech
characteristics are present in the questioned voice, attempts should be made to
include them in the exemplars.
3.4 Marking. Same as Sect. 2
4 PREPARATION OF COPIES.
4.1 Playback of Evidential Recordings. The proper playback of the unknown and
known voice sample is critical, since it provides the optimum output for the aural
and spectral analyses.
4.1.1 Track determination. In situations where the questioned recording was
made on equipment of unknown origin or configuration, it may be necessary to
analyze oxide on the recording before playing it back. The recorded track
position and configuration may be determined by applying an appropriate
ferrofluid to the oxide side of analog tapes in a high amplitude portion of the
recording. The treated area is then viewed under low magnification to determine
the track configuration and offsets.
4.1.2 Azimuth alignment. Where there is evidence of an audio level or clarity
problem during playback, azimuth alignment should be checked and adjusted if
necessary by either an inspection of the developed magnetic striations (see track
4
determination above), frequency analysis of the recorded material, or adjustment
of the reproducer head azimuth for maximum high frequency output. All audio
miniature cassettes, standard cassettes, and open reels (other than loggers)
recorded at 15/16 inches per second (2.4 centimeters per second), or less,
should be carefully examined for loss of higher frequency information, which
often occurs in these formats.
4.1.3 Speed accuracy. Errors in playback speed will cause corresponding
variations in the voice frequency, both aurally and spectrally. The playback speed
error should be determined for all recordings containing known discrete tones,
and then corrected on a reproducer with speed-adjustment circuitry. A Real-Time
(RT) Analyzer or Fast Fourier Transform (FFT) analyzer system should be used
that allows a resolution of 1% (+0.60 hertz) or better at 60 hertz. Where a known
signal is present on the recording, a frequency counter may be employed to
correct tape speed. Ideally, there should be less that a 3% error between
questioned and known samples that are being compared.
4.1.4 Reproducer. Using the information gleaned from the examinations of the
track, azimuth alignment, and speed, a high-quality playback device is configured
to allow optimum output.
4.2 Direct Copies. The following information is provided for the analog reel copies
that are needed for processing on the Voice Identification, Inc., Series 700 sound
spectrograph. If the spectrograph being utilized has a digital memory, the
requirements for cabling and retention are still applicable. Even with digital
memory systems, a high quality digital or analog tape copy should still be
prepared and maintained.
4.2.1 Format. All copies are prepared in a full track, 7 1/2 inches per second
format on 1.0 mil or thicker audio tape from a reputable manufacturer. Normally,
new, unused reels of tape should be utilized; however, previously recorded tape
can be used if either bulk erased or over-recorded on a full track recorder with no
input.
4.2.2 Cabling. All copies must be prepared with good quality cables from the
playback device to the line input of the recording unit. No loudspeaker-tomicrophone copying procedures are permitted.
4.2.3 Recording unit. A separate professional reel recorder, or the one
incorporated in the Series 700 Series Spectrograph, is required. At least once a
year, the recorder must be checked by a technically competent individual to
determine the unit's playback speed accuracy, distortion level, flutter,
record/playback frequency response, and record level. The recorder must meet
the following criteria: playback speed within 0.15% distortion of less than 3% at
200 nWb/m, wow and flutter below 0.15% (NAB unweighted), record/playback
frequency response of 100 to 10,000 hertz + 3 decibels at 200 nWb/m, and a 0
VU level no greater than 250 nWb/m. If the recorder does not meet all of these
standards, it must be repaired and/or adjusted. If a digital system is utilized by
5
the examiner, the system should be checked at least once a year by a technically
competent individual according to the manufacturer's written instructions. Digital
systems should have almost unmeasurable speed errors, wow and flutter,
distortion, and frequency deviations.
4.2.4 Retention. The direct copies must be retained at normal room temperatures
and humidity for at least three (3) years, unless the case has been completely
adjudicated or the contributor requires the return of all materials used by the
examiner.
4.3 Enhanced Copies. When the original recording contains interfering noise
and/or limited frequency response, enhanced copies may provide improved
audibility and more usable spectrograms. At times, separate enhanced copies
will have to be prepared for the aural and spectral examinations to provide
optimum results for each. The following information is specifically provided for the
analog reel copies that are needed for processing on the Voice Identification, Inc.,
Series 700 sound spectrograph. If the spectrograph being utilized has a digital
memory, the requirements for cabling and retention are still applicable. Even with
digital memory systems, a high quality digital or analog tape copy should still be
prepared an maintained. A written record of the settings on the devices used
should be maintained.
4.3.1 Equalizers. Parametric or graphic equalizers can boost and attenuate
selected frequency bands to normalize the recorded speech spectrum. Though
an FFT or RT analyzer is of considerable assistance in adjusting the spectrum, a
final decision on the equalizer settings should be made by either listening and/or
preparing spectrograms, depending upon the enhanced copy's use.
4.3.2 Notch filters. These devices allow the selected attenuation of discrete tones
present in the recordings. An FFT or RT analyzer is of considerable assistance in
identifying the frequency of the tones and optimally centering the filter's notch.
4.3.3 Deconvolutional filters. These digital devices both automatically attenuate
sounds correlated longer than a specified time and flatten the sound spectrum.
The filter can, at times, provide improved spectrographic and aural samples for
examination. Care should be taken to insure that the adaptation rate is not set at
a value that starts to delete speech information.
4.3.4 Other filters. Band pass, shelving, comb, user-characterized digital, and
other filters are helpful in a small number of voice identification cases.
4.3.5 Format. Same as 4.2.1.
4.3.6 Cabling. Same as 4.2.2.
4.3.7 Recording unit. Same as Section 4.2.3.
4.3.8 Retention. Same as Section 4.2.4.
6
5 PRELIMINARY EXAMINATION.
A preliminary examination is conducted to determine whether the unknown and
known voice samples meet specific guidelines to allow continuation of the
examination.
5.1 Original/Duplicate Recordings. The unknown and known voice samples must
be original recordings unless listed as a specific exception below. Copies not
meeting these guidelines cannot be used for examination. Short time restraints
imposed by the contributor are not considered an exception. When access to the
original recording is denied due to legal restraints, copies may be used under the
allowed exceptions. The exceptions for not examining the original recordings are:
a. If the original recording has been erased or destroyed, the examiner should
then use the best first-generation copy available;
b. The copies were prepared by a qualified voice identification examiner or other
technically competent individual following Section 4 guidelines;
c. If the original recording is in a relatively unique format or part of a digital
storage system, the examiner or other technically competent individual should
prepare the copies from the original material following Section 4 guidelines. If
that is not possible, then detailed telephonic and/or written instructions should be
given to the individual preparing the copies. Copies produced by non-technical
individuals should be closely analyzed in the laboratory to insure that the
duplication process was properly done.
5.2 Verbatim/Non-verbatim. The known, or another unknown voice sample, must
be either wholly verbatim (preferred), or partially verbatim to allow meaningful
comparisons with unknown voice samples. A partially verbatim sample should
include phrases and sentences containing at least three (3) similar, consecutive
matching words. An example of the use of partial verbatim samples would be two
(2) unknown recorded false fire alarms containing, at times, nearly identical
phraseology. If no verbatim recordings are submitted by the contributor, the
examiner may analyze the unknown samples to determine whether they would
meet the guidelines if appropriate known voice samples are submitted at a later
time.
5.3 Number of Comparable words. There must be at least (10) comparable word
between two (2) voice samples to reach a minimal decision criteria. Similarly
spoken words within each sample can only be counted once. It is noted that in
most voice samples at least some of the words identified at this point will not be
useful in the final examinations.
5.4 Quality of Voice Samples. This preliminary aural and spectral review is to
determine if the voice samples are of sufficient quality to allow meaningful
comparisons between them.
7
5.4.1 Disguise. Samples, or portions of samples, that contain falsetto, true
whispering (in contrast to low amplitude speech), or other disguises that
obviously change or obscure the vocal formants or other speech characteristics,
may need to be eliminated from comparison consideration. Other types of
disguise may or may not be usable, depending upon the nature of the disguise.
Sometimes a known voice sample with the same type of disguise can be
compared, but the examiner should exercise caution in such examinations.
5.4.2 Distortion. Samples, or portions of samples, that include high-level linear
and/or nonlinear distortion should be eliminated from comparison consideration.
Such distortion can result from saturation of magnetic tape or overdriven
electronic circuits, and can produce artifacts, including formants that did not exist
in the original speech information.
5.4.3 Frequency range. Samples, or portions of samples, that are restricted in
upper frequency range and produce less than two complete speech formants are
of limited value to the examiner. Samples producing three or more speech
formants provide the examiner better information with which to make a
comparison. Sometimes the use of enhanced copies can allow the frequency
range to be extended but note the limitations in Section 7.1.3.
5.4.4 Interfering speech and other sounds. Samples, or portions of samples, that
contain any extraneous speech information or sounds which interfere with aural
identification or spectral clarity should be eliminated from comparison
consideration unless the sounds can be sufficiently attenuated through
enhancement procedures.
5.4.5 Signal-to-noise ratio. Samples, or portions of samples, containing recording
system or environmental noise that impedes aural identification or spectral clarity
should be eliminated from comparison consideration unless the noise can be
sufficiently attenuated through enhancement procedures.
5.4.6 Variations between samples. Though the following variations can quickly
end a voice comparison, the problem can often be remedied by obtaining
additional known samples:
a. Transmission systems. Normally, samples being compared should be
produced through the same type of transmission system, for example, the
telephone, a microphone in a room, or a RF transmitter/receiver. If aurally or
spectrally the samples are noticeably different due to the dissimilarities in the
transmission systems and filtering does not rectify these differences, no further
comparisons should be made.
b. Recording systems. Normally, samples being compared should be produced
on either good quality, or compatible, recording systems. However, if the
recordings contain uncorrectable system differences that affect aural and
spectral characteristics, no further comparisons should be made. Examples of
8
recording differences that can affect the results include high-level flutter, gross
speed fluctuations, and voice-activated stop/starts.
c. Speech delivery. Normally, samples being compared should have the
speakers talking in the same general manner, including speech rate, accent,
similar pronunciation, and so on. However, in cases where this has not been
done, as in poorly produced known exemplars, no further comparisons should be
made.
d. Other. Any other differences between the voice samples that noticeably effect
aural and spectral characteristics should be closely reviewed before proceeding
with the examination.
6 PREPARATION OF SPECTROGRAMS.
6.1 Sound Spectrograph. The examiner must use a sound spectrograph, or a
digital system, that allows the identification and marking of each speech sound
on the spectrogram by either manual manipulation of the drum while listening to
the recorded material or the separate identification of the individual sounds on a
computer monitor. Spectrographs used must be of professional manufacture,
such as the Voice Identification 700 Series or professional computerized systems,
such as the Kay Elemetrics Model 5500. The spectrograph should be calibrated
at least every six (6) months according to the manufacturer's instructions.
6.1.2 Print Quality. Spectrographic prints must be produced either in an analogue
format or, if from a computerized system, must be printed with a minimum of 600
dots per inch resolution.
6.2 Format.
6.2.1 Filter bandwidth. A 250 to 300 hertz bandwidth filter is recommended for
the production of most spectrograms. A 450 to 600 hertz bandwidth filter may
sometimes improve the formant appearance for high-pitched voices. Narrower
filters should only be used for non-voiced sounds and calibration purposes.
6.2.2 Mode. The bar display mode must be used for all spectrograms with the
high-shaping equalizer engaged (except when an enhanced copy is being used
that has already properly shaped the spectrum).
6.2.3 Frequency range. An appropriate frequency range should be chosen that
fully displays all speech sounds in the unknown voice sample. The known voice
spectrograms are then prepared using the same frequency range.
6.2.4 Direct v. enhanced. When enhanced copies are used for the examination,
at least some spectrograms must be prepared from the direct copies.
9
6.3 Marking. Each spectrogram must be marked below each speech sound,
either phonetically, orthographically, or a combination of both. Great care should
be taken to insure that the speech sounds are accurately designated as to how
they were spoken, which may not be their correct pronunciation. The
spectrograms should be appropriately labeled with identifying information such
as specimen, case, and laboratory identifiers. The spectrograms may be marked
consecutively for each unknown and known sample. Known and unknown
sounds may be marked in different colored ink to facilitate comparisons.
6.4 Retention. All spectrograms should be retained for at least three (3) years
after completion of the examination, unless the case has been completely
adjudicated or the contributor requires the return of all materials used by the
examiner.
7 SPECTROGRAPHIC/ AURAL ANALYSIS.
7.1 Pattern Comparison.
7.1.1 Intraspeaker consistency. The examiner must visually compare similarly
spoken words within each voice sample to determine the range of intraspeaker
variability. If there is considerable variability, the word must not be used for
comparison. If there is considerable variability in a number of words in a sample,
the sample should not be used for comparison. This is often encountered with
disguised voices and known exemplars from uncooperative individuals.
7.1.2 Similar speech sounds. Only speech sounds of similarly spoken words
should be compared between voice samples. Comparison of the same speech
sound but in different words, should be avoided.
7.1.3 Direct v. enhanced. When using spectrograms from direct and enhanced
copies, both should be visually compared to words from the known or questioned
voice sample. The examiner should be cognizant that the enhancement process
may distort the spectral energy distribution, thus increasing the likelihood of a
false elimination.
7.1.4 Number of comparable words. This is determined by the total number of
different words present in both samples that meet the standards set forth in
Section 5.4.1 - 6. A similar or nearly similar word appearing more than once in
one or both samples should be counted only as one comparable word.
7.1.5 Speech characteristics.
a. General formant shaping and positioning. A formant is a band of acoustic
energy produced by spoken vowels and resonant consonants. Formants and
other vocal patterns produced on the spectrograms are visually compared by the
10
examiner. Generally, the spoken word will produce a set or sets of three (3) or
more observable formants. A good pattern match exists when the majority, if not
all, of the formant shaping and positioning exhibit strong similarities. A precise
photographic match rarely occurs even between two (2) consecutive utterances
of the same word spoken by the same individual. Conversely even very different
voices can exhibit similarities in general formant shaping and positioning for
some words. Examination of these patterns must be conducted between each
comparable word of the voice samples.
b. Pitch striations. Pitch, or fundamental frequency, can be a useful characteristic
for distinguishing between speakers. Pitch information is displayed on a
spectrogram in the form of closely-spaced vertical striations, with the spacing and
shaping being useful parameters of the individual talker. Differences in the pitch
rate and the smoothness or coarseness of the pitch quality should be examined
both spectrally and aurally; but most talkers are characterized by fairly wide pitch
ranges.
c. Energy distribution. Energy distribution of certain vocal sounds can assist the
examiner in analyzing similarities and differences between voice samples.
Certain phonemes are displayed primarily by their energy distribution diffused
across a certain frequency range. Plosive and fricative consonants are displayed
along the frequency axis as concentrated dark energy distribution patterns.
Although the characteristics of energy distributions, especially bursts, are more
dependent upon the type of sounds produced than the speakers, some talkerdependent characteristics can be observed.
d. Word length. The time length of a particular spoken word can be readily
compared between voice samples. When a person speaks more slowly or faster
than normal, the time between words is usually more affected than the length of
the individual words. It is noted that a word appearing at the end of a sentence or
phrase is usually longer than the same word appearing in the middle.
e. Coupling. The effects of inappropriate coupling can often be observed in
spectrograms as either diminished or enhanced energy in the frequency range
between the first and second formants. Coupling is related to the open/close
condition of the oral and nasal cavities. In normal speaking the nasal cavity is
coupled to the oral cavity for nasal sounds, such as "n", "m", and "ng". However,
some talkers are hyper nasal, producing nasal-like characteristics in
inappropriate vocal sounds; other speakers are hypo nasal producing limited
nasal qualities even when appropriate.
f. Other. Plosives, fricatives, and inter-formant features should be spectrally
compared between samples by the examiner. Other sounds such as inhalation
noise, repetitious throat clearing, or utterances like "um" and "uh" can sometimes
be compared to the known exemplar if they have been successfully replicated.
7.2 Aural Comparison.
11
7.2.1 Short-term memory. An aural short-term memory comparison must be
conducted either by playing the two (2) samples on separate playback systems
with a patching arrangement to allow rapid switching between them or by
recording short phrases or sentences from each sample on the same recording.
The short-term memory playback tape should contain all words used in the
spectrographic comparison. The two (2) samples should be reviewed at
approximately the same speech amplitude and with the same general frequency
range. The frequency range may be normalized between the samples by using
band pass filtering on the sample with the widest frequency range to duplicate
the range found on the other sample.
7.2.2 Direct v. enhanced. When direct and enhanced copies have been produced,
both should be aurally compared to the known or questioned sample. The
examiner should recognize that though enhancement procedures often improve
intelligibility, they can also produce changes, at times, that can make samples of
the same talker sound somewhat different.
7.2.3 Pronunciation. Only similarly pronounced words should be compared
between samples.
7.2.4 Intraspeaker consistency. The examiner must aurally compare similar
words within each sample to determine if they are spoken in a generally
consistent manner. If intraspeaker variability is present for a particular word, that
word should not be compared to the other voice sample. If considerable
intraspeaker variability is present in the entire sample, that sample should not be
used for comparison. This is often the problem with disguised speech and known
exemplars from uncooperative individuals.
7.2.5 Speech characteristics.
a. Pitch. See sect. 7.1.5.b.
b. Intonation. Intonation is the perception of the variation of pitch, commonly
known as a melody pattern. Spontaneous conversation will normally exhibit this
characteristic to a greater extent than a passage that is read by the speaker.
c. Stress/Emphasis. The stress or emphasis within the words of the sample
should be similar for different recordings of the same talker when no disguise is
present.
d. Rate. The rate of speaking under the same conditions is relatively constant for
a particular talker. However, rates of reading, recitation, and conversation will
normally vary for the same talker.
e. Disguise. Obvious vocal disguises can disqualify a sample for comparison
purposes. The examiner should carefully analyze the characteristics of the
disguise in a sample and then determine if it is possible to make a meaningful
comparison with another sample, whether it also contains a disguised voice or
not.
12
f. Mode. Certain speaker-dependent characteristics can be discerned from the
mode in which a speaker initiates sounds. Speakers range from gradually to
abruptly initiating voicing, which can reveal useful similarities and differences
between two samples.
g. Psychological state. Listening usually reveals many of the effects of an altered
psychological state upon the voice. Alterations may be characterized as
nervousness, over-excitement, excessive monotone, crying, and so on. The
examiner should be cautious in comparing samples with major changes due to
an altered psychological state.
h. Speech defects. Speech defects are abnormalities in the voicing of sounds,
and can include lisps, pitch and loudness problems, and poor temporal
sequencing. Except for extreme cases, there are no criteria to assess whether a
voice is considered normal or defective. Obvious, or even subtle, defects in the
questioned or known voice samples can often provide vital information in the
comparison decision.
i. Vocal quality. Vocal quality is the perception of the complex, dynamic interplay
of the laryngeal voicing (pitch, intonation, and stress), articulator movement, and
oral cavity resonances. Since each individual’s voice is relatively unique in its
vocal quality, comparisons can provide important information regarding
similarities and differences between the voice samples.
j. Other. Examples of other useful speech characteristics that are occasionally
heard include long-term fluctuations of pitch (vibrato), vocal fry (extremely low
pitching), pitch breaks, and stuttering.
7.3 Conclusions. Every aural/spectrographic examination conducted can only
produce one of seven (7) decisions; Identification, Probable Identification,
Possible Identification, Inconclusive, Possible Elimination, Probable Elimination,
or Elimination. The following descriptions for each decision are the minimal
decision criteria, and must be adhered to by the examiner, except that lower
confidence level can always be chosen, even though the criteria would allow a
higher degree of confidence. Within the range of probable decisions, the
examiner may wish to clarify his findings, i.e. low probability, high probability,
depending upon the quantity and quality of the comparable material available to
the examiner. Comparable words must meet the previously listed criteria. The
following are the seven (7) possible decisions.
7.3.1 Identification. At least 90% of all the comparable words must be very similar
aurally and spectrally, producing not less than twenty (20) matching words. Each
word must have three (3) or more usable formants. This confidence level is not
allowed when there is obvious voice or electronic disguise in either sample, or
the samples are more than six (6) years apart.
13
7.3.2 Probable Identification. At least 80% of the comparable words must be very
similar aurally and spectrally, producing not less than fifteen (15) matching words.
Each word must have two (2) or more usable formants.
7.3.3 Possible Identification. At least 80% of the comparable words must be very
similar aurally and spectrally, producing not less than (10) matching words. Each
word must have two (2) or more usable formants.
7.3.4 Inconclusive. Falls below either the Possible Identification or Possible
Elimination confidence levels and/or the examiner does not believe a meaningful
decision is obtainable due to various limiting factors. Comparisons that reveal
aural similarities and spectral differences, or vice versa, must produce an
Inconclusive decision.
7.3.5 Possible Elimination. At least 80% of the comparable words must be very
dissimilar aurally and spectrally, producing not less than (10) that do not match.
Each word must have two (2) or more usable formants.
7.3.6 Probable Elimination. At least 80% of the comparable words must be very
dissimilar aurally and spectrally, producing not less than fifteen (15) words that
do not match. Each word must have two (2) or more usable formants.
7.3.7 Elimination. At least 90% of all the comparable words must be very
dissimilar aurally and spectrally, producing not less than twenty (20) words that
do not match. Each word must have three (3) or more usable formants. This
confidence level is not allowed when there is obvious voice or electronic disguise
in either sample, or the samples are more than six (6) years apart.
7.4 Second Opinion. A second opinion is not required, but may be obtained from
another certified examiner when desired by either the examiner or the party
submitting the evidence.
7.4.1 Independence. A second opinion must be completely independent of the
first examiner's decision, and no oral or written information shall be provided
regarding that first opinion.
7.4.2 Material provided. The second examiner should only be provided the
originals, or direct and enhanced copies, any work notes under Sections 2, 3,
and 4 and the spectrograms. The second examiner must not be provided any
materials that reflect even partially, the first examiner's opinions regarding the
examination.
7.4.3 Examination. A thorough analysis should be conducted by the second
certified examiner, using the guidelines in Sections 5, 6 and 7 (except for 7.4). It
is left to the discretion of the second examiner whether to prepare additional
spectrograms or copies.
7.4.4 Resolving differences. If different decisions are reached by the two (2)
examiners, a detailed discussion between them of the analysis will often lead to a
14
resolution. If not, the lower confidence level must be reported and testified to
when both decisions are an identification or elimination. If split between and
identification and elimination, no matter what the confidence level, the decision
must be inconclusive. A third independent decision can be obtained but the result
will be the lowest confidence level, or an inconclusive of all the examiners
involved.
7.4.5 Reporting. Whenever possible, the second examiner should prepare a
short report listing the results of the second opinion. This is not necessary if both
examiners are in the same organization. The name and results of the second
opinion can then be included in the first examiner's work notes.
8 WORK NOTES.
8.1 Required Information. The examiner's work notes should be in accordance
with Rule 26 of the Federal Rules of Evidence - Expert Witness Statement
categories, and should contain, as a minimum, the following information:
a. Laboratory, case, and specimen identifiers;
b. Description of submitted evidence;
c. Chain-of-custody documentation;
d. Track determination, azimuth alignment, and speed accuracy information,
where required, for each submitted sample;
e. Information on the duplication processes, including the type of equipment and
format copies;
f. Information of the enhancement processes, if any, including the type of
equipment, filter settings, and format copies;
g. List of the exact words used for comparison and whether they matched or not;
h. Name of any second opinion examiner and the results of that examination;
i. Final decision.
8.2 Retention. The work notes should be retained for at least three (3) years after
completion of the examination unless the contributor has requested that all
material relating to the case be returned.
9 REPORTING.
9.1 Format. The report should be typed, dated, and in a standard laboratory or
business letter style. The content of the report should be in conformity with Rule
26 of the Federal Rules of Evidence. The following information must be included:
a short description of the evidence being examined, a summary of the
15
examination performed, the final decision, and a statement of accuracy. Exhibits,
handouts and supporting documentation should be separate from the report.
Business matters, such as payment of fees, should be set forth in separate
communications and not included within the report.
9.2 Decision Statement. The report must clearly state which of the seven (7)
decision options listed in Section 7.3 was the final result of the examination.
10 TESTIMONY.
The American Board of Recorded Evidence does not take a position as to
whether or not a certified examiner should provide testimony regarding
examination results. However, an examiner must follow the standards set forth in
this document, including the appropriate criteria set forth in this section, whether
they provide testimony, or not.
10.1 Testimony v. Investigative Guidance. Each specific organization or
individual examiner must decide before conducting spectrographic voice
identification examinations whether testimony will be provided. If not, the
contributor must be advised of the investigative guidance policy and all oral and
written reports should set forth this information.
10.2 Qualification List. The presentation of the qualifications of the examiner
should be in conformity with Rule 26 of the Federal Rules of Evidence - Expert
Witness Statement categories, regarding expert witnesses.
10.3 Pre testimony Conference. Discussion of the examination with the attorney
before judicial proceedings is an important aspect of providing meaningful
testimony and educating the attorney on the strengths and limitations of the
technique. The conference should include a candid discussion, the inherent
problems, identification of scientific literature that is either critical or supportive,
and other information important to the testimony.
10.4 Appearance and Demeanor. Whenever possible, examiners must dress in
proper business attire or appropriate law enforcement or military uniform for all
judicial proceedings, maintain a professional demeanor even under adversarial
conditions, and direct explanations to the jury, when present.
10.5 Presentation. The examiner should provide to the judge and/or jury, as a
minimum, his/her qualifications, an overview of the spectrographic technique, its
scientific basis, the details of the analysis procedures followed in the specific
case, and the results of the analysis. The information should be presented in a
form understandable to non-experts, but with no loss of accuracy.
16
ANOMALIES ASSOCIATED WITH COMPUTER EDITING OF
RECORDED TELEPHONE CONVERSATIONS
Second international chemical congress forensic symposium fall 1995 San
Juan Puerto Rico by Steve Cain
During a two to three year period, a Midwestern entrepreneur had been
interested in filing a patent on an innovative new product. As a home-based
business man, much of his product development and marketing strategies were
accomplished through contact with several dozen product development attorneys
and other business advisors over his home and office telephones. When
requested to provide the original telephone tape recordings, he claimed they had
been inadvertently misplaced but that he had made copies of the relevant
conversations which he later surrendered for forensic analysis. Although
unsuccessful in ever examining the original tapes, I did have two copies of each
of the original tapes. The original recorders were described as Radio Shack type
portable machines together with a telephone interface device and two consumer
brand high speed dubbing cassette recorders which purportedly were used in the
selective dubbing of individual telephone conversations from the original tapes.
During review of ten composite copy tape conversations, it became apparent
through both aural and spectrographic/waveform analysis, that there existed a
number of suspicious record events (i.e. “anomalies”) which deserved further
instrumental attention. A KAY Digital Spectrograph Model 5500 was used for the
bulk of the analysis. As the original tapes were not available, magnetic
development was not deemed appropriate and therefore traditional digital
waveform/spectrographic techniques were utilized in the examination process.
Before displaying examples of the computer-related edited phenomena, it may
prove beneficial to review the traditional analog anomalies often associated with
falsification of recordings. These include:
1. Deletion: the elimination of words or sounds by stopping the tape and overrecording unwanted areas.
2. Obscuration: the mixing in of sounds of amplitude sufficient to mask waveform
patterns which originally would show stop/starts in inappropriate places.
3. Transformation: the rearranging of words to change content or context.
4. Synthesis: the adding of words or sounds by artificial means or impersonation.
Anomalies often times include the following phenomena:
1. Gaps: segments in a recording which represent unexplained changes in
content or context.
2. Transients: short, abrupt sounds exemplified by clicks, pops, etc.
3. Fades: gradual loss of volume.
1
4. Equipment Sounds: context inconsistencies caused by the recording
equipment (such as hum, static, and varying pitches).
5. Extraneous voices: background voices which at times appear to be as near as
the primary voice or can even mask the primary voice. (1)
Modern day technology and the development of the DSP chip have greatly
complicated the issue of tape tampering detection and further increases the
likelihood
that altered tapes can escape detection. The Federal Bureau of Investigation
Signal Analysis Branch has already acknowledged, “it is difficult to detect some
alterations when a recording is digitized onto a computer system, physically or
electronically edited and recopied onto another tape. (2) Recently there have
been at least 20 different manufacturers of desktop computer editing
workstations or digital recorders which can be used as “turn key” editing systems.
Software related computer cards can transform a personal computer into a
sophisticated digital audio editing machine. Some of the systems do require that
the initial conversion of the analog format be accomplished by a digital audio
recorder before accessing the computer hardware.(3)
Digitization of speech can sometimes leave discernable artifacts, especially
“aliasing” effects. This phenomena of digitizing the speech signal involves two
distinctive processes known as Sampling and Quantizing, which are the true core
of the digital recording process. Speech digitization requires filtering by an
appropriate low pass filter which should remove any high frequencies that are
beyond the sampling rate of the equipment. The sampling process refers to the
transforming of the low-filtered electronic waveform into many thousands of small
units of time. Each of these time units are later quantized with respect to its
respective amplitude.
The Nyquist Theorem, however, requires that the sampling frequency be twice as
high as the highest frequency converted into digital format. If this theorem is not
followed, an undesirable effect known as Aliasing occurs.(4) High frequency
changes in amplitude are not properly encoded, leaving some information lost
and occasionally new erroneous signals are generated. “If the throughput
frequency is greater than one-half the sampling frequency, aliasing inevitably
occurs.”(5) For example if S is the sampling rate and F is a higher frequency than
one-half the sampling rate and N is an integer, a new sample frequency, Fa is
also created at Fa = ± NS ± F. Therefore, if S equals 44 KHZ and we sampled at
36 KHZ, another sample frequency would occur at 8 KHZ. If we sample at 40
KHZ, a 4 KHZ aliasing signal would occur. (6)
Other aliasing effects involve Image Aliasing which occurs in multiple images
produced by the sampling process. If a 44 KHZ sampler is utilized and a 36 KHZ
input signal is analyzed, some of the resulted output frequencies would 8 KHZ,
52 KHZ, 80 KI-IZ, etc. In addition, Harmonic Aliasing can exaggerate the problem.
Complex tones, for example, could result in aliasing frequencies generated
2
separately for each harmonic. The practical result of this would be additional
harmonics would be added to the digitized signal which normally would be
multiples of the harmonic of the fundamental frequency.(7)
As DSP technology and their respective chips become more sophisticated and
available to the consumer, the ability to edit, alter or fabricate audio recordings
will be enhanced. Computer-based digital editing now permits the generation of
lengthy, fabricated audio segments, sometimes devoid of the traditional
transients in other editing artifacts associated with analog tape tampering.
The results of an aural/waveform/spectrographic analysis on the evidence tape
copies disclosed a number of computer related editing anomalies associated with
significant portions of the recorded telephone conversations, namely:
1. Uncharacteristic tones in the recordings sometimes occurring at even
numbered multiples of each other (i.e. 4, 8, 1 6, 20 KHZ).
2. Omission or deletion of material.
3. Abrupt beginning and ending of ongoing speech.
4. Aliasing effects.
The more subtle effects of the digital editing process involving “aliasing” artifacts
can sometimes be heard but are more readily apparent in the
spectrographic/waveform analysis of the altered speech signals.
Examples of the digital editing process associated with this case are displayed in
the accompanying sets of overhead transparencies. A short term aural composite
tape was produced and should further corroborate the nature and extent of the
digital editing anomalies associated with the computer edits found in this
examination process.
BIBLIOGRAPHY
1. Steve Cain, “Sound Recordings as Evidence in Court Proceedings,” article accepted for publication by
National District Attorneys Association, The Prosecutor, to be published late 1995.
2. Bruce E. Koenig, “Authentication of Forensic Audio Recordings,” Journal of Audio Engineering Society, 38,
1/2, 1990, January/February, page 4.
3. Steve Cain, “Verifying the Integrity of Audio and Video Tapes,” paper published in The Champion
Magazine, July, 1993.
4. Jordan S. Gruber, Fausto Poza and Anthony Pellicano, Audio Tape Recordings: Evidence. Experts and
Technology, Volume 48, American Jurisprudence Series, Lawyers Cooperative Publishing, Rochester, New
York, 1993, pp. 108-109.
5. Ken C. Pohlmann, Principles of Digital Audio, Howard W. Sams and Company, 1992, pp. 46-48.
6. Ibid 5, p. 45.
7. Ibid 5, p. 48.
3
VERIFYING THE INTEGRITY OF AUDIO AND VIDEOTAPES
By Steve Cain
Champion - July 1993 by Steve Cain
An ever increasing reliance on tape evidence in criminal prosecutions, especially
in organized crime and drug cases, underscores the importance of tape integrity
and the methods used to qualify or disqualify tape evidence.
This article will discuss some of the procedures utilized in analog and digital
editing of tapes and assess their potential threat vis-a-vis tape tampering issues;
the "legal admissibility" issue surrounding tape recorded evidence to include
defining strategies for the defense to require the government to release the 'best
evidence' for analysis purposes; and an overview of the accepted techniques for
the scientific analysis of recorded tape evidence.
Tape Editing Technology,
The forensic examination of "tampered tapes" should include an inspection of the
original tape(s) and the recorder(s) used to produce the tape(s). In the simple
case, the existence of an electronic edit and/or evidence of physical splicing will
produce acoustic irregularities which can be viewed with instruments and
documented.
Modern day technology was apparently used in the electronic editing performed
on the disputed Gennifer Flowers/Gov. Bill Clinton tape recordings. The Cable
News Network (CNN) asked that I provide an expert opinion on Mr. Clinton's
voice and also asked that I examine the tape submitted by the STAR News
Magazine for any evidence of possible tampering. The later examination
disclosed a number of suspicious acoustic events (anomalies) including: a total
loss of signal (dropouts) ;a change in the speakers' frequency response during
different telephone conversations; and "spikes" (audible sounds of short duration
which are often attributable to normal stop/start and pause functions of the
recorder).
In order to provide any definitive conclusion, I requested the original recorder and
tape to determine if these electronic edits were intentional edits or possible
malfunction/anomalies of the recorder/microphone equipment. CNN has never
received the requested tape or recorder from the Star News Magazine.
Digital editing of both audio and video tapes, however, greatly complicates the
issue and increases the likelihood that altered tapes can escape detection.
The Federal Bureau of Investigation (FBI) Signal Analysis Branch has already
acknowledged, "It is difficult to detect some alterations when a recording is
digitized into a computer system, physically or electronically edited and recopied
on to another tape." *1*
The days of utilizing a razor blade and splicing tape to effectively alter or "doctor"
a recorded conversation are all but gone. Right now there are at least twenty
manufacturers of desktop computer editing work stations or digital recorders
which can be used as "turn key" editing systems. Software and add on computer
cards can transform an IBM personal computer or a Macintosh computer into a
sophisticated digital audio editing machine. Some of the systems require a digital
audio recorder for initial conversion of the analog format before accessing the
1
computer hardware. These editing work stations were developed to save the
motion picture and recording industries money by precluding the necessity of
recording sessions or to correct subtle errors in multi track releases.
Some computer boards and software cost less than a $1,000,and provide both
recording and editing of sound in an IBM compatible or Mac personal computer
format. Editing options are practically inexhaustible thus giving the operator the
ability to alter the tape in a word processor type of mode (i.e. cut and paste, copy,
delete, etc.) while selected playback files utilize subdue cross fading effects that
can "shape" the sound. The typical telltale signs of traditional analog recorder
editing including "clicks, pops" and other short duration sounds, can now all be
effectively removed without any detectable, audible clue.
Traditional Editing Techniques
Present tape editing practices include either physical splices or electronic editing
on one or more analog tape recordings whenever portions of selected
conversations are over recorded (i.e. erased) or the original recorder was
stopped and restarted inappropriately. While listening to the tape, the attorney
may first suspect an alteration by noting either unexplained transients, equipment
sounds, extraneous voices, or inconsistencies with provided written information.
The major categories of tape alterations include; (1) Deletion; (2) Obscuration; (3)
Transformation; and (4) Synthesis *2* Deletion of unwanted material can readily
be done through splicing or by using one or more recorders to erase, rerecord, or
stop/pause the recorder at strategic points within the conversation. Obscuration
involves the distortion of a recorded signal with the purpose of rendering
selective portions unintelligible. This method, for example, was used during the
editing of the infamous 18 minute gap in the Watergate tapes. This technique is
also used to .mask splices, clicks, or suspicious transients and is more difficult to
detect than deletion methods. By judicious use of two tape recorders, one may
add "noise" to the copy and thereby mask the original recording and render it
less intelligible. One can also reduce the volume of the slave recorder and thus
weaken the amplitude of target conversations on the original tape.
Transformation involves the alteration of portions of a recording so as to change
the meaning of what is said. The technique is similar to deletion practices but
greater skill and care must be applied as a knowledge of acoustic phonetics is
required to avert a suspicious edit.
Lastly, synthesis is the generation of artificial text by adding background sounds
or conversation to the tape copy which were not present on the original recording.
The addition of selective phrases can be accomplished if a sufficient data base
library of recorded conversations is available. It must be emphasized that all of
the traditional analog methods of altering audiotapes can be more efficiently and
surreptitiously accomplished through the use of digital editing work stations.
Tape Authentication And Detection Of Edits
With the threat of digital editing looming larger, it is more inoperative than ever
that both the official tapes and recorders be made available for inspection.
The FBI's Signal Analysis Branch has developed a set of well defined procedures
for the acceptance of authentication requests which provides an excellent
2
overview of what the government considers to be essential for a scientifically
valid tape analysis:
1. Sworn testimony or written allegations by defense, plaintiff, or government
witnesses of tampering or other illegal acts. The description of the problem
should be as complete as possible, including exact location in recording, type of
alleged alteration, scientific test performed, and so on;
2. The original tape must be provided. Copies of a duplicate tape cannot be
authenticated and are normally not accepted for examination by the FBI;
3. The tape recorders and related components used to produce the recording
must be provided; and,
4. Written records of any damage or maintenance done to the recorders,
accessories, and other submitted equipment must be provided.
In addition, there must be a detailed statement from the person or persons who
made the recording describing exactly how it was produced and the conditions
that existed at the time, including:
A. Power source, such as alternating current, dry cell batteries, automobile
electrical system, portable generator.
B. Input, such a telephone, radio, frequencies (Rf) transmitter/receiver, miniature
microphone, etc.
C. Environment, such as telephone transmission line, small apartment, etc.
D. Background noise, such as television, radio, unrelated conversations,
computer games, etc.
E. Foreground information, such as number of individuals involved in the
conversation, general topics of discussion, closeness to microphone, etc.
F. Magnetic tape, such as brand, format, when purchased and whether
previously used.
G. Recorder operation, such as number of times turned on and off in the record
mode, type of keyboard or remote operation for all known record events, use of
voice activated features, etc.
Also recommended is a typed transcript of the recording, to include both English
and foreign language versions *3*
It is essential in all tape authentication exams to obtain the original recorder and
tape, as copies cannot normally be authenticated. If the defense is encountering
difficulties in obtaining the necessary "originals" they may wish to cite Koenig's
article'*4*as an authoritative resource which specifies the reasons why the
original evidence is essential in any tape tampering request.
If the original tape and recorder are not available for inspection, the forensic
expert can still conduct a preliminary examination of the submitted "copy" for any
evidence of discontinuous recorder operation, although all conclusions must
necessarily be qualified regarding possible editing effects. The examination
process normally includes both an aural, physical, and instrumental analysis of
the evidential tape. Phase continuity, speed determination, azimuth
determination, waveform analysis, spectrographic and narrow band
spectrographic analysis are among the techniques employed to evaluate the tape.
The techniques and tests are usually adequate in the detection of altered analog
recordings. Fortunately, the vast majority of altered tapes today are still analog
3
tapes.
Defense counsel should have a working knowledge of how tapes are analyzed.
First, there is a physical inspection of the submitted tape, the tape housing, the
tape recorder and all ancillary equipment used to make the original recording:
microphones, telephone couplers, transceivers, etc. A magnetic development
test involves the application of a special fluid which under proper magnification
will make visible the head track configuration, off-azimuth recordings, start/stop
functions, damage to recording heads, etc. The forensic expert can subsequently
determine whether the submitted tape is a copy, has been over-recorded, or was
made on a different recorder than the one submitted. The original recorder can
be detected by slight speed fluctuations and deformities in the rotating parts
which provide a unique "wow and flutter" signature which can be measured. Also,
spectrum analysis can be used to measure slightly different signals transmitted
through the microphone or telephone equipment. All of the signal analysis
equipment can be useful in answering questions related to bandwidth, distortion
effects, or unique tones generated during the original recording process.
Forensic Video Examinations
The forensic video examiner is concerned with the authenticity and integrity of
the signal. Questions relating and whether the tape is a copy, a compilation of
other tapes or an edited version are of important consideration. Forensic
examinations of videotapes usually consist of both a visual and aural
examination. One of the more important pieces of equipment used in forensic
video examinations is a waveform monitor which is a specialized oscilloscope. It
displays the voltage versus time modes and has specialized circuits to process
the signal. If any editing occurs, then its possible to display the signal aberration
on the display screen of the instrument.*5*
Additional tests include measurements of the chrominance, hue and burst of the
color videotape by using a vector scope. The vector scope measures the
chrominance information and allows for the examination of matching bursts of
multiple signals. It also permits the investigation of edit points.
Vertical, interval and horizontal information known as video synchronizing
information can be observed on a cross pulse monitor. This "cross pulse"
information can be viewed on a cross pulse monitor and with proper application,
one can often determine if the videotape is a copy or an original. In cases where
the helical heads are out of alignment, a set of marks could exist for each
succeeding generation or copy.*6* Lastly, if one suspects videotape editing, the
examination will require a frame-by-frame inspection, with the use of waveform
monitors, vector scopes, and a cross pulse monitor together with other forensic
equipment as deemed appropriate. It must be noted that there are sophisticated
production studios that can edit videotapes in such fashion that traditional
methods of detection are no longer adequate. Studios capable of producing such
tapes are, for now, generally limited to larger metropolitan areas.
Legal Issues/Admissibility
In their article, "Attacking The Weight Of The Prosecution's Science
Evidence,"*7* authors Edward J. Imwinkelried and Robert Scofield explore the
4
thesis that the accused has a constitutional right to introduce expert testimony
which can generate a reasonable doubt. The authors warn, however, that this
right to relevant criminal evidence is in fact very limited in scope, namely; (1)
important or "crucial evidence" and; (2) the defense must show that the evidence
is "trustworthy."
Likewise, authors Nancy Hollander and Lauren Baldwin point out that the
admissibility of an expert's testimony is often dependent on whether the expert is
testifying for the defense or for the prosecution ."*8*
In the field of forensic tape analysis, there exists few competently trained and
certified experts available to the defense to challenge the accuracy of
government tapes and/or the conclusions of the government experts. Even
though I have over twenty years experience in federal law enforcement and as a
Treasury Department crime laboratory supervisor, I am routinely subjected to
concerted efforts by the prosecution to attack my credibility and the accuracy of
my conclusions. As you would expect, as a government expert, I never received
any criticisms from the prosecutor concerning my credentials or accuracy of my
findings.
Access To Evidence
More and more courts are being forced to address the question of whether the
government has the privilege to withhold technical data from a defendant
challenging the integrity of electronic surveillance evidence. A few courts have
recognized "qualified privilege" for the government to such data (by drawing an
analogy to an "informer's privilege"), but have not been very sensitive to the
unique nature of electronic surveillance evidence nor defined the showing
required to overcome the government's "qualified privilege." Under the due
process clause, criminal defendants should be afforded a meaningful opportunity
to present a complete defense.*9* To safeguard this right the court has
recognized the principal of "constitutionally guaranteed access to
evidence ....*10* This access to evidence however, is not absolute as indicated in
Roviaro v. United States,*11*" wherein the court recognized the government's
limited privilege to withhold the identity of informers. Two circuit courts of appeal
have extended the limited privilege recognized in Roviaro to the nature and
location of electronic surveillance equipment."*12*
In Angiulo and Cintolo, the appellants asserted that the district court had
mistakenly barred questions concerning providing them the precise location of
microphones hidden in an apartment. Trial motions for the information had not
been made nor had the defendants offered any technical basis for the value of
the information. The government successfully objected to the questions
concerning the microphones location on the grounds that it would reveal
sensitive surveillance techniques and jeopardize future criminal investigations.
In upholding the district court, the First Circuit, citing Van Horn *13* and United
States v. Harley,*14* and making an analogy to the informer’s privilege in
Roviaro held that a qualified privilege against compelled government disclosure
of sensitive investigative techniques exists."*15* The privilege can be overcome,
however, by a sufficient showing of need. The defendant must show that, "he
needs the evidence to conduct his defense and that there are no adequate
5
independent means of getting at the same point."*16* The Cintolo court stressed
that the extent to which adequate alternative means could have substituted for
the proper testimony is "a key to evaluating this claim of necessity.*17*
As technological advances have occurred in digital editing, there likewise has
been a tremendous increase in the number of body worn FM transmitters and
other recording devices used by law enforcement to collect evidence against
defendants. It should be emphasized, however, that some of this evidence may
not be admissible in court if the agencies do not comply with several Federal
Communication Commission (FCC) regulations. First, all nonfederal agencies
must use only transmitters that are approved by the FCC and without this
approval the transmitter is not considered a legal transmitting device and
therefore cannot be legally used to gather evidence. Secondly, state and local
agencies must be licensed in the FCC's Police Radio Service and thus far most
departments reportedly have not met this requirement. These observations are
part of the information contained in "Equipment Performance Report: Body Worn
FM Transmitters," a report of the Technology Assessment Program (TAP). This
program tested nine Body-Worn FM transmitters in accordance with National
Institutes of Justice (NIJ) Standard 0214.01. These standards require
transmitters passing the test to provide intelligible audio signals that result in
acceptable quality voice recordings.*18* As noted in the Cintolo and Angiulo
decisions, the defense failed to provide a sufficient showing of necessity, thus, it
is imperative that defense experts vouch for the necessity of access to the
government evidence as soon as possible.
The Need For Original Recording Equipment And How To Get If
There are a number of valid scientific reasons for accessing original tapes,
recorders, and related equipment to conduct a proper analysis.
In practically every creditable forensic publication dealing with forensic tape
analysis procedures, the authors emphasize the necessity of examining the
original evidence or a direct patch cord copy. In many cases, however,
experience has shown an unwillingness of the government prosecutor and
agents to provide such materials to the defense for examination purposes. The
government may object that the defense never requested the original or direct
copy recordings and therefore, their motions for access at the eleventh hour are
basically "delay strategies." This argument can be effectively countered if the
defense obtains an appropriate court order requesting the defense expert be
provided access to the required "best evidence recordings."
Secondly, the government may contend that it has a qualified (if not absolute)
privilege of withholding technical data from the defense counsel citing "National
Security" or indicating that such release may jeopardize future criminal
investigations. The Anguilo and Cintolo decisions provide the defense counsel
relief from such government actions. Counsel must show the need for the
evidence to conduct the defense and that there "is no adequate independent
means of getting at the same points."
The importance of the defense obtaining the original or at least a direct patch
cord copy of all evidential recordings cannot be over emphasized. In practically
every case I have seen, the copy initially provided by the government was not
6
adequate for the best voice identification, tape enhancement or tape
authentication examination. Subsequent motions filed by the defense citing the
aforementioned requisite need for the original evidence often results in its
release by the court. As reflected in the newly approved International Association
for Identification standards for analysis of questioned voice recordings, the
"unknown and known voice samples must be original recordings, unless listed as
a specific exception ...."*19*
Notes:
1. Bruce E. Koenig, Authentication of Forensic Audio Recordings, JOURNAL OF
AUDIO ENGINEERING, 38 No. 1/2, 1990, Jan/Feb, page 4.
2. National Commission For The Review of Federal and State Wiretapping Laws,
pp 223225,1972.
3. Steve Cain, Voiceprint Identification, NARCOTICS, FORFEITURE, AND
MONEY LAUNDERING UPDATE NEWSLETTER, U.S. Department of Justice,
Criminal Division, (Winter 1988).
4. Bruce E. Koenig, Authentication of Forensic Audio Recordings, JOURNAL OF
AUDIO ENGINEERING SOCIETY, 38 No. 1/2, 1990, Jan/Feb. page 4.
5. Tom Owen, Forensic Audio and Video Theory And Applications, JOURNAL
OF AUDIO ENGINEERING SOCIETY, Vol. 36, No. 1/2. 1988, Jan/Feb, page 39.
6. Ibid page 40.
7. Edward J. Imwinkelried, and Robert G.Scofield, Attacking The Weight Of
Prosecution ~Scientific Evidence, THE CHAMPION, PDN, April 1992.
8. Nancy Hollander and Lauren M. Baldwin, Testimony In Criminal Trials:
Creative Uses,Creative Attacks, THE CHAMPION, December 199 1.
9. California v. Trombetta, 467 U.S. 479, 485 (1984).
10. United States v. Valenzuela Bemal, 458 U.S. 858, 867 (1982).
11. 353 U.S. 53 ().
12. See United States v. Angiulo, 847 F.2d. 956,98182 (lst Cir. 1988); and United
States v. Cinto1o, 818 F.2d. 980, 100103 (lst Cir. 1987); United States v. Van
Horn, 789 F.2d. 1492, 150708 (llth Cir. 1986).
13. 798, F.2d. 1492 ( ).
14. 682 F.2d. 1018, 1020 (D.C. Cir 1982).
15. Cintolo, 818 F.2d. 1002.
16. See Harley, supra.
17. Cintolo, 818 F.2d. 1003.
18. Copies are available at no charge from the Technology Assessment Program
Information Center (TAPIC), tollfree number 800-248-2742 or (301) 251-5060.
19. IAI Voice Comparison Standards, JOURNAL OF FORENSIC
IDENTIFICATION, January/February, 1992
7
AUTHENTICATION OF SOUND RECORDINGS FOR EVIDENTIARY
PURPOSES
By: STEVE CAIN, MFS, MFSQD
PRESIDENT
FORENSIC TAPE ANALYSIS, INC
LAKE GENEVA, WISCONSIN
MICHAEL R. CHIAL Ph.D
PROFESSOR AND CHAIRMAN OF
COMMUNICATIONS PROGRAMS AND
PROFESSOR OF COMMUNICATIVE DISORDERS
UNIVERSITY OF WISCONSIN-MADISON
PRESENTED AT 1994 ANNUAL MEETING OF THE
AMERICAN ACADEMY OF FORENSIC SCIENCES
(JURISPRUDENCE SECTION)
FEBRUARY 18, 1994
AUTHENTICATION OF SOUND RECORDINGS FOR EVIDENTIARY
PURPOSES
An ever-increasing reliance on tape evidence in both criminal and civil hearings
underscores the importance of tape integrity and the methods used to qualify or
disqualify audiotape evidence. Tape recordings are subject to increasing
falsification and misinterpretation, especially with the advent of computer-based
digital editing equipment. The purpose of this paper is four-fold: 1) to identify the
predominant methods by which audiotapes are normally intentionally altered or
falsified; 2) identify the physical and instrumental techniques for detecting signs
of tape falsification; 3) briefly discuss the increasing threat caused by modernday digital editing techniques and 4) provide examples of both analog and
digitally falsified tapes.
There are two generally accepted approaches for establishing the authenticity of
a questioned tape recording. Current legal practices normally require that the
burden of proof be placed on the attorney seeking to introduce the tape into
evidence. This will require that the attorney demonstrate that certain accepted
methods designed to protect from any form of tape tampering have been
adhered to and if that is not successful to submit the tape to a qualified expert for
a forensic examination. On a more practical level, an original recording is
considered authentic if it starts at the beginning of the tape and does not stop
until the end. Any stops or restarts should be announced by the operator.
Original recordings should contain all of the audio information recorded at the
moment in time that the event occurred. The recording should further not contain
any break in its continuity or content nor should it contain any suspicious signs
suggestive of falsification.
1
It is important for both attorney and investigator to understand that falsification or
tampering with tapes involves an intentional attempt to alter the tape’s original
content. Often, however, the evidential recorders and their respective tapes have
been unintentionally interrupted during the recording process. This innocuous or
accidental interruption of the tape does not constitute a falsification effort and
may include the following operator errors; 1) accidental stop/restart of tape
recorder; 2) mechanical malfunction of the tape recorder; 3) damage to the tape
oxide or the use of a previously recorded tape; 4) “off-speed” recording due to
low batteries or improper AC line connections; 5) microphone abnormalities; etc.
The major categories of intentional tape editing or falsification include; 1)
Deletion; 2) Obscuration; 3) Transformation; and 4) Synthesis. Deletion of
unwanted material can be rapidly accomplished through either splicing or by
using one or more recorders to erase, rerecord, or stop/pause the recorder at
strategic points within the conversation. Obscuration involves the distortion of a
recorded signal with the purpose of rendering selective portions unintelligible (i.e.
the eighteen minute gap in the infamous Watergate tapes). This technique can
also be used to mask splices, clicks, or suspicious transients. Transformation
involves the alteration of portions of a recording so as to alter its original content.
The technique is similar to deletion practices but requires greater knowledge of
acoustic phonetics and is more difficult to accomplish. Lastly, synthesis is the
generation of artificial text by adding background sounds or conversation to the
taped copy which were not present on the original recording. It should be
emphasized that all of the aforementioned traditional analog techniques for
altering audiotapes could be more effectively and surreptitiously accomplished
through the use of digital editing workstations.
The principles of falsification are also similar to the general principles of disguise.
Namely, the individual actually effecting the tape falsification is attempting to
obscure or disrupt important features of the originally recorded event or subject
of interest. This is accomplished through various masking techniques. Secondly,
falsification efforts are often designed to misdirect the attention of the listener to
an irrelevant aspect or feature of the signal or an event of interest.
The electromechanical indications of falsified tapes should include one or more
of the following phenomenon:
1) Gaps segments in a recording which represents unexplained changes in
content or context. A gap can contain buzzing, hum, or silence.
2) Transients - short, abrupt sounds exemplified by clicks, pops, etc. Transients
may indicate tape splicing or some other interruption of the recording process.
3) Fades - gradual loss of volume. Fades can cause inaudibility and are
considered gaps when the recording becomes fully inaudible.
2
4) Equipment sounds - inconsistencies of context caused by the recording
equipment itself. Common equipment sounds include hum, static, whistles, and
varying pitches.
5) Extraneous voices - background voices which at times appear to be as near
as the primary voices, and at times can even block the primary voices.
The methods for detecting falsified (non-authentic) recordings include:
Critical Listening The forensic tape specialist will normally listen with high quality
head phones and professional recording equipment to the original tapes prior to
conducting any instrumental examination. He notes any unusual aural and/or
acoustic events such as starts, stops, speed fluctuations, and other variations
requiring investigation. He examines all recorded events to include both
foreground and background sounds and listens for abnormal changes, absences,
or presences of differing environmental sounds. He concentrates on voices,
conversation and other audible sounds.
Aural Anomalies Would include sudden changes in a person’s voice, abrupt
unexplained topic changes, or a sudden change in foreground/background
information.
Physical lnspection
Magnetic Development
Spectrum Analysis Employs the use of specialized computer equipment which
measures the frequency spectrum of the recorded tape and provides a visual
interpretation of the frequency vs. amplitude, frequency vs. amplitude vs. time
displays. This allows for the expert to view the entire spectrum or to zoom in on
one particular area of interest to help characterize the acoustic nature of a
particular anomaly and to possibly identify its source.
Waveform Analysis - A computer generated display representing time vs.
amplitude of recorded signals in graphic form. Such analysis normally allows the
expert to measure and identify record-mode events including the measurement
of record-to-erase head distances, determination of the spacing between gaps
and multiple gap erase heads, and inspection of the signature shape and spacing
of various record event signals.
Test Recordings on Evidential Recorders and Accessory Equipment -Various
electrical, magnetic and mechanical measurements of both standard and
modified recorders can be used in determining the possible origins of
questionable tones or sounds occurring on the evidential recording.
There exist many different methods of both analog and digital editing of tape
recordings and the below examples highlight one of the more common methods
utilized.
3
TRADITIONAL METHODS
OF TAPE EDITING METHOD OF DETECTION
1. Whispered Speech 1. Talker identification (voice print analysis) involving the
combined aural/spectrographic method
2. Vocal Disguise or Mimicking 2. Talker identification (voice print analysis)
3. Typical Analog Edits - Splicing (electronic or physical), stop/restart, overrecording, pausing of recorder, erasures, dubbing, etc. 3. Critical listening,
instrumental analysis, magnetic development, and spectrum analysis.
4. Re-recording to obscure physical physical edits, etc. 4. Critical listening,
instrumental analysis, magnetic development, and spectrum analysis.
CONTEMPORARY/FUTURE CHALLENGES
Digital editing of both audio and video tapes has greatly complicated the
authentication process and increases the likelihood that altered tapes can
escape detection. There are at least 30 different desktop computer editing
workstations or digital recorders which can be used as “turnkey” editing systems.
Software and add on computer cards can transform an IBM or Macintosh
computer into a sophisticated digital audio editing machine. Some of the systems
require a digital audio recorder for initial conversion of the analog format before
accessing the computer hardware. These editing workstations were originally
designed by the motion picture and recording industries to correct subtle errors in
multi-track releases and can now be purchased at prices as low as $300 for the
software. The editing options are practically inexhaustible and provide the
operator the ability to alter the tape in a word-processing format (ie. cut and
paste, copy, delete, etc.,) while selecting playback files which can help “shape”
the sound. The typical telltale signs of traditional analog recorder editing
including clicks and pops and other short duration sounds can now be effectively
removed without little if any detectable audible clues.
Examples of varying editing processes including related hardware and/or
equipment:
1) Pitch Shift Telephones
2) Vocal Disguise through synthesized speech (Votrax or Dectalk).
3) Computer Manipulation of speech formant data (Kay Elemetrics Model 4300
and ASL programs - Re-synthesis of Human Speech)
4) Additive mixing of noise or other background and foreground signals into ongoing speech.
5) Signal Processing Filters (analog and digital)
a. Phasing Anomalies
4
b. Chorusing
c. Harmonic Distortion
d. Reverberation
e. Filtering of Selective Frequencies
f. Channel Switching
The threat of future digital editing is becoming of increasing concern to the courts.
It is therefore more imperative than ever that both the original tapes and the
recorders be made available for inspection. Both the FBI Signal Analysis Branch
and other certified acoustic tape experts recognize that it is essential for the
contributing attorney to provide all of the original tapes and related recording
equipment before a complete authentication can be accomplished.
Professor Chial and I have attempted to explain some of the more traditional and
more recent methods of detecting falsified or edited audiotape recordings;
identify the various physical and instrumental techniques for detecting signs of
tape falsification; discuss various examples of falsified tapes, and lastly to briefly
discuss the increasing threat caused by digital computer-based editing systems.
It is relatively easy to change the content of a recording by deleting words or
obscuring meaning with over-recorded sounds or by transforming the context
through rearrangement of selected phrases or added words. Nevertheless,
falsifications normally leave detectable magnetic and waveform acoustic
signatures which can lead to forensic individualization of the evidential recorders
and tapes.
Note: For additional information see the following published articles:
“Authentication of Forensic Audio Recordings,” Journal of Audio Engineering
Society, 38, 1990, Bruce E. Koenig.
The National Commission for the Review of Federal and State Wire Tapping
Laws, 1976, Mark Weiss, et al.
“Verifying the Integrity of Audio and Video Tapes,” The Champion Magazine,
Summer 1993, Steve Cain.
“Sound Recordings as Evidence in Court Proceedings,” The Prosecutor
Magazine, Sept/Oct. 1995
5
AES standard for forensic purposes —Criteria for the authentication of
analog audio tape recordings
Users of this standard are encouraged to access http://www.aes.org/standards to
determine if they are using the latest printing incorporating all current
amendments and editorial corrections.
This document has been reproduced by Global Engineering Documents with the
permission of AES under a royalty agreement.
AUDIO ENGINEERING SOCIETY, INC.
60 East 42nd Street, New York, New York 10165, USA
AES standard for forensic purposes
Criteria for the authentication of analog audio tape recordings
Published by
Audio Engineering Society, Inc.
Copyright © 2000 by the Audio Engineering Society
Abstract
The purpose of this standard is to formulate a standard scientific procedure for
the authentication of audio tape recordings intended to be offered as evidence or
otherwise utilized in civil, criminal, or other fact finding proceedings.
An AES standard implies a consensus of those directly and materially affected by
its scope and provisions and is intended as a guide to aid the manufacturer, the
consumer, and the general public. The existence of an ABS standard does not in
any respect preclude anyone, whether or not he or she has approved the
document, from manufacturing, marketing, purchasing, or using products,
processes, or procedures not in agreement with the standard. Prior to approval,
all parties were provided opportunities to comment or object to any provision.
Approval does not assume any liability to any patent owner, nor does it assume
any obligation whatever to parties adopting the standards document. This
document is subject to periodic review and users are cautioned to obtain the
latest editi
AES43-2000
Contents
Foreword 3
1 Scope 4
2 Normative references 4
3 Definitions 4
1
4 Verification of authenticity 5
4.1 Criteria 5
4.2 Equipment 5
4.3 Reporting 5
5 Examination and analysis 6
5.1Evidence management 6
5.2 Critical listening and waveform examination 7
5.3 Photo-microscopic analysis 8
5.4 The formulation of an opinion and conclusion 8
6 Testimony 9
6.1 Preparation 9
6.2 Problems 10
Annex A Informative references 11
Foreword
[This foreword is not a part of AES standard for forensic purposes — criteria for
the authentication of analog audio tape recordings, AES43-2000.]
This document was developed by a writing group, headed by A. Pellicano, of the
SC-03-12 Working Group on Forensic Audio of the SC-03 Subcommittee on the
Preservation and Restoration of Audio Recordings. The writing group was formed
to execute project AES-X48.
It results from an international consensus and is not intended to. reflect the
practice of any single nation. As an AES standard, it is an international
professional society’s statement of technical good practice, but its use is entirely
voluntary and it does not have the status of a governmental regulation.
Nevertheless, any claim to voluntary compliance with the standard implies
acceptance of its mandatory clauses.
In 1991, SC-03-12 was organized as AESSC WG-12 at the request of a
community of engineers from the ABS. the Acoustical Society of America,
various law enforcement agencies, and groups concerned with testimony. The
group concerns itself with the handling, authentication, and enhancement of
audio recorded materials basing itself on methodologies such as developed from
those described in Bolt, Cooper, Flanagan, McKnight, Stockham, and Weiss,
Report on a Technical Investigation Conducted for the U.S. District Court for the
2
District of Columbia by the Advisory Panel on the White House Tapes. May 31,
1974.
This document results from one of the projects set out at the early meetings of
the working group.
Tom Owen, Chair of SC-03-12
Michael McDermott, Vice-Chair of SC-03-12
1999-09-03
AES standard for forensic purposes —
Criteria for the authentication of analog audio tape
recordings
1 Scope
This standard specifies the minimum procedure for the authentication of analog
audio tape recordings intended to be offered as evidence or otherwise utilized in
civil, criminal, or other fact finding proceedings. It does not specify or restrict
additional testing procedures that can be used.
These methodologies are suggested to any and all individuals and groups who
hold themselves out to be or are recognized as forensic tape analysts or experts.
This standard is a set of procedures set forth to inform attorneys, courts, and
other interested parties. It also serves to aid interested parties who are
attempting to determine whether or not the procedures and methodologies of
potential, chosen, or opposing experts are of a scientific nature and would
withstand objective scrutiny.
2 Normative references
The following standard contains provisions that, through reference in this text,
constitute provisions of this document. At the time of publication, the edition
indicated was valid. All standards are subject to revision, and parties to
agreements based on this document are encouraged to investigate the possibility
of applying the most recent editions of the indicated standards.
AES27- 1996, AES recommended practice for forensic purposes — Managing
recorded audio materials intended for examination.
3 Definitions
3.1
authentication
3
authentic recording and authenticity analysis as defined in AES27
3.2
forensic tape analyst
FTA
entity performing authentication according to this standard
3.3
designated original recording
DOR
original recording as defined in AES27
3.4
designated originating recording device
DORD
original recorder as defined in AES27
3.5
employer
engaging party
entity engaging the services of an FTA
3.6
cassette
device composed of a case containing two coplanar or superimposed hubs or
reels on which a magnetic tape is wound, so that the tape can move from hub
(reel) to hub (reel) during recording, reproduction, a fast forward movement, or
rewinding, and can be easily and instantaneously inserted in a recordingreproducing equipment or in a reproducer designed for this purpose, without
handling the magnetic tape
3.7
memorialization
legally acceptable documentation of evidence
3.8
4
test recording
recording made by the FTA, using the designated originating recording device
and a non-evidence blank tape, for the purpose of determining certain
performance characteristics of the recording device
3.9
signature
waveform or microscopic visualization (or demonstration) of record events either
located on the DOR or created on a test recording, or both, utilizing the DORD or
any tape recording device examined by the FTA for the purpose of identification
or comparison during an examination
4 Verification of authenticity
4.1 Criteria
Verification is predicated upon two sets of criteria:
a) that a person, whether a law enforcement official or any individual stated, if
called upon, could or would testify under penalty of perjury, that the tape
recorded evidence presented as the DOR is, in fact, the tape material utilized to
create the recording at the exact time that the occurrence, interview, interrogation,
or recorded content actually took place;
b) that by a comprehensive examination procedure and scientific means the FTA
was able to determine that it is the original.
4.2 Equipment
The FTA shall examine the DOR along with and utilizing the DORD. The FTA
shall render findings that would scientifically evince that the DORD recorded the
designated original recording, and found no conclusive evidence of tampering,
unauthorized editing, or any form of intentional deletions, material or otherwise,
within the recorded content.
4.3 Reporting
The FTA may then render an opinion that the recording has passed the
procedure or standard for authentication and that the questioned tape recording
is authentic in physical state and in content.
5 Examination and analysis
5.1 Evidence management
Except where otherwise specified in this standard, evidence management
practices shall comply with AES27.
5
5.1.1 Physical examination
5.1.1.1 Record-prevention punch-out tabs
If the audio evidence is contained in a tape cassette that features recordprevention punch-out tabs, the FTA should try to obtain permission to remove
them or the FTA may remove the tabs at its discretion. If the tabs are removed,
the FTA shall attach the removed record-prevention tabs to a suitable carrier
such as a file card by means of a nondestructive and removable adhesive such
as transparent adhesive tape. The carrier shall be placed inside a sealed
envelope, with the date and time that the envelope was sealed and the signature
of the FTA written across the seal. The cassettes shall be comprehensively
photographed or videotaped before and after the removal of the punch-out tabs.
5.1.1.2 Operating condition
When the tape recorded evidence is contained in a cassette, the cassette shall
be carefully examined to determine that it is operable. The FTA shall inspect the
cassette, making sure that there is no obstruction to the tape. The FTA shall also
look for apparent tears or splices on the tape material itself that could possibly
obstruct or deter playback. The FTA shall carefully rotate the tape hubs in both
directions to detect any hidden obstruction that could hinder playback. When
examining a reel of tape, the same care and caution shall be executed.
NOTE Playback of a damaged tape can produce further damage to the tape.
5.1.1.2.1 Notification of damage
If during the physical examination, the FTA finds evidence of physical tampering
or damage to the cassette or the tape material, the FTA shall immediately inform
its employer that the submitting party shall be notified. If the cassette or tape
material can be repaired, then the FTA shall obtain written permission from the
submitting party prior to proceeding with any repairs or modifications. Whether or
not the FTA receives permission to repair the damage or remove the tape
material and place it in a new cassette or otherwise prepare the DOR to be
available for playback, the FTA shall photograph or videotape the evidence for
reference to memorialize the discovery. If the tape or cassette is repaired, the
videotape or photographs shall comprehensively depict the repair.
5.1.1.2.2 Splices
If a physical splice is located, the splice shall be noted and photographed or
videotaped at the time of the observation.
5.1.1.3 DORD condition
The DORD and any accompanying apparatus such as separate microphones,
switching devices, and similar accessories shall be inspected and examined to
determine that they are operational. After the FTA concludes the visual
6
inspection, a compatible tape shall be placed in the DORD and the functions of
the DORD shall be tested to ensure that it can play back the DOR without
damage to the DOR or the DORD.
5.1.1.3.1 Notification
If the DORD is not functional, the FTA shall inform its employer that the
submitting party shall be notified. If the DORD can be repaired, then the FTA
shall obtain written permission to do so. If the repair necessitates the
replacement of the record-playback head, the erase head or both, the FTA shall
indicate to the employer that the replacement of the head or heads negates an
authentication procedure and that the FTA report of findings relates only to the
examination of the DOR. All repairs shall be comprehensively memorialized
including who repaired the recorder and at what facility. All replaced parts shall
be maintained as evidence by the FTA.
5.1.2 Verification
Compliance with 5.1 shall be verified and attested to by the FTA before
proceeding further with the evidence.
5.2 Critical listening and waveform examination
The critical listening and waveform examination procedures can assist an FTA in
attempting to determine whether or not any anomalies are present on the
questioned recording.
5.2.1 The FTA shall produce a first test recording containing known exemplars of
the functions of the DORD. It should include a minimum of ten start recording
signatures, ten stop recording signatures, ten stop-start recording signatures, ten
pause signatures (assuming that the recording device has this feature), and if the
DORD is so equipped, ten voice activation signatures. Other test recordings may
be produced which should include over-recordings and other variations of the
record functions of the recording device if necessary or appropriate.
5.2.2 The designated recording device shall be utilized to play back the test
recording. The playback should be rehearsed to ensure that the level of playback
is appropriate. That setting should be fixed by either carefully applying tape
across the volume control of the recorder or exacting some form of measurement
that would ensure that the playback output level can be reasonably reproduced.
5.2.3 The first test recording should be played back into a configuration of either
a computerized method of storing the playback on a hard disk, or some form of
memory device that would allow repetitive playback. Many programs are now
available to digitize the playback and store that information on hard drives. They
further allow an array of playback functions, and most have features that would
enable the FTA to view the waveform.
7
5.2.4 Once the signal or audio from the test recording has been stored, playback
of the digitized recording can take place to enable the FTA to listen to the
recording while viewing the waveform. The FTA can then be informed as to how
the record functions of the designated recording device sound (assuming that the
functions generate a discernible audible sound when played back) and are
visually demonstrated or appear in the waveform domain. The FTA should then
study and scrutinize the signatures so that it can be reasonably acquainted with
how the function signatures of the designated recording device sound and are
seen or demonstrated in the waveform domain.
5.2.5 The DOR shall be played back with as close to the exact output level and
through the exact configuration as the test recording. The output volume control
of the DORD may be adhesive-taped to fixed position until all of the test
recordings are created and subsequently stored. Once the signal or audio from
the DOR is stored, then the FTA shall critically listen to the content while viewing
the waveform.
5.2.6 The FTA should then produce, by the safest and best means possible, at
least two first-generation copies of the DOR for reference and to evince the state
of the recorded content at or about the time of receipt. If the FTA is asked for
copies, then copies should be provided appropriately labeled and marked.
5.2.7 The critical listening and waveform examination should occur as often as
the FTA deems it necessary in order to answer the following questions.
a) Was the content consistent and uninterrupted throughout the entirety of the
questioned tape recording? If not, then the location of the gaps, dropouts, overrecordings, or any other form of disruption should be delineated for further
examination and analysis. If there are other apparent unrelated recordings, they
should be cataloged for reference and/or possible further examination and
analysis.
b) Were there any identifiable record function signatures detected and located in
the content? If so, are they consistent with the test recording exemplars? If not,
they should be designated as possible anomalies. In either case, they should be
labeled or otherwise delineated for further analysis.
c) Was there any form of anomalous or otherwise perceptible aural or visible
indications in the playback or waveform display? If so, their presence should be
labeled or otherwise delineated for further analysis. This question would include
level changes, apparent or obvious differences in background content, or any
other form of aurally perceptible variances.
d) Were there background conversations or content? For example, were there
radio communications or other perceptible speech, or repetitive noise that would
aid in determining authentication? If so, they should be labeled or otherwise
delineated for reference and further analysis.
8
e) If (a) through (d) render any form of anomaly or evident difference, then further
test recordings utilizing the DORD should be produced in an attempt to recreate
or mimic the differences or anomalies detected and located. If the further tests
can or cannot do so, that revelation should be reported.
5.2.8 These and other findings should be reported upon, verified in the waveform,
and their precise location noted for future reference. Once these procedures
have been accomplished, then the next step shall be to perform the photomicroscopic examination and analysis.
5.3 Photo-microscopic analysis
5.3.1 Test recordings for the specific purpose of photo-microscopic analysis
should be produced. These test recordings should include all of the record
function signatures of the DORD.
5.3.2 The test recordings should be examined under the microscope, in a
scientific manner, which would allow the
PTA to view the magnetic domain (Bitter patterns) of the record function
signatures of the test recording examined.
See annex A for informative references.
5.3.3 The known exemplars produced, viewed, and examined can familiarize the
FTA with how the function signatures of the DORD appear. The FTA can now be
enabled to make measurements, take photographs, videotape, and otherwise
memorialize the procedure and the resulting findings.
5.3.4 The FTA can now perform the same examination and analysis upon the
designated original recording. This procedure, when performed in a scientific
manner, can enable the FTA to attempt to identify the signatures located on the
questioned recording. The FTA can now make comparisons and other forms of
tests resolving the issue of authenticity as it pertains to the recording that the
FTA is examining. It further allows the opportunity to demonstrate the s findings
by means of measurements, photographs, videotapes, or any other form of
demonstrative means that could be reviewed by the employers, the courts, or
juries and other experts, opposing or consulting.
5.3.5 An FTA can now draw conclusions from these findings, including whether
or not the DORD actually recorded the designated original recording. An FTA’s
finding could either validate this fact or disprove it. In some cases no definitive
solution can be made.
5.4 The formulation of an opinion and conclusion
5.4.1 Once an FTA has performed all of the testing procedures and rendered
scientific findings, it should be sure to have:
9
a) performed all of the tests and examinations in a scientific manner, that if
recreated or duplicated by another expert would render the exact same findings;
for example, if the PTA has found and identified a stop/start recording signature
at a specific location on the questioned recording, another expert or analyst could
or would find and identify that same signature at the same location;
b) produced comprehensible and repeatable graphic waveform displays,
printouts, or any other form of graphic rendering that would demonstrate the
FTA’s findings in the waveform domain, so that another expert or any other
individual could view them in an effort to determine whether the FTA’s findings
exist and are valid;
c) produced sufficient photographs, videotapes, or any other form of definitive
renderings that would demonstrate the FTA’s findings in the magnetic domain, so
that another expert or any other individual could view them in an effort to
determine whether the FTA’s findings exist and are valid.
5.4.2 If asked, an FTA should render a comprehensive report that would
effectively demonstrate all of the procedures and findings, in a scientific manner,
that would survive objective scrutiny and lend credence to its opinion and
conclusion.
5.4.3 As to what an FTA hears or perceives in the playback of the DOR that is
not demonstrable, that information would be categorized as subjective and left to
the courts, juries, or other parties to determine its relevance, validity, or both. It
may, however, be reported thereon.
5.4.4 After an FTA has completed all of the tests and examinations, has analyzed
and memorialized all of the findings, and has either rendered a comprehensive
written report or rendered an oral report to the employers regarding this opinion
and conclusion, based on a high degree of scientific certainty, an FTA may be
permitted to testify as to its opinion and conclusion.
6 Testimony
Once an FTA has finalized its examination and analysis and reached a definitive
conclusion and opinion, the FTA may be available for testimony if called upon to
do so.
6.1 Preparation
6.1.1 To adequately prepare for testimony, an FTA shall attend to its files so that
notes, correspondence, data, and other written or otherwise demonstrable
information are in a comprehensive form. This requirement includes the
cataloging of all the evidence submitted, the test recordings produced, and any
10
and all demonstrative renderings that may be requested to be viewed by the
opposing parties, their experts, and the engaging party.
6.1.2 Once the files .are in order, then an FTA should review all findings in a
comprehensive fashion to determine that all of the calculations, demonstrative
renderings, reports, and supporting information are complete and, most
importantly, accurate. The PTA should thoroughly review its deposition, if one
had occurred, and any and all forms of reports it may have previously rendered.
6.1.3 When an FTA is reasonably assured that it is prepared, then the FTA shall
proceed to prepare its employer and first, and at the very least, demonstrate the
following:
a) that the FTA followed the criteria for authentication as strictly as possible;
b) that the FTA had attained a high degree of scientific certainty as to its findings,
opinion, and conclusion;
c) that all of the FTA’s demonstrative waveform or spectral renderings are
accurate and truthfully demonstrate all of the findings which it claims are located
on the questioned recording; further, that if any other competent expert or party
examined the FTA’s waveforms, it can locate the signatures, events, edits, or
anomalies that are graphically demonstrated in the FTA’s depictions at or about
the same location as did the FTA presenting the findings;
d) that all of the FTA’s photographs or other forms of visual magnetic domain
renderings are accurate and truthfully demonstrate its findings located on the
questioned recording; further that if any other competent expert or party
examined the magnetic domain, that party could and would locate the signatures,
events, edits, or anomalies that are demonstrated in the PTA’s depictions at or
about the same location as did the FTA when presenting the findings;
e) that the FTA has performed its examination in the utmost unbiased ethical
manner and that it believes the findings, opinion, and conclusion would withstand
the scrutiny of peers and the legal process;
f) that the FTA should submit or make available to its employer all reference
materials, instrumentation manuals, literature or any other form of documentation
or data that the FTA has relied upon during its examination and analysis, in
rendering its opinion, or both. Further, the FTA should attempt to familiarize its
employer in the syntax, nomenclatures, or terminology utilized in their field;
g) that the FTA should assert that its employer can rely upon the FTA to
professionally and truthfully testify as to the findings with the utmost assurance
within its capabilities and competence.
6.1.4 At this point the engaging party may further interview or mock cross
examine an FTA in an attempt to ascertain any issues relating to the findings,
11
opinions and conclusions rendered or any issues relating to prior testimony given
by an FTA.
6.2 Problems
6.2.1 From time to time there are problems educating or relating findings to the
engaging party. The FTA should avail itself in an effort to clearly address the
issues caused by its findings or the engaging party’s apprehensions if any exist.
6.2.2 If an FTA senses or is otherwise led to believe that the engaging party has
difficulty in comprehending the issues, or its findings, opinions, and conclusions,
the FTA may suggest further preparation or offer the services of another expert
to further clarify the issues or perform an independent examination and analysis
of the questioned recording, in an attempt to satisfy the doubt of the engaging
party or otherwise assure it of the testimony to be presented.
12
Forensic speaker identification based on spectral moments
R. Rodman,* D. McAllister,* D. Bitzer,* L. Cepeda* and P. Abbitt†
*Voice I/O Group: Multimedia Laboratory
Department of Computer Science
North Carolina State University
rodman@csc.ncsu.edu
†Department of Statistics
North Carolina State University
ABSTRACT A new method for doing text-independent speaker identification geared to
forensic situations is presented. By analysing ‘isolexemic’ sequences, the method
addresses the issues of very short criminal exemplars and the need for open-set
identification. An algorithm is given that computes an average spectral shape of the
speech to be analysed for each glottal pulse period. Each such spectrum is converted to a
probability density function and the first moment (i.e. the mean) and the second moment
about the mean (i.e. the variance) are computed. Sequences of moment values are used as
the basis for extracting variables that discriminate among speakers. Ten variables are
presented all of which have sufficiently high inter- to intraspeaker variation to be
effective discriminators. A case study comprising a ten-speaker database, and ten
unknown speakers, is presented. A discriminant analysis is performed and the statistical
measurements that result suggest that the method is potentially effective. The report
represents work in progress.
KEYWORDS speaker identification, spectral moments, isolexemic sequences, glottal
pulse period
PREFACE
Although it is unusual for a scholarly work to contain a preface, the controversial nature
of our research requires two caveats, which are herein presented.
First, the case study described in our article to support our methodology was performed
on sanitized data, that is, data not subjected to the degrading effect of telephone
transmission or a recording medium such as a tape recorder. We acknowledge, in
agreement with Künzel (1997), that studies based strictly on formant frequency values
are undermined by telephone transmission. Our answer to this is that our methodology is
based on averages of entire spectral shapes of the vocal tract. These spectra are derived
by a pitch synchronous Fourier analysis that treats the vocal tract as a filter that is driven
by the glottal pulse treated as an impulse function. We believe that the averaging of such
spectral shapes will mitigate the degrading effect of the transmittal medium. The purpose
of this study, however, is to show that the method, being novel, is promising when used
on ‘clean’ data.
We also acknowledge, and discuss below in the ‘Background’ section, the fact that
historically spectral parameters have not proved successful as a basis for accurate speaker
identification. Our method, though certainly based on spectral parameters, considers
averages of entire, pitch independent spectra as represented by spectral moments, which
are then plotted in curves that appear to reflect individual speaking characteristics. The
other novel part of our approach is comparing ‘like-with-like’. We base speaker
13
identification on the comparison of manually extracted ‘isolexemic’ sequences. This, we
believe, permits accurate speaker identification to be made on very short exemplars. Our
methods are novel and so far unproven on standardized testing databases (though we are
in the process of remedying this lacuna). The purpose of this article is to publicize our
new methodology to the forensic speech community both in the hopes of stimulating
research in this area, and of engendering useful exchanges between ourselves and other
researchers from which both
parties may benefit.
INTRODUCTION
Speaker identification is the process of determining who spoke a recorded
utterance. This process may be accomplished by humans alone, who compare a spoken
exemplar with the voices of individuals. It may be accomplished by computers alone,
which are programmed to identify similarities in speech patterns. It may alternatively be
accomplished through a combination of humans and computers working in concert, the
situation described in this article.
Whatever the case, the focus of the process is on a speech exemplar – a recorded
threat, an intercepted message, a conspiracy recorded surreptitiously – together with the
speech of a set of suspects, among whom may or may not be the speaker of the exemplar.
The speech characteristics of the exemplar are compared with the speech characteristics
of the suspects in an attempt to make the identification.
More technically and precisely, given a set of speakers S = {S1 … SN}, a set of collected
utterances U = {U1 … UN} made by those speakers, and a single utterance uX made by
an unknown speaker: closed-set speaker identification determines a value for X in [1 …
N]; open-set speaker identification determines a value for X in [0, 1 … N], where X = 0
means ‘the unknown speaker SXS’. ‘Text independent’ means that uX is not necessarily
contained in any of the Ui.
During the process, acoustic feature sets {F1 … FN} are extracted from the
utterances {U1 … UN}. In the same manner, a feature set FX is extracted from uX. A
matching algorithm determines which, if any, of {F1 … FN} sufficiently resembles FX.
The identification is based on the resemblance and may be given with a probability-oferror coefficient.
Forensic speaker identification is aimed specifically at an application area in
which criminal intent occurs. This may involve espionage, blackmail, threats and
warnings, suspected terrorist communications, etc. Civil matters, too, may hinge on
identifying an unknown speaker, as in cases of harassing phone calls that are recorded.
Often a law enforcement agency has a recording of an utterance associated with a crime
such as a bomb threat or a leaked company secret. This is uX. If there are suspects (the
set S), utterances are elicited from them (the set U), and an analysis is carried out to
determine the likelihood that one of the suspects was the speaker of uX, or that none of
them was. Another common scenario is for agents to have a wiretap of an unknown
person who is a suspect in a crime, and a set of suspects to test the recording against.
Forensic speaker identification distinguishes itself in five ways. First, and of
primary importance, it must be open-set identification. That is, the possibility that none of
the suspects is the speaker of the criminal exemplar must be entertained. Second, it must
be capable of dealing with very short utterances, possibly under five seconds in length.
14
Third, it must be able to function when the exemplar has a poor signal-to-noise ratio. This
may be the result of wireless communication, of communication over low-quality phone
lines, or of data from a ‘wire’ worn by an agent or informant, among others. Fourth, it
must be text independent. That is, identification must be made without requiring suspects
to repeat the criminal exemplar. This is because the criminal exemplar may be too short
for statistically significant comparisons. As well, it is generally true that suspects will
find ways of repeating the words so as to be acoustically dissimilar from the original.
Moreover, it may be of questionable legality as to whether a suspect can be forced to
utter particular words. Fifth, the time constraints are more relaxed. An immediate
response is generally not required so there is time for extensive analysis, and most
important in our case, time for human intervention. The research described below
represents work in progress.
BACKGROUND
The history of electronically assisted speaker identification began with Kersta (1962), and
can be traced through these references: Baldwin and French (1990), Bolt (1969), Falcone
and de Sario (1994), French (1994), Hollien (1990), Klevans and Rodman (1997), Koenig
(1986), Künzel (1994), Markel and Davis (1978), O’Shaughnessy (1986), Reynolds and
Rose (1995), Stevens et al. (1968) and Tosi (1979).
Speaker identification can be categorized into three major approaches. The first is
to use long-term averages of acoustic features. Some features that have been used are
inverse filter spectral coefficients, pitch, and cepstral coefficients (Doddington 1985).
The purpose is to smooth across factors influencing acoustic features, such as choice of
words, leaving behind speaker-specific information. The disadvantage of this class of
methods is that the process discards useful speaker-discriminating data, and can require
lengthy speech utterances for stable statistics.
The second approach is the use of neural networks to discriminate speakers.
Various types of neural nets have been applied (Rudasi and Zahorian 1991, Bennani and
Gallinari 1991, Oglesby and Mason 1990). A major drawback to the neural net methods
is the excessive amount of data needed to ‘train’ the speaker models, and the fact that
when a new speaker enters the database the entire neural net must be retrained.
The third approach – the segmentation method – compares speakers based on
similar utterances or at least using similar phonetic sequences. Then the comparison
measures differences that originate with the speakers rather than the utterances. To date,
attempts to do a ‘like phonetic’ comparison have been carried out using speech
recognition front-ends. As noted in Reynolds and Rose (1995), ‘It was found in both
studies [Matsui and Furui 1991, Kao et al. 1992] that the front-end speech recognizer
provided little or no improvement in speaker recognition performance compared to no
front-end segmentation.’
The Gaussian mixture model (GMM) of speakers described in Reynolds and Rose
(1995) is an implicit segmentation approach in which like sounds are (probabilistically)
compared with like. The acoustic features are of the mel-cepstral variety (with some other
preprocessing of the speech signal). Their best results in a closed-set test using five
second exemplars was correct identification in 94.5% ±1.8 of cases using a population of
16 speakers (Reynolds and Rose 1995: 80). Open-set testing was not attempted.
15
Probabilistic models such as Hidden Markov Models (HMMs) have also been
used for text-independent speaker recognition. These methods suffer in two ways. One is
that they require long exemplars for effective modelling. Second, the HMMs model
temporal sequencing of sounds,which ‘for text-independent tasks … contains little
speaker-dependent information’ (Reynolds and Rose 1995: 73).
A different kind of implicit segmentation was pursued in Klevans and Rodman
(1997) using a two-level cascading segregating method. Accuracies in the high 90s were
achieved in closed-set tests over populations (taken from the TIMIT database) ranging in
size from 25 to 65 from similar dialect regions. However, no open-set results were
attempted.
In fact, we believe the third approach – comparing like utterance fragments with
like – has much merit, and that the difficulties lie in the speech recognition process of
explicit segmentation, and the various clustering and probabilistic techniques that
underlie implicit segmentation. In forensic applications, it is entirely feasible to do a
manual segmentation that guarantees that lexically similar partial utterances are compared.
This is discussed in the following section.
SEMI-AUTOMATIC SPEAKER IDENTIFICATION
Semi-automatic speaker identification permits human intervention at one or more stages
of computer processing. For example, the computer may be used to produce spectrograms
(or any of a large number of similar displays) that are interpreted by human analysts who
make final decisions (Hollien 1990).
One of the lessons that has emerged from nearly half a century of computer
science is that the best results are often achieved by a collaboration of humans and
computers. Machine translation is an example. Humans translate better, but slower;
machines translate poorly, but faster. Together they translate both better and faster, as
witnessed by the rise in popularity of so-called CAT (Computer-aided Translation)
software packages. (The EAMT – European Association for Machine Translation – is a
source of copious material on this subject, for example, the Fifth EAMT Workshop held in
Ljubljana, Slovenia in May of 2000.)
The history of computer science also teaches us that while computers can achieve
many of the same intellectual goals as humans, they do not always do so by imitating
human behaviour. Rather, they have their own distinctly computational style. For
example, computers play excellent chess but they choose moves in a decidedly nonhuman way.
Our speaker identification method uses computers and humans to extract
isolexemic sound sequences, which are then heavily analysed by computers alone to
extract personal voice traits. The method is appropriate for forensic applications, where
analysts may have days or even weeks to collect and process data for speaker
identification.
Isolexemic sequences may consist of a single phone (sound); several phones such
as the rime (vowel plus closing consonant(s)) of a syllable (e.g. the ill of pill or mill); a
whole syllable; a word; sounds that span syllables or words; etc. What is vital is that the
sequence be ‘iso’ in the sense that it comes from the same word or words of the language
as pronounced by the speakers being compared. A concrete example illustrates the
concept. The two pronunciations of the vowel in the word pie, as uttered by a northern
16
American and a southern American, are isolexemic because they are drawn from the
same English word. That vowel, however, will be pronounced in a distinctly different
manner by the two individuals, assuming they speak a typical dialect of the area. By
comparing isolexemic sequences, the bulk of the acoustic differences will be ascribable
to the speakers. Speech recognizers are not effective at identifying isolexemic sequences
that are phonetically wide apart, nor are any of the implicit segmentation techniques.
Only humans, with deep knowledge of the language, know that pie is the same word
regardless of the fact that the vowels are phonetically different, and despite the fact that
the same phonetic difference, in other circumstances, may function phonemically to
distinguish between different words. The same word need not be involved. We can
compare the ‘enny’ of penny with the same sound in Jenny knowing that differences –
some people pronounce it ‘inny’ – will be individual, not linguistic. Moreover, the human
analyst, using a speech editor such as Sound ForgeTM, is able to isolate the ‘enny’ at a
point in the vowel where coarticulatory effects from the j and the p are minimal.
In determining what sound sequences are ‘iso’, the analyst need not be concerned
with prosodics (pitch or intonation in particular) because, as we shall see, the analysis of
the spectra is glottal pulse or pitch synchronous, the effect of which is to minimize the
influence of the absolute pitch of the exemplars under analysis. In fact, one of the
breakthroughs in the research reported here is an accurate means of determining glottal
pulse length so that the pitch synchronicity can hold throughout the analysis of hundreds
of spectra (Rodman et al. 2000). Isolexemic comparisons cut much more rapidly to the
quick than any other way of comparing the speech of multiple speakers. Even three
seconds of speech may contain a dozen syllables, and two dozen phonetic units, all of
which could hypothetically be used to discriminate among speakers.
The manual intervention converts a text-independent analysis to the more
effective text-dependent analysis without the artifice of making suspects repeat
incriminating messages, which does not work if the talker is uncooperative in any case,
for he may disguise his voice (Hollien 1990: 233). (The disguise may take many forms:
an alteration of the rhythm by altering vowel lengths and stress patterns, switching
dialects for multidialectal persons, or faking an ‘accent’.)
For example, suppose the criminal exemplar is ‘There’s a bomb in Olympic Park
and it’s set to go off in ten minutes.’ Suspects are interviewed and recorded (text
independent), possibly at great length over several sessions, until they have eventually
uttered sufficient isolexemic parts from the original exemplar. For example, the suspect
may say ‘we met to go to the ball game’ in the course of the interview, permitting the
isolexemic ‘[s]et to go’ and ‘[m]et to go’ to be compared (text dependent). A clever
interrogator may be able to elicit key phrases more quickly by asking pointed questions
such as ‘What took place in Sydney, Australia last summer?’, which might elicit the word
Olympics among others. Or indeed, the interrogator could ask for words directly, one or
two at a time, by asking the suspect to say things like ‘Let’s take a break in ten minutes.’
The criminal exemplar and all of the recorded interviews are digitized (see below)
and loaded into a computer. The extraction of the isolexemic sequences is accomplished
by a human operator using a sound editor such as Sound ForgeTM. This activity is what
makes the procedure semi-automatic.
FEATURE EXTRACTION
17
All the speech to be processed is digitized at 22.050 kHz, 16 bit quantization, and stored
in .wav files. This format is suitable for input to any sound editor, which is used to extract
the isolexemes to be analysed. Once data are collected and the isolexemes are isolated,
both from the criminal exemplar and the utterances of suspects (in effect, the training
speech), the process of feature extraction can begin.
Feature extraction takes place in two stages. The first is the creation of ‘tracks’,
essentially an abbreviated trace of successive spectra. The second is the measurement of
various properties of the tracks, which serve as the features for the identification of
speakers.
Creating ‘tracks’
We discuss the processing of voiced sounds, that is, those in which the vocal cords are
vibrating throughout. The processing of voiceless sounds is grossly similar but differs in
details not pertinent to this article. (The interested reader may consult Fu et al. 1999.) Our
method requires the computation of an average spectrum for each glottal pulse (GP) –
opening and closing of the vocal cords – in the speech signal of the current isolexeme.
We developed an algorithm for the accurate computation of the glottal pulse period (GPP)
of a succession of GPs. The method, and the mathematical proofs that underlie it, and a
comparison with other methods, are published as Rodman et al. (2000).
By using a precise, pitch synchronous Fourier analysis, we produce spectral
shapes that reflect the shape of the vocal tract, and are essentially unaffected by pitch. In
effect, we treat the vocal tract as a filter that is driven by the glottal pulse, which is
treated as an impulse function. The resulting spectra are highly determined by vocal tract
shapes and glottal pulse shapes (not spacing). These shapes are speaker dependent and
this provides the basis for speaker identification.
We use spectral moments as representative values of these spectral shapes. We
use them as opposed, say, to formant frequencies, because they contain information
across the entire range of frequencies up to 4 kHz for voiced sounds, and 11 kHz for
voiceless sounds (not discussed in this article). The higher formants, and the distribution
of higher frequencies in general, have given us better results than in our experiments with
pure formant frequencies and even with moments of higher orders (Koster 1995).
Knowing the GPP permits us to apply the following steps to compute a sequence
of spectral moments.
(Assume the current GPP contains N samples.)
1. Compute the discrete Fourier transform (DFT) using a window width of N, thus
transforming the signal from the time domain to the frequency domain.
2. Take the absolute value of the result (so the result is a real number).
3. Shift over 1 sample.
4. Repeat steps 1–3 N times.
5. Average the N transforms and scale by taking the cube root to reduce the influence of
the first formant, drop the DC term, and interpolate it with a cubic spline to produce a
continuous spectrum.
6. Convert the spectrum to a probability density function by dividing it by its mass, then
calculate the first moment m1 (mean) and the second central moment about the mean m2
(variance) of that function in the range of 0 to 4000 Hz and put them in two lists L1 and
18
L2. Let S(f) be the spectrum. The following formulae are used, appropriately modified for
the discrete signal:
7. Repeat Steps 1 through 6 until less than 3N samples remain.
8. Scale each moment: m1 by 10-3 and m2 by 10-6.
Several comments about the algorithm are in order. The shifting and averaging in Steps
1–3 are effective in removing noise resulting from small fluctuations in the spectra, but
preserving idiosyncratic features of the vocal tract shape. Although the window spans the
length of two glottal pulses as it slides across, there is one spectral shape computed per
glottal pulse. The overlapping windows improve the sensitivity of the method. The
process is computationally intense but it yields track points that are reliable and
consistent in distinguishing talkers. The procedure also removes the pitch as a parameter
affecting the shape of the transform, as noted above.
In Step 5 the cube root is taken – at one time we took the logarithm – because the
first formant of voiced speech contains a disproportionate amount of the spectral energy.
The effect of taking the cube root ‘levels’ the peaks in the spectrum and renders the
spectrum more sensitive to speaker differences. The means and variances of Step 6 are
chosen as ‘figures of merit’ for the individual spectra. Although representing a single
spectrum over a 4 kHz bandwidth with two numbers appears to give up information, it
has the advantage of allowing us to track every spectrum in the isolexeme and to measure
the changes that occur. This dynamism leads to features that we believe to be highly
individuating because they capture the shape, position and movement of the speaker’s
articulators, which are unique to each speaker. (This is argued in more detail in Klevans
and Rodman 1997.) Also in Step 6, the division by the spectral mass removes the effect
of loudness, so that two exemplars, identical except for intensity, will produce identical
measurements. Finally, the scaling in Step 8 is performed so that we are looking at
numbers in [0, 3] for both means and variances. This is done as a matter of convenience.
It makes the resulting data more readable and presentable.
The result of applying the algorithm is a sequence of points in twodimensional
m1-m2 space that can be interpolated to give a track. These are the values from the lists
L1 and L2. The tracks are smoothed by a threestage cascading filter: median-5, average-3,
19
median-3. That is, the first pass replaces each value (except endpoints) with the median
of itself and the four surrounding values. The second pass takes that median-5 output and
replaces each point by the average of itself with the two surrounding values. That output
is subjected to the median-3 filter to give the final, smoothed track. The smoothing takes
place because the means and variances of the spectra make small jumps when the speech
under analysis is in a (more or less) steady state as in the pronunciation of vowels. This is
true especially for monophthongal vowels such as the ‘e’ in bed, but even in diphthongs
such as the ‘ow’ in cow, there are steady states that span several glottal pulse periods. The
smoothing removes much of the irrelevant effect of the jumps. (See also Fu et al. 1999,
Rodman et al. 1999.)
A visual impression of intra- and interspeaker variation may be seen in Figure 1.
The first two tracks in the figure are a single speaker saying owie on two different
occasions. The third and fourth tracks in the figure are two different speakers saying owie.
Figures 2 and 3 are similar data for the utterances ayo and eya.
Our research shows that the interspeaker variation of tracks of isolexemic
sequences will be measurably larger than the intraspeaker variation, and therefore that an
unknown speaker can be identified through these tracks.
20
Extracting features from tracks
To compare tracks, several factors must be considered: the region of moment space
occupied by the track; the shape of the track; the centre of gravity of the track; and the
orientation of the track. Each of these characteristics displays larger interspeaker than
intraspeaker variation when reduced to statistical variables. One way to extract variables
is to surround the track by a minimal enclosing rectangle (MER), which is the rectangle
of minimal area containing the entire track. The MER is computed by rotating the track
about an endpoint one degree at a time and computing the area of a bounding rectangle
whose sides are parallel to the axes each time, and then taking the minimum. The
minimum is necessarily found within 90 degrees of rotation. This is illustrated in Figure 4.
21
From the MER of the curve in its original orientation, we extract four of the ten
variables to be used to characterize the tracks, viz. the x-value of the midpoint, the yvalue of the midpoint, the length of the long side (L), and the angle of orientation ().
(The length of the short side was not an effective discriminator for this study.) Four more
variables are the minimal x-coordinate, the minimal y-coordinate, the maximal xcoordinate, and the maximal y-coordinate of the track. They are derived by surrounding
the track in its original orientation with a minimal rectangle parallel to the axes and
taking the four corner points. These eight parameters measure the track’s location and
orientation in moment space.
The final two variables attempt to reflect the shape of the track. Note that the
spacing and number of track points in an utterance depend on the fundamental pitch. The
higher the frequency the fewer the number of samples in the period and hence the greater
the number of track points that will be computed over a given time period. To obviate
this remaining manifestation of pitch and hence, the number of track points, as a factor
affecting the measurement of the shape of a curve, we reparameterize the curve based on
the distance between track points. We normalize the process so that the curve always lies
in the same interval thus removing track length as a factor. (Other variables take it into
account.)
More particularly, we parameterize the tracks in m1-m2 space into two integrable curves
by plotting the m1-value of a point p (the ordinate) versus the distance in m1-m2 space to
point p+1 (abscissa), providing the distance exceeds a certain threshold. If it does not, the
point p+1 is thrown out and the next point taken, and so on until the threshold is
exceeded. The abscissa is then normalized to [0, 1] and the points interpolated into a
smooth curve by a cubic spline. This is known as a normalized arc length
parameterization. A second curve is produced via the same process using the m2-value of
the point p. The two quadrature-based variables are calculated by integrating each curve
over the interval [0, 1].
The ten variables are most likely not completely independent. With a data set of
this size, it is nearly impossible to estimate the correlations meaningfully. The first eight
22
variables were selected through exploratory analysis to characterize the MER. The last
two variables are related to the shape of the track as opposed to its location and
orientation in m1-m2 space and are therefore likely to have a high degree of
independence from the other eight.
Figures 5A–C illustrate the discriminatory power of these variables. Figures 5A
and 5B represent two different utterances of ayo by speaker JT. The first plot in each
figure is the track in moment space. The second and third plots are the normalized arc
length parameterizations for m1 and m2. (The actual variable used will be the quadrature
of these curves.) The similarity in shape of corresponding plots for the same-speaker
utterances is evident. Figure 5C is the set of plots for the utterance of ayo by speaker BB.
The different curve shapes in Figure 5C indicate that a different person spoke.
23
A CASE STUDY
The experiment
From an imagined extortion threat containing ‘Now we see about the payola’, we
identified three potential isolexemes: owie, eya, and ayo, as might be isolated from the
underlined parts of the exemplar. Single utterances of owie, eya, and ayo were extracted
from the speech of ten unknown speakers. The set consisted of eight males of whom five
were native speakers of American English, and three were near accent-less fluent English
speakers whose native language was Venezuelan Spanish. The two females were both
native speakers of American English. This is the testing database. We then asked the ten
speakers – BB, BS, DM, DS, JT, KB, LC, NM, RR, VN – to utter owie, eya, and ayo four
times to simulate the results of interrogations in which those sounds were extracted from
the elicited dialogue. This is the training database. All the speech samples were processed
to create tracks as described in the ‘Creating “tracks”’ subsection above.
The objective of the experiment is to see if the 10 features described in the
previous subsection are useful in discriminating among individuals. The approach of
using several variables to distinguish between groups or classes is referred to as
discriminant analysis. (See, for example, Mardia et al. 1997.) As mentioned in the
24
‘Background’ section above, many authors have employed methods such as neural
networks and hidden Markov models to discriminate between individuals. (See
Klevans and Rodman (1997) for a general discussion.) A disadvantage of these methods
is that they require a large amount of training data. We present a fairly simple
discriminant analysis, which is easily implemented and can be used with a small amount
of training data.
Determining effective discriminators
The set of variables described in the ‘Extracting features from tracks’ subsection above
seemed to capture important features of the ayo, eya and owie tracks. We therefore used
an analysis of variance (ANOVA) to confirm that these variables are effective in
discriminating between individuals. ANOVA is a method of comparing means between
groups (see, for example, Snedecor and Cochran 1989). In this case, a group is a set of
replicates from an individual. If the mean of a feature varies across individuals, then this
variable may be useful for discriminating between at least some of the individuals. In an
ANOVA, the F-statistic is the ratio of the interspeaker variation to the intraspeaker
variation. If this ratio is large (much larger than one), then we conclude that there is a
significant difference in feature means between individuals.
Table 1 contains the F-statistics for each of the ten variables described in the
‘Extracting features from tracks’ subsection. In this analysis, each variable is considered
separately, so the F-statistic is a measure of a variable’s effectiveness in distinguishing
individuals when used alone. For these data an F-statistic larger than 2.2 can be
considered large, meaning the variable will be a good discriminator. Note that a large Fvalue does not imply that we can separate all individuals well using the single feature;
however, it will be useful in separating the individuals into at least two groups. All of the
variables discussed in the ‘Extracting features from tracks’ subsection have a large Fstatistic. (Indeed, we used the F-statistic to eliminate as ineffective such potential
variables as the length of the short side of the MER.)
25
Measures of similarity
Having determined that all ten features are useful for all three sounds, the discriminant
analysis will be based on these 30 variables. Let yi be the 30- dimensional vector of
sample means for speaker i. In our training database, this mean is based on four
repetitions for each speaker. It will be easy to discriminate between individuals if the yi’s
are ‘far apart’ in 30- dimensional space. One way to measure the distance between means
is to use Euclidean distance. However, this metric is not appropriate in this situation
because it does not account for differing variances and covariances. For example, a
change in one unit of the angle of orientation variable is not equivalent to a change of
one unit of a quadrature-based variable. Also, with a one-unit change in maximum-y, we
might expect a change in minimum-y or the long side variables. Mahalanobis distance is
a metric that accounts for variances and covariances between variables (see, for example,
Mardia et al. 1979).
Let ∑be the 30x30 dimensional covariance matrix. We will partition ∑ into nine 10x10
matrices, six of which are distinct. The matrix has the form
For example, the submatrix ∑AErepresents the covariance matrix of the ten variables
associated with the ayo sound. The submatrix ∑AA represents the covariance matrix of
the ten ayo variables and the ten eya variables. We make two assumptions about the
structure of this matrix. First, we assume that the diagonal submatrices are constant for all
individuals, so that ∑AA ∑EE and ∑OOcan be estimated by pooling the corresponding
sample covariance matrices across individuals. This is a fairly strong assumption, but
with the size of the training data set, we cannot reliably estimate these matrices separately
for each individual without making even more stringent distributional assumptions.
Secondly, we assume that ∑has a block diagonal structure. That is, the matrices ∑AE,
∑AOand ∑EOare assumed to be matrices of zeros. This is also a strong assumption, but
again, the size of the training data set does not allow for reliable estimation of these
submatrices. Let ∑ˆ be the estimate of
∑using the zero matrices and estimated matrices described above. The squared
Mahalanobis distance between individuals i and j is
Table 2 contains squared Mahalanobis distances for the ten individual means in the
training database. The lower triangle of the table is blank because these cells are
redundant. Relatively small distances indicate that the individuals are similar with the
respect to the variables used in the analysis. For example, the most similar individuals are
26
KB and NM (the two female speakers) while the most dissimilar are DM and RR (both
native speakers of American English).
Classifying exemplars
For features extracted from a set of three utterances (ayo, eya, owie) from a speaker, we
can calculate the squared Mahalanobis distance from the exemplar to each individual
mean by
For the closed-set problem, we identify Sx by choosing the individual mean to which yx
is closest. We first tested this identification rule on each exemplar in the training set. The
rule correctly identified the speaker for all training exemplars. We would expect to have a
low error rate in this case, since each exemplar was also used in estimating
The rule was also applied to unknowns 1–7 in the testing database. These
exemplars came from speakers in the training set. (Unknowns 8–10 were ‘ringers’
introduced for the open-set test. They consisted of one male and two female native
speakers of American English, replacing one female and one male native speaker of
American English, and one male speaker whose native language was Venezuelan
Spanish.) Table 3 contains the squared Mahalanobis distances from each yx to each
individual mean. Each speaker was identified correctly. For each unknown(1–7), the
minimum distance is less than 100, except for Unknown 6. The asterisk marks the
minimum distance for unknowns 8–10. The minimum distances are lower, in general,
than the interspeaker distances given in Table 2. This confirms that this set of variables is
useful for discriminating between individuals. Also, the distances from each speaker in
the test set seem to follow approximately the same trends as in Table 2. For example, in
the training data, DM was the most dissimilar to BB. For Unknown 1 (BB), the largest
distance is to DM.
27
In many cases, it will be desirable to report not only the individual selected by the rule,
but also to provide an estimate of the reliability of the procedure. The reliability may be
determined empirically. We may use the observed error rate for the closed-set
classification rule when applied to the training data and test speakers 1–7. Crossvalidation can also be used to estimate error rates. However, due to the size of the study,
neither method will provide a reasonable estimate of reliability of the procedure. Another
method of estimating reliability would be to make distributional assumptions, e.g.
multivariate normality. Any such assumptions would be difficult to verify with such a
small data set. Developing a framework for estimating the reliability of such a procedure
with a small data set is planned for future work.
For the open-set problem, the rule must be modified to allow us to conclude that
the unknown speaker is not in the training set (X=0) One way of modifying this rule
would be to establish a distance threshold. If none of the distances Mxi fall below this
threshold, then we conclude X=0. As in the closed-set problem, estimates of reliability
are desirable. In general, error rates will depend on the choice of the threshold.
We investigated empirical choices of thresholds for this experiment. For the test data set,
if we choose a distance threshold, we will misclassify at least one of the ten unknowns.
For example, if we choose a distance threshold of 100, Unknown 6 will be incorrectly
assigned to So and Unknown 9 will be incorrectly classified as KB. In this testing
situation, we can pick a distance threshold that minimizes the number of misclassification
errors. However, this will not be possible in a practical situation. A framework for
choosing thresholds for the open-set problem is planned for future work.
SUMMARY AND FUTURE DIRECTIONS
The results we obtained are encouraging because of the sparseness of data. The known
speakers had about 8–12 seconds of speech data per speaker. The unknowns had onequarter of that amount, 2–3 seconds. In an actual forensic situation there is an excellent
28
likelihood of having many times the amount of data for the criminal exemplar (unknown
speaker), and as much data as needed for suspects (known speakers).
The identification process is cumulative in nature. As additional data become
available, there is more information for individuating speakers, and the error probabilities
diminish. In practice the only limitation is the amount of data in the criminal exemplar
(the testing data). Often, authorities are able to collect as large an amount of training data
as needed. Each new sound sequence that undergoes analysis makes its small
contribution to the overall discrimination. In even as short an utterance as ‘There’s a
bomb in Olympic Park and it’s set to go off in ten minutes’ there are easily a dozen or
more sequences that may be extracted for analysis. Thus we are sanguine about the
ability of this method to work in practice.
When the case study is regarded as closed-set speaker identification, the system
performed without error. While it is unreal to expect zero error rates in general, the
results forecast a relatively low error rate in cases of this kind. Many practical scenarios
require only closed-set identification. For example, in a corporate espionage case, where
a particular phone line is tapped, there are a limited number of persons who have access
to that phone line. Similar cases are described in Klevans and Rodman (1997).
The more difficult and more general open-set identification yielded error rates
between 10 and 20 per cent depending on how thresholds are set. Our current research is
strongly concerned with reducing this error rate.
Future research: short term
Our research in this area is expanding in three directions. The first is to use a larger
quantity of data for identification. Simplistically, this might have ten repetitions of ten
vowel transitional segments similar to owie for the training database. It is expected that
the F-values of the variables would rise, meaning that the ratio of interspeaker variation
to intraspeaker variation will climb. At one time we used only three utterances per sound
per speaker in the training base and when we went to four utterances the F-values
increased significantly, which validates our expectation. (Naturally this implies lengthier
interrogating sessions in a forensic application, but when a serious crime is involved, the
extra effort may be justified.)
The second direction is to use more phonetically varied data. The vowel
transitions of this study were chosen primarily to determine if the methodology was
promising. They do not span the entire moment space encompassed by the totality of
speech sounds. There are speech sounds such as voiced fricatives that produce tracks that
extend beyond the union of the MERs for the above utterances. Moreover, we are also
able to process voiceless sounds to produce moment tracks, but using a different
processing method that analyses the speech signal at frequencies up to 11 kHz (Fu et al.
1999). We are also able to process liquid [l], [r] and nasal sounds [m], [n], [nj], [N]. We
hypothesize that the use of other transitions, for example, vowel-fricative-vowel as in
lesson, will increase the discriminatory power of the method because it ‘views’ a
different aspect of the speaker’s vocal tract. An interesting, open, minor question is
whether particular types of sequences (e.g. vowel-nasal-vowel, diphthong alone, etc.) will
be more effective discriminators than others.
We are currently moving from producing our own data to using standardized
databases such as those available from the Linguistic Data Consortium. While this makes
29
the data extraction process more difficult and time-consuming, it has the advantage of
providing test data of the kind encountered in actual scenarios, particularly if one of the
many telephone- based databases are used.
The third direction is to find more and better discriminating variables. Eight of the
ten variables are basically ‘range statistics’, a class of statistics well known for their lack
of robustness and extreme sensitivity to outliers, and as noted above, are not entirely,
mutually independent. Both more and varied data would obviate these shortcomings, but
what is truly needed is a more precise measurement of curve shape, since the shape
appears to be highly correlated to the speaker.
We are experimenting with methods to characterize the shape of a curve. The
visual appearance of the shape of tracks for a given speaker for a given utterance, and the
differences between the shapes of the tracks among speakers for the same utterance,
suggest that curve shape should be used for speaker identification.
Curvature scale space (Mokhtarian and Mackworth 1986, Mokhtarian 1995,
Sonka et al. 1999) is a method that has been proposed to measure the similarity of 2D
curves for the purpose of retrieving curves of similar shape from a database of planar
curves.
The method tries to quantify shape by smoothing the curve (the scaling process)
and watching where the curvature changes sign. When the scaling process produces no
more curvature changes, the resulting behaviour history of the changes throughout the
smoothing process is used to do curve matching (Mokhtarian 1995). We are currently
exploiting this methodology to extract variables that are linked to the shape of the
moment tracks in m1-m2 space. These variables should provide discriminating power
highly independent of the variables currently in use, and hence would improve the
effectiveness of the identification process.
Other methods for exploiting shape differences are also being considered.
Matching shapes, while visually somewhat straightforward, is a difficult problem to
quantify algorithmically and methods for its solution have only recently begun to appear
in the literature.
Future research: long term
Our long-term future research is also pointed in three slightly different directions. They
are (1) noisy data, (2) channel impacted data, and (3) disguised voice data. All three of
these data-distorting situations may compromise the integrity of a speaker identification
system based on ‘clean’ data. A system for practical use in a forensic setting would need
methods for accommodating to messy data. This is a vast and complex topic, and most of
the work needed would necessarily follow the development of the speaker identification
system as used under less unfavourable circumstances.
ACKNOWLEDGEMENT
The authors wish to acknowledge the editors for helpful assistance in improving the
presentation of the foregoing work.
REFERENCES
Baldwin, J. R. and French, P. (1990) Forensic Phonetics, London: Pinter Publishers.
30
Bennani, Y. and Gallinari, P. (1991) ‘On the Use of TDNN-Extracted Features
Information in Talker Identification’, ICASSP (International Conference on Acoustics,
Speech and Signal Processing), 385–8.
Bolt, R. H., Cooper. F. S., David, E. E., Denes, P. B., Pickett, J. M., and Stevens, K. N.
(1969) ‘Identification of a speaker by speech spectrograms’, Science, 166: 338–43.
Doddington, G. (1985) ‘Speaker Recognition – Identifying People by Their Voices’,
in Proceedings of the IEEE (Institute of Electronics and Electronic Engineers),
73(11): 1651–63.
Falcone, M. and de Sario, N. (1994) ‘A PC speaker identification system for forensic use:
IDEM’, in Procedings of the ESCA (European Speech Communication Association)
Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny,
Switzerland, 169–72.
French, P. (1994) ‘An overview of forensic phonetics with particular reference to speaker
identification’, Forensic Linguistics, 1(2):169–81.
Fu, H., Rodman, R., McAllister, D., Bitzer, D. and Xu, B. (1999) ‘Classification of
Voiceless Fricatives through Spectral Moments’, in Proceedings of the 5th International
Conference on Information Systems Analysis and Synthesis (ISAS’99),Skokie, Ill.:
International Institute of Informatics and Systemics, 307–11.
Hollien, H. (1990) The Acoustics of Crime: The New Science of Forensic Phonetics,
New York: Plenum Press.
Kao, Y., Rajasekaran, P. and Baras, J. (1992) ‘Free-text speaker identification over long
distance telephone channel using hypothesized phonetic segmentation, ICASSP
(International Conference on Acoustics, Speech and Signal Processing), II.177–II.180.
Kersta, L. G. (1962) ‘Voiceprint identification’, Nature, 5(196): 1253–7.
Klevans, R. L. and Rodman, R. D. (1997) Voice Recognition, Norwood, Mass.:Artech
House Publishers.
Koenig, B. E. (1986) ‘Spectrographic voice identification: a forensic survey’, Journal of
the Acoustical Society of America, 79: 2088–90.
Koster, B. E. (1995) Automatic Lip-Sync: Direct Translation of Speech-Sound to MouthAnimation, PhD dissertation, Department of Computer Science, North Carolina State
University.
Künzel, H. J. (1994) ‘Current Approaches to Forensic Speaker Recognition’, in
Proceedings of the ESCA (European Speech Communication Association) Workshop on
Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland,
135–41.
Künzel, H. J. (1997) ‘Some general phonetic and forensic aspects of speaking
tempo’,Forensic Linguistics, 4(1): 48–83.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis, London:
Academic Press.
Markel, J. D. and Davis, S. B. (1978) ‘Text-independent speaker identification from a
large linguistically unconstrained time-spaced data base’, ICASSP (International
Conference on Acoustics, Speech and Signal Processing), 287–9.
Matsui, T. and Furui, S. (1991) ‘A text-independent speaker recognition method robust
against utterance variations’, ICASSP (International Conference on Acoustics, Speech
and Signal Processing), 377–80.
31
Mokhtarian, F. (1995) ‘Silhouette-based isolated object recognition through curvature
scale space’, IEEE (Institute of Electronics and Electronic Engineers)
Transactions on Pattern Analysis and Machine Intelligence, 17 (5): 539–44.
Mokhtarian, F. and Mackworth, A. (1986) ‘Scale-based description and recognition of
planar curves and two-dimensional shapes’, IEEE (Institute of Electronics and Electronic
Engineers) Transactions on Pattern Analysis and Machine Intelligence, V. Pami-8 (1):
34–43.
Oglesby, J. and Mason, J. S. (1990) ‘Optimization of Neural Models for Speaker
Identification’, ICASSP, 393–6.
O’Shaughnessy, D. (1986) ‘Speaker Recognitiion’, IEEE (Institute of Electronics and
Electronic Engineers) ASSP (Acoustics, Speech and Signal processing) Magazine,
October, 4–17.
Reynolds, D. A. and Rose, R. C. (1995) ‘Robust text-independent speaker identification
using Gaussian mixture speaker models’, IEEE (Institute of Electronics and Electronic
Engineers) Transactions on Speech and Audio Processing, 3(1): 72–83.
Rodman, R. D. (1998) ‘Speaker recognition of disguised voices’, in Proceedings of the
COST 250 Workshop on Speaker Recognition by Man and Machine: Directions for
Forensic Applications, Ankara, Turkey, 9–22.
Rodman, R. D. (1999) Computer Speech Technology, Boston, Mass.: Artech House
Publishers.
Rodman, R., McAllister, D., Bitzer, D., Fu, H. and Xu, B. (1999) ‘A pitch tracker for
identifying voiced consonants’, in Proceedings of the 10th International Conference on
Signal Processing Applications and Technology (ICSPAT’99).
Rodman, R., McAllister, D., Bitzer, D. and Chappell, D. (2000) ‘A High-Resolution
Glottal Pulse Tracker’, in International Conference on Spoken Language
Processing(ICSLP), October 16–20, Beijing, China (CD-ROM).
Rudasi, L. and Zahorian, S. A. (1991) ‘Text-independent talker identification with neural
networks’, ICASSP (International Conference on Acoustics, Speech and Signal
Processing), 389–92.
Snedecor, G. W. and Cochran, W. G. (1989) Statistical Methods (8th edn), Ames, IA:
Iowa State University Press.
Sonka, M., Hlavac, V. and Boyle, R. (1999) Image Processing, Analysis, and Machine
Vision (2nd edn), Boston, MA, PWS Publishing, ch. 6.
Stevens, K. N., Williams, C. E., Carbonelli, J. R. and Woods, B. (1968) ‘Speaker
authentication and identification: a comparison of spectrographic and auditory
presentations of speech material’, Journal of the Acoustical Society of America,(43):
1596–1607.
Tosi, O. (1979) Voice Identification: Theory and Legal Applications, Baltimore, Md.:
University Park Press.
32
Download