Overview of Text to Speech Notes

advertisement
Overview of Text to Speech
Getting the computer to read your
printed document out loud”
Text to Speech
• “Text-to-Speech software is used to
convert words from a computer document
(e.g. word processor document, web
page) into audible speech spoken through
the computer speaker”
Benefits
• The benefits of speech synthesis have
been many, including computers that can
read books to people, better hearing aids,
more simultaneous telephone
conversations on the same cable, talking
machines for vocally impaired or deaf
people and better aids for speech therapy.
The history of speech synthesis
•
What you maybe don't know is that the first synthetic speech was
produced as early as in the late 18th century.
• The machine was built in wood and leather and was very
complicated to use generating audible speech. It was constructed by
Wolfgang von Kempelen and had great importance in the early
studies of Phonetics.
• The picture following is the original construction as it can be seen at
the Deutsches Museum (von Meisterwerken der Naturwissenschaft
und Technik) in Munich, Germany.
Von Kempelens Machine
Voder
• In the early 20th century when it was
possible to use electricity to create
synthetic speech, the first known electric
speech synthesis was "Voder" and its
creator Homer Dudley showed it to a
broader audience in 1939 on the world fair
in New York.
OVE
• One of the pioneers of the development of
speech synthesis in Sweden was Gunnar Fant.
• During the 1950s he was responsible for the
development of the first Swedish speech
synthesis OVE (Orator Verbis Electris.)
• By that time it was only Walter Lawrences
Parametric Artificial Talker (PAT) that could
compete with OVE in speech quality.
• OVE and PAT were text-to-speech systems
using Formant (parametric) synthesis.
Speech synthesis becomes more
human-like
• The greatest improvements when it comes to natural
speech were during the last 10 years.
• The first voices we used for ReadSpeaker back in 2001
were produced using Diphone synthesis.
• The voices are sampled from real recorded speech and
split into phonemes, a small unit of human speech. This
was the first example of Concatenation synthesis.
However, they still have an artificial/synthetic sound. We
still use diphone voices for some smaller languages and
they are widely used to speech-enable handheld
computers and mobile phones due to their limited
resource consumption, both memory and CPU.
Unit Selection
• It wasn't until the introduction of a
technique called Unit selection, that voices
became very naturally sounding. this is still
concatenation synthesis but the used units
are larger than phonemes, sometimes a
complete sentence.
Why use Speech Synthesis
• Visual Issue
(Difficulty seeing text)
• Cognitive Issue
( Low reading level/comprehension)
• Motor Issue
(Difficulty handling a book or paper)
Forms of Text
• E text
Most of the text you see on your computer
Examples: Internet, Email, Word Document, E Books
• Paper text
Any text is printed on paper
Examples: Newspaper, Book, Magazine
Forms of Text
• E text
Most of the text you see on your computer
Examples: Internet, Email, Word Document, E Books
• Paper text
Any text is printed on paper
Examples: Newspaper, Book, Magazine
Characteristics of Speech
synthesis systems
• Many speech synthesis systems take as their
input text and output speech.
• Hence they are often known as text to speech
(TTS) systems.
• The naturalness of a speech synthesizer usually
refers to how much the output sounds like the
speech of a real person.
• The intelligibility of a speech synthesizer refers
to how easily the output can be understood.
Parts of Speech Synthesizers
• . Speech Synthesizers usually consist of
two parts.
First Part
• The first part has two major tasks. First it takes
the raw text and converts things like numbers
and abbreviations into their written-out word
equivalents. This process is often called text
normalization, pre-processing, or tokenization.
Then it assigns phonetic transcriptions to each
word, and divides and marks the text into
various linguistic units like phrases, clauses, and
sentences. The combination of phonetic
transcriptions and prosody information make up
the symbolic linguistic representation output of
the first part of the system to the second part.
Second Part
• The other part, the back end, takes the
symbolic linguistic representation and
converts it into actual sound output.
• The back end is often referred to as the
synthesizer.
Text normalization challenges
• The process of normalizing text is rarely straightforward.
Texts are full of homographs (i.e. words that are spelt the
same but are pronounced differently, e.g. Read the book,
The book was read), numbers and abbreviations that all
ultimately require expansion into a phonetic
representation.
• There are many words in English which are pronounced
differently based on context(i.e. homographs) . Some
examples:
• project: My latest project is to learn how to better project
my voice.
• bow: The girl with the bow in her hair was told to bow
deeply when greeting her superiors.
• Most TTS systems do not generate
semantic representations of their input
texts, as processes for doing so are not
reliable, well-understood, or
computationally effective.
Numbers
• Deciding how to convert numbers is another problem
TTS systems have to address.
• It is a fairly simple programming challenge to convert a
number into words, like 1325 becoming "one thousand
three hundred twenty-five".
• However, numbers occur in many different contexts in
texts, and 1325 should probably be read as "thirteen
twenty-five" when part of an address (1325 Main St.) and
as "one three two five" if it is the last four digits of a
social security number.
• Often a TTS system can infer how to expand a number
based on surrounding words, numbers, and punctuation,
and sometimes the systems provide a way to specify the
type of context if it is ambiguous.
Abbreviations
• Similarly, abbreviations like "etc." are easily rendered as
"et cetera", but often abbreviations can be ambiguous.
• For example, the abbreviation "in." in the following
example: "Yesterday it rained 3 in. Take 1 out, then put 3
in."
• "St." can also be ambiguous: "St. John St."
• TTS systems with intelligent front ends can make
educated guesses about how to deal with ambiguous
abbreviations, while others do the same thing in all
cases, resulting in nonsensical but sometimes comical
outputs: "Yesterday it rained three in." or "Take one out,
then put three inches."
Text-to-phoneme challenges
• Speech synthesis systems use two basic
approaches to determine the
pronunciation of a word based on its
spelling, a process which is often called
text-to-phoneme or grapheme-to-phoneme
conversion, as phoneme is the term used
by linguists to describe distinctive sounds
in a language.
Dictionary Based approach
• The simplest approach to text-to-phoneme
conversion is the dictionary-based
approach, where a large dictionary
containing all the words of a language and
their correct pronunciation is stored by the
program. Determining the correct
pronunciation of each word is a matter of
looking up each word in the dictionary and
replacing the spelling with the
pronunciation specified in the dictionary.
Rule based approach
• The other approach used for text-tophoneme conversion is the rule-based
approach, where rules for the
pronunciations of words are applied to
words to work out their pronunciations
based on their spellings. This is similar to
the "sounding out" approach to learning
reading.
Synthesizer technologies
• There are two main technologies used for
the generating synthetic speech
waveforms: concatenative synthesis and
formant synthesis sometimes called
parametric speech synthesis.
• There are others such as
• Recorded promptss
• Intonation modeling
Formant Synthesis
• Formant synthesis does not use any human
speech samples at runtime. Instead, the output
synthesized speech is created using an acoustic
model.
• Parameters such as frequency amplitude etc are
varied over time to create a waveform of artificial
speech.
• This method is sometimes called Rule-based
synthesis but some argue that because many
concatenative systems use rule-based
components for some parts of the system, like
the front end, the term is not specific enough.
• Many systems based on formant synthesis
technology generate artificial, roboticsounding speech, and the output would
never be mistaken for the speech of a real
human. However, maximum naturalness is
not always the goal of a speech synthesis
system, and formant synthesis systems
have some advantages over
concatenative systems.
• Formant synthesized speech can be very reliably
intelligible, even at very high speeds, avoiding the
acoustic glitches that can often plague concatenative
systems.
• High speed synthesized speech is often used by the
visually impaired for quickly navigating computers using
a screen reader.
• Second, formant synthesizers are often smaller
programs than concatenative systems because they do
not have a database of speech samples.
• Last, because formant-based systems have total control
over all aspects of the output speech, a wide variety of
prosody can be output, conveying not just questions and
statements, but a variety of emotions and tones of voice.
Formant
• This synthesis is a sort of source-filter-method that is
based on mathematic models of the human speech
organ.
The approach pipe is modelled from a number of
resonances with resemblance to the formants (frequency
bands with high energy in voices) in natural speech.
The first electronic voices Voder, and later on OVE and
PAT, were speaking with totally synthetic and electronic
produced sounds using formant synthesis. As with
articulatory synthesis, the memory consumption is small
but CPU usage is large.
The Source-Filter Model of
Formant Synthesis
• Excitation or Voicing Source(s) to model
sound source
– standard wave of glottal pulses for voiced
sounds
– randomly varying noise for unvoiced sounds
– modification of airflow due to lips, etc.
Formant Synthesis continued
– high frequency (F0 rate), quasi-periodic,
choppy
– modeled with vector of glottal waveform
patterns in voiced regions
• Acoustic Filter(s)
– shapes the frequency character of vocal tract
and radiation character at the lips
– relatively slow (samples around 5ms suffice)
and stationary
– modeled with LPC (linear predictive coding)
Concatenative synthesis
• Concatenative synthesis is based on the
concatenation (or stringing together) of
segments of recorded speech.
• Generally, concatenative synthesis gives the
most natural sounding synthesized speech.
• However, natural variation in speech and
automated techniques for segmenting the
waveforms sometimes result in audible glitches
in the output, detracting from the naturalness.
There are three main subtypes of concatenative
synthesis:
Subtypes
• Unit selection synthesis uses large speech databases
(more than one hour of recorded speech). During
database creation, each recorded utterance is
segmented into some or all of the following linguistic
constructs such as phonemes, words phrases and
sentences
• Diphone synthesis uses a minimal speech database
containing all the Diphones (sound-to-sound transitions)
occurring in a given language. In diphone synthesis, only
one example of each diphone is contained in the speech
database.
• Domain-specific synthesis concatenates pre-recorded
words and phrases to create complete utterances.
Concatenative Synthesis
– Record basic inventory of sounds
– Retrieve appropriate sequence of units at
run time
– Concatenate and adjust durations and
pitch
– Synthesize waveform
Concatenating synthesis
• A concatenating synthesis is made of recorded pieces of
speech (sound-clips) that is then unitized and formed to
speech.
• Depending on the length of sound-clips that are used it
become a diphone or a polyphonic synthesis.
• The latter in a more developed version is also called a
Unit Selection synthesis, where the synthesizer has
access to both long and short segments of speech and
the best segments for the actual context is chosen.
Diphone
• In phonetics, a diphone is an adjacent pair
of phones. It is usually used to refer a
recording of the transition between two
phones.
• A phone is the actual pronunciation of a
phoneme
Diphone and Polyphone
Synthesis
• Phone sequences capture co-articulation
• That is how combinations of phones sound
Diphone and Polyphone Synthesis
• Data Collection Methods
– Collect data from a single (professional) speaker
– Select text with maximal coverage (typically with
greedy algorithm), or
– Record minimal pairs in desired contexts (real
words or nonsense)
• Reduce number collected by
– phonotactic constraints
– collapsing in cases of no co-articulation
• Cut speech in positions that minimize context
contamination
• Need single phones, diphones and sometimes
triphones
Diphone
• For a diphone synthesis elements from the recorded speech are
very small.
• The speech may sound a bit monotonic.
• Diphone synthesis doesn't work that well
• in languages where there is a lot of inconsequence in the
pronunciation rules (English, Swedish etc)
• in special cases where letters are pronounced differently than in
general.
• The diphone works better for languages that have large
consistencies in the pronunciation (Spanish, Finnish etc.)
• Another advantage is that the prosody, the intonation, can be
described in much detail.
Signal Processing for
Concatenative Synthesis
• Diphones recorded in one context must be
generated in other contexts
• Features are extracted from recorded units
• Signal processing manipulates features to
smooth boundaries where units are
concatenated
• Signal processing modifies signal via
‘interpolation’
– intonation
– duration
Unit selection
• The greatest difference between a Unit selection
and a diphone voice is the length of the used
speech segments.
• There are entire words and phrases stored in
the unit database. this implies that the database
for the Unit selection voices are many times
bigger than for diphone voices.
• Thus, the memory consumption is huge while
the CPU consumption is low.
Unit Selection
• The most important issue is to still get a natural
and smooth prosody.
• This is hard because the units contain both
intonation and pronunciation since entire
phrases are used almost directly from the
recorded data.
• Since the first Unit selection voice was released,
over eight years ago, there has been much
improvement for each new voice with every
release.
HMM synthesis
• A quite new technology is speech synthesis
based on HMM, a mathematical concept called
Hidden Markov models.
• It is a statistical method where the text-tospeech system is based on a model that is not
known beforehand but it is refined by continuous
training.
• The technique consumes large CPU resources
but very little memory.
• This approach seems to give a better prosody,
without glitches, and still produces very natural
sounding, human-like speech
Recorded Prompts
• The simplest (and most common) solution
is to record prompts spoken by a (trained)
human
• Produces human quality voice
• Limited by number of prompts that can be
recorded
• Can be extended by limited cut-and-paste
or template filling
Articulatory synthesis
• In an articulatory synthesis, models of the
human articulators (tong, lips, teethes, jaw) and
vocal ligament are used to simulate how an
airflow passes through, to calculate what the
resulting sound will be like. It is a great
challenge to find good mathematical models and
therefore the development of articulatory
synthesis is still in research. The technique is
very computation-intensive but memory
requirements is almost nothing.
Sable
• SABLE is an emerging standard extending
SGML
• http://www.cstr.ed.ac.uk/projects/sable.html
– marks: emphasis(#), break(#),
pitch(base/mid/range,#), rate(#), volume(#),
semanticMode(date/time/email/URL/...),
speaker(age,sex)
– Implemented in Festival Synthesizer (free for research,
etc.):
http://www.cstr.ed.ac.uk/projects/festival.html
Assistive Applications of speech
synthesis
• Systems that provide voice synthesis
output for blind users are generally
referred to as screen readers. Brown
(1989)[ Cook and Hussey 95] has
identified the capabilities an ideal voice
output system should have.
Key Features
• 1: Good audio environment. No background noise, Good
speakers, earphones etc
• 2: Good intelligibility. The output should be intelligible.
Studies have shown this to be paramount. Studies have
also shown that naturalness of the voice is also desirable
particularly for female users of speech synthesis.
Synthetic voices are not that acceptable.
• 3: The screen reader should work with all commercially
available software, i.e. the blind user should have access
to the same software the sighted user has.
• This includes access to both text and graphics.
• 4: The adapted output system should work with a variety
of speech synthesizer systems.
User Interface
• The user interface should have the following characteristics.
• 1: Spoken letters often sound the same e.g. b and v. To reduce
ambiguity the synthesizer should have access to the aviators
alphabet (Alpha Bravo , Charlie etc.)
• 2: To match the capabilities of normal vision, the screen reader
should be able to read forward or backwards, read punctuation,
highlights and other syntactical conventions.
• 3: A sighted reader often scans whole passages to get context or a
sense of the text. The screen reader should be able to read
complete sentences and passages.
• 4: Computer programs often generate prompts and output
messages. The screen reader should be able to read these.
Operational Characteristics
• The following are desirable operational characteristics of the screen
reader.
• 1; It should be easy to use and maintain. It shouldn’t require huge
amounts of training. Screen readers are often complex and difficult
to master.
• 2: Screen readers have two modes , application and review. Review
is where the reader is basically reading. Application is where the
functionality of the application a can be accessed. For example a
document in a word processor could be read in review mode but
edited in application mode. Ideally the two modes should be
merged. If a mistake were noted while a document is being read
then it would be beneficial to change it there and then and not have
to switch out of review into application mode. .
More
• 3: Central to the success of using screen readers is the
notion of cursor routing. Here the navigation path of the
cursor through the document is recorded so that we can
return to various step points if we have to switch
between modes. For example if we have to switch to
application mode we can retrace our steps through the
document.
• This is similar to the macro capability provided by many
commercial systems.
• 4: In windows based systems there is a capability to
generate a series of windows. Screen readers need to
be able to locate and move between windows and output
and changes that might occur
Graphical User interfaces (GUI’s)
and Screen readers
• .
• GUI’s present unique and didfficult problems to
the blind user.
• GUI’s use visual metaphors to represent
information. For example there are files folders,
briefcases, trashcans and more. each
represented by graphical icons. These icons are
not easily represented by speech synthesizers
designed to convert text to phonemes and so on.
It is absolutely essential for these icons to have
an associated text label which can then be
spoken when the icon is selected.
More
• Visual information is spatial. Location is a component of
this space. Relative positions of objects are organized
within a 2 dimensional space.
• Auditory information is temporal (time based). It is
difficult to convey the position of a pointer by speech.
This is fundamental to the process of selecting of icons
in GUIS.
• An alternative approach to pointer movement is to tab
through the screen. These tabs can then be highlighted
by the screen readers with auditory prompts. Navigation
thus consists of a series Tab prompts followed by the
announcement of the icon labels as each is highlighted
in turn.
Applications of Screen readers
• Screen readers are ideally suited to applications that
consist of text . The following are some of the major
application areas to which screen readers are applied.
• Auditory Reading Substitute
• The oldest and most prevalent use of auditory
substitution is talking books. Traditionally the books
have been read by actors or others and this has been
recorded on tape or disk.
• If the text of a book is available electronically, then
auditory output may be provided by synthetic speech
devices. A key issue here is the intelligibility of the
output.
Using the Web with Screen
readers
• The web is a hugely important gateway to
inclusion. It provides access to information, to
commerce, to entertainment, to news, to
communication , to e-learning and to many other
applications and functions. Screen Readers are
a hugely important technology for people with
poor vision and reading difficulties for accessing
the Web.
• The following is a list of Screen Reader features
and the Web Issues associated with them.
Sequential Reading
• Screen Readers read sequentially from top to bottom,
left to right (they are easily confused by columns).
• As speech synthesis technology matures, browsers
designed specifically to read HTML will make greater
use of HTML tags to format output and provide options
for the user. Tags that are used not according to the
HTML 3.2 specification will create problems for such
browsers.
• Does your site have tables that do not "read" from top to
bottom, left to right? This includes the use of tables to
achieve the effect of columnar text layout. If so, is the
information alternatively provided in some other noncolumnar format?
More web
• Immediate Start
• Screen Readers begin reading as soon as the page is
loaded.
• Do you have excessive standard text or navigational
tools that appear at the top of every page? It is difficult
for a speech synthesizer to be manipulated to "ignore"
such items.
• Navigation
• Navigation is by link, word, line or character, but not by
sentence or paragraph.
• If a user hops from link to link on your web page, will she
or he hear "click here" repeated over and over again or
is the link text brief but meaningful?
More web
• Background Wallpaper
• Screen Readers cannot interpret background graphical
wallpaper (because there is no ALT attribute).
• Do you use background images that contain important
content? If so, do you provide an alternate (text-based)
method for viewing that content?
• Blinking Text
• Screen Readers cannot reliably read blinking text
(sometimes they skip it and sometimes they fixate on it).
• Many non-blind people object to blinking text on
aesthetic grounds, and it can affect speech synthesizer
software negatively as well. Do you really need to use it?
More web
• Search and Punctuation
• Screen Readers can search for text strings or
attributes (jumping from link to link is
accomplished by searching for underlined text).
They use punctuation (periods, commas, etc.) to
structure speech output.
• In addition to its negative consequences for
speech synthesis software, incorrect use of
punctuation and spelling irritates many members
of the general public. Do you use punctuation
correctly and consistently throughout your site?
Have you checked your spelling?
Images
• Images
• Screen Readers read content of ALT attributes
with images but cannot interpret images
themselves.
• Do you use brief but meaningful ALT text with all
images? Does your ALT text describe the
function of certain visual images rather than just
a description of the image? (Example: "change
of topic" rather than "blue line")
More Images
• If an image is important to the understanding or
appreciation of your web site, do you have an
adequate text description of it available?
• Blind people, people using text-only browsers
and those who have turned off automatic
downloading of images see no information when
a web page contains no text. Does the home
page of your web site contain text that could
guide such a user to a non-graphical
alternative?
• Do you provide text descriptions for video clips
and video feeds?
Maths with screen readers
• Reading and writing mathematics is
fundamentally different than reading and
writing text.
• Just being able to read mathematics is
very difficult for visually disabled people.
Non-visual representations (such as
braille) are not as powerful and flexible as
visual ones. Take for instance the following
equation:
Y = +/- b2  4ac
c
• This uses the two dimensions of the paper to represent the fraction
as a clear single component. It also uses the relative positions of
different components in a semantically meaningful way.
• For example the bar on the top of the square root symbol extends
over all the symbols governed by it.
• There is also no redundancy in the representation (unlike written
text); if any one symbol were deleted the meaning of the equation
would be changed completely.
• A problem with reading mathematics in non-visual formats is to be
able to control the reading.
• For instance, if one listens to a whole equation at once, one only
remembers a small part of it. When reading visually people can
focus on the parts of the equation which are of interest and ignore
the rest.
• A step further on from being able to read mathematical
material is to be able to manipulate it, to create new
equations and modify existing ones. This involves the
concepts of selection and re-writing.
• The issues of following mathematical representations in
speech output is down to the nature of text versus
mathematics.
• Text is linear in nature while mathematical equations are
two dimensional. What you have been reading in this
text is a good example of this problem. In contrast,
examine the following relatively simple equation
Figure 1: A relatively simple
equation
• One will immediately notice that the equation
contains a superscript and a fraction - both
being two dimensional in nature. The equation
could have been written in a linear form, for
example:
• a = sqrt(((x super 2) - y) / z)
• For this relatively simple equation, a linear
representation is adequate for reading to a blind
user. But, with any increase in complexity it
becomes apparent that linear representations
are no longer useful.
• Making mathematics accessible to the blind is a
challenging and difficult process. The computer
and its range of output devices has become the
foundation of numerous projects that have
brought this goal closer to a reality.
• With I/O devices such as high-quality speech,
musical tones, refreshable Braille, haptic
feedback and high reliability speech input, new
and effective tools will soon be on the market.
• Other research into direct neural connectivity,
will in the future, make the picture even brighter.
References
•
•
•
•
•
•
•
•
Special Access technology by Paul Nisbet & Patrick Poon
http://callcentre.education.ed.ac.uk/About_CALL/Publications_CAA/Books_
CAB/SAT_CAC/sat_cac.html
http://www.callcentrescotland.org or http://callcentre.education.ed.ac.uk/
[Brown 89] Brown C,. Computer Access In Higher Education for students
with disabilities. Ed 2 Monterey California 1989 Us Dept. of Education
[Cook and Hussey 1995] Cook AM, Hussey SM. Assistive technologies:
principles and practice. Baltimore: Mosby; 1995.
SNOW
[Dutoit 96] T. DUTOIT, An Introduction to Text-To-Speech Synthesis¸
Kluwer Academic Publishers, 1996, 326 pp.
Sn
[Redish and Theofanos 2003] Redish J and Theofanos M.F., Observing
Users Who Listen to Web Sites , STC Usability SIG Newsletter Usability
Interface april 2003 issue (Vol 9, No. 4)
Text-To-speech (TTS)
“Voice” Resources
http://www.microsoft.com/msagent/downloads/user.asp
http://www.bytecool.com/voices.htm
http://www.digitalfuturesoft.com/texttospeechproducts.php
http://www.neospeech.com/product/technologies/tts.php
http://nextup.com/TextAloud/SpeechEngine/voices.html#morefreevoices
Free Text Readers
•
•
•
•
•
•
NaturalReader (100 Character Limit)
ReadPlease
WordTalk
Adobe Reader
Microsoft Reader
Bookshelf
review
Review
review
review
Review
review
Commercial Text Readers
•
•
•
•
•
Natural Reader Professional/Enterprise
WYNN
Premier Assistive Technology
TextOutloud
Kurzweil 3000
E books
•
http://www.ebooks.com/
•
http://library.netlibrary.com/Home.aspx
•
http://www.amazon.com/exec/obidos/tg/browse/-/551440/ref=b_tn_bh_eb/002-1204779-5767200
•
http://www.diesel-ebooks.com/cgi-bin/category.cgi
•
http://www.ereader.com/
Go to Google a type in E books
Download