International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 373 ANALYSIS AND SYNTHESIS OF SPEECH USING MATLAB Vishv Mohan (State Topper Himachal Pradesh 2008, 2009, 2010) B.E(Hons.) Electrical and Electronics Engineering; President-NCSTU University: Birla Institute Of Technology & Science, Pilani -333031(Rajasthan)-India E-mail: vishv.mohan.1@gmail.com ABSTRACT waves on a medium such as a phonograph. The interval of each sound wave has different frequency in its sub-sections. This paper has made an analysis of two matlab functions namely GenerateSpectrogram.m and MatrixToSound.m , in order to analyze and synthesis the speech signals. The first Matlab code section GenerateSpectrogram.m record the user input sound for user (more precisely from the source) defined duration and asks required parameters for computation of spectrogram and returns a matrix with frequency as rows and time as column and corresponding matrix element as amplitude of that frequency. MatrixToSound.m uses the method of additive synthesis of sound to generate sound from the user defined matrix with frequencies as its rows and time as its columns. Sound recording is an electrical or mechanical inscription of sound waves, such as spoken voice, singing, instrumental music, or sound effects. The two main classes of sound recording technology are analog recording and digital recording. Acoustic analog recording is achieved by a small microphone diaphragm that can detect changes in atmospheric pressure (acoustic sound waves) and record them as a graphic representation of the sound Digital recording converts the analog sound signal picked up by the microphone to a digital form by a process of digitization, allowing it to be stored and transmitted by a wider variety of media. Digital recording stores audio as a series of binary numbers representing samples of the amplitude of the audio signal at equal time intervals, at a sample rate high enough to convey all sounds capable of being heard. The feature of analysis and synthesis of sound, is applied to create the speech with the help of matrix of elements as frequency or time domain analyzed parameters with specific amplitude. IJOART Copyright © 2013 SciResPub. Keywords : spectrum, synthesis, simulation, frequency, sound-waves, amplitude, wave sequence. INTRODUCTION The speech is an acoustic signal, hence, it is a mechanical wave that is an oscillation of pressure transmitted through solid liquid or gas and it is composed of frequencies within hearing range. Sound is a sequence of waves of pressure that propagates through compressible media such as air or water. Audible range of sound is 20 Hz to 20KHz, at standard temperature and pressure. During propagation, waves can be reflected, refracted, or attenuated by the medium. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 Recording of Sound Sound recording is an electrical or mechanical inscription of sound waves, such as spoken voice, singing, instrumental music, or sound effects. The two main classes of sound recording technology are analog recording and digital recording. Acoustic analog recording is achieved by a small microphone diaphragm that can detect changes in atmospheric pressure (acoustic sound waves) and record them as a graphic representation of the sound waves on a medium such as a phonograph (in which a stylus senses grooves on a record). In magnetic tape recording, the sound waves vibrate the microphone diaphragm and are converted into a varying electric current, which is then converted to a varying magnetic field by an electromagnet, which makes a representation of the sound as magnetized areas on a plastic tape with a magnetic coating on it. 374 fidelity (wider frequency response or dynamic range), but because the digital format can prevent much loss of quality found in analog recording due to noise and electromagnetic interference in playback, and mechanical deterioration or damage to the storage medium. A digital audio signal must be reconverted to analog form during playback before it is applied to a loudspeaker or earphones. Analysis of Sound Signal The long-term frequency analysis of speech signals yields good information about the overall frequency spectrum of the signal, but no information about the temporal location of those frequencies. Since speech is a very dynamic signal with a time-varying spectrum, it is often insightful to look at frequency spectra of short sections of the speech signal. IJOART Digital recording converts the analog sound signal picked up by the microphone to a digital form by a process of digitization, allowing it to be stored and transmitted by a wider variety of media. Digital recording stores audio as a series of binary numbers representing samples of the amplitude of the audio signal at equal time intervals, at a sample rate high enough to convey all sounds capable of being heard. Digital recordings are considered higher quality than analog recordings not necessarily because they have higher Copyright © 2013 SciResPub. Long-term frequency analysis The frequency response of a system is defined as the discrete-time Fourier transform (DTFT) of the system's impulse response h[n]: Similarly, for a sequence x[n], its long-term frequency spectrum is defined as the DTFT of the Sequence Theoretically, we must know the sequence x[n] for all values of n (from n=-∞ until n=∞) in order to compute its frequency spectrum. Fortunately, all terms where x[n] = 0 do not matter in the sum, and therefore an equivalent expression for the sequence's spectrum is IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 Here we've assumed that the sequence starts at 0 and is N samples long. This tells us that we can apply the DTFT only to all of the sequence, that is, over only part of the non-zero samples of the sequence? Window sequence 375 the non-zero samples of x[n], and still obtain the sequence's true spectrum X (ω). But what is the correct mathematical expression to compute the spectrum over a short section of Then we compute the spectrum of the windowed sequence x w [n] as usual It turns out that the mathematically correct way to do that is to multiply the sequence x[n] by a ‘window sequence’ w[n] that is non-zero only for n=0… L-1, where L, the length of the window, is smaller than the length N of the sequence x[n]: The following figure illustrates how a window sequence w[n] is applied to the sequence x[n]: As the figure shows, the windowed sequence is shorter in length than the original sequence. So we can further truncate the DTFT of the windowed sequence: Effect of the window To answer that question, we need to introduce an important property of the Fourier transform. The diagram below illustrates the property graphically: IJOART I. Implementation of an LTI system in the time domain. Using this windowing technique, we can select any section of arbitrary length of the input sequence x[n] by choosing the length and location of the window accordingly. The window sequence w[n] affect the shortterm frequency spectrum. Copyright © 2013 SciResPub. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 II. Equivalent implementation of an LTI system in the frequency domain. And since the time domain and the frequency domain are each other’s dual in the Fourier transform, it is also true that multiplication in the time domain = convolution in the frequency domain: 376 The two implementations of an LTI system are equivalent: they will give the same output for the same input. Hence, convolution in the time domain = multiplication in the frequency domain: This shows that multiplying the sequence x[n] with the window sequence w[n] in the time domain is equivalent to convolving the spectrum of the sequence X (ω), with the spectrum of the window W(ω). The result of the convolution of the spectra in the frequency domain is that the spectrum of the sequence is ‘smeared’ by the spectrum of the window. This is best illustrated by the example in the figure below: IJOART a) Choice of window Because the window determines the spectrum of the windowed sequence to a great extent, the choice of the window is Copyright © 2013 SciResPub. important. Matlab supports a number of common windows, each with their own strengths and weaknesses. Some common choices of windows are shown below. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 377 IJOART All windows share the same characteristics. Their spectrum has a peak, called the main lobe, and ripples to the left and right of the main lobe called the side lobes. The width of the main lobe and the relative height of the side lobes are different for each window. The main lobe width determines how accurate a window is able to resolve different frequencies: wider is less accurate. The side lobe height determines how much spectral leakage the window has. An important thing to realize is that we can't have short-term frequency analysis without a window. Even if we don't explicitly use a window, we are implicitly using a rectangular window. b) Parameters of the short-term frequency spectrum Besides the type of window —rectangular, hamming, etc. — there are two other factors in Matlab that control the shortCopyright © 2013 SciResPub. term frequency spectrum: window length and the number of frequency sample points. The window length controls the fundamental trade-off between time resolution and frequency resolution of the short-term spectrum, irrespective of the window's shape. A long window gives poor time resolution, but good frequency resolution. Conversely, a short window gives good time resolution, but poor frequency resolution. For example, a 250 millisecond long window can, roughly speaking, resolve frequency components when they are 4 Hz or more apart (1/0.250 = 4), but it can't tell where in those 250 millisecond those frequency components occurred. On the other hand, a 10 millisecond window can only resolve frequency components when they are 100 Hz or more apart (1/0.010= 100), but the uncertainty in time about the location of IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 378 those frequencies is only 10 millisecond. The result of short-term spectral analysis using a long window is referred to as a narrowband spectrum (because a long window has a narrow main lobe), and the result of short-term spectral analysis using a short window is called a wideband spectrum. In short-term spectral analysis of speech, the window length is often chosen with respect to the fundamental period of the speech signal, i.e., the duration of one period of the fundamental frequency. A common choice for the window length is either less than 1 times the fundamental period, or greater than 2-3 times the fundamental period. Examples of narrowband and wideband short-term spectral analysis of speech are given in the figures below: The other factor controlling the short-term spectrum in Matlab is the number of points at which the frequency spectrum H (ω) is evaluated. The number of points is usually equal to the length of the window. Sometimes a greater number of points is chosen to obtain a smoother looking spectrum. Evaluating H (ω) at fewer points than the window length is possible, but very rare. c) Time-frequency domain: Spectrogram An important use of short-term spectral analysis is the short-time Fourier transform or spectrogram of a signal. The spectrogram of a sequence is constructed by computing the short term spectrum of a windowed version of the sequence, then shifting the window over to a new location and repeating this process until the entire sequence has been analyzed. The whole process is illustrated in the figure below: IJOART Together, these short-term spectra (bottom row) make up the spectrogram, and are typically shown in a two-dimensional plot, Copyright © 2013 SciResPub. where the horizontal axis is time, the vertical axis is frequency, and magnitude is the color or intensity of the plot. For example: IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 The appearance of the spectrogram is controlled by a third parameter: window overlap. Window overlap determines how much the window is shifted between repeated computations of the short term spectrum. Common choices for window overlap are 50% or 75% of the window length. For example, if the window length is 200 samples and window overlap is 50%, the window would be shifted over 100 samples between each short-term spectrum. In the case that the overlap was 75%, the window would be shifted over 50 samples. The choice of window overlap depends on the application. When a temporally smooth spectrogram is desirable, window overlap should be 75% or more. When computation should be at a minimum, no overlap or 50% overlap are good choices. If computation is not an issue, you could even compute a new short-term spectrum for every sample of the sequence. In that case, window overlap = window length – 1, and the window would only shift 379 1 sample between the spectra. But doing so is wasteful when analyzing speech signals, because the spectrum of speech does not change at such a high rate. It is more practical to compute a new spectrum every 20-50 millisecond, since that is the rate at which the speech spectrum changes. In a wideband spectrogram (i.e., using a window shorter than the fundamental period), the fundamental frequency of the speech signal resolves in time. That means that you can't really tell what the fundamental frequency is by looking at the frequency axis, but you can see energy fluctuations at the rate of the fundamental frequency along the time axis. In a narrowband Spectrogram (i.e., using a window 2-3 times the fundamental period), the fundamental frequency resolves in frequency, i.e., you can see it as an energy peak along the frequency axis. IJOART GenerateTimeVsFreq.m 1) Duration=input('Enter the time in seconds for which you want to record:'); 2) samplingRate=input('Enter what sampling rate is required of audio 8000 or 22050: '); 3) timeResolution=input('Enter the time resolution desired in millisecond: '); 4) frequencyResolution=input('enter the frequency resolution required: '); 5) usedWindowLength =ceil(samplingRate/frequencyResolution); 6) recObj = audiorecorder(samplingRate,8,1); 7) disp('Start speaking.') 8) recordblocking(recObj,Duration); 9) disp('End of Recording.'); 10) % Play back the recording. 11) play(recObj); 12) % Store data in double-precision array. 13) myRecordingData = getaudiodata(recObj); 14) figure(1) 15) plot (myRecordingData); 16) % No of Data points= samplingRate*Duration; 17) % No of columns in spectrogram=(duration*1000)/timeResolution; 18) % =duration*frequencyResolution; 19) actualWindowLength= ceil((samplingRate*timeResolution)/1000); 20) overlapLength= usedWindowLength -actualWindowLength +4; Copyright © 2013 SciResPub. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 380 21) % Plot the spectrogram 22) S =spectrogram(myRecordingData,usedWindowLength,overlapLength,samplingRat e-1,samplingRate,'yaxis'); 23) [ar ac]=size(S); 24) S1=imresize(S,[ar (Duration*1000)/timeResolution]); 25) AbsoluteMagnitude=abs(S1); 26) figure(2) 27) spectrogram(myRecordingData,256,200,256,samplingRate-1,'yaxis'); 28) TimeInterval=input('Enter the time interval in terms of multiple of time resolution to see the frequencies present at that moment:'); 29) figure(3) 30) plot(AbsoluteMagnitude(:,timeInterval)); Synthesis of Sound There are many methods of sound synthesis. Jeff Pressing in "Synthesizer Performance and Real-Time Techniques" gives this list of approaches to sound synthesis, namely Additive synthesis, Subtractive synthesis, frequency modulation synthesis ,sampling ,composite synthesis ,phase distortion , wave shaping ,Re-synthesis ,granular synthesis ,linear predictive coding ,direct digital synthesis ,wave sequencing ,vector synthesis ,physical modeling. Additive synthesis generates sound by adding the output of multiple sine wave generators. Harmonic additive synthesis is closely related to the concept of a Fourier series which is a way of expressing a periodic function as the sum of sinusoidal functions with frequencies equal to integer multiples of a common fundamental frequency. These sinusoids are called harmonics, overtones, or generally, partials. In general, a Fourier series contains an infinite number of sinusoidal components, with no upper limit to the frequency of the sinusoidal functions and includes a DC component (one with frequency of 0 Hz). Frequencies outside of the human audible range can be omitted in additive synthesis. As a result only a finite number of sinusoidal terms with frequencies that lie within the audible range are modeled in additive synthesis. IJOART We are using additive synthesis to synthesize the sound from matrix having rows as different frequencies and columns as time intervals. a) Additive Synthesis Additive synthesis is a sound synthesis technique that creates timbre by adding sine waves together. In music, timbre also known as tone color or tone quality from psychoacoustics(i.e. scientific study of sound perception) , is the quality of a musical note or sound or tone that distinguishes different types of sound production, such as voices and musical instruments, string instruments, wind instruments, and percussion instruments Copyright © 2013 SciResPub. b) Harmonic form The simplest harmonic additive synthesis can be mathematically expressed as: where ,y(t) is the synthesis output, , , and are the amplitude, frequency, and the phase offset of the th harmonic partial of a total of harmonic partials, and is the fundamental frequency of the IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 waveform and the frequency of the musical note. 381 a function of time, , in which case the synthesis output is c) Time-dependent amplitudes More generally, the amplitude of each harmonic can be prescribed as d) Matlab Code MatrixToSound.m 1) 2) 3) 4) 5) % FUNCTION TO PLAY SOUND FROM THE MATRIX samplingRate= input('please enter the sampling rate used: '); timeResolution= input('Please enter the time resolution in milliseconds: '); matrix= input('please enter the matrix for conversion to sound'); lowerThreshold= input('Please enter the lower threshold value below which the matrix element should be neglected( a number between 0 and 255: '); 6) time=0:1/samplingRate:(timeResolution/1000); 7) [mrows mcolumn]= size(matrix); 8) count=0; 9) [timerow NoOfComponents]= size(time); 10) SineVector=zeros(1,NoOfComponents); 11) InitialSoundMatrix=zeros(NoOfComponents,mcolumn); 12) for j=1:mcolumn 13) for i=1:mrows 14) if(matrix(i,j)>lowerThreshold) 15) t=matrix(i,j)*sin(2*pi*time*i); 16) count=count+1; 17) SineVector=SineVector+t; 18) end 19) end 20) InitialSoundMatrix(:,j)=(SineVector’) 21) end 22) SoundMatrix=InitialSoundMatrix./(255*count); 23) [SMRow SMColumn]=size(SoundMatrix); 24) SoundColumn=reshape(SoundMatrix,SMRow*SMColumn,1); 25) soundsc(SoundColumn,samplingRate); IJOART Conclusion The spectra of the sound corresponding to time can be computed using the GenerateTimeVsFrequency.m matlab file and its result matches approximately with that of specgramdemo function of the matlab. Additive synthesis of sound can be simulated with the help of the matlab file created MatrixToSound.m. It approximates the actual sound. Copyright © 2013 SciResPub. Acknowledgement My research paper is dedicated to my parents Sh. Vasu Dev Sharma, Lecturer Biology at Government Senior Secondary School Bilaspur Himachal Pradesh(India) and Smt. Bandna Sharma; T.G.T Mathematics at Sarswati Vidya Mandir Bilaspur Himachal Pradesh(India) whose blessing and wishes made me capable to complete this paper more effectively and efficiently. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 5, M ay-2013 ISSN 2278-7763 382 References (a) Textbooks [1] Oppenheim, A.V., and R.W. Schafer, Discrete-Time Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1989, pp.713-718. [2] Rabiner, L.R., and R.W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, NJ, 1978. (b) Websources 1) http://www.mathworks.in/matlabcentr al/fileexchange/index?utf8=%E2%9C%9 3&term=spectrogram 2) http://en.wikipedia.org/wiki/Additive_s ynthesis#Time-dependent_amplitudes 3) http://isdl.ee.washington.edu/people/s tevenschimmel/sphsc503/ IJOART http://hyperphysics.phyastr.gsu.edu/hbase/audio/synth.html By: Er. Vishv Mohan, s/o Sh. Vasu Dev Sharma State Topper Himachal Pradesh 2008, 2009, 2010. B.E(Hons.) Electrical & Electronics Engineering, BITS-Pilani_(Rajasthan)-333031_India. vasuvishv@gmail.com Copyright © 2013 SciResPub. IJOART