A PARALLEL PROCESSOR FOR REAL-TIME SPEECH SIGNAL PROCESSING A•V.Ashajayanthi, S.Rajaram, IN. Viswanadham School of Automation, Indian Institute of Science Bangalore 560012, INDIA the various elements of the LP analysis and synthesis procedure, such as determinationof autocorrelation functionand LP coefficients, pitch extraction, and inverse filtering, which are directly applicableto the proposed structure. The paper concludes v4th a discussion of the suitability of this structure for other signal processing operations such as on-line identification. ABSTRACT This paper describes the structure of a parallel processing system, primarily suitable for real-time Linear Prediction (LP) analysis and synthesis of speech signals. This system employs specially developed parallel forms of the well known algorithms (Levinson's Rursion, SIFT as well as LP synthesis filter) for this purpose. The feasibility of implementing the proposed structure using microprogrammable microprocessors, is then discussed. The proposed scheme can be readily adapted for on-line identification of autoregressive processes. SYSTEM ARCHITECTURE INTRODUCTION In recent years, the mathematical technipe of Linear Prediction (LP) has evolved as an efficient method for representing speech signals in terms of a small number of slowly varying parameters. This form of representation plays a key role in speech transmission,perceptionand automatic speech production. However, the LP analysis and synthesis procedures involve exterEive cornpitations, which make it very difficult, if not impossible, to perform them on-line, on a conventional minicompiterfmicrocomputer. The use of multiprocessor computer system, which offer significant speed advantages over single processor system, provideus a viable alternative for such real-time processing. The advent of low cost microprocessors, particularly the TTL, bitsliced, microprogrammable microprocessors provideall the needed speed and design flexibility while rendering such special purpose systems practical and cost effective. The efficiency of a multiprocessing system however, depends on matching the architecture to the computational structure of the applicationconcerned. The structural characteristicsof a typical speech processing algorithms can be exploited to arrive at a relatively simple multiprocessor architecture, for meeting many functional requirements. In this paper, we propose a multiple microprocessor architecture primarily suitable for online LP analysis and synthesis of speech. Further, we present parallel versions of the algorithms, for 75 1979 IF 868 The proposed system employs a modified Single Instruction Multiple Data (SIMD) structure [l],consisting of m+l processors humbered p(O) through p(m), and interconnected as shown in fig. 1. The processorsp(l) through p(m) are simple arithmetic processors workingin strict synchronism, as in the conventional SIMD scheme. The processor p(O), however, combines both the arithmetic function common to the other processors as well as the total system control function. The latter is achieved by including in p(O)the capabilityto access the instructions, decode them, and issue the necessary control signals to all the arithmetic processors for performing the indicated functiore. Moreover, p(O) also serves as the controller for interfacing with external I/O devices, such as ADC and DACTs connected via its I/O Bus. To provide maximum parallelism in operations, each of the processors is provided with its own private memory (PM), containing the data to be operated upon. Additionally, bidirectional data transfer lines, known as Interprocessor Communication Bus (IPC) connecting adjacent processors are also provided to shift data between them.Thus data can be shifted from p(i) to either p(i-l) or p(i±l) for all i E(O,m), simultaneously, with the exceptionthat p(O) and p(m) are not considered as adjacent. The system also incorporates a bidirectional BROADCAST Bus, so that a common data element may be transmitted from any processor to all the other processors simultaneously.Even though the same function could be achieved by successive transmissions using the basic IPC Bus, the latter approachwould require m-cycles as against the single cycle needed when SCAST Bus is used. simultaneouslyby all processors p(j),j E (0,rn); comment collect the first m+l speech samples in processorsp(O) through p(rn); PARALLEL ALGORITHMS FOR LPC ANALYSIS AND SYNTHESIS A primary requirement in the use of a multi- for processor scheme, such as the one described earher in any application, is the development of suitable algorithms which can effectivelyutilize the multiprocessing capabilities andthe available data transfer links. In the present context of LP analysis and synthesis, we require parallel formsof algorithms for (i) determinationof autocorrela tion functions,(ii) determinationof the LP coefficients from a set of linear equations involving the autocorr elation functions, (iii) pitch extraction and (iv) inverse filtering. In this connection, we present parallel forms of algorithms suitable for performingthese computations on the proposed system. AutocorrelationLP analysis The LP analysis of speech signals is based on the concept that the present speech sample can be predicted as a linear combination of its m predecessors. That is, if (n) is the predicted nth sample then i:=0l until rn-i do begin comment Get the new sample into processor p(0); P(0) = S(i); comment Down shift the samples; end of i loop; samples stored, to all the processors; broadcast (P(m)); comment Compute the partial sums of autocorrelation function; R(j):= all Osi< are the speech samples and N is the number of samples in a frame. The error in the i=0 (2) to find a(i), 0eim, the total mean square error in predicting all the samples in a frame is minimized, which results in a set of m linear equationsgiven by [2], where r(j), defined as -r(n+l), 0snm .. . . (3) 0ejm are the autocorrelation funcifuin = s(k)s(k+j), j.O (4) r(j) = k=0 The LP analysis is now performed in two stages; P(j):= P(j—l), P(0):=S(i) end for ly. The algorithmfor performingthis is given below employing an ALGOL like languagenotation, following Stone [4 .Appropriate comments have been included to make the description as self- a(m, i) denotes the value LP coefficient a(i) at the end of m (o) n: = := := o 1 nntil rn-i do beg4n k(n) := -J3(n)/c((n); :=oc(n)+k(n)xJ3(n); a(n±l, o) := a(n, o); for i =1 until n+l do begin a(n+i,i) := a(n,i)+k(n)a(n,n_i+l); oc.(n+i) pl explanatory function Initialise autocorrelation functions initialization. iterations; m+i autocorrelation functions simultaneais- comment (ljm) a(o,o) := 1; Using the systemproposed we can compute evaluation; (0jm); loop; of computing the autocorrelation functions and (ii) solving the system of equations (3), by using Levinson recursion[3]. as possible. Algorithm comment Autocorrelation of j comment (i) all the R(j)+P(m)xP(j), comment Shift down the speech samples and get the next sampleinto processor p(0); In the end, any processorp(i) will containh(m-i), for all i E(O, m). Next all the autocorrelation functions are broadcast to all the processors, to enable further analysis. This will requirem+i broadcast operations. Next, consider the Levinson's recursion procedure for computing the LP coefficients, of eqn. (3). The normal sequentialmethod of computing a(i), i (0, m) can be described as follows using ALGOL notation for simplifying the description of the algorithm. Algorithm comment Sequential procedure for Levinsonrecursion; Thus, a(i)r(n+i-i) P(i) in the processor P(i) will contain speech sample S(m-i) for Oim; comment Evaluate autocorrelation functions; for j = m 1 until N+m do begin comment Broadcast P(m), the oldest of the (rn+l) (1) (n) -1:, are lim a(i)s(n—i) the LP coefficients, s(i), fr = (lj m); P(0): = S(m); comment Now the variable where a(i), prediction of nth sample is e(n) = s(n)-'(n) a(i)s(n—i) = P(j—l), h to zero; R(j) := 0, (0jm); comment The notation (0jm) following statement indicates that this operation is performel 869 end of i loop; for j =o st begin f 1 until n+1 do (n+l) := j3(n+i)±a(n+1,j)XR(n±i—j); end of j loop; form realization of the all-pole synthesis filter can be converted into the parallelform of the algorithm as givenbelow. Here e(i) indicates the excitation functionwhich is either an impulse sequence (for voiced frames)or noise sequence (for unvoicedframes). Algorithm commentsynthesis; commentinitialization; end of n loop; are the required LP coefficients. a(m. i), We can convert this sequential algorithm into a parallel form to take full advantage of our proposedstructure, by repeating some of the ournputations.In fact, each processor p(i) will compute both a(n., i) and a(n, n-i), so that the basic data shifts permitted betweenadjacent processors, are sufficient to align the a(i) pairs appropriately for further iteration. Algorithm commentparallel Levinson recursion; oim S(0) := e(0); Sl(j) := 0, (0jxn); for i:=1 1 until N-l do begin comment initialisation; a(o,o) broadcast (S(i-l)); 1; b(o, 1) := comment Compute partial sums; Sl(j) := Sl(j)+S(i—l)xá(m,j), comment shift up the partial sums; Sl(j) := S1(j+l), (0jm-l); comment Compute speech output in processor S(i) := e(i)—Sl(0); (ljm); 1; commentin generalb(i,j) will contain a(i, i-fl for j E (o, m); := comment evaluation for n:=o of a' s; end until rn-i do -/3/°i; '(+k; broadcast c(:= (k); c(n+l,j) : = a(n,j)+b(n,j)k, (o.jm); b(n+l,j) : = b(n,j)+a(n,j)Xk, (oejm); a(n+l,j) c(n+i,j), (os-jrn); overall speed up that results may not be m+l. By comparing individual operations such as loading, additions, multiplications,broadcasts and data shifts in both the normal sequentialform as well as the suggested parallel form, we can obtain a measure of the effective speed up possible with this system. These results are summarized in Table 1 below. In arriving at the numerical values for the effective speed up, we have assumed that the add time and load time are same, the broadcast and shift operation require twice the load time, while the multiplication operation is equivalentto 16 load/add operations (as is possible with AM2903 microprocessors). Table I. Effective speed up for the parallel algorithms (Computed for the case with 1\=256,m=l3) Time for Time for Effective Procedure sequential parallel speed up ratio system system 55263 5667 9.75 Proc.1 26 Proc.2 2963 1446 2.04 Proc.3 53138 6121 8.68 Proc.4 1(j) := a(n+i,j).'cR(n+2—j), (ojm); 2(j) := b(n+1,j)xR(j+l), (ojm); comment if nxnod2=l else J31 := fx] denotes integral part of x; thenJj):= l(j)+J32(j), (j In/21+l) (j):= 3l(j)+ p2(j), ( ojm); for j:=l until Ff12] +1 do begin comment shift up partial sums of J31(j+1), (ojm—i); comment and compute; j31(j) end J3 := := j÷ )l(o); of j loop; b(n+i,j) := end of n loop; b(n+l,j-l), (lj-m); Pitch extraction The well known SIFT algorithm Es] for extraction can be readily adapted tothe pitch suggested structure, since it essentially involves Levinson's recursion of fourth order, evaluation of residual error signals using these coefficients by inverse filtering, and evaluationof autocorrelation functions of these error signals. The computations of Levinsonrecursion and autocorrelation can be carried out using the parallel algorithm similar to those alreadygiven, while inverse filtering can be paralle]ized as described in the next sub-section. The voiced/unvoiced frame detection, however, needs to be performed sequentially. Total 111364 13260 8.40 (Proc. 1: autocorrelation evaluation, Proc.2: Broadcast autofunctions, Proc. 3: Levinson recursion, Proc.4: Synthesis. All execution times shown are in terms of basic load operations) Synthesis: For of i loop; EFFECTIVE SPEEDUP While the parallel form of an algorithm exploits the availability of m+l independent processors for performingarithmetic, operations simultaneously, they involve additional overheads such as broadcasts and data shifts, for aligning the data as required. In view of these overheads, the begin k:= p(O) the synthesis of speech signals, direct 870 line identification of autoregressive process.The parallel forms of basic algorithmsneeded for use HARDWARE REALIZATION The availability of microprogrammable,bit sliced microprocessors has simplified the design of array processors, such as the one described with this structure are also described alongwith a estimate of the speed benefits that quantitative would result. A specific hardware r.ealisation of th.s unit, employing the AM2900 family of microprocessors is in progress. References [1] M.J. Flynn, 'Very high speed Computing systems, "Proc.IEEE Vol.54, pp. l9Ol-.09:JJec,1966. J. [2] D. Markel and A. H. Gray,Jr. Linear predic tion of speech,New York:Springer-verlag 1976. [3] N. Levinson,"The Weiner RMS error criterion in filter design and prediction, 'J. Math. Phys. Vol.25,pp. 261 -278; 1974. [4] H. S.Stone,Introductionto ComputerArchi- While we have built a processor with very similar architecture, on an experimeiin this paper. al basis earlier, using INTEL-3000 family of devices [6), the availability of more powerful bitslices as the AM2900 family of devices, is found to be more attractive for building a complete processorfor such real-time speech applicatiors. Specifically, each of the arithmetic proces sors p(l) through p(m), is to be a 16 bit unit, realised using four numbers of AM2903 microprocessor slices, along with a AM2902 carry look ahead generator. The processor p(O), will in addition, containthree AM2909microsequencers cas - tecture,Chicago:SRA, 1975. [5] J.D.Markel,"TheSIFT algorithm for funda- caded together, which together with a ROM containing microinstructions will function as the central microprogrammed control element. It is estimated that about 512 words of 60 bits each will be required as the control memory. The usual commercially available semiconductor RAM's with an access time 300 nsecs are proposed to be used as the private memories. SOME POSSIBLE EXTENSIONS The algorithms presented above can be extended to include more versatile speech transmission schemes such as the adaptive frame rate and improved modelling of speech using overlapped frame analysis. In transmission schemes using adaptive frame rates, the LP coefficientsare transmitted only when the distance function determined,based on the autocorrelation functions of the speech samples of the current frame and those of the LP coefficients of the previous frame, exceeds certain threshold valueL7j,f8J.The distance function can be computedin paralle, in a similar manner as des cribed earlier, in the case of the overlappedfrarre analysis, the computation requirement is one of calculating two sets of autocorrelationfunctions one for the normal succession of speech frame and the other for the overlappedframe, and also the respective LP coefficients. The algorithm already described could be modified to suit the above rec'uirement. Also,the Levinson's recursion algorithm described above (wherethe order of recursion is kmn a priori), can be modified to suit the requirement for on-line identificationof AR process j91 (which requires that the recursion be continued till tin difference in residual error betweensuccessive iterations becomes constant). mental frequency estimation'IEEE Acoustics. Vol.AU—20, pp. 367—377: Dec. 1972. K. Prasaima Kumar, H. K. SampathKumar, [s] V. "Vector instruction processor VIP-3000' ME project report, School of Automation, Indian Institute of Science, Bangalore, 1978. [7] D.J.Magill,"Adaptive speech compressionfor Packet Coiruiinimtion Systems,'Tecom. conf. Record, IEEE Pub. 73CH0805-2, 29D1-5,1973. [8] F. Itakura,"Minimum prediction Residual Principle Appliedto Speech RecognitionIEEE ASSP, Vol.ASSP-23pp. 67-72:Feb. 1975. [9] M.D. Srinath,M.M.Viswanathan, "Sequential algorithm for identificationof parameters of autoregressive process, "IEEE Aut. Control, Vol.AC-20,pp. 542 -546: Aug.1975. CONCLUSIOr' structure Of a parallel processing system, employing microprogrammable microprocessors, which can be adapted for a number of real-time signal processing tasks such as LP analysis and synthesis of speech, and onWe have described the Fl6.1 871 PROPOSED STRUCTURE