ARRAY CANALIZED CODING TECHNIQUE FOR FREQUENCY BAND COMPRESSION IN SPEECH TELECOMMUNICATION SYSTEMS Shahrokh Sanati Department of Communications Technology, University of Ulm, Germany ABSTRACT This paper is a result of Author s design of a Real Time System that as the main application analyzes human speech at the calling party, compresses/codes it in a Bandwidth about 500Hz instead of so called traditional bandwidth of 4KHz in Fixed Telephony Systems, then reconstructs the speech at the called party. This leads to expanding efficiency of using frequency band resources especially to earn more gain from installed infrastructures in rural areas where developing network is hard due to geographical obstacles or even financial issues in developing or third world countries. Besides considering increasing need in the military for reliable wireless communications in limited and crowded spectrum, this also could be implemented in Military Wireless Communications. KEYWORDS Vowels, Consonants, Modeling Human Speaking Organ, Array Coding/Decoding 1. INTRODUCTION Since in almost all fixed telephony systems all around the world a frequency band of 4KHz is considered as the frequency spectrum of the main part of human voice information, all the interfaces and consequent parts in whole fixed telephony systems are designed based on this pre-assumption. These parts in general include Filters, Sampling and Analog to Digital Conversion (ADC), Digital to Analog Conversion (DAC) as well, Frag and Synchronization in implementing Time Division Multiple Access (TDMA) and even consequent interfaces which are used in low or high density transmission methods via Radio Links or Optical Line Terals (OLT) which are normally used between the nodes, in this case Switching Centers, of a Telephony Network. In this paper the concentration has been over the characteristics of human voice information and investigating about how to decrease the 4KHz of frequency band which was mentioned above. In order to gain this goal, first the very fundamental principles of human voice would be described and simply modeled here and then with introducing an idea as Array Canalized Coding/Decoding it has been tried to give a complete model which could be implemented as the solution interface. 2. CHARACTERISTICS OF HUMAN VOICE INFORMATION 2.1 Vowels Vowels ( a, e, i, o, u as the major vowels in English language) directly come from the Vocal Cords. Their frequency spectrums are approximately regular and not much influenced by the teeth, s or vocal cavities (mouth and nose). Vowels are the creators of the basic part of speech because they help: - to make unlimited easy to pronounce combinations of letters (words) in each language 378
IADIS International Conference on Applied Computing 2005 - to contrast between similar words - to apply the feelings while talking. As mentioned, any vowel has almost regular frequency spectrum but observation these spectrums has also shown that within these spectrums there are some very narrow bands, which could be even mentioned as some certain frequencies. While talking, whenever a vowel appears, these very narrow bands or frequencies seem bolder in compare to other parts of the vowels spectrum. They are so called Formants. Figure 1 shows spectrums of eight American English vowels. Figure 1. Formants of eight American English Vowels (Speaker is male) Formants' frequencies depend on many factors but the most important ones are as below (also see Figure 2): - A : Minimum surface through the path between the Vocal Cords and Lips, - L: Distance from A to the glottis, - A : Lip Opening (Open surface between the s) Figure 2. Relation between values of A, L and A (Right: pronouncing ε Left: pronouncing α ) 379
Figure 3 shows the first five calculated formants of vowels according to variation of L and A = 0.65cm2. As observed, variation of A at fixed A does not seriously affect formants frequencies and this leads to a valuable result. Regardless of physical shape and size of speaking organs, L in any language is approximately constant because humans of the same local nationality have learnt to speak and pronounce vowels in similar way. Therefore, it would be possible to specialize several main frequencies or very narrow frequency bands for detecting formants by tracking the arrangement of experienced default tables. Figure 3. First five calculated formants of vowels. While A = 0.65cm2 the three sets of curves represent opening ( A ) of 4cm 2 (un-rounded), 2 2cm and.65 2 0 cm (rounded) It is interesting to point out that second and third formants of female speakers are very similar to the corresponding formants for male speakers but fourth formants of females are upper than males fourth formants. Perhaps this is the natural key helps speech processors of human brain to realize the sex of speakers without seeing them even about a woman with bass voice or a man with sharp voice. 2.2 Consonants Despite of the vowels, Consonants are not directly produced by vocal cords. They are creatures of vocal cavities with help of tongue, uvula, teeth, and s. Consonants do not have regular frequency spectrums and that is why it is somehow impossible to detect them. This is the exact reason that makes it hard to pronounce the letters, consequently the words, in local accents for strangers at least in short period of time before they learn how to make these sounds! 2.3 Modeling Human Speaking Organ Considering principles mentioned above now it is possible to give a very simple model for human speaking organ as Figure 4. Periodic Excitation Source (Pitch Source) and Noise Source produce primary needed energy. Besides Pitch Source block produces regular frequencies (or several narrow bands) which could be the formants of vowels. In similar way, Noise Source block could act the rule of consonants production. The effects of vocal cavities and their dependents (nose, mouth, throat, uvula and tongue) are considered as a block called Vocal Tract Filters, and Gain Block adjusts the volume of output sound. 380
IADIS International Conference on Applied Computing 2005 3. ARRAY CANALIZED CODING/DECODING Figure 5 shows a complete system in both Coder and Decoder sides (Analyzer/Synthesizer) as a Communication System. Left to right: Speech is first fed to an array of Band Pass Filters (BPF). Later it would be seen that number of Band Pass Filters (BPF), analyzer channels, directly affects synthetic signal's quality at the decoder side. Figure 4. Simple Model for Human Speaking Organ The Pass Bands of this array should be set to be successive in order to totally cover the frequency Bandwidth of input signal which is speech in this case. Each row is followed by a Rectifier and a Low Pass Filter (LPF). This simply means that the primary 4KHz of speech spectrum now has been divided into several sub-bands and the canalized signals would be first rectified and then converted to some dc levels while passing through these branches. Figure 5. Array Coding/Decoding System Diagram 381
After which, at the end of each row of this array, there are some dc values which could be interpreted as power of input speech at corresponding sub-band and in this paper are called Spectral Coefficients. Now, instead of transmitting whole bandwidth it is just needed to transfer Spectral Coefficients to decoder side at called party and use them to reconstruct the primary speech according to the principles and simple model of human speaking organ given in section 2 of this paper. There are two other blocks in Coder/Analyzer side which separately refine two very important parameters of input speech. Voicing Detector acts as a flag shows moments that vowels appear in input speech. While voicing detector only informs when vowels happen, the other block, Pitch Detector, analyzes the appeared vowels and detects which of vowels are happened in exact. Right in Figure 5, Decoder side, it is easy to find the extracted model of human speaking organ shown on Figure 4 and described in section 2.3. Here 'Spectral Envelope Model' has been replaced by another array of band pass filters (BPF). Characteristics of these filters are like their corresponding ones in coder side with the same bandwidths and the same central frequencies that totally should cover desired frequency band. Pulse Source is a block with ability to produce all vowels. It just seeks information cog from pitch detector in coder part to know which vowels should be produced. Voice Information is a key controller that connects the decoder array either to the Pulse Source when vowels appear in speech or to the Noise Source when speech consists of consonants. Voice Information is itself under-controlled by Voicing Detector from Coder part. In each column of decoder array there is a gain controller block before band pass filter. Voltage Gain Controllers (VGC) tune the amplitude of signal in each sub-band. DC levels gained from Coder Part are actually the Control Voltages of these voltage gain controllers. Finally signals cog out from sub-bands are being summed up and build the reconstructed speech. 4. CONCLUSION Using this technique for telephony applications requires just less than 500Hz of base band frequency bandwidth to transfer spectral coefficients and additional information provided by pitch and voicing detectors. Decoder will map this 500Hz to desired 3 to 4KHz for telephone quality and reconstructs speech. Therefore, at most it brings 87.5% of saving in bandwidth that means up to 8 times more connections. Quality of reconstructed signal tightly depends on number of sub-bands, their bandwidth and sharpness of filters. However, to earn the goal which is compressing needed frequency bandwidth for transmitting speech, it is important to keep the number of sub-bands limited. This leads to a sharp edge in balancing between quality and the goal. Experiments have shown that for reasonable qualities, 15-20 sub-bands would be sufficient. While the model diagram here is including analog to digital and digital to analog converters, sample system designed by the author has been completely analog using discrete component and a 20 row/column array. Besides no detail is given here for the scheme used for communication channel since through this work author has been trying to investigate applicability of the idea introduced in this paper. Definitely more work is needed to describe best suited interface and scheme for communication channel. ACKNOWLEDGEMENT The author would like to acknowledge Dr. Reinhold Luecker and the International Office of the University of Ulm for their valuable help regarding registration this paper with IADIS Conference in Applied Computing 2005. REFERENCES Peter Ladefoged, 2001. Vowels and Consonants: An Introduction to the Sounds of Language. Blackwell. John R. Deller, 1999. Discrete-Time Processing of Speech Signals. Wiley-IEEE Press. USA. Lawrence Rabiner, 1993. Fundamental of Speech Recognitions. Prentice Hall PTR Frank Fallside, 1985. Computer Speech Processing. Prentice Hall Ltd. UK. 382