ARRAY CANALIZED CODING TECHNIQUE FOR FREQUENCY BAND COMPRESSION IN SPEECH TELECOMMUNICATION SYSTEMS

Similar documents
Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Consonants: articulation and transcription

Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Voice conversion through vector quantization

Body-Conducted Speech Recognition and its Application to Speech Support System

Speaker recognition using universal background model on YOHO database

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Phonetics. The Sound of Language

Speaker Recognition. Speaker Diarization and Identification

THE RECOGNITION OF SPEECH BY MACHINE

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Proceedings of Meetings on Acoustics

Speaker Identification by Comparison of Smart Methods. Abstract

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speak with Confidence The Art of Developing Presentations & Impromptu Speaking

CURRICULUM VITAE FOR ANNET NSIIMIRE

Rhythm-typology revisited.

Segregation of Unvoiced Speech from Nonspeech Interference

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

WHEN THERE IS A mismatch between the acoustic

Five Challenges for the Collaborative Classroom and How to Solve Them

Speech Recognition at ICSI: Broadcast News and beyond

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Progress Monitoring for Behavior: Data Collection Methods & Procedures

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Functional Skills Mathematics Level 2 assessment

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

A study of speaker adaptation for DNN-based speech synthesis

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Generating Test Cases From Use Cases

On the Combined Behavior of Autonomous Resource Management Agents

age, Speech and Hearii

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Using EEG to Improve Massive Open Online Courses Feedback Interaction

SAM - Sensors, Actuators and Microcontrollers in Mobile Robots

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

What motivates mathematics teachers?

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Using SAM Central With iread

Seminar - Organic Computing

EXPO MILANO CALL Best Sustainable Development Practices for Food Security

Beginning primarily with the investigations of Zimmermann (1980a),

Process to Identify Minimum Passing Criteria and Objective Evidence in Support of ABET EC2000 Criteria Fulfillment

Data Fusion Models in WSNs: Comparison and Analysis

Circuit Simulators: A Revolutionary E-Learning Platform

OFFICIAL DOCUMENT. Foreign Credits, Inc. Jawaharlal Nehru Technological University

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Millersville University Degree Works Training User Guide

FOR TEACHERS ONLY. The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION PHYSICAL SETTING/PHYSICS

Association Between Categorical Variables

Mandarin Lexical Tone Recognition: The Gating Paradigm

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

FULBRIGHT MASTER S AND PHD PROGRAM GRANTS APPLICATION FOR STUDY IN THE UNITED STATES

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Rule Learning With Negation: Issues Regarding Effectiveness

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

Major Milestones, Team Activities, and Individual Deliverables

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Integration of ICT in Teaching and Learning

A Reinforcement Learning Variant for Control Scheduling

New Features & Functionality in Q Release Version 3.1 January 2016

Learning Methods in Multilingual Speech Recognition

Coimisiún na Scrúduithe Stáit State Examinations Commission LEAVING CERTIFICATE 2008 MARKING SCHEME GEOGRAPHY HIGHER LEVEL

Using Moodle in ESOL Writing Classes

Section 3.4. Logframe Module. This module will help you understand and use the logical framework in project design and proposal writing.

Culture, Tourism and the Centre for Education Statistics: Research Papers

A Hybrid Text-To-Speech system for Afrikaans

CWIS 23,3. Nikolaos Avouris Human Computer Interaction Group, University of Patras, Patras, Greece

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Course Law Enforcement II. Unit I Careers in Law Enforcement

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Computer Science. Embedded systems today. Microcontroller MCR

A Practical Approach to Embedded Systems Engineering Workforce Development

Use and Adaptation of Open Source Software for Capacity Building to Strengthen Health Research in Low- and Middle-Income Countries

Monitoring Metacognitive abilities in children: A comparison of children between the ages of 5 to 7 years and 8 to 11 years

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

9 Sound recordings: acoustic and articulatory data

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

New Features & Functionality in Q Release Version 3.2 June 2016

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

INPE São José dos Campos

Interactive multimedia broadcasting new educational opportunities.

Teaching digital literacy in sub-saharan Africa ICT as separate subject

Transcription:

ARRAY CANALIZED CODING TECHNIQUE FOR FREQUENCY BAND COMPRESSION IN SPEECH TELECOMMUNICATION SYSTEMS Shahrokh Sanati Department of Communications Technology, University of Ulm, Germany ABSTRACT This paper is a result of Author s design of a Real Time System that as the main application analyzes human speech at the calling party, compresses/codes it in a Bandwidth about 500Hz instead of so called traditional bandwidth of 4KHz in Fixed Telephony Systems, then reconstructs the speech at the called party. This leads to expanding efficiency of using frequency band resources especially to earn more gain from installed infrastructures in rural areas where developing network is hard due to geographical obstacles or even financial issues in developing or third world countries. Besides considering increasing need in the military for reliable wireless communications in limited and crowded spectrum, this also could be implemented in Military Wireless Communications. KEYWORDS Vowels, Consonants, Modeling Human Speaking Organ, Array Coding/Decoding 1. INTRODUCTION Since in almost all fixed telephony systems all around the world a frequency band of 4KHz is considered as the frequency spectrum of the main part of human voice information, all the interfaces and consequent parts in whole fixed telephony systems are designed based on this pre-assumption. These parts in general include Filters, Sampling and Analog to Digital Conversion (ADC), Digital to Analog Conversion (DAC) as well, Frag and Synchronization in implementing Time Division Multiple Access (TDMA) and even consequent interfaces which are used in low or high density transmission methods via Radio Links or Optical Line Terals (OLT) which are normally used between the nodes, in this case Switching Centers, of a Telephony Network. In this paper the concentration has been over the characteristics of human voice information and investigating about how to decrease the 4KHz of frequency band which was mentioned above. In order to gain this goal, first the very fundamental principles of human voice would be described and simply modeled here and then with introducing an idea as Array Canalized Coding/Decoding it has been tried to give a complete model which could be implemented as the solution interface. 2. CHARACTERISTICS OF HUMAN VOICE INFORMATION 2.1 Vowels Vowels ( a, e, i, o, u as the major vowels in English language) directly come from the Vocal Cords. Their frequency spectrums are approximately regular and not much influenced by the teeth, s or vocal cavities (mouth and nose). Vowels are the creators of the basic part of speech because they help: - to make unlimited easy to pronounce combinations of letters (words) in each language 378

IADIS International Conference on Applied Computing 2005 - to contrast between similar words - to apply the feelings while talking. As mentioned, any vowel has almost regular frequency spectrum but observation these spectrums has also shown that within these spectrums there are some very narrow bands, which could be even mentioned as some certain frequencies. While talking, whenever a vowel appears, these very narrow bands or frequencies seem bolder in compare to other parts of the vowels spectrum. They are so called Formants. Figure 1 shows spectrums of eight American English vowels. Figure 1. Formants of eight American English Vowels (Speaker is male) Formants' frequencies depend on many factors but the most important ones are as below (also see Figure 2): - A : Minimum surface through the path between the Vocal Cords and Lips, - L: Distance from A to the glottis, - A : Lip Opening (Open surface between the s) Figure 2. Relation between values of A, L and A (Right: pronouncing ε Left: pronouncing α ) 379

Figure 3 shows the first five calculated formants of vowels according to variation of L and A = 0.65cm2. As observed, variation of A at fixed A does not seriously affect formants frequencies and this leads to a valuable result. Regardless of physical shape and size of speaking organs, L in any language is approximately constant because humans of the same local nationality have learnt to speak and pronounce vowels in similar way. Therefore, it would be possible to specialize several main frequencies or very narrow frequency bands for detecting formants by tracking the arrangement of experienced default tables. Figure 3. First five calculated formants of vowels. While A = 0.65cm2 the three sets of curves represent opening ( A ) of 4cm 2 (un-rounded), 2 2cm and.65 2 0 cm (rounded) It is interesting to point out that second and third formants of female speakers are very similar to the corresponding formants for male speakers but fourth formants of females are upper than males fourth formants. Perhaps this is the natural key helps speech processors of human brain to realize the sex of speakers without seeing them even about a woman with bass voice or a man with sharp voice. 2.2 Consonants Despite of the vowels, Consonants are not directly produced by vocal cords. They are creatures of vocal cavities with help of tongue, uvula, teeth, and s. Consonants do not have regular frequency spectrums and that is why it is somehow impossible to detect them. This is the exact reason that makes it hard to pronounce the letters, consequently the words, in local accents for strangers at least in short period of time before they learn how to make these sounds! 2.3 Modeling Human Speaking Organ Considering principles mentioned above now it is possible to give a very simple model for human speaking organ as Figure 4. Periodic Excitation Source (Pitch Source) and Noise Source produce primary needed energy. Besides Pitch Source block produces regular frequencies (or several narrow bands) which could be the formants of vowels. In similar way, Noise Source block could act the rule of consonants production. The effects of vocal cavities and their dependents (nose, mouth, throat, uvula and tongue) are considered as a block called Vocal Tract Filters, and Gain Block adjusts the volume of output sound. 380

IADIS International Conference on Applied Computing 2005 3. ARRAY CANALIZED CODING/DECODING Figure 5 shows a complete system in both Coder and Decoder sides (Analyzer/Synthesizer) as a Communication System. Left to right: Speech is first fed to an array of Band Pass Filters (BPF). Later it would be seen that number of Band Pass Filters (BPF), analyzer channels, directly affects synthetic signal's quality at the decoder side. Figure 4. Simple Model for Human Speaking Organ The Pass Bands of this array should be set to be successive in order to totally cover the frequency Bandwidth of input signal which is speech in this case. Each row is followed by a Rectifier and a Low Pass Filter (LPF). This simply means that the primary 4KHz of speech spectrum now has been divided into several sub-bands and the canalized signals would be first rectified and then converted to some dc levels while passing through these branches. Figure 5. Array Coding/Decoding System Diagram 381

After which, at the end of each row of this array, there are some dc values which could be interpreted as power of input speech at corresponding sub-band and in this paper are called Spectral Coefficients. Now, instead of transmitting whole bandwidth it is just needed to transfer Spectral Coefficients to decoder side at called party and use them to reconstruct the primary speech according to the principles and simple model of human speaking organ given in section 2 of this paper. There are two other blocks in Coder/Analyzer side which separately refine two very important parameters of input speech. Voicing Detector acts as a flag shows moments that vowels appear in input speech. While voicing detector only informs when vowels happen, the other block, Pitch Detector, analyzes the appeared vowels and detects which of vowels are happened in exact. Right in Figure 5, Decoder side, it is easy to find the extracted model of human speaking organ shown on Figure 4 and described in section 2.3. Here 'Spectral Envelope Model' has been replaced by another array of band pass filters (BPF). Characteristics of these filters are like their corresponding ones in coder side with the same bandwidths and the same central frequencies that totally should cover desired frequency band. Pulse Source is a block with ability to produce all vowels. It just seeks information cog from pitch detector in coder part to know which vowels should be produced. Voice Information is a key controller that connects the decoder array either to the Pulse Source when vowels appear in speech or to the Noise Source when speech consists of consonants. Voice Information is itself under-controlled by Voicing Detector from Coder part. In each column of decoder array there is a gain controller block before band pass filter. Voltage Gain Controllers (VGC) tune the amplitude of signal in each sub-band. DC levels gained from Coder Part are actually the Control Voltages of these voltage gain controllers. Finally signals cog out from sub-bands are being summed up and build the reconstructed speech. 4. CONCLUSION Using this technique for telephony applications requires just less than 500Hz of base band frequency bandwidth to transfer spectral coefficients and additional information provided by pitch and voicing detectors. Decoder will map this 500Hz to desired 3 to 4KHz for telephone quality and reconstructs speech. Therefore, at most it brings 87.5% of saving in bandwidth that means up to 8 times more connections. Quality of reconstructed signal tightly depends on number of sub-bands, their bandwidth and sharpness of filters. However, to earn the goal which is compressing needed frequency bandwidth for transmitting speech, it is important to keep the number of sub-bands limited. This leads to a sharp edge in balancing between quality and the goal. Experiments have shown that for reasonable qualities, 15-20 sub-bands would be sufficient. While the model diagram here is including analog to digital and digital to analog converters, sample system designed by the author has been completely analog using discrete component and a 20 row/column array. Besides no detail is given here for the scheme used for communication channel since through this work author has been trying to investigate applicability of the idea introduced in this paper. Definitely more work is needed to describe best suited interface and scheme for communication channel. ACKNOWLEDGEMENT The author would like to acknowledge Dr. Reinhold Luecker and the International Office of the University of Ulm for their valuable help regarding registration this paper with IADIS Conference in Applied Computing 2005. REFERENCES Peter Ladefoged, 2001. Vowels and Consonants: An Introduction to the Sounds of Language. Blackwell. John R. Deller, 1999. Discrete-Time Processing of Speech Signals. Wiley-IEEE Press. USA. Lawrence Rabiner, 1993. Fundamental of Speech Recognitions. Prentice Hall PTR Frank Fallside, 1985. Computer Speech Processing. Prentice Hall Ltd. UK. 382