logo
Process Patrol

Welcome to my site.
This project was developed by a former Engineer and now a patent agent assistant studding towards LLM degree. Seeing new inventions is very interesting to me. I created this site to outlines my favorite inventions along with inventions that I believe have potential.

Speaker recognition with glottal pulse-shapes

by Kametani, Jun;



BACKGROUND OF THE INVENTION

The present invention relates to speaker recognition techniques using glottal pulse shapes.

It is known to use glottal pulse-shapes extracted from talker's voice as a unique key to identify the voice pattern of individuals. With a conventional speaker recognition system for multiple designated speakers, the beginning and end points of an utterance are detected by an endpoint detection unit. This is usually done by detecting the power level of the voice input and identifying it as a voice signal if it exceeds a prescribed threshold. Once the beginning and end of the utterance has been found, a series of measurements are made to provide feature parameters such as cepstrum coefficients, linear prediction coefficients and/or auto-correlation coefficients. While these parameters contain articulation information which is relevant to speaker identity, they further contain speaker independent information such as phonemes. The conventional speaker recognition system enhances its reliability by additionally employing speaker identifying keywords, or digitized spoken words or phrases, stored in a memory. In response to a voice input, the memory is searched to detect a keyword which is combined with a stored voice pattern and to detect a match with the extracted feature using a dynamic programming technique by which the time scale of a reference utterance is dynamically warped so that significant events of the input utterance line up with the corresponding significant events in the reference utterance.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide a speaker recognition system having a high reliability of speaker recognition without using speaker-specific keywords.

According to the present invention, there is provided a speaker recognition system which comprises a voiced sound detector for detecting a voiced sound sample from an input utterance of a speaker. A prefiltering circuit is provided for deriving from the voiced sound sample a compensation parameter indicating the decaying characteristic of a high frequency component of the voiced sound sample and compensating for the voiced sound sample in accordance with the compensation parameter. An estimation circuit is provided for estimating a glottal excitation source pulse of the vocal tract system of the speaker from the compensated voiced sound sample. A glottal pulse-shape is simulated from the estimated glottal excitation source pulse using the compensation parameter and supplied to a decision circuit for analyzing the simulated glottal pulse-shape to determine the vocal features of the speaker and making a decision whether the determined features coincide with reference features stored in a pattern memory.

Since glottal pulse-shapes contain little or no phonemes (a speaker independent parameter), speakers can be recognized with a high degree of discrimination than with the prior art keyword system.

More specifically, the voiced sound detector detects a voiced sound sample X.sub.i from the input utterance, where i is an integer identifying the sample, and the prefiltering circuit derives from the voiced sound sample a compensation coefficient K indicating the decaying characteristic of the high frequency component and compensates for the voiced sound sample in accordance with the coefficient K to produce an output sample Y.sub.i represented by a relation Y.sub.i =X.sub.i -K.X.sub.i-1. The estimation circuit estimates a glottal excitation source pulse U.sub.i of the vocal tract system of the speaker from the compensated voiced sound sample Y.sub.i, the estimated glottal excitation source pulse U.sub.i being represented by a relation ##EQU1## where .alpha..sub.j indicates a linear prediction coefficient of the j-th order and n is an integer. The glottal pulse-shape H.sub.i is simulated from the estimated glottal excitation source pulse U.sub.i in accordance with the compensation coefficient K so that the simulated glottal pulse-shape H.sub.i is represented by a relation H.sub.i =(U.sub.i +K.times.U.sub.i-1)+.beta..times.(U.sub.i-1 +K.multidot.U.sub.i-2), where .beta. is a constant smaller than unity.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be described in further detail with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a speaker recognition system of the present invention;

FIG. 2 shows details of the voiced sound detection circuit of FIG. 1;

FIGS. 3a to 3d are waveforms generated in various parts of the system of FIG. 1; and

FIG. 4 is a schematic diagram of a model of a voice generation mechanism.

DETAILED DESCRIPTION

Referring now to FIG. 1, the speaker recognition system of the present invention generally comprises an analog-to-digital converter 2, a voiced sound detection circuit 3, a prefiltering circuit 4, a glottal source estimation circuit 5, a glottal pulse-shape simulation circuit 6, a speaker identification circuit 7, a utilization circuit 8, and a sequence controller 9 which provides overall timing control for the system. Analog-to-digital converter 2 digitizes the analog waveform of a sound signal supplied directly from a microphone 1 or by way of a telephone line, not shown, at a sampling rate of 10 kHz, for example. Digital samples from A/D converter 2 are supplied to voiced sound detection circuit 3.

As shown in FIG. 2, voiced sound detection circuit 3 comprises a framing circuit 50 which segments the time scale of input by organizing the input samples in response to timing pulses supplied from sequence controller 9 into a series of 100-ms frames each containing 1000 digital samples X.sub.i, for example. A squaring circuit 51 is connected to the output of the framing circuit 50 to produce a signal representative of the average power of the samples in each frame, which is applied to a comparator 52 for comparison with a reference value. The output of framing circuit 50 is further applied to a zero-crossing detector 53 and a buffer 56. Zero-crossing detector 53 counts zero-crossings of the samples in each frame and supplies its output to a comparator 54 for making a comparison with a reference count. If the average power of a given frame is higher than the reference value, comparator 52 produces a logic 1 and if the number of detected zero-crossings is smaller than the reference count, comparator 54 produces a logic 1. The outputs of comparators 52 and 54 are applied to an AND gate 55 to supply an enable pulse to buffer 56 indicating that the input frame is a voiced sound. A typical waveform of such voiced sound is shown in FIG. 3a.

As opposed to the voice signal detected by the endpoint detection system of the prior art, the voiced sound so detected by the circuit 3 contains glottal pulse waves generated by the oscillatory movement of the vocal cord of a speaker as well as noise components other than pitch harmony. The pitch frequency of glottal pulse waves, the slope and bends of the spectral contour and the noise components contained in the output of circuit 3 are the unique characteristics that identify a particular person among an ensemble of persons.

When buffer 56 is enabled, ten frames of voiced sound are supplied in sequence from buffer 56 to prefiltering circuit 4 in response to a timing pulse from sequence controller 9. Due to the spectral characteristic of glottal pulse waves and the radiation effects of lips, the amplitude of the high-frequency components of the voiced sound decreases with frequency at a rate approximately -10 dB/oct. This high-frequency decaying characteristic results in a concentration of acoustic energies in the lower frequency range of the spectrum and lowers the accuracy of a vocal tract transfer function which will be obtained later. To ensure a satisfactory level of computation accuracy, the prefiltering circuit 4 provides high-frequency compensation to equalize the spectrum characteristic of the voiced sound input from circuit 3.


Rotary turret head apparatus Reduced cost impregnated wipes
Optical transmitting and receiving apparatus Wind-activated side view mirror cleaner
Central tire inflation system Propylene polymer compositions
Two-part adjustable approach ramp NiCd/NiMH battery charger
Dielectric covered electrostatic chuck Pulse shaper
Reagent for streptococcal anti-esterase assay Sliding door latch strike
Pierced earlobe protector Electroconductive spring material
Mobile data telephone Marble/disk game
Leak detecting monitor Wet type multiple disc clutch
Multi-directional slide switch Laptop computer support table
Card connector of reduced-height profile Writing instrument with eraser dispenser
Bolt action ring binder Dispenser
Electron beam system Charge transfer transversal filter
Semiconductor memory with built-in cache Echo canceler
Membrane switch apparatus Reversal photographic film for displays
Scan welding method and apparatus Electrostatic chuck employing thermoelectric cooling
2,4-disubstituted 1,3-dioxolanes Intake apparatus
Heat-insulating articles Pulsed source scanning interferometer
Hair spray Fastener for thermoplastics
Multi-arm frequency sweep generator Electroluminescent transparency illuminator
Low ride saddle mount Infrared remote control circuit
Pipelined multiple issue packet switch Lobster trap
Rotatably-twisting display device Method of altering double-stranded DNA
Hydro-pneumatic pumpsets Midsole for shoe
Adjustable skate brake Compressed air servicing unit
Load-dependent brake-power regulator U-bolt clamp assembly
Recovery boiler Easy empty seed hopper
Dispenser pack Integrable circuit for digital-to-analog converter
Locking device for peg-board hooks Sandwich preparation and warming pan
Disk drive apparatus and motor Non-tobacco containing smoking product
Pet kennel Water spike
Label with removal slit Pool filter
Ceiling system Apparatus for refining molten metal
Generator motor for vehicles DRAM with reduced leakage current
Portable winch Dental laser system
Beverage magnetizing container Nucleic acids encoding dystrophin-associated proteins

The present invention is based on a well known voice generation model which is approximated by a series circuit of a source model, a vocal tract model and a radiation model as shown in FIG. 4. It is recognized that double differentiations of a voiced glottal pulse wave results in a periodic pulse. The source model is based on this fact to provide double integrations on periodic glottal excitation pulses U.sub.i to approximate glottal pulse waves H.sub.i which are to be supplied to the vocal tract model. The vocal tract model is approximated by a vocal tract filter, and the radiation model approximates the radiation pattern of lips and is equivalent to a differential circuit. These models are considered to be "linearly" coupled, and therefore, the source model and the radiation models can be approximated by a "single" integration circuit.

Prefiltering circuit 4 comprises a buffer 10, a correlator 11 and an adaptive inverse filter 12 which corresponds to the inverse filter of the source-radiation model. Ten frames of the voiced sound from detection circuit 3 are stored into buffer 10 for temporary storage. The stored frames are successively supplied to correlator 11 to detect a coefficient K.sub.m of first-order auto-correlation between successive digital samples X.sub.i in current frame F.sub.m. The auto-correlation coefficient K.sub.m is supplied to adaptive inverse filter 12 to which the digital samples of each frame are successively supplied from buffer 10. Adaptive inverse filter 12 multiplies each sample with the coefficient K.sub.m to produce a weighted sample K.sub.m .times.X.sub.i-1 and derives a difference between the weighted sample and a subsequent sample X.sub.i to generate a weighted differential sample Y.sub.i of the form:

Y.sub.i =X.sub.i -K.sub.m .multidot.X.sub.i-1. (1)

The weighted differential sample Y.sub.i is stored into buffer 10 to recover a frame F'.sub.m containing weighted differential samples Y.sub.i. A typical waveform of the sample Y.sub.i is shown in FIG. 3b.

In response to a timing pulse from sequence controller 9, the output signal Y.sub.i of prefiltering circuit 4 is supplied to glottal source excitation estimation circuit 5, which comprises a buffer 20 for temporarily storing samples Y.sub.i, a linear prediction circuit 21 and a residual error detector 22. Linear prediction circuit 21 and residual error detector 22 combine to form an inverse filter of the vocal tract function. From the stored samples Y.sub.i, the linear prediction circuit 21 derives the following polynomial of linear prediction coefficients: ##EQU2## where .alpha..sub.j represents the j-th order of linear prediction coefficient for each frame, and n is an integer typically in the range between 8 and 14. Specifically, linear prediction circuit 21 derives the linear prediction coefficient .alpha..sub.j from whole samples of each frame stored in buffer 20 and multiplies samples Y.sub.i-j with the coefficient .alpha..sub.j and provides a total of the products (.alpha..sub.j .times.Y.sub.i-j) for i=1 to j=n. The output of linear prediction circuit 21 is supplied to residual error detector 22 which detects a residual error between the output of linear prediction circuit 21 and a stored sample Y.sub.i to produce an output sample U.sub.i (see FIG. 3c), the difference being stored back into buffer 20. Thus, the output sample U.sub.i which approximates the periodic glottal excitation pulse is given by: ##EQU3## Samples U.sub.i are supplied from buffer 20 to glottal pulse-shape simulation circuit 6 in response to a timing signal from sequence controller 9.

Glottal pulse shape simulation circuit 6 comprises a buffer 30 for temporarily storing samples U.sub.i, an adaptive filter 31, a buffer 32 for storing first-order auto-correlation coefficients K.sub.m from correlator 11, and a multiply-and-add circuit 33. Adaptive filter 31, which represents the filter of the source-radiation model, is connected to buffers 30 and 32 to provide the function of multiplying a sample U.sub.i-1 with a coefficient K.sub.m to produce a weighted sample (K.sub.m .times.U.sub.i-1) and adding a sample U.sub.i to the weighted sample K.sub.m .times.U.sub.i-1, producing an output sample G.sub.i of the form:

G.sub.i =U.sub.i +K.sub.m .multidot.U.sub.i-1 (4)

The output samples G.sub.i are stored back into buffer 30 and retrieved by multiply-and-add circuit 33 by which sample G.sub.i-1 is multiplied with coefficient .beta. in the range between 0.9 to 0.99, producing a weighted sample (.beta..times.G.sub.i-1). This weighed sample is summed with a previous sample G.sub.i to produce a glottal pulse-shape H.sub.i of the form:

H.sub.i =G.sub.i +.beta..multidot.G.sub.i-1 (5)

Since the glottal pulse wave H.sub.i can be derived from double integrations of glottal excitation pulses U.sub.i as mentioned earlier, the adaptive filter 31 is equivalent to the function of first integration and the multiply-and-add circuit 33 is equivalent to the function of second integration. Glottal pulse shape simulation circuit 6 further includes an auto-correlator 34 which detects auto-correlation between input digital samples U.sub.i to produce a digital sample P.sub.i representative of the pitch period T of glottal pulse-shape samples U.sub.i. Samples H.sub.i and P.sub.i are stored back into buffer 30 and delivered to speaker identification circuit 7 in response to a timing pulse from sequence controller 9. A typical waveform of glottal pulse-shape H.sub.i is shown in FIG. 3d.

Speaker identification circuit 7 includes a buffer 40 for storing samples H.sub.i and P.sub.i, a features extraction circuit 41, a features comparator 42 and a pattern memory 43. Features extraction circuit 41 analyzes the samples H.sub.i stored in buffer 40 to detect the glottal closure interval .tau., opening time t.sub.o and closing time t.sub.c (see FIG. 3d) and derives ratios R.sub.1 =.tau./T and R.sub.2 =t.sub.o /t.sub.c. The average and deviation values of the extracted features are determined and supplied to comparator 42 in which they are compared with a set of reference parameters stored in pattern memory 43 during a recording mode of the system. These reference parameters represent the average and deviation values of features T, R.sub.1 and R.sub.2 of a designated person. Comparator 42 determines distances between the corresponding items. If the detected distances are within specified range, decision is made that the speaker's voice matches a corresponding voice pattern. This fact is communicated to a utilization circuit 8.

Since the speaker recognition system of this invention detects glottal pulse-shapes containing few or no phonemes (speaker independent parameter) and extracts those parameters uniquely characterizing a speaker's voice pattern, speakers can be recognized with a higher level of discrimination than with the prior art keyword system.

The foregoing description shows only one preferred embodiment of the present invention. Various modifications are apparent to those skilled in the art, however, without departing from the scope of the present invention which is only limited by the appended claims. Therefore, the embodiment shown and described is only illustrative, not restrictive.