Time and amplitude scaling\normalizing of sound signals.

Maldun · Mar 7, 2011

Hello!

Firsly, I know, there are some speech regocnition topics and I went through them (like https://www.edaboard.com/threads/15683/), but I haven't found useful info there, and I have a concrete problem (not the speech recognition itself).

The whole problem is, how to scale\normalize audio signals (e.g., one specific word) from different people (so they will differ in volume and length), so that I can achieve relatively homogeneous results from following parametrization. Amplitude normalization\scaling is pretty simple I believe, but I am not sure about time normalization.
Some solution come to mind, such as resampling signals so they have equal quantity of samples, but I'm not sure if they are okay, and, anyway, inventing again things that are already known is not very smart.

Just so if you want more specifics:
So, a part of my task is to recognize person's identity (out of given number of people) using voice records (like, him using microphone, rezulting in .wav file). It's supposed to be content-independed authentication. I have a data base of some words (i.e., their averaged parametrization (or, lets say, I have an already taught neural network)) that will accur with high probability (that's task-specific).
The way I see it - I have some audio input (like .wav file, consisting of someone saying random sentence), then I look for a specific words (from DB) in sentence and extract it. Extracting is a problem on it's own, but not the one that concerns me the most now.
To compare words that I've extracted to the ones from DB, I believe they should be normalized\scaled (amplitude- and time-wise) to achieve reasonable results (compared non-scaled words has proved to be unproductive (as one would expect)).
I decided to try to use Cepstral Coefficients (derived from LPC) as parametrization and neural network for ~classification.
Next part, with identity authentification is quite easy given good normalization\scaling of the signals.

P.S.: One more question, can CC derived from LPC are announcer independent (more or less) or not.
P.P.S.: May be you will think of better approaches (other than LPC to CC parametrization) to identify identity of a person, having given input data, and a proper way to extract words from sentence.
P.P.S.: Sorry for my English.

Welcome to EDAboard.com

Time and amplitude scaling\normalizing of sound signals.

Maldun

Newbie level 2

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics