The first question was conceptually answered, once was the required Hardware to speech recognizer.
Concerning to Firmware/Software requirements, it is a quite complex area, and is also subjected to behavioral input data, and the set of terms to be recognized : ( numbers and letters ? Words ? )
The classical approach is to use a Neural Network concept.