Not quite the same scenario but I have a single chip (PIC18F46J11) being used to produce different greetings as RFID keys are placed near a lock mechanism.
I used that IC because it has lots of ROM, the messages are stored as 8-bit WAV samples and are played back using the PICs PWM output. The quality isn't great but adequate when using 4,000 samples per second, good enough to recognize who's voice is recorded but certainly not Hi-Fi !
The principle might be adaptable to your needs. Sorry but although I can explain how it works, I can't let you have the software for contractual reasons. It's written in C.
Brian.