To extract instruments, about the best you can do is to filter out one note at a time. This should work with 'pure' sounds, as comes from a flute or piano.
You examine one second of music in a spectrum display. See what other pitches are sounding, besides the voice.
You're looking particular for the fundamental frequencies, meaning the lowest frequency of a note that instrument is making. It usually is the strongest portion of the volume coming from most any instrument.
The fundamental is a sine wave. (As well as all the overtones are sine waves.) It lasts as long as a note lasts. Maybe several seconds, maybe a tenth of a second.
If you can identify a fundamental frequency, then you can filter it out (in that segment of music).
To filter one frequency at a time will be tedious. If you are willing to go through this process, little by little, then you can remove just about all of that instrument.
And even if the instrument is not a 'pure' tone, just removing the fundamental will reduce its volume.
You might get by with a low pass filter, to remove an instrument in one operation. The instrument must be playing notes at a lower frequency than the voice is. Your filter must have a sharp cutoff slope.
-----------
What you'd like to do is automate the above process. If you can write the code to do it, then you may have created a new technique which will get some attention for you. Because I've never heard of it being done.