If you look for a cook-book approach, that may be difficult to find for your case.
One of the seminal book in Computer Architecture is: "Computer Architecture: A Quantitative Approach" by Hennessy and Patterson. It should have most of what you need, and you should read and really understand the applicable portions of it.
I don't fully understand the reasoning behind your hot spare. As long as the operating conditions are in the proper range (mostly temp and voltages) CPUs don't usually fail. But let assume it does fail, but then how do you detect the failure with one spare? You can probably detect many catastrophic failures, but detecting subtle bugs (like the floating point bug on the early Pentiums) will be next to impossible.
Where utmost reliability is required, one can use 3 CPUs, all doing the same computation parallel, and use voting if 2 outputs disagree. And if CPU generates incorrect result several times over a given period, one can conclude that it's defective and deactivate it. With only 2 CPUs, one can conclude that one of them had failed (or produced an incorrect result) but figuring out which one is bad is a bit tricky, especially if the failure is intermittent.