I'm looking for a paper online, with little luck so far.
I found it in the past - it's just a matter of time.
For most instances, you'll want a "reciprocal square-root" algorithm. The algorithm is iterative, and you end up with x^-0.5.
That's ok - just multiply by x when you're done, and you have x^+0.5.
That form is used (to my knowledge) within most DSPs and FPGAs, since it makes the least use of divides.
Added after 16 minutes:
Still looking, but I need to do a few things - I don't believe this is what I had found in the past, but it'll get you started:
**broken link removed**
(I'll look some more later)
Added after 6 minutes:
Also - are you looking for floating-point implementations? Or maybe integer with extra fractional bits?