Ok
So there isnt anything really wrong with the code. It should synthesize and work ok on an FPGA, but I can see potential bugs and design flaws.
1. with i,j,k and l. First of all you have decalred them 0 to 100, and you never reset them. In simulation, when they get to 100, and you try and add one, you will get an integer out of range error (integers do not roll over). But on real hardware, they will be synthesised into 7 bit numbers, and will actually run 0 to 127, and they will roll over back to 0.
2. How big could dout be? you'll probably get away with it being 8 (like the default) but its going to get bigger and bigger. Instead of havving a big parrallel output, why cant you put the results in a RAM? yes you will have to access them serially, but you will use far less resources. Looking at the way you have it, it wouldnt be too big a deal to have this as a ram anyway, as only 1 output is written in any 1 clock cycle.
3. The function sum of square does too many actions in a single clock cycle. I would recommend breaking this up and pipelining it. All you would really need to do is put almost identical code inside a clocked process, and just make sure you delay any addressing by the same number of clocks. This would allow you to increase the FMax of the final circuit (assuming thats what you meant by "faster")
Its not VHDL code you speed up, its the underlying desing.
I would have a look at the RTL Viewer when you compile it for an FPGA. It will show you the circuit it created - see if it matches up with what you drew before you wrote the code (you did draw the circuit before you wrote the code right? - HDL is hardware description language - so if you dont know what the circuit is - how do you expect to describe it?)
And finally - stop using std_logic_unsigned and std_logic_arith. They are non-standard VHDL. I suggest switching to numeric_std (which is part of the VHDL standard). With your current code I sdont think it will break anything.