If you are clear about the hardware , your design will be optimized design. The only thing you will left to tool to do the logic optimizztion of your optimized design.
For Gate count, what to use / what not to use, .. its all depend on requirement. If you are working at high frequency , you might have to put more flops .. but is technology is 28 or 20 compared to 65 .. then you might be using less flops .. so it all depend on where your design will going to synthesize .. depend on that you can write RTL efficiently.