Having a read_count and write_count won't help speed up the empty flag as they are on different clock domains, if you enable them you will see that the two counts are typically 3-4 counts different depending on the read/write clock frequency relationship. This is inherent in the way an asynchronous FIFO operates as the address pointers need to be transferred across clock domains.
Like I wrote in my previous post, you can't improve the empty signal latency unless you change the design to use a synchronous FIFO design.