I found this from a Xilinx White paper
"A synchronous RAM cannot perform read-modify-write operations in a single clock
cycle, but the dual-port, synchronous block RAM in all Xilinx® FPGAs can pipeline
the write operation and achieve a throughput of one read-modify-write operation per
clock cycle. To do so, the designer uses Port A as the read port, uses Port B as the write
port, and uses one common clock for both ports. The read address is routed to Port A.
A copy of the read address is delayed by one clock and routed to Port B. The data from
Port A is modified and used as the data input to Port B"
...