@dpaul, perhaps.
RGMII is a basic, but interesting standard. The clock and data are edge aligned. In order to correctly clock data, the clock edge needs to be delayed by an appropriate amount. This was originally intended to be done by adding a buffer or long trace to the PCB. Because this is annoying, PHY's eventually added the ability to delay the RX clock as well as the TX clock. This also leads to problems because two 1/4th cycle delays will place the edges at the data transitions once again!
For FPGA applications, where data is consumed by the FPGA, there are additional concerns. The BUFG buffers actually have a very significant delay which for some devices exceeds 1/2 cycle. (BUFIO/BUFR don't have this issue, nor does a BUFG+PLL/MMCM/DCM set ups as a zero-delay buffer) As a result, some FPGA applications will want to delay the _data_ to generate a correctly delayed _clock_.
Finally, ethernet does not assume exact clock frequencies. This means the data rate from the FPGA and each port will be slightly different. (Unless both PHY's have the same clock source.)
If you want to go this route, you should try to think about the constraints required.
The reliable solution is to use an RGMII-GMII core (code is in some of the coregens) and correctly constrain the inputs/outputs in the UCF file and adding the input/output constraints. I don't think the output constraints made sense in that I don't think I could get automatic checking. I recall just setting them up and using them to set output delays. This code also has the elastic buffer to allow operation with different clock rates.
These cores may use excessive clocking resources, but it is easy to correct that as long as the constraints are appropriately modified.
edit -- Also, there may be two RGMII options for your PHY. One will have TX delay and one won't. However the internals of the FPGA play a role in all of this. As a result, I prefer to set the PHY into basic RGMII mode and do all of the data-clock manipulation in the FPGA.