You got it.
Look at the expression
A + B
Where A and B are 32-bit numbers. In RTL, this gets evaluated directly by the CPU the simulation is running on, at the speed of a single CPU instruction. But at the gate level- each gate gets evaluated as one CPU instruction. In reality, the scheduling and propagation of new values resulting from each evaluation takes far more cycles than the actual evaluation. Many simulators optimize gate-level netlists back to their RTL equivalent to get performance, but there's only so much you can do and still preserve the timing accuracy needed in in gate-level simulation