As far I know I don't think that you could find a faster way to test the status of a I/O line using a commercial CPU: usually the status of processor's IO line is mapped to a specific bit in one (or more depending on the IO pin count) CPU register. So I suppose that 2 clock cycles are the minimum overhead you could expect to acquire this information: I am not expert on CPU architecture design but I think that this is because the IO status registers are not allowed to be directly used as operand register for ALU instructions, so the CPU has to perform a read to move the IO register value in the ALU operand register and then the bitwise operation (AND with the bit mask value) to get the status of the IO line.
Just for my curiosity: why it is so critical doing this operation in a single cycle?
Regards
Mowgli