Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

Efficient SerDes-Like Module

Status
Not open for further replies.

SharpWeapon

Member level 5
Joined
Mar 18, 2014
Messages
89
Helped
0
Reputation
0
Reaction score
0
Trophy points
6
Activity points
705
Hi folks,

In my design I have N(could range from 64 to 1024) number of processing elements, each PE taking x samples at a time. I would like to feed all N PEs and trigger their processing at the same time from a single on-chip memory. I tried designing serdes-like module which reads the memory serially and deserialize each x consecutive samples and feed each PES. But I am not happy with my design choice since it has a long delay N*x clock cycles to feed all PEs, which is equal to the same delay as streaming each sample to the PEs. I am wondering if there is any more efficient way of doing this?

Cheers,
 

Instead of having parallel data in your memory have serial data in your memory and each bit of the width of the memory is a separate serial channel. This will entail loading each bit of each PE sample into the memory at separate addresses.
Code:
e.g. 4 PE samples 0x8, 0x7, 0x9, 0x4
starting from MSB of samples read first:
RAM contents
1010
0101
0100
0110
^^^^
8794
Doing the above allows you to send data to all PEs simultaneously with a latency that is only dependent on the bit width of the parallel data being sent to the PEs (s2p conversion). You should be able to get the tools to produce a RAM that is 1024 bits wide.

The memory might be made up of multiple RAMs but will be treated as a single memory. Unless you ONLY want to use a single block RAM in your FPGA using more parallel resources is the only way to improve your latency.
 
Your approach is ideal if you have fewer number of samples. In my case, the RAM is 62500x16bits containing image pixels and I have a minimum of 64 PEs. If I changed the dimension to what you suggested, the RAM will be 64x15625. How can you have such a RAM, I don't think it is feasible?

Another approach: So, each memory location contains one pixel which is 16 bits wide. The first x pixels from x memory locations, once streamed, should be feed to the first PE, the second x pixels from x memory locations to the second PE etc. I thought of running the SerDes module faster than the clock for PEs to reduce the delay but that still doesn't satisfy me. Any better way?
 

Your approach is ideal if you have fewer number of samples. In my case, the RAM is 62500x16bits containing image pixels and I have a minimum of 64 PEs. If I changed the dimension to what you suggested, the RAM will be 64x15625. How can you have such a RAM, I don't think it is feasible?

Another approach: So, each memory location contains one pixel which is 16 bits wide. The first x pixels from x memory locations, once streamed, should be feed to the first PE, the second x pixels from x memory locations to the second PE etc. I thought of running the SerDes module faster than the clock for PEs to reduce the delay but that still doesn't satisfy me. Any better way?

I don't think you followed my explaination...

The PEs are across the width of the RAM so for your example you would have a 15625 deep x 64 wide RAM. The data in the RAMs would be loaded serially and offloaded serially. All other methods require time multiplexing a bus and as you've declared you don't like the results.

- - - Updated - - -

expanding on my original example:
Code:
e.g. With 4 PEs being used:
1st 4 PE samples 0x8, 0x7, 0x9, 0x4
2nd 4 PE samples 0x3, 0xb, 0xa, 0x1

starting from MSB of samples read first:
N x 4 RAM contents
A[0]: 1010  |
A[1]: 0101  | Read serially in this direction
A[2]: 0100  |
A[3]: 0110  V
      ^^^^
      8794  <= PE1-PE4 first samples

A[4]: 0110
A[5]: 0000
A[6]: 1110
A[7]: 1101
      ^^^^
      3ba1
 
I like your method.:clap: But in my case switching to that kind of memory configuration will add unnecessary addressing complexity and additional modules. Thanks!
 

Status
Not open for further replies.

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top