[moved] clock domains crossing

dora · Apr 30, 2015

Hi ads-ee,

What you describe looks about the same complexity as what I had in mind.
Aparently you are strongly against using both clock edges even in the very simple form I explained.

Few additional questions.
1. What is the tool you have used to draw the waveforms?
2. I plan to validate this using a FPGA development board we have inhouse.
Is there a free tool which will asses how many resourcse (FFs , etc) will be required and eventually suggest a minimum Xilinx/Altera chip which will fit the design.
Obviously I can test in ISE but is ths my only option?

Thanks
Dora

TrickyDicky · Apr 30, 2015

ISE and quartus are your only options for resource usage estimates. As for appropriate chips, usually that comes from io, memory and dsp requirements.

Unless you need high speed ios, ask you usually need is the cyclone or Spartan families.

ads-ee · Apr 30, 2015

dora said:
Hi ads-ee,

What you describe looks about the same complexity as what I had in mind.
Aparently you are strongly against using both clock edges even in the very simple form I explained.

The complexity I'm referring to has nothing to do with the amount of logic required to implement something. It is the amount of effort required to understand every facet of the operation of the circuit. Subtle problems are difficult to detect or find in clever/tricky circuits and may appear out of nowhere when large production runs are done or different lot codes are used.

I've had first had experience working on fixing production problems stemming from lot code changes in behavior due to subtle differences in timing resulting in circuit failure. On many occasions this was due to some convoluted circuit with feedback asynchronous set/reset logic and pos/neg edge clocks. As the circuits were asynchronous they tended to have many timing issues that the original designer "fixed" by false path constraints :shock:.

I feel unless there is a large benefit to using a clever circuit (e.g. huge reduction in logic used, resolves latency issue with turn around time, etc) I avoid using them altogether. Throwing a little pop-psychology in, I think the reason some people insist on using clever circuits in their design is to show off how "smart" they are. Me I'd rather have everyone think I'm an idiot and have reliable designs that never fail in production or in the field. In the end I'm always the guy they call in to fix all the problems.

Few additional questions.
1. What is the tool you have used to draw the waveforms?

Visio, I created a set of waveform shapes that I use for drawing timing diagrams.

2. I plan to validate this using a FPGA development board we have inhouse.
Is there a free tool which will asses how many resourcse (FFs , etc) will be required and eventually suggest a minimum Xilinx/Altera chip which will fit the design.
Obviously I can test in ISE but is this my only option?

You can count up FFs and then estimate the number of logic resources (e.g. 4-input LUTs, 6-input LUTs, # of product terms, etc). Or just pick a big part and synthesize and implement it and see how many resources it takes and then rebuild with the smallest part that can hold the logic.

counting is what I usually do to get a gross estimate (for picking 1-2 parts to start with), which later gets refined when I do actual implementation runs.

As an experiment I wrote the code and a rudimentary test bench to check the functionality of my proposed approach and it took about 35 min to write both the code and the testbench. Debugging the typos and one compare mistake took another 10 min this morning. I check the first 20 or so transfers and they all seemed to work correctly, but the test bench really needs to have these added: testbench run time control of the clock frequency relationship and input/output queues for the self checking compare with data drop/repeat indication.

As you can see in the simulation output there is a large amount of setup time on the light blue signal sr[15:0] prior to sampling the transferred data into the tx_buf (yellow). You can see that the 0xd2aa is skipped in this waveform as there was no fsync[2] received during that tx_buf sample (which is all synchronous to clk[2]).

- - - Updated - - -

Update:
A Xilinx XC9572 fits the uni-directional design I created, so a xc95108 might fit the bi-direction design if the counters are modified to allow them to be resource shared for both directions. This would involve modifying the shift_en generation to use a count range instead of a terminal value.
Otherwise the design would fit in the xc95144 with a duplicate instantiated.

ads-ee · May 1, 2015

Update 2:
I was checking some pricing on CPLD (just being curious) and I would suggest not using a CPLD, but go with a small FPGA from Microsemi or Lattice. The xc95144 is ~$9 on Digikey whereas a Lattice iCE part is ~$4 and a Microsemi ProASIC3 nano <$5, both the Lattice and Microsemi parts are much larger and could be used for other glue logic if required.

dora · May 1, 2015

Hi ads-ee,

Thanks. I will cehck it. Actually I was disapointed from the fact Xilinx has no 144 cells CPLD with smaller amount of pins.
Dora

ads-ee · May 1, 2015

Well your typical CPLD architecture is a macrocell that is connected to a pin, so with 144 macrocells you end up with a part that needs 144 pins, unless some of them are left as un-bonded pins.

BTW would you like me to post the code I wrote or would you rather continue developing a rising/falling edge triggered design?

dora · May 2, 2015

Hi ads-ee,

Well XC95144XL can be with 81 or 117 I/O pins.
So it seems to me still some pins flexibility for a certain amount of macrocells.

Yes please share your code. It will help me.

Thanks
Dora

ads-ee · May 4, 2015

dora said:
Yes please share your code. It will help me.

Sorry didn't have the file available where I was this weekend.

Serial 16-bit Synchronizer

Code Verilog - [expand]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
module a (
  output              sdo,
  input               sdi,
  input       [2:1]   fsync,
  input       [2:1]   clk
);
 
  reg         [3:0]   icnt;
  reg         [15:0]  sr;
  wire                shift_en
                      = fsync[1] | |icnt;
 
  reg         [1:0]   shf_pipe;
  reg                 shf_fedge;
  reg                 toggle = 0;
 
  reg         [1:0]   sync_pipe;
  reg                 sample_strb;
  reg         [15:0]  tx_buf;
  reg         [4:0]   ocnt;
  reg         [15:0]  out_sr;
 
 
  always @ (posedge clk[1]) begin
 
    // when fsync arrives in clk[1] domain initialize the counter
    // the counter halts the shifting of sdi data when the s2p
    // conversion is complete. This holds the 16-bit parallel data
    // constant for 16 more clock cycles.
    if (fsync[1]) begin
      icnt <= 15;
    end else if (icnt != 0) begin
      icnt <= icnt -1;
    end
 
    // serial to 16-bit parallel conversion for transfer to
    // clk[2] domain
    if (shift_en) begin
      sr <= {sr[14:0], sdi};
    end
 
    // falling edge detector logic to drive a toggle status
    shf_pipe <= {shf_pipe[0], shift_en};
    shf_fedge <= (shf_pipe == 2'b10);
 
    // toggle status let's us cross clock domains with a single
    // signal that only has one transition per transfer.
    // note: shf_pipe == 2'b10 could be used instead of shf_fedge
    // reducing the FF count by 1. Another FF could be removed by
    // using the !shift_en & shf_pipe[0] only, though that increases
    // the combo logic to toggle generation.
    if (shf_fedge) begin
      toggle <= ~toggle;
    end
 
  end
 
  always @ (posedge clk[2]) begin
 
    // synchronizer and both edge detector to sample the stable
    // parallel sr data.
    sync_pipe <= {sync_pipe[0], toggle};
    sample_strb <= ^sync_pipe;
 
    // parallel sample transfer buffer from clk[1] to clk[2] domain
    // data is transferred ~3-4 clock cycles after it's stopped shifting.
    // note: ^sync_pipe could be used instead to reduce latency by 1
    // clock cycle or to reduce resource usage by 1 FF.
    if (sample_strb) begin
      tx_buf <= sr;
    end
 
    // synchronized counter to fsync[2] so we can determine if the
    // output shift register can be loaded.
    if (fsync[2]) begin
      ocnt <= 0;
    end else begin
      ocnt <= ocnt +1;
    end
 
    // load the output shift reg at the rising edge of fsync[2] and shift out
    // the 16-bits of data to the receiver.
    if (ocnt == 30) begin
      out_sr <= tx_buf;
    end else begin
      out_sr <= {out_sr[14:0], 1'b0};
    end
 
  end
 
  assign sdo = out_sr[15];
 
endmodule

The testbench I used to generate the waveforms.

Code Verilog - [expand]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
`timescale 1ps/1ps
 
module a_tb;
 
  reg   [2:1]   clk;
  reg   [2:1]   fsync;
  reg   [15:0]  sdi;
  wire          sdo;
 
  initial begin
    clk[1] = 0;
    forever begin
      clk[1] = #50000000 ~clk[1]; // 100 us period
    end
  end
 
  initial begin
    clk[2] = 0;
    forever begin
      clk[2] = #55000000 ~clk[2]; // 110 us period
    end
  end
 
  reg   [4:0]   cnt1 = 0;
  reg   [4:0]   cnt2 = 0;
 
  always @ (posedge clk[1]) begin
 
    // protocol 1 timer
    cnt1 <= cnt1 +1;
 
    // generate an fsync
    fsync[1] <= (cnt1 == 0) ? 1'b1 : 1'b0;
 
    // generate data
    if (cnt1 == 0) begin
      sdi <= $random;
    end else begin
      sdi <= {sdi[14:0], 1'b0};
    end
 
  end
 
  always @ (posedge clk[2]) begin
 
    // protocol 2 timer
    cnt2 <= cnt2 +1;
 
    // fsync 2 generator
    fsync[2] <= (cnt2 == 0) ? 1'b1 : 1'b0;
 
  end
 
 
  a  uut (
    .sdo    (sdo),
    .sdi    (sdi[15]),
    .fsync  (fsync),
    .clk    (clk)
  );
 
endmodule

Hopefully this will help you out.

dora · May 5, 2015

Hi ads-ee,

Thanks for the code.
Meanwhile I have been implemented my version which seems to work fine.
Typically the PCM interface is driving outgoing data on one of the clock edge and sample incoming data on the opposite edge.
So I had to use dual clock edge design anyway.
For this test I have used Spartan 6 based dev board which we have.

The final implementation will be on XC95144XL however or some other similarly priced CPLD/FPGA
I tried to fit my design in XC95144XL and managed but only if set fitting "Optimize Density" otherwise it doesn't fit.
So right now I am into thinking how to optimize my design.

On your code I see
sample_strb <= ^sync_pipe;
Is this xor between the two bits of sync_pipe and result in sample_strb?
I haven't know about this unitary way of using the bitwise operators.

Update:
A Xilinx XC9572 fits the uni-directional design I created, so a xc95108 might fit the bi-direction design if the counters are modified to allow them to be resource shared for both directions. This would involve modifying the shift_en generation to use a count range instead of a terminal value.
Otherwise the design would fit in the xc95144 with a duplicate instantiated.

In my implementation I don't share any resourses.
I have two modules (with interface similar as your module "a" ) One for each direction.
Obviously if I impliment a module which implements both direction at once I will be able to share the counters.
However this will make my design more complex.
I really like the fact I split the problem in two simpler unidirection modules.

Any comments about this?

Thanks
Dora

ads-ee · May 5, 2015

Dora said:
Thanks for the code.
Meanwhile I have been implemented my version which seems to work fine.
Typically the PCM interface is driving outgoing data on one of the clock edge and sample incoming data on the opposite edge.
So I had to use dual clock edge design anyway.

Then all I can say is be through in your testing in simulation don't fudge with checking anything (corner cases, clock phasing for both fast and slow transfers, phase skew for identical clock frequencies, etc) and then run a full SDF back annotated gate level simulation using the same testcases from the functional simulation.

My design takes a lot less simulation effort, and you don't have to run back annotated gate level simulations.

Dora said:
The final implementation will be on XC95144XL however or some other similarly priced CPLD/FPGA
I tried to fit my design in XC95144XL and managed but only if set fitting "Optimize Density" otherwise it doesn't fit.
So right now I am into thinking how to optimize my design.

Then that means your design is more complex/bigger, a single uni-directional version of my design fits with room to spare ~45% utilization.

=Dora]On your code I see
sample_strb <= ^sync_pipe;
Is this xor between the two bits of sync_pipe and result in sample_strb?
I haven't know about this unitary way of using the bitwise operators.

it's a bitwise reduction operator, very handy. &data (data is all ones), ~|data (data is all zeros)

Dora said:
In my implementation I don't share any resourses.
I have two modules (with interface similar as your module "a" ) One for each direction.
Obviously if I impliment a module which implements both direction at once I will be able to share the counters.
However this will make my design more complex.
I really like the fact I split the problem in two simpler unidirection modules.

Any comments about this?

The tools when told to "Optimize Density" probably shared the counters to make it fit.

I still think it's a mistake to use both edges of the clock. If you need opposite edge tx/rx then you use a single output register on the opposite edge of the clock to tx. This isolates the opposite edge timing constraints to a single location that is very easy to verify. (e.g. SPI is a perfect example of this type of interface)

[moved] clock domains crossing

dora

Full Member level 3

TrickyDicky

Advanced Member level 7

ads-ee

Super Moderator

ads-ee

Super Moderator

dora

Full Member level 3

ads-ee

Super Moderator

dora

Full Member level 3

ads-ee

Super Moderator

dora

Full Member level 3

ads-ee

Super Moderator

Similar threads

[moved] clock domains crossing

Full Member level 3

Advanced Member level 7

Super Moderator

Super Moderator

Full Member level 3

Super Moderator

Full Member level 3

Super Moderator

Full Member level 3

Super Moderator

Similar threads

Privacy & Transparency

Privacy & Transparency