The normal viterbi decoder is basically an array of add-compare-select units and then a traceback unit.
For the entire design, a key question is if the design runs at full-rate, slightly less than full rate, half-rate, or lower. This affects how much logic sharing you can have, as well as allows solutions to some other problems.
The add-compare-select units should be small. This is because they will be in a potentially large array. Any inefficiency can be multiplied 128x or more. Further, the array routes to a single unit which means larger units also lead to more congestion issues. A main decision for this stage that is not present in the explaination of the algorithm is how to keep the accumulators stable. This comes up when you look at sizing the accumulator with the knowledge that an arbitrary number of non-zero inputs can go to each. There are a variety of methods that can be employed here.
The traceback is practical. Register exchange isn't really that practical. From looking at PNPH, it also looks suspicious. Perhaps they are fine for some design specs or on asics, but they do grow in size quickly.
That said, you may have a short register exchange stage feeding the traceback unit. This is based on the required read bandwidth of the traceback ram. Instead of storing best single-step to the ram, you can store best two-step path. This allows you to read two steps per cycle while writing only one on average. Thus you can read out a number of steps equal to the pre-read amount followed by the number of data points equal to that amount. This can be finished in the time it takes to write another pre-read amount. (I forget the term here. It is some scaling of constraint length normally.) However, if the design runs at half rate this extra stage is not needed.