A Revolutionary Massively Parallel Processing Architecture

ThisIsNotSam · Jul 19, 2018

paulpawlenko said:
Show me the actual language that functionally runs this "logic in memory" or those "hardware accelerators".

You don't understand what a hardware accelerator is. There is no language. There is only a thin layer of programmable registers, typically. That is what we do when we really want performance that a softcore/processor cannot deliver. Furthermore, we can use logic-in-memory to make these accelerators even better, with distributed and smart memory access that is orders of magnitude better than multi-level cache.

This is the NORM for any and all modern SoCs: a bunch of cores and some memory, with the heavy load pushed to specialised accelerators. Show me how your tremendously naive architecture can perform better than a god damn specialized accelerator, I dare you. You can't, because that is impossible. Show me how your extremely naive architecture can replace the cores we have on today's SoCs, I dare you. You can't, because no one is going to write software differently such that it fits your architecture.

Congratulations, you achieved a grand total of zero innovation.

Dead topic, please close.

paulpawlenko · Jul 20, 2018

ThisIsNotSam said:
You don't understand what a hardware accelerator is. There is no language. There is only a thin layer of programmable registers, typically. That is what we do when we really want performance that a softcore/processor cannot deliver. Furthermore, we can use logic-in-memory to make these accelerators even better, with distributed and smart memory access that is orders of magnitude better than multi-level cache.

This is the NORM for any and all modern SoCs: a bunch of cores and some memory, with the heavy load pushed to specialised accelerators. Show me how your tremendously naive architecture can perform better than a god damn specialized accelerator, I dare you. You can't, because that is impossible. Show me how your extremely naive architecture can replace the cores we have on today's SoCs, I dare you. You can't, because no one is going to write software differently such that it fits your architecture.

Congratulations, you achieved a grand total of zero innovation.

Dead topic, please close.

Strawman argument. I have never stated it is faster than dedicated hardware.
I stated it is an order of magnitude faster than general purpose processors from Intel, AMD and NVIDIA because it is.
Perhaps you could get your facts straight before posting in the future.

Unless you show me how your SoC can be used to build an OS, device drivers, graphic APIs, network APIs and any application they are not in the same league.

ThisIsNotSam · Jul 20, 2018

Let me know when someone funds this... yawn.

"Someone is wrong on the internet, I cannot rest!"

paulpawlenko · Jul 20, 2018

ThisIsNotSam said:
Let me know when someone funds this... yawn.

"Someone is wrong on the internet, I cannot rest!"

Indeed what a shame that the world is run by many small minded people who care more about the Kardashians wardrobe than logical accuracy of fundamental technology.

Intel & AMD have billion dollar R&D and still use the von Neumann model that everyone knows is a dinosaur.
4P is the most significant solution to parallelism in our lifetime, regardless of how many people choose to ignore this engineering fact.

I notice that when your brief, meager attempt at any type of technical discourse fails you fall back on uninformed flaming.
Shows who you are.

My design shows who I am, a visionary with a design that has yet to be ***TECHNICALLY*** invalidated in any way and only gets stronger with each attack.
Throw stones all you like, I am correct until proven otherwise and history will ultimately decide the validity of my design whether or not the hype driven money people agree today or not.

ThisIsNotSam · Jul 20, 2018

A visionary? Dear lord, you surely think highly of yourself.

asdf44 · Jul 20, 2018

Please mods don’t close this thread.

paulpawlenko · Jul 20, 2018

ThisIsNotSam said:
A visionary? Dear lord, you surely think highly of yourself.

I draw consistent conclusions based upon available evidence using logic with a laser focus on accuracy about everything.
The fact that I am objectively a visionary is one of those conclusions. Do you even look at the magnitude my claims that have zero technical refutation?

BradtheRad · Jul 20, 2018

Parallel processors are implemented in today's computers. To the degree it speeds things up, they are an assist, of course. However we don't see ads for a computer 'containing not 2, not 4, but 23 CPUs!' Not yet anyway. It turns out that computer design has changed along the path that suits a global market overall. A revolutionary design needs to adjust to real-world situations. It would be nice if we could split up all theoretical chessboard scenarios (eight moves deep), among 23 processors each one playing out a different scenario. But they need to be connected electronically. Wiring them together sounds like a formidable job, don't you think? Whereas they did manage to build a Deep Blue computer years ago in real life, which beat a chess grandmaster.

- - - Updated - - -

And it did it byan algorithm of examining all possible scenarios several moves deep.

paulpawlenko · Jul 20, 2018

BradtheRad said:
Parallel processors are implemented in today's computers. To the degree it speeds things up, they are an assist, of course. However we don't see ads for a computer 'containing not 2, not 4, but 23 CPUs!' Not yet anyway. It turns out that computer design has changed along the path that suits a global market overall. A revolutionary design needs to adjust to real-world situations. It would be nice if we could split up all theoretical chessboard scenarios (eight moves deep), among 23 processors each one playing out a different scenario. But they need to be connected electronically. Wiring them together sounds like a formidable job, don't you think? Whereas they did manage to build a Deep Blue computer years ago in real life, which beat a chess grandmaster.

And it did it byan algorithm of examining all possible scenarios several moves deep.

I agree that the global market is and should be the driving force for investment and development.
Furthermore, I hope we would all agree that, everything else being equal, smaller and faster is better.

Deep Blue used specialized hardware so is not really comparable with 4P but is comparable with NP4P, the non-programmable version. NP4P can, for a single, specific problem, outperform the posted performance of the SUMMIT supercomputer in a device that fits into a standard PC case. But not being programmable, it is of limited use.

In terms of multiple CPUs, you explain my point exactly. More CPUs without a good way to connect them is of limited value. The value of modern multiple CPUs is really in application level parallelism where we today can switch between multiple applications without excessive delay. Back in the 90s, switching between applications on single core machines often slowed the system while the caches were repopulated to the application "in focus".

The processors in 4P do not use the von Neumann model of processing instructions with operands. The large number of processors allows programmers to effectively programmatically create hardware. Once the program, or module, is locked into place on the canvas, the data is streamed through similar to running in an ASIC. Think in terms of a UNIX shell processing commands through pipes. Since the data is constantly flowing, very large throughputs are possible resulting in high level of computation efficiency.

Notice that the data is constantly streaming from input to output instead of shuttling data back and forth between memory and the CPU through 3 cache levels. This is the primary difference from the von Neumann model that accounts for the very large performance gains. Even more importantly, however, is the level of scalability. If you own a PC with a GPU and want to upgrade you have to replace the CPU or GPU you cannot scale by just adding more. In 4P, the parallel architecture naturally allows for additional computing resources to simply increase the size of the canvas. So if your computer is running slow in the future, you can purchase an additional processing chip and add it to your existing system without upgrading the whole thing.

betwixt · Jul 20, 2018

I find myself torn between two choices here. 4P seems to be a mix of different workable technologies but not a complete answer to any of them. The concept of massive parallel processing is not new, Inmos tried it with the 'Transputer' processors which in theory were expandable but, fast as they were, all they really did was pass the bottleneck to the arbitration hardware. Something still had to decide which Transputer did which task.

I also see similarity to bit-slice processing (for those aged under 100, it was a a single bit programmable arithmetic module) which again was expandable but mostly due to the technology of the day, was power hungry and limited in speed by it's size.

If the 4P is an expandable array of 'normal' CPUs it may have massive processing power and ability to handle simultaneous instructions but how do those instructions get passed to the core? If it is in the fastest possible way, as in the instruction bus of Harvard processors, the instruction width is the same as the bus width so expanding it would require 'wider' instructions or some other mechanism to steer the instructions to the destination processor.

More processing power is obviously appealing but how do the smaller component cores cooperate with each other without some supervisory mechanism still slowing things down?

Brian.

KlausST · Jul 20, 2018

Hi,

I´m another one who has not understood the details of 4P.

What I expect (maybe there is already available) is a running system. Maybe an audio or video application, where one can demonstrate the speed and flexibility, but also the configrability/programmability of an 4P system.
I think of voice recognition or face recognition: To demonstrate how fast and how safe can it detect - let´s say 3 persons on a picture with 100 persons.

I didn´t go through the complete thread(s), thus I don´t know how to build the 4P.
But because of FPGAs are configurable, and the designer is free to decide which signals are processed in parallel or in series, it may be possible to build (parts of) the 4P system in an FPGA.
Usually FPGAs include different types of memory (RAM), too.

And paulpawlenko - as a software designer - can show how easy or efficient it is to program/configure the 4P system.

I´m a person who best learns/understands by pictures, examples, trials.
This would convince me more than 1000 pages of text.

Klaus

paulpawlenko · Jul 20, 2018

betwixt said:
I find myself torn between two choices here. 4P seems to be a mix of different workable technologies but not a complete answer to any of them. The concept of massive parallel processing is not new, Inmos tried it with the 'Transputer' processors which in theory were expandable but, fast as they were, all they really did was pass the bottleneck to the arbitration hardware. Something still had to decide which Transputer did which task.

I also see similarity to bit-slice processing (for those aged under 100, it was a a single bit programmable arithmetic module) which again was expandable but mostly due to the technology of the day, was power hungry and limited in speed by it's size.

If the 4P is an expandable array of 'normal' CPUs it may have massive processing power and ability to handle simultaneous instructions but how do those instructions get passed to the core? If it is in the fastest possible way, as in the instruction bus of Harvard processors, the instruction width is the same as the bus width so expanding it would require 'wider' instructions or some other mechanism to steer the instructions to the destination processor.

More processing power is obviously appealing but how do the smaller component cores cooperate with each other without some supervisory mechanism still slowing things down?

Brian.

Very valid analysis, well done.
Indeed the specifics of how the instructions are passed to the hardware are the IP, hence secret although I do share the PSP, a 10 transistor processor that can "create" hardware programmatically.

However, the details of exactly how 4P is programmed are not really needed to address your main point (as I understand it) of instruction level congestion. Understand that the individual processors defining a program on the 4P canvas each receive an instruction to do a job. As you correctly identified, the amount of bandwidth and time to program all these processors is non-trivial. However, once these processors are programmed, they are switched into run mode where they perform their programmed job and operate at near hardware speeds on data streaming by. Think commands in the UNIX shell each command living resident on its own hardware operating on piped streams in 2 dimensions. Very large throughput.

The idea is to have largely used code, like operating systems, graphics APIs, etc. resident on the canvas so the overhead of loading the program only occurs once. Applications will necessarily be loaded and unloaded to and from the canvas potentially running into the problems you describe. However, keep in mind two key design features:
1) Physical pins are programmable
2) SSD will likely have 4P canvas also

1) Pins on a 4P canvas are simply extensions of the streaming design with additional latency. So when a new application is launched by the user, the OS tells "all" the pins to load the app from memory onto free space on the 4P canvas increasing the loading speed linearly with the number of pins. Once loaded, the OS signals the pins to be assigned to each application as a resource just like memory in a conventional OS.

2) Down the road, I envision SSDs having a layer of resident 4P canvas "hovering above" the physical persistent cells. The massive space would create local computation for everything other than interfacing with physical devices like display, input, network, etc. Programs would then remain entirely on the canvas thereby greatly reducing bandwidth used for programming and allowing it to be used for data processing.

Yes, it is true that a high degree of "program swapping" on and off the canvas will bog down the system. The answer is to add more hardware until an equilibrium is reached. The natural scalability of the architecture makes this a very effective, nearly trivial, solution.

Along those lines, however, it is worthy to note that 20 and definitely 50 years ago these issues would have likely been show stoppers since the design really requires resident programs to get the massive performance increases. Today, transistor densities, clock speeds and economy of scale make it mostly a non issue but it is easy to understand why this direction was not chosen initially.

nice post, good points

ThisIsNotSam · Jul 20, 2018

Remind me again how many transistors are used to build a DFF with scan.

betwixt · Jul 20, 2018

Requesting even more clarification: I'm still mixed about this, we both see the pitfalls of simply adding more land to the farm. It doesn't make it more productive if you haven't got the ability to bring a bigger harvest in.

Are you proposing 'parallel programs running simultaneously' rather than 'parallel processors running one program'? In other words duplicating the building block of a bigger system into the architecture of a single silicon device. That seems to be VLSI with more than one core embedded into it. If that is the explanation, passing the program to each individual core would be a huge task and re-programming on the fly would hold up other processes as it took place. Besides, unless the sheer scale is different, that technology has been around for years and is used in almost all Intel/AMD/ARM and other devices.

Brian.

c_mitra · Jul 20, 2018

I find myself torn between two choices here. 4P seems to be a mix of different workable technologies but not a complete answer to any of them.

It certainly needs more work. But anyway I am not even remotely an expert on this topic.

But most modern supercomputers (and massively parallel systems) have 100s of boards with CPU and RAM but are network limited.

paulpawlenko · Jul 20, 2018

betwixt said:
Requesting even more clarification: I'm still mixed about this, we both see the pitfalls of simply adding more land to the farm. It doesn't make it more productive if you haven't got the ability to bring a bigger harvest in.

Are you proposing 'parallel programs running simultaneously' rather than 'parallel processors running one program'? In other words duplicating the building block of a bigger system into the architecture of a single silicon device. That seems to be VLSI with more than one core embedded into it. If that is the explanation, passing the program to each individual core would be a huge task and re-programming on the fly would hold up other processes as it took place. Besides, unless the sheer scale is different, that technology has been around for years and is used in almost all Intel/AMD/ARM and other devices.

Brian.

There is no real concept of a CPU type "core", that is really the point.
Nor is there a real concept of instruction/operands/registers.

The tiny processors on the canvas form Instruction Modules such a the Less Than module shown here at the bottom of the page:
http://sourcecodecreations.com/how-4p-works

Instruction Modules operate on streams or serial arrays as input and produce streams or serial arrays as output
A Less Than module requires a few dozen processors and an external timer indicating the size of the primitives in the stream.
For example, a Less Than operating on a stream of 32 bit unsigned integers would require a 32 clock timer to signal the start of each new integer.

Adders, multipliers, if-else, etc. are the Instruction Modules that form the basic "instruction set" of the canvas, keeping in mind that they operate on streams of data.

So the physical canvas processors are used in groups to form Instruction Modules that form larger modules corresponding to C procedures.
Small modules are directly instantiated onto the canvas similar to "inline" statements in C++.
Larger modules such as those that form graphics APIs, would remain resident on the canvas and communicate via busses.

These ultimately form modules that are programs.
One such program is the Traveling Salesman Program shown running in a hardware simulator here:
http://sourcecodecreations.com/4p-in-action-1

NOTE: Since I have been censured previously I would send a private message with these links rather than posting them but your messaging is disabled.

So, non-parallel programs like an OS will be set into a single location on the canvas.
Highly parallel SIMD tasks, like graphics ray tracing or the traveling salesman problem, will have nearly identical modules arrayed on large areas of the canvas.
The tiny physical processors forming the canvas can also be programmed to form busses to move data around the canvas.

Once the physical processors are programmed, they just sit and do their jog until reprogrammed.
This architecture creates near-hardware throughput and is extremely efficient provided large amounts of code can remain resident on the canvas to avoid lots of context switching as you alluded to in your previous post.

betwixt · Jul 21, 2018

Sorry about the disabled messaging, I started getting as many as 300 messages a day from users and it simply became impossible to read them all while busy working on other things. I had no option but to turn it off. All the moderators here are volunteers so we have to weave our duties into available free time.

Also sorry about censuring, we (moderators) have a difficult task of keeping the forum dedicated to the intended purpose and we delete many messages every day. Most are pure spam but we also have a significant number of 'repeat offenders' who deliberately post technical but poorly thought out questions with no apparent purpose but to waste our time. We also see some very good and topical questions that contain links to examples that sell certain 'health supplements' so we have to be very careful to check and filter what gets shown to readers. Sometimes we get it wrong.

Brian.

paulpawlenko · Jul 21, 2018

betwixt said:
Sorry about the disabled messaging, I started getting as many as 300 messages a day from users and it simply became impossible to read them all while busy working on other things. I had no option but to turn it off. All the moderators here are volunteers so we have to weave our duties into available free time.

Also sorry about censuring, we (moderators) have a difficult task of keeping the forum dedicated to the intended purpose and we delete many messages every day. Most are pure spam but we also have a significant number of 'repeat offenders' who deliberately post technical but poorly thought out questions with no apparent purpose but to waste our time. We also see some very good and topical questions that contain links to examples that sell certain 'health supplements' so we have to be very careful to check and filter what gets shown to readers. Sometimes we get it wrong.

Brian.

I understand and empathize.
This is a good forum for a good purpose.
I gladly forgo minor inconvenience from time to time for the greater good of keeping the discussions technically relevant leading to better discourse as I find is the case here.

ThisIsNotSam · Jul 21, 2018

paulpawlenko said:
I understand and empathize.
This is a good forum for a good purpose.
I gladly forgo minor inconvenience from time to time for the greater good of keeping the discussions technically relevant leading to better discourse as I find is the case here.

Could you please provide the transistor count for a DFF with scan?

ThisIsNotSam · Jul 23, 2018

ThisIsNotSam said:
Could you please provide the transistor count for a DFF with scan?

Maybe I should then. A DFF w/ scan should have around 30 transistors. Your XOR should have another 10T or so. We are now up to 40T or so. It seems you are now using a NOR? I can't keep track anymore. Truth is... your 10T transistor is either a salesman speech or the innocence of someone that is not a circuit designer. You can't build an ASIC that is not testable. You need DFFs w/ scan, period.

I really don't want to discuss the merit of the overall idea, I still think you are reinventing the wheel and that you have no clue of scalability.

So let's keep the discussion on the circuit side of things. Have you considered that your design has timing issues? Back to back flop connections will suffer from hold problems. Your PSP needs one or more delay cells. Your transistor count is now 46T or so. Let's round it up to 50, you are at 5x your initial budget already.

Now, because the datapath is so short, it is possible you could run this at 4GHz or so in a modern technology. Good luck powering up all these cells running this fast. You will inevitably run into power distribution issues. Maybe you can only run 1/4 of the PSPs that fast while the rest of the system has to be put to sleep.

Clock distribution for this 'canvas' would be challenging. This surely looks like one of those designs were the clock tree is responsible for 50% of the power consumption or so. Not to mention the overhead in area, as all those buffers have to be put somewhere on the die. Maybe you are up to 60T per PSP now.

This is what you get when a mechanical engineer turned into software developer declares himself a visionary. A bunch of GARBAGE. If you show up at TSMC or some other foundry with this idea, even their junior engineers would cringe.

A Revolutionary Massively Parallel Processing Architecture

Advanced Member level 5

Junior Member level 2

Advanced Member level 5

Junior Member level 2

Advanced Member level 5

Advanced Member level 4

Junior Member level 2

Super Moderator

Junior Member level 2

Super Moderator

Advanced Member level 7

Junior Member level 2

Advanced Member level 5

Super Moderator

Advanced Member level 6

Junior Member level 2

Super Moderator

Junior Member level 2

Advanced Member level 5

Advanced Member level 5

Similar threads

Privacy & Transparency

Privacy & Transparency