A Revolutionary Massively Parallel Processing Architecture

paulpawlenko · Jul 14, 2018

I would like to start a technical discourse about a a massively parallel architecture called Practical Plentiful Parallel Processing or 4P. These processors are designed by effectively inserting tiny processors between memory cells thereby creating a very large parallel hardware canvas. The main idea is to keep processing localized unless and until the data needs to be streamed off chip for IO such as graphics, network or disc. All data is referenced by a data stream consisting of a single address and element count. This drastically simplifies the hardware by eliminating the need for addressing and 3 cache levels. The hardware simplicity allows an order of magnitude (minimum) of performance increase per unit area of silicon.

This hardware canvas operates very similar to a physical IC layout except that it is entirely programmable. Modules are similar to procedures from conventional languages except that they are instantiated by placing them onto a physical area of the hardware canvas where they operate on data as it streams through them similar to commands in the UNIX shell. Modules can have multiple inputs & outputs and can operate asynchronously by sleeping until a true value on a specified input line signals them to wake. Simple, serial modules occupy only a small amount of canvas while parallel SIMD modules, such as graphics, can have identical modules in rows and columns running in parallel over large portions of the canvas.

The streaming IO makes communicating between physical chips logically identical to communicating on chip excepting the additional latency incurred when starting a new stream. With some exceptions, physical pins are functionally programmable and can be allocated as a resource by the operating system as can area on the canvas. One exception is the main control pin on each physical chip that maintains the highest level of authority to control the operation of any area on the physical chip at all times. The security areas of the operating system are typically loaded via secure physical connection or secure network connection via this pin.

Contrast this with conventional processor architectures where address pins are mandated by the hardware specification. By allowing general use of hardware resources such as pins and processing canvas space, the architecture give tremendous power to the program developer that is simply not possible through conventional, instruction/operand based designs. These designs are also greatly limited by the serial instruction streams they process. The 4P hardware canvas runs every module in parallel, with every memory cell being updated every clock tick.

I could keep writing for hours on why this architecture is so greatly superior to conventional architectures but I am just trying to get the conversation started. Please feel free to post any questions or comments. I look forward to reading them.

Paul

ThisIsNotSam · Jul 14, 2018

Are you a hobbyist? Or a salesman?

paulpawlenko · Jul 14, 2018

I was a professional software developer for over 10 years and have a BS & MS in Mechanical Engineering from the University of Illinois.

Since grad school, I had been working on a streaming language design that would simplify code development while creating a level of "logical security" to replace the system of "security by obscurity" that I found as the de facto standard back in the 90s.

While the UNIX security model was largely effective, I found the limiting factor in security was always the hardware, specifically von Neumann architecture. Having limited hardware skill, I did indeed tinker on several simple hardware designs before stumbling upon the discovery of the PSP, a 10 transistor processor that can compute anything. By applying basic logic to very simple, IC layouts in SPICE, I found that hardware could be greatly simplified while simultaneously increasing performance dramatically. After verifying several key features with EE professors, I developed the hardware layout that was logically consistent and functional.

Understanding the gravity of this discovery/development, I began to work on a hardware simulator to prove the design and show it running. Once I had reached a point where I knew the design was valid, I built a website to showcase my work.

Since I cannot, by rule, promote any products here, this writing is intended to describe this monumental advance in processor technology that people interested in such things can understand how and why it is so important.

Since you asked, however, as a factual matter, yes I am trying to sell the IP but that is not my purpose for posting here.

c_mitra · Jul 14, 2018

It is not the CPU or the RAM that is the bottleneck; it is the network. The super-generic OS is the elephant in the room. If you want to have a specific task to be executed, the network and hence the performance can be enhanced. You need to trip the kernel and reduce that to the barest.

But anyway I would love to know more about the "gravity of this discovery/development"- it may be suitable for some simple engineering problems (say modelling the atmosphere and predicting the weather). They are not complex but computationally "huge". I guess you will understand what I mean...

paulpawlenko · Jul 14, 2018

Ok, when you say the “network” then reference the kernel I assume you are talking about transmitting data between RAM and the processor as opposed to the Internet type of “network” right? If I am wrong then please explain. Otherwise, here is the response:

Keep in mind that the processor canvas is resident inside its own RAM. So kernel calls occur on chip. This is discussed in detail here:

http://sourcecodecreations.com/how-4p-works

Scroll down to 4P operating system: user app load

If I am missing your point, please specify and I will do my best to explain.

As you have surmised, largely SIMD applications are well suited to this architecture, but the gravity of this discovery/development is significantly larger. I do honestly believe that 99% of all text based programming languages such as C, C++, C#, Java, VB, Perl, etc. will become obsolete in favor of the programming style offered by 4P. The http link above also offers a video of a program running on a 4P canvas. This video shows how simple examination of the state of memory can offer a wealth of information as to the intent and structure of the program itself. Contrast this with conventional CPU architectures where any hardware state will, at best, yield assembly code.

The totality of advantages over conventional processors lead me to categorize this technology as a discovery because, once you really understand the significance of the design, it is clearly a evolutionary step toward the way that nature “wants” us to compute as evidenced throughout physical example including DNA, complex atomic interaction and our own additions such as nations, cities, factories and assembly lines. The video shows the beginning of this principle that will be exemplified when computational hardware is realized as a physical canvas of massively parallel processors as described. It is difficult for me to explain the enormity of this technology because so few people can truly understand even the basic concepts. It is this frustration that has led me here, to people who can hopefully appreciate the scope of this discovery.

Please do not hesitate in correcting any misinterpretations from the intent of your original points or expanding upon any topic that is unclear.

Paul

ThisIsNotSam · Jul 14, 2018

paulpawlenko said:
I was a professional software developer for over 10 years and have a BS & MS in Mechanical Engineering from the University of Illinois.

So you have no ASIC experience?

paulpawlenko · Jul 14, 2018

No formal training, I taught myself and that is really why the hardware had to end up so simple.

It started with a 10 transistor (the original had 14) cpu shown here:

http://sourcecodecreations.com/the-psp

It can power anything by constructing gates on the fly. After discovering it, I ran it by several people with ASIC experience and none had ever seen anything like it.

Also, after receiving doubts about my claims I stripped the programmable portion away from the 4P hardware canvas yielding an ASIC design. I built a functioning SPICE model that is publicly displayed here: http://sourcecodecreations.com/what-is-np4p

This model is not programmable and is for 6 cities. The above link describes, in detail, how to extend this design to 23 cities. Such hardware would reach 200 petaflops in a device small enough to fit inside a standard PC box. I have a vendor working on a cost estimate for such a device with the intention of doing a head to head competition with the new SUMMIT supercomputer.

The necessary simplicity of the hardware design is the key driver to the technology, but the performance numbers speak for themselves.

Are you Sam?

ThisIsNotSam · Jul 14, 2018

I with you luck, but your idea is insanely out of touch with reality.

BradtheRad · Jul 14, 2018

paulpawlenko said:
This model is not programmable and is for 6 cities. The above link describes, in detail, how to extend this design to 23 cities.

Your articles mention cities and the traveling salesman. Is this by any chance the famous problem of the salesman who wants to find the shortest path so he visits all the cities on his list, once and only once? A computer is ideal for this task, of course. Can you tell us more about the program your parallel processors use to examine all scenarios, as compared to a conventional lone cpu running one program?

paulpawlenko · Jul 15, 2018

Everyone is entitled to their opinion as you are certainly entitled to yours.

I notice that while my initial thread goes into significant technical detail as does the website your replies have zero technical content at all.

You seem to be preoccupied with characteristics about my personal self, rather than objective, technical criteria about the architecture.

Do you have any technical background? If so, perhaps you could display your knowledge level through discourse. I would certainly be interested in hearing what part of reality my idea is out of touch with. Since I know the validity of the design, I seriously doubt the you will present any substantive, objective, fact based technical critique that I have not already identified and addressed but you are welcome to try.

If you do not understand the detailed technical workings of the design then thank you for your vague, non-technical opinion. The world needs all kinds.

- - - Updated - - -

Yes I refer to the Traveling Salesman Problem (TSP). The link states this more clearly but I apologize for the lack of clarity in the article.

It is important to understand that there are two separate, but related technologies that implement solutions to the TSP.

The first runs on the fully programmable 4P programmable canvas. The video showing how the TSP is solved on 4P is shown here: http://sourcecodecreations.com/4p-in-action-1

The second runs on a non-programmable ASIC chip (NP4P) that is etched into silicon for that problem only. Not very useful but lots of computation power in a very small space. This design is shown in detail here: http://sourcecodecreations.com/what-is-np4p

First, consider a lone conventional CPU running one program.
Typically, a program written to solve the problem is written in a high level language like C++. The compiler reduces the C++ instructions down to machine instructions consisting of a series of instructions each having an associated set of operands. These instructions/operands are transferred from RAM to the CPU through a series of caches labeled L3, L2, L1. Each CPU fetches, decodes and executes each instruction "one after the other". In reality there is a lot of parallelism occurring within the CPU to gain throughput so instructions do, technically get executed in parallel. But that parallelism is localized to sets of instructions that occur "nearby" each other as defined by the C++ code. The CPU is limited, by hardware design, in the level of parallelism it can achieve because of potential branching in the code. For the TSP this is not an issue because of the structure of the problem, but the CPU has to be designed for all case including code with lots of branches. Out of order execution, branch prediction and speculative execution are all hardware mechanisms used to effectively parallelize the serial instruction stream generated by the C++ compiler thereby increasing throughput that defines the performance of the CPU.

The 4P programmable canvas was designed as a response to the inefficiencies in the architecture described above. Using transistors to try to parallelize an inherently serial programming model is less efficient that designing code in parallel to begin with. The cache model alone, introduces serious issues with data consistency that must be accounted for, again requiring additional circuitry. The addressing model that determines data availability in the caches is also inefficient since streams of data can and should be referenced by a single address and size rather than, again, utilizing hardware to reference every 32/64 bit word of data.

So how does 4P work?
For the sake of clarity, I will present the simple foundation of the architecture. Understand that this simplistic view is then modified to increase efficiency but that part is beyond the scope of this writing.

First, think of RAM with rows and columns of memory cells containing bits. Surrounding each memory cell is a simple processor that can be programmed to accomplish a simple task. Cells communicate with neighbors forming a programmable hardware canvas. A module is a group of cells programmed to perform a specific function and can contain sub modules, very similar to a procedure in C. A Module has zero or more inputs and 1 or more outputs that can connect to other modules, like the argument list in C. Technically, each cell itself is a module so the definition is fully recursive including an entire chip, group of chips or even groups of networked computers.

Unlike a procedure in C, a module is loaded onto the physical canvas where it is locked into place such that all the cells within the module know what their job is. Once the module has been loaded, the data is then streamed through the module. The video shows the TSP module as the data streams through. The processors are not shown, only the states of the memory cells as they change to process the data. As the states of memory change over the canvas you can visually see binary integers "moving" across the canvas, being added together and accumulated at the result through a Less Than module that compares the newly generated value with the previously stored minimum distance.

The increase in performance of this design is only one advantage. Next generation 4P development environments will allow dragging and dropping modules from a library similar to how ASIC SPICE layouts are designed today. And the performance is very similar to the underlying hardware because the coding is occurring very close to the metal. Note also, that module hierarchies are ways of viewing the hardware. Contrast this with a conventional CPU where visually viewing any CPU hardware state will yield, at most, assembly code that contains virtually no high level information about the structure and intent of the program.

I could go on for hours listing all the advantages and still only scratch the surface. The level of scalability by streaming chip to chip to increase the size of the hardware canvas. The logical security model. Variable bit integer sizes. On and on.

I hope I answered your question. If not, let me know and I will try to zero in on what you were specifically looking for.

Paul

andre_luis · Jul 15, 2018

From an overview perspective, the salesman example you implemented in LTspice, it sounds as a hardware accelerator specific to that problem which is typically accomplished via software, but this model does not seem to use the basic concept behind the purpose of your approach (namely, interleaving 'processors between memory banks'). Perhaps if you explain more summarized and focused on the concept would turn things easier to understand instead of rambling on the advantages without presenting any performance comparison with other approaches in ASIC.

BTW, I did not understand the meaning of this statement on the page "what is 4p":

Why does every single integer on the system need its' own address? It doesn't. Most of the time large arrays need a start location and a length.

This does not is true at all, as well as do not attempt to explain the innovation announced.
Another phrase that caugh my attention is that:

Trying to get an inherently serial design to run in parallel instead of just designing in parallel to begin with.

This is exactly what hardware accelerators do, some of them at a single clock, others few more. There are specific state of the art ideas for many mathematical or computational problems, much of them available at known scientific publications, but all of them have innerent drawbacks, as for example the larger hardware amount of resources required.

So either I did not understand your invention at all, or you were not clear in your explanation.

paulpawlenko · Jul 15, 2018

andre_teprom said:
From an overview perspective, the salesman example you implemented in LTspice, it sounds as a hardware accelerator specific to that problem which is typically accomplished via software, but this model does not seem to use the basic concept behind the purpose of your approach (namely, interleaving 'processors between memory banks'). Perhaps if you explain more summarized and focused on the concept would turn things easier to understand instead of rambling on the advantages without presenting any performance comparison with other approaches in ASIC.

BTW, I did not understand the meaning of this statement on the page "what is 4p":

{ Why does every single integer on the system need its' own address? It doesn't. Most of the time large arrays need a start location and a length. }

This does not is true at all, as well as do not attempt to explain the innovation announced.
Another phrase that caugh my attention is that:

{Trying to get an inherently serial design to run in parallel instead of just designing in parallel to begin with. }

This is exactly what hardware accelerators do, some of them at a single clock, others few more. There are specific state of the art ideas for many mathematical or computational problems, much of them available at known scientific publications, but all of them have innerent drawbacks, as for example the larger hardware amount of resources required.

So either I did not understand your invention at all, or you were not clear in your explanation.

Your first quoted point:
Maybe you have more current information that I do not have. To my knowledge cache hits and misses are determined based upon the memory address of the data operand. Hence each cache has to hold the address of each operand. Is this not true? If not, then how does hardware determine cache hits/misses? In any case there are certainly many physical pins allocated for addresses. If those do not tie data to addresses then all those pins must be underutilized, correct?

Your second quoted point:
Exactly my point. Assembly language/machine language is inherently serial. Programs are writing instruction after instruction. Serial. C++ is written serially. CUDA is written serially just like C. Sure many identical copies of the program are run in SIMD but the program itself is executed one instruction after the other in serial. As you agreed, hardware acceleration does as much in parallel as it can to increase throughput but the underlying execution model is one instruction processed after the other. Serial. Much of the hardware acceleration tries to take this serial design and do computation in parallel. I argue that this is much less efficient that just writing code in parallel to being with.

4P canvas runs parallel everywhere. Look at the video of the hardware simulator http://sourcecodecreations.com/4p-in-action-1
Each processor surrounding each memory cell is given an instruction. All the instructions are updated every clock. The program itself is designed in parallel and runs in parallel. There is no serial, line by line execution, programs are built on the canvas in parallel from the first bit. The larger the canvas the more computation can be done, all in parallel, including streaming data between chips thereby effectively creating an even larger canvas. There are no instructions and operands. The program gets locked in the canvas then processes the streaming data just like it is shown in the video.
-------------

You are exactly correct in noting that the SPICE version is an ASIC hardware acceleration. It is Non Programable, hence NP4P. Also correct, is that it does not exemplify the processors in memory. The SPICE model you are referring to is what happens when the "programmable part" is stripped away leaving the NP4P ASIC behind. It was posted on the website to: 1) Show the effectiveness of the streaming architecture in an open design
2) To show an ASIC model that can beat the worlds new fastest supercomputer in a device that fits inside a PC Box

You allude that this is simply "hardware acceleration" so I ask: Have you ever seen an ASIC hardware acceleration of any real world problem that claims it can compete with a supercomputer using so little hardware?

I am unaware of any other ASIC design that makes such a performance claim. If someone shows me such a design or even a credible claim, then I will stand corrected. Yes, this version is non-programmable and is etched for a single problem only. Yet, if such a design exists then I have to wonder why these devices are not already replacing supercomputers? Since typically run a standard set of 30 or so "problem types" I would think building 30 problem specific ASIC boxes would be much more cost effective in addition to saving electricity. That question aside, I think if Google, Intel or any major company could, even using a single one-problem ASIC, beat a supercomputer with such a small device that, not only would they have already done so, but it would be world news. For this reason it is likely that the NP4P design stands alone, at least for the present as the design is publicly available for any skeptics to analyze at their leisure. Furthermore, as I stated, if Google had an ASIC that could beat SUMMIT on a real problem using so little hardware, I think most people would agree that such a design would be widely publicized and rightfully so.

So that leaves either 1) acknowledging the significance of my claim or 2) technically refuting the design.
The design is open so people are free to find any errors for the NP4P

The design of true importance is the programmable, 4P hardware canvas.

Indeed, you hit the nail on the head and quicker than most, I might add. I do not share the inner workings of programmable hardware as that is the IP that is protected by secrecy. Instead I show:
1) The PSP, another significant IC in its own right
2) A working simulator showing a running program
3) the non programmable version, publicly
4) numerous technical layouts of how the canvas is used to directly compete with everything programmable

I can only show so much without losing the IP. The fact that the NP4P ASIC hardware accelerated TSP has such monumental performance, however, should give insight as to the value and validity of the programmable version as is the purpose for publicly displaying the IC layout.

Paul

c_mitra · Jul 15, 2018

Keep in mind that the processor canvas is resident inside its own RAM. So kernel calls occur on chip. This is discussed in detail here:

I shall certainly go through the reference given...

ThisIsNotSam · Jul 15, 2018

Dude, you are reinventing the wheel. Search IEEExplore for logic in memory and for hardware accelerators. There is nothing novel about what you are doing.

FYI, I have designed what is probably the biggest chip ever designed by an academic. 1B transistors in the damn thing.

andre_luis · Jul 15, 2018

Much of the hardware acceleration tries to take this serial design and do computation in parallel. I argue that this is much less efficient that just writing code in parallel to being with.

There are few processing applications that could be accomplished totally in parallel, most of them are pipelined, therefore the overal concept of that approach would be quite restricted to a tinny scope, which I have not yet been able to imagine what is. I`m sitll not able to link the proposed revolutionary solution ("interleaving processors between memory cells") to the scenarios you have presented; as said before, you could try to explain things more consiselly and focused on the proposed idea instead of just giving general clues

paulpawlenko · Jul 15, 2018

andre_teprom said:
There are few processing applications that could be accomplished totally in parallel, most of them are pipelined, therefore the overal concept of that approach would be quite restricted to a tinny scope, which I have not yet been able to imagine what is. I`m sitll not able to link the proposed revolutionary solution ("interleaving processors between memory cells") to the scenarios you have presented; as said before, you could try to explain things more consiselly and focused on the proposed idea instead of just giving general clues

This link provides the general workings of an operating system by simple example:
http://sourcecodecreations.com/how-4p-works Scroll down to "Architecture Details"
scroll down to see how the canvas sorts in O(N) time, in parallel
these concepts can be used to build an OS or any application

It's not easy to explain when I do not know your knowledge level or whether we are past certain facts.
Previously you had issue with my assertation that conventional processors use significant hardware resources to refer to every 32/64 bit word by its unique memory address. I responded by examining cache operation. Do we agree that operands in registers are referenced by unique address? Do we agree that conventional instruction/operand computation is inherently serial?

andre_luis · Jul 15, 2018

You have certainly spent a lot of time on this, which is something remarkable, but it is still not clear how the applications cited in the text (streaming voice, video, etc ...) - which are essentially serial data - how as they would be treated with a parallel approach.

Still, in the text there are many images and apology to the citted innovation, but little explanation as to how such results would be achieved, in a comparative fashion. There is, yes, an immense verbiage dealing with things in an almost superficial touch.

I recommend that you use hereafter, instead of a layman language, a slightly more conventional terminology so that things can be treated in a somewhat more understandable way and amenable to a more serious analysis. There are here members with reputable professional history, as well as with academic degrees, so it will not be any problem to go deeper.

I'm still curious to know what this is about, but less and less I see the thing converging to a concrete understanding, but only a sequence of 'PowerPoint'-like webpages.

By the way, instead of redirecting your answers to external links, please explain here what are asked, attaching if possible what pictures are necessary.

paulpawlenko · Jul 16, 2018

I am definitely willing to work with anyone interested in understanding 4P technology.
However, it is difficult to explain things concretely when you do not answer specific questions that I pose directly for the purpose of isolating the *specific* parts of the design that are eluding you.

So here are 2 specific questions.
1) do you agree that conventional processors, including GPUs, process programs serially as an ordered sequence of instructions/operands?
2) do you agree that conventional processors use addresses in the 3 cache levels and in registers to reference operands and dedicate circuitry to do so?

Also, do you understand that the 4P program is locked into the hardware canvas prior to running any data through it?
It is not important to understand the mechanism that is used to accomplish this to understand how the canvas is utilized to do useful things.
It is important to understand program becomes stationary on the canvas and that the data moves through the program and is processed while doing so.

Once I am confident that I we are on the same page with these specific details, then I can productively address your very valid point of how the canvas deals with inherently serial computation.

Also, this video shows a program running and is essential to understanding the operation of the canvas: http://sourcecodecreations.com/4p-in-action-1

paulpawlenko · Jul 19, 2018

ThisIsNotSam said:
Dude, you are reinventing the wheel. Search IEEExplore for logic in memory and for hardware accelerators. There is nothing novel about what you are doing.

FYI, I have designed what is probably the biggest chip ever designed by an academic. 1B transistors in the damn thing.

Show me the actual language that functionally runs this "logic in memory" or those "hardware accelerators".
Parallel SIMD designs are great. I would like to see an OS design the runs efficiently on a GPU.

Languages that run the von Neumann model are inherently serial.
I would very much like to see the non - von Neumann language that offers the same degree of parallel performance and generality/flexibility as 4P.

andre_luis · Jul 19, 2018

it is difficult to explain things concretely when you do not answer specific questions that I pose directly for the purpose of isolating the *specific* parts of the design that are eluding you

It is not we who must answer anything, but YOU must support your thesis with palpable arguments. So far all you've done is bumping on this thread either redirecting to your website, or touching on generic subjects, rather than sustaining your proposal. Unless you properly reply to the specific above posed inquiries, all this will be treated as a bare advetissement and this thread will be deleted.

Welcome to EDAboard.com

A Revolutionary Massively Parallel Processing Architecture

Junior Member level 2

Advanced Member level 5

Junior Member level 2

Advanced Member level 6

Junior Member level 2

Advanced Member level 5

Junior Member level 2

Advanced Member level 5

Super Moderator

Junior Member level 2

Super Moderator

Junior Member level 2

Advanced Member level 6

Advanced Member level 5

Super Moderator

Junior Member level 2

Super Moderator

Junior Member level 2

Junior Member level 2

Super Moderator

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor