Imagine a very, very large corn field. Think about a field the size of a large city, or perhaps a factor of ten bigger than that. Now consider how you would harvest the ripe corn in that field.
Plan A: Get a single harvesting machine and start methodically chewing through the corn, strip by strip.
Plan B: Get many, many harvesting machines and get them methodically chewing through strips of corn in parallel with each other.
Plan A, I think you will agree, demonstrates about as much higher order intelligence as a single ear of corn in that field. Plan B is obviously the way to go. Why? Because from a harvesting point of view, one strip of corn in one part of the field, is utterly unrelated to any another. Thus the harvesting can happen in parallel. Blindingly obvious right?
Parallelizing the work has other benefits apart from throughput. We can use cheap and cheerful machines to do the work rather than a very expensive uber-machine. Who cares if the individual cheap machines break down? We just replace the ones that fail and forge on. The chances of corn production stopping completely because of hardware failure are basically infinitesimally small. What are the chances of hundreds, perhaps thousands of independent machines all failing at the same time? Effectively zero.
In IT terminology, the harvesting system exhibits high availability, fault tolerance and a linear scaling relationship between throughput and processing power. Nice.
Now lets switch from corn to, say, customers in our reveries. Imagine a very large set of customers. We need to look at what they have bought from us in the last month and generate invoices, one for each customer.
Plan A: Get a single invoice processing "machine". Start at one corner of the customer list and work methodically through to the end.
Plan B: Get many, many invoice processing machines and get them methodically chewing through strips of customers in parallel.
From a common sense perspective, Plan B is just as compelling as it is in the corn harvesting example. However, by and large, enterprise computing does not work that way. We go for Plan A most of the time. Bigger, faster, more expensive individual machines to do processing of tasks that could be done faster and cheaper with multiple machines processing in parallel.
In fact, Plan B is even more compelling in computing than it is in harvesting corn. The unit cost of processing machines continues to fall through the floor. So much so that a large Web search engine provider - who makes extensive use of parallelism - does not even replace individual processing machines when they break down. Why? Because the cost of dispatching an engineer to replace it is higher than the cost of just tacking a new one on the end of the processing rack. This is a pretty radical shift in the economics of computing.
What an embarrassment of computing power riches surrounds us! Yet, by and large, in enterprise computing, we do the equivalent of heading off into a corner of the corn field with a single machine, working through our data strip by strip. If we were engineers in the agricultural sector we would be laughed at. Seriously.
Two questions arise I guess. Firstly, why do we gravitate towards plan A in enterprise computing and secondly, what should be done about it?
I think the first question is best answered by saying that we do it, the way we do it, because we have *always* done it that way. Ever since Jon von Neumann's brilliant insights into general purpose processing machines, we have been building mental models of processing around the idea of a single, all powerful CPU (the 'C' stands for 'Central' after all). The so-called Von Neumann architecture is endemic in the way we think about systems.
It is a curse of sorts. The easiest way to see it in action is to ask a developer how long it will take to process a million invoices if each invoice takes 10 seconds of processing time. After some calculation the answer will be that it will take about one full day. Without even thinking about it, most developers will serialize the invoice processing and assume that everything will be routed through a single processor, which will chew through invoices at the rate of one every 10 seconds.
The answer to the second question doubles as a rallying cry. There is a ton of lore about parallel computing out there. There are real systems such as Beowulf Clusters in the Linux world that show the power of massive parallelism. There is the Grid Forum which is emerging as a focal point for activity in parallel computing technologies.
Now here is the rallying cry. A lot of the interest in massive parallelism comes from scientists interested in fluid dynamics, N-Body problems, quantum chemistry and the like. All really important stuff but where are the commercial IT people? The benefits for enterprise computing of parallel computing are enormous, not only in terms of availability of computational power but also in terms of cutting costs, availing of computational power on demand and so on.
I think part of the problem is that there are many commercial IT problems which, in parallel computing terminology are 'trivially parallelizable' and thus not a source of scientific interest. It is true that finding a way to distribute the calculation of Nth degree polynomials on a grid is a lot harder than distributing invoice generation on a grid. The latter is an example of a problem that is amenable to what is known as 'domain decomposition'. Simply put, the problem is like a large corn field, it is trivial to perform the work in parallel. The more machines the better.
It seems to me that enterprise IT people need to start wrapping their heads around this stuff. In the million invoice processing example, given a grid with a million nodes that you can tap in to, you could process all your invoices in 10 seconds. Let's double - no treble - that figure to take account of bandwidth and data transmission. 30 seconds plays 24 hours. Nice.
I think it's time to take a long hard look at Plan B.