A new simulation by Sandia for the U.S. Department of Energy’s National Nuclear Security Administration has found that trying to use too many cores for multicore supercomputing processing just leads to slower, not faster, computations.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin company.
Using algorithms for deriving knowledge from large data sets, Sandia simulations found a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores.
Exceeding eight multicores caused a decrease in speed. Sixteen multicores performed barely as well as two, and after that, a steep decline is registered as more cores are added, Sandia reported.
Crowded Bus
The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor.
“The difficulty is contention among modules,” noted James Peery, director of Sandia’s Computations, Computers, Information and Mathematics Center.
“The cores are all asking for memory through the same pipe. It’s like having one, two, four, or eight people all talking to you at the same time, saying, ‘I want this information.’ Then they have to wait until the answer to their request comes back. This causes delays,” he explained.
‘Not Rocket Science’
“Processors pull data in from external memory, manipulate the data, and write it out. That’s what they do. As CPUs get faster, they need more memory bandwidth, and if they don’t get it, that lack of bandwidth constrains performance. This is not rocket science,” Nathan Brookwood, principal analyst for Insight 64, told TechNewsWorld.
“If your state department of transportation issued a press release warning ‘average freeway speed will decrease as traffic density increases,’ would anybody pay attention?” he said, noting that this is essentially what Sandia has done.
“It’s easier to add cores to CPUs than to add memory bandwidth to systems. On-chip caches mitigate this a bit; programs that spend a lot of time iterating on small data sets that fit in on-chip caches can avoid these problems, but programs that use big data sets that don’t fit into on-chip caches run smack into them, and scale poorly. Multicore approaches just make this worse, since they provide a straightforward approach for increasing demands on memory bandwidth, but little help on adding that bandwidth,” Brookwood explained.
“Again, it’s not a new problem. System architects and chip designers have been wrestling with it for years, and there are no easy solutions. On the other hand, it’s not a show stopper, just another problem those software guys will have to tackle if they want to increase performance using multicore chips,” he added.
‘The Problem Is Often Ignored’
Sandia did acknowledge Brookwood’s point in the Sandia report.
“To some extent, it is pointing out the obvious — many of our applications have been memory-bandwidth-limited even on a single core,” noted Arun Rodrigues, who was on the Sandia team that ran the simulations.
“However, it is not an issue to which industry has a known solution, and the problem is often ignored,” Rodrigues added.
Certainly, the memory wall is the biggest hardware threat to multicore. The idea that somehow 8 cores is the limit and that no progress is being made on memory bandwidth is misleading, however. Memory bandwidth is increasing, just not at the rate of Moore’s Law. We can expect DDR3 and other technologies to help out.
The biggest threat to multicore overall, however, is the difficulty of parallel programming. Fortunately, there are new multicore programming technologies that have started to address that need. See, for example, http://www.cilk.com.