March 11, 2021
Domain-specific computing may be all the rage, but it is avoiding the real problem.
The bigger concern is the memories that throttle processor performance, consume more power, and take up the most chip area. Memories need to break free from the rigid structures preferred by existing software. When algorithms and memory are designed together, improvements in performance are significant and processing can be optimized.
Domain-specific processing was popularized by the 2018 Turing lecture, “A New Golden Age for Computer Architecture,” by John Hennessy and David Patterson. But processors have been constrained by memory for decades. Changing processing without a rethink of memory and memory hierarchies ignores Amdahl’s Law, which provides a mathematical relationship between the speed-up possible for a system when certain pieces of that system are improved. It basically says you get diminishing returns if you only concentrate on one piece of the system rather than looking at the system as a whole.
So why not concentrate on the bottleneck? “Domain-specific memory is just a new term, but architects have been doing these kinds of optimizations for a long time,” says Prasad Saggurti, director of product marketing at Synopsys. “And if they haven’t, they’re missing a trick because most people have been doing it.”
Others agree. “Remember video memories — DRAM with built-in shift registers?” asks Michael Frank, fellow and system architect at Arteris IP. “Perhaps GDDR [1-5], special cache tag memories, or associative memories back in the days of TTL? A lot of these have not really survived because their functionality was too specific. They targeted a unique device. You need a large enough domain, and you are fighting against the low cost of today’s DRAM, which has the benefit of high volume and large-scale manufacturing.”
Sometimes it goes deeper than that. “You might hardwire something into a ROM,” says Synopsys’ Saggurti. “What we are seeing is more people fine-tuning memory today. For instance, with a Fourier transform, or a Z transform, people would write the code in such a way that you could store the coefficients in a certain order. When you’re doing a matrix multiplication, you can store the coefficients in a certain order so that reading it out would be faster. You may not store data in one memory, instead, putting it in three or four different memories so that you could be reading things through multiple data paths. These kinds of things have been happening more recently.”
Change is hard. “The challenge is that in the past, people had a nice, abstract model for thinking about computing systems,” says Steven Woo, fellow and distinguished inventor at Rambus. “They never really had to think about memory. It came along for free and the programming model made it such that when you did references to memory, it just happened. You never had to be explicit about what you were doing.”
Progress is being made in general memory performance. “Today’s memory controllers and advanced interface standards have dramatically improved what you can extract from advanced silicon technology,” says Arteris’ Frank. “This has enabled deep queues and advanced schedulers. Advanced memory technologies, such as high-bandwidth memory (HBM), and stacked die support bandwidths that we thought impossible to achieve just a decade ago. Yet it doesn’t come cheap. Sub-10 nm technologies also enable large caches, so maybe we can call this poor man’s domain-specific memory.”
But these are all examples of small incremental changes. “Architecting memory subsystems in which compute primarily follows data, rather than the other way around, requires a significant rethink of many precepts that architects are accustomed to,” says Matt Horsnell, senior principal research engineer for Arm’s Research and Development group. “There is an opportunity to enhance the programming abstraction, from today’s typical list of operations on data, to an expanded form that encapsulates concurrency and some notion of the relative distances between compute units and data items. Such abstractions could enable the necessary transformations to more optimally target domain specific memories when algorithms are evolving rapidly.”
Data centers in the driver’s seat
Data centers are the drivers for many technology trends today. “One of the fastest growing applications for compute is in data centers where the software applications crave more memory capacity, bandwidth at lower latency,” says Ravi Thummarukudy, CEO for Mobiveil. “With the advent of the latest industry standard, Compute Express Link (CXL), system architects can tier the memory needed between main memory in DDRn DIMMS, and CXL-based DDRn or newer persistent memories. The latency and economic characteristics of these tiers of memories are different, and that gives architects options to mix and match the memories to suit their requirements.”
That is a continuation of the legacy memory architectures. “Many OEMs and system houses are designing their own SoCs to customize silicon to their specific workloads,” says Tim Kogel, principal applications engineer at Synopsys. “The biggest opportunity for performance and power gains is the specialization of the memory hierarchy together with the supporting interconnect architecture.
Consider power. “In current architectures, 90% of the energy for AI workloads is consumed by data movement, transferring the weights and activations between external memory, on-chip caches, and finally to the computing element itself (see figure 1),” says Arun Iyengar, CEO of Untether AI. “Only by focusing on the needs for inference acceleration and maximizing power efficiency are we able to deliver unprecedented computational performance.”
Memory optimization is a system-level problem that touches all aspects of the design — hardware, software, and tools. “Strategies to optimize memory are diverse and depend on the application domain,” adds Kogel. “The best strategy is to avoid off-chip memory access altogether. For domain-specific architectures, this can typically be achieved by increasing available on-chip memory, either in the form of caches or application managed memory. Especially in the area of deep learning accelerators, the available on-chip memory is a decisive design parameter that also impacts how the neural network application is compiled onto the target hardware — for example, the tiling of the convolution operator.”
Many designs are looking to go further than this. “Domain-specific memory concepts are being explored in the spatial compute domain,” says Arm’s Horsnell. “As an example, DSPs tend to provide a pool of distributed memories, often directly managed in software, which can be a better fit for the bandwidth requirements and access patterns of specialized applications than traditional shared-memory systems. In order to bridge the efficiency gap with fixed-function ASICs, these processors often offer some form of memory specialization by providing direct support for specific access patterns (such as N-buffering, FIFOs, line buffers, compression, etc.). A crucial aspect of the orchestration within these systems, and a challenge in designing them, is determining the right granularity for data accesses, which can minimize communication and synchronization overheads whilst maximizing concurrency at the same time. Other challenges persist, including programming, coherence, synchronization, and translation, which add software complexity. However, a possible route forward is to rely on domain-specific languages (DSLs), which by making the data flow of the apps more explicit, can enable compilers to identify specialized memory access patterns and map them onto the hardware more effectively.”
It also pays to take a closer look at the memories themselves. “Hyper-customization is the trend that we see when it comes to memories,” says Anand Thiruvengadam, senior staff product marketing manager within Synopsys. “This means purpose-built memories for different end applications. Even within a particular end application like AI there are different needs for memories, such as for training or inferencing, inferencing in the servers, or inferencing in the far edge. Each of these applications has different requirements, and that means you have to customize the memories. This customization means you no longer can view memories as commodities or off-the-shelf products. You have to build it for a particular application. That is where the secret sauce kicks in.”
In many cases memory and interconnect are tightly coupled. “Anything goes when it comes to combining memory and interconnect technologies to satisfy the data access requirements of application workloads — for example, multiple levels of clustering combining processing with local memory to take advantage of the locality in data-flow applications, or huge multi-banked/multi-ported on-chip SRAMs for buffering feature maps of CNN accelerators, and deep cache hierarchies with sophisticated coherency protocols to mitigate the lukewarm working set of data center workloads.”
Small changes can yield big results. “Just look at the little miracle that Apple has performed with the M1,” says Frank. “They figured out how to architect a memory subsystem that serves multiple heterogeneous masters well, using intelligent caching strategy and a huge, multi-level cache hierarchy.”
As is often the case, software is the inertial anchor. “What usually happens is there is an algorithm in place, and we see a way to optimize it, optimize the memory, so that the algorithm is much better implemented,” says Saggurti. “On the flip side, we have these different types of memory. Can you change your algorithm to make use of these new kinds of memories? In the past, using TCAMs was mostly a networking domain construct to look up IP addresses. More recently, training engines are starting to use TCAMs, and that is such a different approach. This needs software, or firmware to change based on the types of memories available. But most of the time, software stays fixed and memory changes to make the resultant implementation better.”
A lot of time and money is being invested in artificial intelligence these days. Custom chips are constrained by throughput, and that is putting the spotlight on the memory and interconnect.
“Historically, memory and interconnect architectures have been designed based on static spreadsheets or simple analytical models like the roofline performance model,” says Kogel. “For state-of-the-art applications, this becomes pretty complex. For example, predicting the memory requirements of every layer in a CNN requires the consideration of compiler optimization like tiling and layer fusion. These static methods become unreasonably complex and inaccurate for the prediction and optimization of SoC-level workloads with diverse IP sub-systems and dynamic application scenarios. On the other hand, running the application on top of hardware emulation or a prototyping system is too late in the development process to make any drastic changes or major optimization of the memory design.”
That puts the focus on the intended workloads. “The key to efficient memory subsystems is the knowledge of your workload,” says Frank. “Understanding how it behaves, maybe even shaping it in a way that makes it more compatible with the limitation of your memory hierarchy, this is where architecture is challenged. Domain specific accelerators require tuned memory systems — and the art of building the transformation engine that ‘impedance’ matches the mass produced, page organized, bursty access DRAM and the engine’s access pattern requires insight into the system behavior, modeling tools and a lot of workloads to play with. Sometimes it takes changing the way the workload processes the data to be able to improve the overall system. A good example was the transition from ‘direct’ rendering to tile-based processing in GPUs.”
It all comes down to modeling and simulation. “We propose the use of virtual prototyping tools to model the application workload, together with accurate transaction-level models of the interconnect and memory architecture,” says Kogel. “This quantitative ‘architecture first’ approach allows early tradeoff analysis, resulting in a reliable implementation specification. At the expense of additional modeling and simulation effort, the benefit is reduced risk of missing performance and power targets, or reduced cost of overdesigning the hardware just to be on the safe side. In the era of diminishing returns from Moore’s Law, the opportunity is to come out with a more optimized and differentiated product.”
That allows the impact of algorithmic changes to be seen, as well. “There is a need to go back and redesign the algorithms,” says Thiruvengadam. “They can be redesigning for the traditional legacy memory architectures, or they can be redesigned for new architectures, new memories styles, new memory flavors. There is this constant push for performance scaling, cost scaling, and also being able to balance the tradeoffs for the different applications. This is essentially the reason why you’re seeing continued development of MRAMs and FeRAMs. They’re trying to find a sweet spot for at least a couple of variables, if not all the variables. The need for redesigning algorithms along with the memory architectures is certainly becoming important.”
Balance is necessary. “You need to think about the concept of computational intensity and the type of operations involved,” says Frank. “Certain algorithms have insatiable bandwidth requirements, while others move only relatively small amounts of data but perform thousands of operations on it. In-memory operation may work well for SIMD-type processing, where the instruction bandwidth is small relative to the data bandwidth and many elements are processed using the same recipe. But as soon as there are sequential dependencies in the data stream or irregular dataflow, the benefit of domain specific memory shrinks.”
While architectural changes may produce large results, optimizing the memories may also provide gains. “A large proportion of the power and area of today’s accelerators is used on memory,” says Horsnell. “So any latency/density/energy improvements achieved by new memory technologies could have a dramatic impact.”
Custom memories are becoming big business. “You start to see things like in-memory compute, near-memory compute, specific memories that might be write-all-zero memory — memories that are optimized for certain types of operations,” says Saggurti. ” We are seeing a lot of customers ask us about MRAM, even more customization of SRAMs, TCAMs, and certain tweaks to the TCAMs.”
Difficulties remain, though. “I have had a lot of discussions regarding custom memory designs, where processing on the memory die would have been an ‘ideal’ architecture,” says Frank. “It would have provided high bandwidth, low latency, etc. Everything was right, except for the fact, that the memory process was limiting what logic could be integrated — three or four metal layers, low-power, but slow transistors. That meant inefficiency for the compute engine. Sacrificing clock speed and circuit complexity suddenly made the integration of the compute engine no longer such a good choice.”
But some of these changes will become necessary. “People want to bring flash on chip and make it an embedded flash,” says Saggurti. “Then the question becomes, ‘Is it even possible?’ At 28nm you might be able to do embedded flash, but people start to think about things like MRAM at 22nm.”
Still, there are other ways to look at the problem. “Process variability across a wafer and across the die, and even over time, limit memory design,” adds Saggurti. “When you design a memory, a simple SRAM, you tend to design for the case when the bit cell goes one way — slow — and the periphery goes the other way — fast. If you design for that, and if majority of your silicon is typical, you’re leaving a lot of performance and power on the table. If you understand where you are in the process range and enable the chip designer to act upon that information, then you can adjust timing accordingly. Your design could be more optimal, and you don’t have to design for the worst case.”
While memory has always been a design tradeoff, it has never received the same level of attention as processing, even though it is the performance limiter in terms of bandwidth, power, and area. AI is causing people to rethink memory architectures out of necessity, but with that extra attention, design teams may also rethink some of the software and algorithms that were optimized for legacy memory systems. In a world where performance gains do not come for free every 18 months, more extreme measures are becoming the only way to stop products from becoming commodities.(From Brian Bailey)