March 11, 2021
A well-known DRAM vulnerability called “rowhammer,” which allows an assailant to disrupt or take control of a system, continues to haunt the chip industry. Solutions have been tried, and new ones are being proposed, but the potential for a major attack persists.
First discovered some five years ago, most of the efforts to eliminate the “rowhammer” threat have done little more than mitigate the problem.
“Rowhammer is a big issue,” said Barbara Aichinger, vice president at FuturePlus. “The vendors claim that it was ‘fixed,’ but it was not. If you look just at the many papers that have been published in 2020, you will see plenty of evidence of that.”
There are numerous ways to block rowhammer, although so far none has been broadly accepted as definitive and decisive. Mitigations can be found at the software level, the browser level, and in hardware in DRAMs and memory controllers. But these only attempt to thwart the attacks. They don’t solve the problem at the root cause. One company now claims to have a solution.
Rowhammer basics
Rowhammer occurs as an unintended consequence of the way DRAM is made. That process is a carefully crafted way of getting as many bits as possible down onto silicon for as little money as possible. Simply changing the process is no mean feat. The fact that we can’t upend the way we build enormous amounts of memory — along with the constant promise of mitigations as being sufficient — have prevented root-cause solutions.
The problem occurs at the die level along walls that have been etched as part of the manufacturing process. That etching process leaves imperfections, or traps, that can capture electrons and hold onto them. If those electrons stayed in the traps, this might not be such a big problem. But later in the memory access cycle, those electrons can be released. From there, they can drift around, potentially ending up in a neighboring cell.
“Every time you turn the row from off to on to off, you get a puff of electrons into the substrate,” said Andy Walker, vice president of product at Spin Memory. “Some of these electrons will migrate and be picked up by nearby nodes.”
Fig. 1: Traps along the sidewall (left) capture electrons that remain there temporarily (center). Later, they can be released and migrate to other cells (right). Source: IEDM/Micron
A DRAM bit cell is nothing more than a capacitor that stores charge, along with the means of getting charge in and out when writing, and determining how much charge is on there when reading. Capacitors can leak, and the reading process is itself destructive. So a capacitor must have its value refreshed right after its read or, if it’s not accessed for a long time, then at some pre-determined frequency.
The fundamental point here is that the state of the cell is determined by the charge on the capacitor, and that charge is vulnerable between refresh cycles. The drifting electrons can migrate into a cell, changing the charge in the cell. If done too many times, enough charge can accumulate to change the perceived state of the cell.
This is where the “hammer” part of rowhammer comes in. The idea is that, if a given row is read enough times before a refresh occurs, the repeated mini-bursts of these errant electrons can change a neighboring cell. In fact, at the recent IEDM conference, Naga Chandrasekaran, senior vice president, technology development at Micron, noted that, with shrinking dimensions, it may not be only neighboring rows that are vulnerable. As rows get closer together, even near-neighboring rows – two or even more rows away – could be affected as well.
From phenomenon to attack
It takes some clever thinking to take this phenomenon and figure out how it could be used to attack a system. While there don’t appear to have been any serious black-hat attacks yet, there have been numerous academic papers illustrating rowhammer as a means of taking control of a system.
“Some notable demonstrations of the attack are elevating to higher system level rights (like to administrator), rooting an Android phone, or taking control of what should be a protected virtual machine,” said John Hallman, product manager for trust and security at OneSpin Solutions.
Looking from the top of a system down and the bottom of the chip up, there are two big challenges. One lies in knowing where critical system data is located in memory. The other requires knowledge of which rows are physically adjacent. The specific layout of the chip important, and this is usually kept confidential by chipmakers. You can’t assume that the physical arrangement of a memory made by one vendor will be the same as that of another vendor.
All of this has made rowhammer difficult, but by no means impossible, to turn into a viable attack. While the specifics of the various attacks are laid out in the many research reports detailing the results, a couple of examples show how it’s not so much a challenge of getting complete control of some random part of memory, but rather getting control of strategic locations — and with that, taking control of the overall system.
One attractive place to target are the tables used for memory management. They lay out the intended boundaries for the various processes running, including the permissions required to access the different allocations. Rather than attacking the main memory, attacking these page tables means that, with one edit, a restricted process may change in a way that makes more (or all) of the memory – including secure blocks – accessible to the attacker. With that one change, the system has now been opened up to further exploitation.
As to determining which row to hammer – and then hammering it – the prevalent use of cache makes this harder. If you write a program that simply accesses some memory location repeatedly, you won’t be leveraging the rowhammer phenomenon. That’s because the first memory access will cause the contents to be loaded into cache, and all of the subsequent ones will pull from cache rather than re-reading the memory.
That makes getting around the cache an important part of any exploit. It can be made easier or harder, depending on the processor used, because different architectures have different cache eviction policies (and those with purely deterministic policies are more at risk). That said, however, determining adjacencies can involve making subtle timing calculations to determine whether data is or is not already in the cache or the row buffer within the DRAM.
Making an attack even tougher is the fact that some memory bits are more vulnerable to attack than others. There may be a deterministic cause, such as making a particular area a likely target in multiple chips, or there may be some random element to it. So not every memory cell will respond to rowhammer in the same way.
The impact of these projects is a recognition this is a real threat, not a theoretical one, and it’s just a matter of time before someone creates havoc – especially with so much computing moving to the cloud, where countless servers and their memory can be accessed from anywhere in the world.
Mitigations and bypasses
To date, most visible efforts to counter rowhammer don’t solve the fundamental physics of the problem; they provide ways to work around the issue. And they’ve been implemented at multiple levels.
For instance, using a browser to access a remote server has made the browser industry a stakeholder. Because an attack can involve subtle timing measurements, browsers reduced the granularity of timing available. It’s no longer possible to get nanosecond-level accuracy. Instead, it may be microseconds – still accurate, but a thousand times less accurate, and enough to restrict one way of attacking.
“Major browsers have mitigated this issue, or at least tried to,” said Alric Althoff, senior hardware security engineer at Tortuga Logic. “Many of the actual fixes are software-based and very targeted (e.g. Google Chrome mitigated GLitch by removing extensions from a webGL implementation in 2018). But the big takeaway is that hardware vulnerabilities that ‘can’t be exploited remotely’ are only waiting on an experiment that shows that they can, and that ‘can’t be exploited’ really means we just can’t think of a way to do the remote exploit right now.”
In a retrospective paper, six idealized solutions were proposed. “The first six solutions are: 1) manufacturing better DRAM chips that are not vulnerable, 2) using (strong) error correcting codes (ECC) to correct rowhammer-induced errors, 3) increasing the refresh rate for all of memory, 4) statically remapping/retiring rowhammer-prone cells via a one-time post-manufacturing analysis, 5) dynamically remapping/retiring rowhammer-prone cells during system operation, and 6) accurately identifying hammered rows during runtime and refreshing their neighbors.”
Most mitigations focus on number 6. Number 1 would be the desired root-cause fix. Number 2 – ECC – can be used, but has limitations that we’ll discuss shortly. Number 3 may be attractive, but it’s a constant chase with no end. And numbers 4 and 5 create significant system-level complexity.
Much of the mitigation focus has been at the lower memory level – divided between the DRAM chip itself and the controllers that stand between the DRAM and the system. “Within the refresh cycle, there is a window when such attacks exceed a given value,” said Vadhiraj Sankaranarayanan, senior technical marketing manager at Synopsys. “Then the solutions can be built anywhere – at the controller or the DRAMs. It requires expensive hardware, and it is power-hungry. But we want the memory to be safe because the data is the king here.”
One way to prevent attacks is to count the number of accesses on a given row between refreshes. If a threshold is exceeded, then you prevent further access. While that may sound simple in concept, it’s difficult to put into practice. There aren’t good models for memories refusing an access that otherwise appears to be legitimate. So decisions would be needed all the way back into the system for what to do if a read request is denied. Does that mean that the controller stops, waits, and tries again? Does the operating system get involved? Does an application ultimately fail?
Two new capabilities added to the JEDEC memory standards have provided another response. One new feature is called target-row refresh, or TRR. The idea there is that, while DRAMs are set to refresh after reads and according to a schedule, a finer-granularity mechanism is needed for executing single-row refreshes on-demand. If someone or something – in the memory or in the controller – detects that an attack might be underway, it can issue a refresh to the affected row and reverse any hammering that might have occurred up to that point.
“The controller keeps monitoring, and, if it suspects a particular row or rows are getting attacked, the controller immediately finds out what the possible victims are,” said Sankaranarayanan. “Then it puts the DRAMs in TRR mode, and it can send proactive refreshes to those victim rows to prevent them from losing their original state.”
The monitoring can alternatively be implemented in the DRAMs themselves, at the cost of die size and power. “DRAMs can also have counters,” added Sankaranarayanan. “It’s power-hungry, but some have counters that can monitor persistent accesses.”
Zentel is offering a solution in what it refers to as “rowhammer-free” DRAM. “For the 2Gb and 4Gb DDR3 (25nm node) DRAM, Zentel applies a proprietary rowhammer protection scheme with an integrated hardware combination of multiple counters and SRAM to monitor the number of row activations, and to refresh the victim row as soon as a certain maximum count is reached,” said Hans Diesing, director of sales at Zentel. This provides a response that shouldn’t measurably affect performance or be visible outside the DRAM.
This solution comes, of course, with a cost. “The additional hardware structure adds to chip real estate and, due to less wafer yield, it is not so competitive on cost and price compared to the rest of this industry,” he added. “But this rowhammer-free version was designed on demand of customers from the HDD industry.”
TRR has not satisfied all players, however. “In general, the DRAM vendors and the controller vendors are secretive about TRR,” said Sankaranarayanan. In fact, rather than being simply one mitigation, TRR appears to be an umbrella for a number of mitigations, many of which can be bypassed. “Unfortunately, TRR describes a collection of methods, many of which don’t work,” said Althoff. “It is therefore not a mitigation per se, only a family of countermeasures.”
While TRR may be able to protect against one-sided (one neighboring attack row) or two-sided (both neighboring rows as attackers), it can’t help against “many-sided” attacks – multiple rows being worked at the same time. A tool has even been developed to help figure out how to modify attacks in the presence of TRR so that they will still be effective.
Error-correction codes (ECC) also are seen as a possible solution. The idea there is a row may be corrupted, but that corruption will be corrected during the read-out process. That may be the case for rows where a single bit or so has been corrupted, but – given that one hammers an entire row, not just pieces of it – there may be more errors than the ECC can correct. “One of the primary protections for this attack has been error code correction (ECC), though even now attackers are beginning to identify ways around these protections,” noted Hallman.
In addition, some ECC implementations correct only the data being read, not the original data in the row. Leaving the incorrect bit in place means that future refreshes will reinforce the error, since refresh restores what’s already there, rather than restoring it to some known golden reference state. Avoiding this would mean using ECC to determine the incorrect bits and correcting them in memory.
There’s also a new controller command called refresh management (RFM). “RFM is in the JEDEC standard for DDR5, but that hasn’t been evaluated by the broader security audience yet,” said Althoff. “So while it seems conceptually good, it’s untested, and so is not a known mitigation, just a presumptive one.”
The pattern has been this and other mitigations are published, and the academic world goes to work to prove that they can still get around the mitigations. And, for the most part, they’ve been correct.
An additional concern is circulating now, given that most mitigations have focused on CPU-based systems. GPUs may provide an alternative way to attack a system, so attention is needed there, as well.
“The industry has been working to mitigate this threat since 2012, with techniques such as Target Row Refresh (TRR) part of DDR3/4 and LPDDR4 standards and Refresh Management (RFM) in DDR5 and LPDDR5 specifications,” said Wendy Elsasser, distinguished engineer at Arm. “However, even with these and other mitigation techniques, as DRAM internal layouts are proprietary, rowhammer attacks are particularly difficult to mitigate against.”
Can the fundamental issue be solved?
The Holy Grail with this issue has been a way to stop the migrating electrons from disturbing cells. Doing that in a manner that doesn’t upend the whole DRAM process or make DRAMs unaffordable has been the big challenge. That’s why there has been so much focus on solving the problem indirectly, through mitigations, rather than solving it directly. But with mitigations under constant attack, a root-cause solution would be welcome.
“This is an argument for a hardware solution to a hardware problem,” said Althoff. “If the hardware is vulnerable, pushing the mitigation responsibility to software – or any higher abstraction level – is equivalent to [a popular meme showing a water leak being plugged with duct tape.]”
One company claims to have found such a fix – possibly by accident. Spin Memories (erstwhile STT, an MRAM producer) created a novel selector that would help to reduce the area required for a memory bit cell. Many bit cells consist of a single component (like a resistor, capacitor, or transistor), but they need a way to be shut off so they’re not accidentally disturbed when another related cell is accessed. For this reason, an additional “selector” transistor is added to every bit cell, making the bit cell larger.
Spin Memories found it could take a page from the 3D NAND book – making a transistor operate vertically with a surrounding gate – and placing that under the memory cell rather than next to it. This stacked arrangement would therefore compact the size of the memory array.
“It can then be used for any resistive switch like ReRAM, CBRAM, CERAM and PCRAM – any two-terminal resistor that requires current or voltage to switch,” said Walker. “It’s a vertical gate all around transistor based on selective epitaxy. It’s a high-voltage device in 3D NAND that we adapt to our very low-voltage application. It requires high drive and low leakage, that what that translates to in materials science is that the channel of the device has to be monocrystalline.” Hence, epitaxy rather than deposition.
This gives the transistor two critical characteristics that make it a contender for full rowhammer solution. One is that the silicon used is epitaxially grown above the wafer rather than being etched into the wafer. As the etching is the primary source of the traps that capture the electrons in the first place, eliminating those trap sites greatly reduces, or even eliminates, the source of the issue.
The second characteristic is the buried n-type layer that effectively blocks stray electrons, from whatever source, interfering with the bit cell. If borne out, this would effectively shut down the rowhammer mechanism.
Fig. 2: On the left, electrons trapped on the aggressor cell can drift to the neighboring cell and change the charge on the capacitor. On the right, the newly proposed structure uses epitaxy, creating fewer trap site, and an n-doped region blocks any errant electrons from accessing the bit cells. Source: Spin Memory.
Spin, in conjunction with NASA and Imec, are publishing a paper (behind a paywall at present) that will detail the solution. As with any such proposal, it must circulate amongst the security community, facing challenges and tests before it can be accepted as definitive.
Proving the effectiveness of a mitigation isn’t easy, requiring careful modeling of attacks – at least, known ones. “By utilizing of our fault injection and detection tools, we can work with customers to model the attacks and demonstrate the effects on the memory,” said Hallman. “This could identify areas where information could still be leaked.”
Proving the effectiveness of a silicon-level fix from first principles is also a challenge. “DRAM is hard IP, and the attack exploits physics, so you’d need something with precision on the order of SPICE, or a targeted alternative, to verify with confidence pre-silicon,” said Althoff.
But proof of both mitigations and fixes are necessary in a wary industry. “Spin is not the first to try to produce rowhammer-immune DRAM,” noted FuturePlus’ Aichinger. “Several new mitigation strategies are under discussion, and you should hear more about this in 2021.”(From Mark)