This is a small anecdote of something that happened to me the other day.

Internal compiler error

I was building LLVM trunk with GCC 6.2.0 and the compilation failed with some weird internal compiler error of GCC. I restarted the compilation and this time succeeded. And then other builds of LLVM failed, but in other ways. Randomly. With silly errors that vanished after a few retrials.

Anyone with experience will first assume that there is a bug in GCC, like the program corrupts its own memory due to some memory error. And such assumption would be right yet building LLVM was really good triggering this particular bug. A quick look in GCC's Bugzilla did not reveal anything specific of 6.2.0 so it had to be something else.

Then it occurred to me that maybe there is a hardware problem. Building LLVM with ninja uses, by default, all the cores of the CPU and some LLVM files are huge. So we have a scenario on I'm able to easily use a big percentage of the memory of my system (16 GiB). What if some physical memory address is wrong but Linux will likely use it under big memory loads?

So I passed memtestx86, and voilà! Memtest detected that one bit in the physical address 0x2e4bd5d28 was stuck! No matter what is written in that address, the bit 22 of that (32-bit) address was always set to 1. Compilers are particularly sensitive to this kind of problems because, to save memory, they tend to pack lots of data as bitfields.

Memtest did not detect any other problem. So the result is that of 16 GiB, 1 bit is wrong! Bummer.

Mitigation

The obvious solution would be replacing the affected DIMM bank. But it seems wasteful for a single bit. That probably would make sense if the errors were all across the memory, but for 1 bit it is hard to justify.

So we need a way to tell the operating system: hey, don't use that address. Luckily there is a way to tell Linux about this. Not strictly acknowledging that the memory is wrong but just saying that it is "reserved" and as such cannot be used by the operating system.

To do this is in a modern Debian system you first need to identify the address (in my case 0x2e4bd5d28) and the extent of bytes you want to reserve. In my case, only 1 bit is wrong in an access of 32-bit, so we will want to ignore at least 4 bytes.

Now modify the file /etc/default/grub and modify the variable GRUB_CMDLINE_LINUX_DEFAULT to include a memmap parameter of this form memmap=bytes$address. Because this file is processed as a script we need to escape two times, so the syntax will be memmap=bytes\\\$address. In my case my Linux command line looks like this:

GRUB_CMDLINE_LINUX_DEFAULT="quiet memmap=0x4\\\$0x2e4bd5d28"

Now run update-grub to update grub, and restart.

To check if it works just check the output of dmesg. At the beginning the Linux kernel prints the memory map, mine includes a line like this.

...
[    0.000000] user: [mem 0x0000000100000000-0x00000002e4bd5d27] usable
[    0.000000] user: [mem 0x00000002e4bd5d28-0x00000002e4bd5d2c] reserved
[    0.000000] user: [mem 0x00000002e4bd5d2d-0x000000043f5fffff] usable
...

That reserved block is precisely the address I don't want Linux to use as memory.

After this change, now I can build LLVM without weird internal compiler errors. Yay!