|
ASRock AM4 Pro4 Motherboard |
A while back I have built ECC platforms based on consumer hardware and found the information on the web is not really accurate.
Specifically, while ASUS AM3 boards support ECC mode and I have been using this for some years now. I have never witnessed any error being reported during this time, but I am located near the sea level which has influence on the number of cosmic rays.
There is also Machine Check Exceptions support on these platforms but the motherboard itself is hit and miss. These are useful to track errors in CPU caches and other parts, that help prevent data corruption and make you aware of damaged hardware (mostly PSU or board VRMs).
Some TLDR for CPUs:
- AMD Phenom I/II and Athlon 64 X2 chips support error reporting through module "edac_mce_amd". This module works without ECC and reports cache or other
errors related to the CPU.
- Athlon II also works.
- For AM4 Ryzen, APUs only support ECC if they are from "Pro" line.
TLDR for motherboard support:
- ASUS AM3 M4A8xx motherboards officially support ECC but you should not rely on it.
- ASUS AM4 and Ryzen support ECC through RASDaemon but only on up to date BIOS.
- Older ASUS AM4 BIOS report through kernel methods but only uncorrectable errors(UE) are logged in '/sys' nodes.
- All ASRock AM4 boards seem to support ECC mode.
- Gigabyte AM4 B550 boards mention ECC mode support.
- Only ECC Unbuffered RAM is supported. (PC3/4-xxxxxE JEDEC specs)
On the kernel side:
- RAM ECC is supported through "amd64_edac" module.
- CPU error reporting is handled by "edac_mce_amd".
Tested hardware:
- ASUS M4A87TD/USB3
- ASUS M4A88TD-V EVO/USB3
- ASUS A320M-K
- ASUS EX320M Gaming
- ASUS ROG Strix B450-F
Testing ECC Support
At this point I've had the ASUS M4A board run with ECC RAM for 3+ years and saw no errors reported. This is not really expected, if studies on servers are comparable to consumer hardware.
The only way I see to rapidly test ECC is working is to set unstable timings with ECC disabled and confirm with Memtest86. Next, set it to on and verify it is reporting errors and/or fixing them.
Most hardware will have different ways to report errors and it will take a very long time to test. I don't think most consumer boards will support ECC Error Injection, which would be the best way to test.
ASUS AM3 Motherboards
|
ASUS M4A88T |
The kernel itself only shows messages with no detail, no matter what kernel parameters are passed to 'mce' boot parameter:
[Hardware error] Machine Check Exception
From testing, these will be corrected errors but I don't know how it will handle uncorrectable errors, as those are harder to reproduce. There is some level of functionality here but it seems the kernel will not be aware of corruption of memory from uncorrectable errors.
The first problem is there is no additional information on what exactly the error is, so the OS will not know if it needs to kill some process to prevent data corruption. There should be additional lines after the [Hardware error] entry but the motherboards is not handling the error further.
Also, the '/sys' nodes for 'mc*' entries in edac module will not be populated with error counts. So you can't really track them over time without custom scripts that monitor the kernel log.
I don't consider ECC to be fully functional on these boards because of this, though some posts seemed to imply ECC was correctly supported.
These boards also don't report any kind of error related to CPU errors. I was first aware of this functionality when a damaged ASRock board started locking up but due to errors reported to the OS. On compatible boards, these show up on the kernel messages in the following format:
[Hardware error] Machine Check Excpetion logged
[Hardware error] ERROR DETAILS
These do not get recorded on MCE Log but are specifically handled by the kernel. ('edac_mce_amd' module) This is useful because uncorrected errors can then discard buffers or kill the process with corrupted data.
Because of ASUS not enabling this functionality, you may get some data corruption if the PSU or motherboard VRM are damaged. I would not rely on this hardware without regularly testing CPU stability with something like Prime95.
ASUS AM4
|
ASUS EX320M Gaming
|
On ASUS AM4 boards, older BIOS versions would work as the AM3 boards but memory error reporting was working correctly - you could read /sys and get error counts, or look at the kernel log.
On current BIOS, AMD has updated the error reporting to the modern RAS functionality of the Linux Kernel. You have to install RAS Daemon and use 'ras-mc-ctl' command to read error counts.
The corrected errors (CE) are reported in RAS but are not counted in EDAC 'mc*' nodes, only uncorrected errors. The kernel handles UEs and won't reboot the system if the process is non-critical or hits a page cache, which is discarded.
CPU related MCE/RAS may require some register tweaking, according to AMD's documentation for these CPUs. I have managed to reproduce CPU crashes by undervolting, with no errors reported in the kernel or RAS Daemon.
Other AM4 Brands
From ASRock specifications the boards support ECC. This has been the case from A320 up to X570, all the types of motherboards as even the cheapest HDV models with very reduced sizes and price.
Forum users also report it working
[Reddit] but it is unknown to what extent and if other CPU errors are reported.
Gigabyte AM4 based on B550 and X570 chipsets are explicitely listed to support ECC mode with unbuffered DIMMs. Boards with A520 chipsets mention ECC but not explicitely "ECC Mode", which may mean it will work normally with ECC DIMMs but not offer the same advanced functionality.
All series of these boards seem to work as long as you select one of those two chipsets.
Gigabyte A520I AC I managed to also boot with ECC and there are a lot of configuration options related to ECC. Further testing is needed on this brand.