were done at i7core_edac driver. This chapter will cover those differences
1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
- (QPI). At the driver, the term "socket" means one QPI. It should also be
- associated with the CPU physical socket.
+ (QPI). At the driver, the term "socket" means one QPI. This is
+ associated with a physical CPU socket.
Each MC have 3 physical read channels, 3 physical write channels and
3 logic channels. The driver currenty sees it as just 3 channels.
Each channel can have up to 3 DIMMs.
The minimum known unity is DIMMs. There are no information about csrows.
- As EDAC API maps the minimum unity is csrows, the driver exports one
+ As EDAC API maps the minimum unity is csrows, the driver sequencially
+ maps channel/dimm into different csrows.
+
+ For example, suposing the following layout:
+ Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
+ dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
+ dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
+ Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
+ dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
+ Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
+ dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
+ The driver will map it as:
+ csrow0: channel 0, dimm0
+ csrow1: channel 0, dimm1
+ csrow2: channel 1, dimm0
+ csrow3: channel 2, dimm0
+
+exports one
DIMM per csrow.
- Currently, it also exports the several memory controllers as just one. This
- limit will be removed on future versions of the driver.
+ Each QPI is exported as a different memory controller.
2) Nehalem MC has the hability to generate errors. The driver implements this
functionality via some error injection nodes:
For injecting a memory error, there are some sysfs nodes, under
- /sys/devices/system/edac/mc/mc0/:
+ /sys/devices/system/edac/mc/mc?/:
- inject_addrmatch:
+ inject_addrmatch/*:
Controls the error injection mask register. It is possible to specify
several characteristics of the address to match an error code:
dimm = the affected dimm. Numbers are relative to a channel;
For example, to generate an error at rank 1 of dimm 2, for any channel,
any bank, any page, any column:
- echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
+ echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
+ echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
To return to the default behaviour of matching any, you can do:
- echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
+ echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
+ echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
inject_eccmask:
specifies what bits will have troubles,
2 for the highest
1 for the lowest
- inject_socket:
- specifies what QPI (or processor socket) will generate the error.
- on Xeon 35xx, it should be 0.
- on Xeon 55xx, it should be 0 or 1.
-
inject_type:
specifies the type of error, being a combination of the following bits:
bit 0 - repeat
For example, the following code will generate an error for any write access
at socket 0, on any DIMM/address on channel 2:
- echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch
+ echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
- echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
+ For socket 1, it is needed to replace "mc0" by "mc1" at the above
+ commands.
+
The generated error message will look like:
EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
3) Nehalem specific Corrected Error memory counters
- Nehalem have some registers to count memory errors, reporting it on a
- way that it is different from what EDAC API allows. Due to that, a
- separate sysfs note were created to handle such counters.
-
- They can be read by looking at the contents of "corrected_error_counts"
- counter:
-
- $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
- dimm0: 15866
- dimm1: 0
- dimm2: 27285
+ Nehalem have some registers to count memory errors. The driver uses those
+ registers to report Corrected Errors on devices with Registered Dimms.
+
+ However, those counters don't work with Unregistered Dimms. As the chipset
+ offers some counters that also work with UDIMMS (but with a worse level of
+ granularity than the default ones), the driver exposes those registers for
+ UDIMM memories.
+
+ They can be read by looking at the contents of all_channel_counts/
+
+ $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
+ /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
+ 0
+ /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
+ 0
+ /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
+ 0
+
+ What happens here is that errors on different csrows, but at the same
+ dimm number will increment the same counter.
+ So, in this memory mapping:
+ csrow0: channel 0, dimm0
+ csrow1: channel 0, dimm1
+ csrow2: channel 1, dimm0
+ csrow3: channel 2, dimm0
+ The hardware will increment udimm0 for an error at the first dimm at either
+ csrow0, csrow2 or csrow3;
+ The hardware will increment udimm1 for an error at the second dimm at either
+ csrow0, csrow2 or csrow3;
+ The hardware will increment udimm2 for an error at the third dimm at either
+ csrow0, csrow2 or csrow3;
+
+4) Standard error counters
+
+ The standard error counters are generated when an mcelog error is received
+ by the driver. Since, with udimm, this is counted by software, it is
+ possible that some errors could be lost. With rdimm's, they displays the
+ contents of the registers