NPS4 on a Threadripper 3960x gives two nodes with no memory at all

Question

I set my 3960x to NPS4 (Nodes Per Socket: 4) mode to experiment with NUMA on Linux. My system has 4 32 GiB DIMMs across 4 channels, so I expected each of the 4 nodes to get one. Instead, nodes 1 & 2 get 64 GiBs each, and nodes 0 & 3 get 0:

tavianator@tachyon $ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
node 1 size: 64342 MB
node 1 free: 4580 MB
node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
node 2 size: 64438 MB
node 2 free: 4276 MB
node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  12  12  12 
  1:  12  10  12  12 
  2:  12  12  10  12 
  3:  12  12  12  10

Is this expected? Are the node 0/3 cores further away from memory than than the node 1/2 cores?

John Mahowald · Accepted Answer · 2023-02-12 17:09:57Z

Ryzen 5 3960x is a desktop part. There are not the same quality of balanced memory guides like there are for EPYC server CPUs. On EPYC, memory is really in quadrants of memory channel pairs. Not being able to find one for Matisse, my guess is that half the channels means half the interleave sets, so two.

Even though it can be creative with its topology, this still is one socket, one hop away from all its memory. More serious NUMA effects do not take effect until multiple sockets need to talk to each other.

To see actual NUMA, get a 2 socket server. However, its possible your workloads do not need that, AMD makes some big single socket boxes these days.

2 nodes per socket possibly will result in a more reasonable topology. For development purposes only, to see what it looks like. I am skeptical this will result in noticeable performance improvements.

The default in production should still be NPS1, unless you have data to suggest otherwise.

Tavian Barnes · Accepted Answer · 2023-02-14 18:38:50Z

I learned a lot digging into this which I'll summarize below. TLDR: yes, it seems like the NPS4 NUMA topology is accurate. Nodes 1/2 do have lower-latency access to memory than nodes 0/3. This is surprising to me because I'd always seen the 3960x/3970x diagrammed like this:

The package has 4 CCDs arranged into quadrants, each with two 3-core (3960x) or 4-core (3970x) CCXs. From this diagram it seems like the two CCDs on the left should have equal access to the memory channels on that side. So, not one channel per CCD like I was thinking, but two channels shared between two CCDs, making NPS2 seem most reasonable.

However, a more detailed diagram from WikiChip shows some asymmetry:

CCD0 (top right) is connected to the I/O die by a GMI2 link (red lines). Right next to it are two memory controllers (UMC0&1), but these are not connected to any memory channels. In contrast, CCD2 underneath it is right next to UMC2/3, which are connected to memory channels. It's conceivable that CCD2 has lower memory latency than CCD1.

Can we measure it? One tool for this is the Intel Memory Latency Checker. Let's try it!

# tar xf mlc_v3.10.tgz
# sysctl vm.nr_hugepages=4000
vm.nr_hugepages = 4000
# ./Linux/mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.10
malloc(): corrupted top size
[1]    18377 IOT instruction (core dumped)  ./Linux/mlc --latency_matrix

Uh, okay then, let's try the previous version:

# tar xf ~/Downloads/mlc_v3.9a.tgz
# sysctl vm.nr_hugepages=4000
vm.nr_hugepages = 4000
# ./Linux/mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.9a
Command line parameters: --latency_matrix 

Using buffer size of 600.000MiB
Measuring idle latencies (in ns)...
                Numa node
Numa node            0       1       2       3
       0        -         98.8   108.1  -
       1        -         93.3   111.9  -
       2        -        112.3    93.2  -
       3        -        107.6    97.9  -

This confirms it! Same-node latency is ~93ns, and node 1↔2 latency is ~112ns, but node 0↔1 and 2↔3 latency is in between at ~98ns. Interestingly, nodes 0/3 worst-case latency is slightly better than nodes 1/2 at ~108ns. This makes sense looking at the diagram, as CCD0 is slightly closer to UMC4/5 than CCD2. Bandwidth has a similar story:

# ./Linux/mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.9a
Command line parameters: --bandwidth_matrix 

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1       2       3
       0        -       33629.4 31791.2 -
       1        -       34332.5 31419.5 -
       2        -       31193.1 34266.8 -
       3        -       32077.3 33799.3 -

What this seems to mean is that some cores on a 3960x (and presumably 3970x) are slightly privileged with regard to memory latency and bandwidth. I'd be curious to see the results for a 3990x -- does e.g. CCD1 perform similarly to CCD0?

Stack Exchange Network

NPS4 on a Threadripper 3960x gives two nodes with no memory at all

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
performance
numa
.

Hot Network Questions

NPS4 on a Threadripper 3960x gives two nodes with no memory at all

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxperformancenuma.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
performance
numa
.