Many segfaults, RAM failing?

Question

I have a Proxmox installation that is unstable, both the host and LXC containers crash. I to find the fault I have tried swapping RAM, CPU and MB. I also did a fresh reinstall. But I'm still getting segfaults, and now i suspect the second set of RAM sticks I use are faulty as well. That is since all the segfaults are caused by different SW libraries, but have IP/SP to approximately the same area of RAM.

But before I go out and buy another pair om RAM sticks I would a second opinion, is the RAM or can cause be something else?

See output below:

-- Boot a7c68ceb88ef4d39af834b30afc8b7a7 --
-- Boot e546636248024201b128732ac8773aa6 --
Sep 11 18:55:30 pve kernel: .NET ThreadPool[3862528]: segfault at 42bfb9a8 ip 00007fe5309a7341 sp 00007fe542bfb8f0 error 4 in memfd:doublemapper (deleted)[7fe530990000+1e7000] likely on CPU 9 (core 16, socket 0)
-- Boot 0d84be2364364002be86b166bab2fb8b --
Sep 17 09:48:27 pve kernel: .NET ThreadPool[516579]: segfault at 7f28aee62de8 ip 00007f28aee62de8 sp 00007ee7357f7728 error 15 in memfd:doublemapper (deleted)[7f28aee60000+10000] likely on CPU 4 (core 8, socket 0)
Sep 17 09:55:45 pve kernel: .NET Server GC[7759]: segfault at 8 ip 00007fd5f7e010bc sp 00007fd5f7f2edf0 error 6 in libcoreclr.so[7fd5f79c2000+4e2000] likely on CPU 4 (core 8, socket 0)
Sep 17 13:50:14 pve kernel: .NET Server GC[1511201]: segfault at 8 ip 00007fb8fe3dfa5b sp 00007fb8fc203970 error 6 in libcoreclr.so[7fb8fdfc4000+4a9000] likely on CPU 4 (core 8, socket 0)
Sep 17 14:36:19 pve kernel: .NET Server GC[548740]: segfault at 8 ip 00007f91c3c010bc sp 00007f91c402adf0 error 6 in libcoreclr.so[7f91c37c2000+4e2000] likely on CPU 8 (core 16, socket 0)
Sep 17 15:09:17 pve kernel: .NET Server GC[1793028]: segfault at 8 ip 00007f885c2010bc sp 00007f885c63adf0 error 6 in libcoreclr.so[7f885bdc2000+4e2000] likely on CPU 4 (core 8, socket 0)
Sep 17 15:11:02 pve kernel: .NET Server GC[1951658]: segfault at 8 ip 00007fc496c010bc sp 00007fc497039df0 error 6 in libcoreclr.so[7fc4967c2000+4e2000] likely on CPU 4 (core 8, socket 0)
Sep 17 15:47:53 pve kernel: PLUGIN[proc][2989]: segfault at b5 ip 00005585edbfb3c2 sp 00007f28d47af4c0 error 4 in netdata[5585edb4e000+270000] likely on CPU 4 (core 8, socket 0)
Sep 17 15:48:16 pve kernel: .NET Server GC[1960959]: segfault at 8 ip 00007f3d5c0010bc sp 00007f3d5c436df0 error 6 in libcoreclr.so[7f3d5bbc2000+4e2000] likely on CPU 4 (core 8, socket 0)
Sep 17 15:50:01 pve kernel: .NET Server GC[2113796]: segfault at 8 ip 00007fd0514010bc sp 00007fd051832df0 error 6 in libcoreclr.so[7fd050fc2000+4e2000] likely on CPU 4 (core 8, socket 0)
Sep 17 16:55:47 pve kernel: .NET Server GC[2121413]: segfault at 8 ip 00007f8b55c010bc sp 00007f8ad404bdf0 error 6 in libcoreclr.so[7f8b557c2000+4e2000] likely on CPU 16 (core 32, socket 0)
Sep 17 18:37:22 pve kernel: .NET Server GC[2396493]: segfault at 8 ip 00007f609c0010bc sp 00007f609c13edf0 error 6 in libcoreclr.so[7f609bbc2000+4e2000] likely on CPU 3 (core 4, socket 0)
Sep 17 18:37:22 pve kernel: .NET Server GC[2396494]: segfault at 4 ip 00007f609be5741b sp 00007f6014e78938 error 4
Sep 17 18:37:22 pve kernel: .NET Server GC[2396495]: segfault at 0 ip 00007f609be5e014 sp 00007f6014df7970 error 4 in libcoreclr.so[7f609bbc2000+4e2000] likely on CPU 8 (core 16, socket 0)
Sep 17 20:15:08 pve kernel: .NET ThreadPool[3291437]: segfault at 7ff4146c5e41 ip 00007ff4146c5e41 sp 00007f75f8edc8e0 error 14 likely on CPU 9 (core 16, socket 0)
Sep 17 21:37:34 pve kernel: .NET Server GC[3577402]: segfault at 8 ip 00007fe8eaa010bc sp 00007fe8eae39df0 error 6 in libcoreclr.so[7fe8ea5c2000+4e2000] likely on CPU 4 (core 8, socket 0)
Sep 18 08:03:31 pve kernel: .NET Server GC[1929138]: segfault at 8 ip 00007f495dfdfa5b sp 00007f495e0efdf0 error 6 in libcoreclr.so[7f495dbc4000+4a9000] likely on CPU 4 (core 8, socket 0)
-- Boot cc5a739fcdde4774baea9476f4e35954 --
-- Boot 37eeab715f514a53a70be19bf02979fb --
Sep 20 10:29:02 pve kernel: traps: .NET BGC[10478] general protection fault ip:7f9f9487a3ad sp:7f5dccff8708 error:0 in libcoreclr.so[7f9f947c2000+4e2000]
Sep 20 16:39:44 pve kernel: .NET Server GC[346393]: segfault at 8 ip 00007fb3534010bc sp 00007fb2d0204df0 error 6 in libcoreclr.so[7fb352fc2000+4e2000] likely on CPU 4 (core 8, socket 0)
Sep 20 16:41:10 pve kernel: .NET Server GC[1951791]: segfault at 8 ip 00007f774ee010bc sp 00007f774f23adf0 error 6 in libcoreclr.so[7f774e9c2000+4e2000] likely on CPU 4 (core 8, socket 0)
-- Boot ee6412c44c104e9eb7464bb4633ffedc --
-- Boot d3ba03c45a0d4a08926f1f6a037c0185 --
Sep 22 06:03:35 pve kernel: postgres[9914]: segfault at 55b789ca6d7b ip 000055b7c4899e43 sp 00007ffd96be1ad0 error 6 in postgres[55b7c45dc000+5d9000] likely on CPU 4 (core 8, socket 0)
Sep 22 08:59:49 pve kernel: unattended-upgr[994014]: segfault at ffffffff00000101 ip 00000000005716be sp 00007ffc3e607990 error 7 in python3.8[423000+296000] likely on CPU 4 (core 8, socket 0)
-- Boot 38a800e306bb414591afb9293bbffa7a --
Sep 25 04:00:15 pve kernel: .NET ThreadPool[2832713]: segfault at 7f04190a5d78 ip 00007f04190a5d78 sp 00007f02a57f7928 error 15 likely on CPU 9 (core 16, socket 0)
Sep 25 16:40:34 pve kernel: postgres[2212459]: segfault at 0 ip 0000000000000000 sp 00007ffc1ad44a78 error 14 in postgres[55cc302e2000+c6000] likely on CPU 4 (core 8, socket 0)
-- Boot 0e3cda8f03d44454a147268b14fd1679 --
Sep 26 12:16:38 pve kernel: .NET ThreadPool[371639]: segfault at 7f8bc9ded288 ip 00007f8bc9ded288 sp 00007f8a915fae40 error 15 likely on CPU 9 (core 16, socket 0)
Sep 26 12:17:03 pve kernel: .NET Tiered Com[409330]: segfault at 7f754c5d42c8 ip 00007f754c5d42c8 sp 00007fb6aabfdcb8 error 15 likely on CPU 9 (core 16, socket 0)
Sep 26 12:29:58 pve kernel: .NET ThreadPool[469166]: segfault at 2f003c ip 00007fcdbd3a098d sp 00007f8b807f4d78 error 4 in libc.so.6[7fcdbd228000+195000] likely on CPU 9 (core 16, socket 0)
Sep 27 04:17:06 pve kernel: .NET ThreadPool[560187]: segfault at 22 ip 0000000000000022 sp 00007f2b00ff6908 error 14 in FinnCore[5567d53cb000+c000] likely on CPU 9 (core 16, socket 0)
Sep 27 05:17:40 pve kernel: traps: .NET ThreadPool[855136] general protection fault ip:7f1c52cb97b8 sp:7f1cb9bfb180 error:0 in libcrypto.so.3[7f1c52cb2000+25d000]

@GeraldSchneider good question, I forgot to mention. I did run memtest86. What is strange is that it found no errors. — Northern Brewer, Sep 27 at 7:56
Well, what kind of hardware is this? I've seen problematic machine where CPU was faulty (and system was failing into panic). So, perform a complete hardware diagnostic: try to replace all the RAM (and see if it helps, then replace back one by one to find out faulty module), replace CPU and see if it helps, replace motherboard and see if it helps, and so on. — Nikita Kipriyanov, Sep 27 at 8:16
@JaromandaX good catch! But the area of the RAM is also in same address range. So I guess I cannot tell if it is RAM or CPU with swapping with new components. I guess it could even be the MB. — Northern Brewer, Sep 27 at 10:26

John Mahowald · Accepted Answer · 2023-10-05 00:34:24Z

Virtual memory does not work like that. Each process has their own address space for all their stuff including instruction pointer (ip) and stack pointer (sp). Much more likely this is a memory management issue in low level code. Although not impossible to also have hardware memory faults.

Get crash dumps when this occurs and look at them with a C debugger. From libcoreclr.so library and the .NET cmdline .NET applications are involved, naturally. Apparently Microsoft has specific advice and tooling, see their docs on Analyze dumps on Linux.

Configure saving dump files, as in with environment variables. DOTNET_DbgEnableMiniDump=1 and DOTNET_EnableCrashReport=1 seem useful for example. Also be aware your OS distro might have its own crash dump handling, I'm unclear how they interact.

Load the crash dump into LLDB as in the docs. lldb --core <dump-file> <host-program> Then try their sos debugger extension. As usual when you aren't familiar with a program, a stack trace is helpful to narrow the search. sos CLRStack for the managed code and sos DumpStack for all code.

Collect version information for every dotnet runtime installed or bundled with applications. Install different versions and confirm if they are affected, such as upgrading to the latest of the major version in use. Or downgrading to any versions that worked before.

Microsoft claims you can get .NET help via support channels. Although debugging what exactly is going on might take someone who hacks on the runtime. Once you have the affected functions, consider running it by Stack Overflow or the dotnet issue tracker.

For crashes that are not dotnet, with a dump you can still attach a debugger like lldb. At least a backtrace would be useful, to see if any patterns emerge from the code.

Note that software faults could be either in the source code, a bug in some memory management. Or could be corrupted, some bits got flipped in your copy and its doing bad things. Verify the integrity of your installed software. Consider building another host like this with exactly the same software packages, verifying the signatures of the repo, and see if you can reproduce the problem.

In parallel to the software fault investigation, you may wish to continue to investigate hardware faults.

If after replacing memory and CPU and the main board and you still get faults, either you haven't found the faulting component, you are the most unlucky person and were shipped new faulty hardware, your physical environment is super hostile, or something else. This is still a very broad investigation, all we can tell from what you shared is that some programs crashed.

RAM modules transfer huge amounts of data constantly, with a vanishingly small error rate. A problem where the OS runs mostly normally but programs crash sometimes will be very tricky to root cause. Memory errors require ECC RAM to diagnose with certainty, get hardware with it. Intel and others segment their products, its usually only on server boxes, unfortunately. Maybe start with a micro server, a tiny tower that could serve a variety of test purposes, starting with reproducing this fault on a different box with reliability features.

If you do get ECC RAM and other RAS hardware, of course there will be software to collect and report faults. On Linux, rasdaemon is the current trendy tool.

Eventually replace all the hardware if you still have doubts that it could be hardware. In professional scenarios, the hardware is inexpensive relative to the applications going down. With such service contracts you will not get much argument with replacing parts.

Power supply in particular, replace that. Check the quality of the utility power, such as with a good UPS.

All of this is a lot of words to say very few possible faults can be excluded based on what you have provided. Be prepared for a deep investigation to find root cause.

Thank you for a very detailed explanation. There are a couple of segfaults that are not from dotnet but unattended-upgr and postgres. Since the sefaults where not all from dotnet I figured the fault had to lie somewhere else. Do you agree? Since dotnet is consuming most CPU time by far on my server it would make sense to me to have most dotnet segfaults if the fault chance is relying on consumed CPU time and not application. — Northern Brewer, Sep 27 at 19:10
See my edit for some thoughts on hardware faults. I do not have opinions on what the cause as, very few things can be concluded from just a list of crashing programs. — John Mahowald, Oct 5 at 0:37

Stack Exchange Network

Many segfaults, RAM failing?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
debian
memory
hardware
proxmox
segmentation-fault
.

Linked

Hot Network Questions

Many segfaults, RAM failing?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged debianmemoryhardwareproxmoxsegmentation-fault.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
debian
memory
hardware
proxmox
segmentation-fault
.