A Bitcoin node will detect bugs in the matrix

since I'm not seeing any serious memory or disk issues, and assuming there are no bugs in bitcoin-core, I wonder if my bitcoin node is getting hit by cosmic rays and bits are getting flipped every now and then, leading to leveldb checksum errors.

I looked into ECC ram but I may need to build a home server, since it's not common in desktops motherboards.

Is this is why you’ve been going on about ecc memory nostr:npub12262qa4uhw7u8gdwlgmntqtv7aye8vdcmvszkqwgs0zchel6mz7s6cgrkj ? I found a post about it from greg maxwell 10 years ago as well, saying all his non-laptop machines use ecc memory:

https://www.reddit.com/r/Bitcoin/comments/2jpk54/risks_of_running_bitcoin_client_on_a_computer/cle3qyb/ nostr:note1ktns7scrqr00h9a400eajn8k23hcxzzp35syfr7j4tvzjkdpjjdsj4z0sf

Reply to this note

Please Login to reply.

Discussion

I run ECC RAM on my hard nodes

gonna have to setup a server rack in my house nostr:npub1q3sle0kvfsehgsuexttt3ugjd8xdklxfwwkh559wxckmzddywnws6cd26p

i haven't had a rack in my house in a decade... i could be persuaded though...

another water heater… :)

ryzen supports ecc

Need motherboard support though which is rare for consumer motherboards

I read up you need a ryzen pro apu

I have bought 2 random ASUS boards, one meant for “professional use” and one meant to be lower end (B)

Both work

Can you link?

Will do in a bit

It is detected as ECC too.

G series CPUs don’t support it don’t ask how I know (tried to add ECC RAM to a system with one) and it didn’t detect as ECC

It is supported but not “validated” (don’t blame us if it doesn’t work) for consumer platforms

/r/homelab sends their regards

Doesn't need to be a power hungry monster homelab though, IIRC some of the newer Lenovo thinkcentre tiny nodes support ECC and have two NVME slots

In many years I've never had a core edition leveldb corruption without a no-raid disk failure (bad sectors)

Yeah time to get serious about this. Gonna do a proper zfs raid and ecc setup

Yes.

I have never had a corrupt DB with ECC memory, even with a few hard power cuts. I use RAID 1, though not with checksumming (mdadm) so about 50% of the time any issues should come up with disks if any.

You’re more likely hitting hardware issues that only crop up when running the machine hot. Try running y-cruncher and memtest. Similarly try testing your disk (don’t know any applications that test if it corrupts at high rate, I know they exist tho)

Maybe but i’m never really running this machine hot

and its water cooled

The Udoo Bolt Gear mini PC supports ECC RAM , that’s why I got one 3 years ago and it’s been running non stop ever since. Silent , powerful. Not super cheap but a reasonable price for what you get

You are probably experiencing memory corruption or something similar without noticing it.

The only times I've ever had a leveldb database get corrupted were due to bad memory. I spent a fair bit of ₿ upgrading my desktop to ECC memory a few years ago due to a run-in with bad memory that corrupted files.

I’ve had many bad memory issues in the past and it usually always leads to system instability. If it is memory it must be a very minor issue that somehow doesn’t cause anything else to crash.

It's very easy for minor memory issues to result in disk corruption rather than overall system instability. You just need something like a single bad bit that only shows up sometimes, eg while hot as nostr:nprofile1qqsr6tj32zrfn7v0pu4aheaytdnnc6rluepq73ndc2tdjzus34gat9qpz4mhxue69uhhyetvv9ujuerpd46hxtnfduhswulwwv pointed out.

Hmm will run more tests to confirm

Write a memory test program that stores 4092 bytes and a CRC

memtest86 does this more thoroughly… ive always used that in the past

You want real workloads

Yeah but its harder to test every physical address, isn’t that the point

Due to interleaving and similar you already mostly achieve that

Memtest doesn’t tend to get your CPU hot, though. Different things can fail at different utilization levels…

There was that prime-something program i remember using a long time ago for that, not sure if there are more modern solutions

ycruncher?

I remember prime95 but maybe that was like 20 years ago

damn im old

whatever type of error you are having must be a more widespread issue than a few pages to be triggered so often

the ideal program would have a small pool it nonstop allocates to and deallocates from and a pool it very slowly checks and rotates allocations in/out

write random data and CRC it as I said

you then want to stop the process and get a debugger if there’s a mismatch and see the physical location along with identifying the RAM module

Prime95 will definitely get a CPU toasty, but you would need to ensure it's running with a mix of large and small FFT and for a while (e.g. hours/days). Small FFT maximizes CPU heat, but isn't the best for finding instabilities working with RAM. Large FFT helps there. In general, I find the overclocking community sometimes works on "vibes" rather than completely proven test methodologies, so take this with a grain of salt.

not that you're overclocking, but that group tends to accumulate tribal knowledge of CPU/RAM stability tests. Hardware can be "fun". A ton of variables. Even things like bios versions can cause instabilities.