I've been tinkering with an LND node for a few months now and the #1 issue is getting out of sync with the chain data. Even when it has the blocks, it sometimes hasn't processed them.

This same problems happened to me with the lightning node in Zeus.

And it's a troublesome problem to detect too because uptime monitors are going to report the service is up. Systemd is going to say the service is running. The blockheight might even be up to date!

It'd take some custom code to dig deep and detect this as soon as it happens. Sure, something could be written, but I haven't come across it yet.

As I understand it, a watchtower is looking for channel closures with old channel state. This is not at all the same as detecting that your node is falling behind and payments are going to fail (sometimes silently) if someone tries to move any money.

The only explanation I can muster for this being the state of lightning after 5+ years is that laws must have had a severe chilling effect and kept developers away. It's hard to understand why the quality would be so low unless there just aren't enough devs to go around. I know the people who are working on these projects are incredibly passionate, so I don't believe for a moment that it's a lack of caring on their part.

Reply to this note

Please Login to reply.

Discussion

What hardware are you on? I think a lot of peoples LN node problems come from believing the lie that a RPI is fast enough.

I forget what I have allocated to those VMs, but it's a 64-core AMD with 256GB RAM.

Drive type that the chain is on?

On production it's a ceph cluster with is a mix of SSDs and WD Red spinny disks.

In the test environment, I think it's a WD Red, but it might be an SSD. Can't remember off the top of ny head.

Interesting. I run mine on a NVMe and it makes a huge difference. I think Spinny disc full chain sync is a month+ these days. NVMe is 48 hours or less.

I haven't dug into the details of why but Bitcoin chain ops seem to be very disc intensive.

To be clear the bitcoind server has never had any problems. It processes and stores blocks like a champ.

It's only been lnd that is troublesome.

Right but remember they are stacked. Bitcoind is the first pass at the blocks then electrs then LND each adding more IO both in terms of bandwidth and thrash.

So if bitcoind can't keep up, LND will never work. You may have split the sweet spot where Bitcoind can work but you don't have enough left over for electrs and LND to do their thing.

It seems to be keeping up just since I upgraded to 0.19.0-beta, but we will see how long that lasts...

It used to happen a lot to me as well. I was unable to figure out what's going on.

And in the end it was a failing/slow ssd.

Were you getting caught in application startup loops where it would never make any progress in processing any blocks? Trying to figure out if we're running into similar issues.

That one was fixed by upgrading to 0.19.0-beta

Another one was some random remote BTC node going offline and my network connection to it timed out, then my server would gracefully shut down. It looked like an unexpected exception that was being caught, and shutdown is how it was handled.

When I switched the .service file to always restart instead of only on failure, the wallet would be locked. After manually unlocking it, it'd run into an issue, reboot, and start over from the beginning... only to repeat this process.

I hacked around this by turning up the log level to debug and it never happened again. It's been a lot of things like this, where no cause is ever identified and nobody understands why the weird workarounds evade the core problems.

I don't remember exactly, but lnd was often out of sync, could not connect to btc node, there were timeouts all over the place.

Actually, better hints gave btc node. When it started up it took forever to load mempool from disk, i2pd showed tons of errors around networking issues.

it was all so random, that I couldn't wrap my head around that.

I figured by accident when trying to copy larger amout of data out of the disk. The speed was good at the beginning but after some time it degraded to kbits/s. smart didn't show any problems either and short tests didn't show any signs.

Do the channels go down when you fall out of sync?

They become inactive. But as soon as I fix whatever the problem de jour was, they become active again.