It is important to empathize with frustrated users. It's sometimes an unattainable ideal, but who hasn't hit software that Just Doesn't Work? We don't really care if it's just something about our setup, or fundamentally broken, or a completely unhelpful error message: it's an incredibly frustrating feeling of impotence.

Sure, you shouldn't take it out on the devs you aren't paying, but we're all human.

I can't speak for all developers, but I became a FOSS coder in the Linux Kernel. That gave me a pretty thick skin: Linus could be an ass, and even when he was wrong there was no appeal. So I generally find it easier to sift through the users' frustrations and try to get to the problem they are having.

https://github.com/ElementsProject/lightning/issues/7180

And often it turns out, I agree! This shit should just Work Better!

CLN payments are the example here, and it was never my priority. That might seem weird, but the first production CLN node was the Blockstream store. So we're good at *receiving* payments! But the method of routing and actually making payments is neither spec-defined nor a way to lose money. It's also hard to measure success properly, since it depends on the vagaries of the network at the time

But it's important, turns out :). And now we see it first-hand since we host nodes at Greenlight. So this release, unlike most, was "get a new pay system in place" (hence we will miss our release date, for the first time since we switched to date-based releases). Here's a list of what we did:

1. I was Release Captain. I was next in the rotation anyway, but since this was going to be a weird release I wanted to take responsibility.

2. I wrote a compressor for the current topology snapshot. This lets us check a "known" realistic data set into the repo for CI.

3. I wrote a fake channel daemon, which uses the decompressed topology to simulate the entire network.

4. I pulled the min-cost-flow solver out of renepay into its own general plugin, "askrene". This lets anyone access it, lets @lagrange further enhance it, and makes it easier for custom pay plugins to exist: Michael of Boltz showed how important this is with mpay.

5. A new interface for sending HTLCs, which mirrors the path of payments coming from other nodes. In particular, this handles self-pay (including payments where part is self-pay and part remote!) and blinded path entry natively, just like any other payment.

6. Enhancements and cleanups to our "libplugin" library for built-in plugins, to avoid nasty hacks pay has to do.

7. Finally, a new "xpay" command and plug-in. After all the other work, this was fairly simple. In particular, I chose not to be bound to the current pay API, which is a bit painful in the short term.

8. nostr:nprofile1qqsx533y9axh8s2wz9xetcfnvsultwg339t3mkwz6nayrrdsrr9caagppemhxue69uhkummn9ekx7mp0fqcrlc changed our gossip code to be more aggressive: you can't route if you can't see the network well!

Importantly, I haven't closed this issue: we need to see how this works in the Real World! Engineers always love rewriting, but it can actually make things worse as lessons are lost, and workarounds people were using before stop being effective.

But after this fairly Herculean effort, I'm going to need to switch to other things for a while. There are always other things to work on!

Reply to this note

Please Login to reply.

Discussion

Appreciate the work you and many others put into this issue as a user of CLN.

Appreciate the work and the run down Rusty 👊

love to see it as a CLN node runner 🤙

Snarky asshole devs are the best devs.

Keep Calm and Carry on.

😁 🧡🤗

Wow, that's a high quality bug report, especially for such a frustrating, long term, hard to fix issue. Kind of inspiring.

Dunno if you saw https://bluematt.bitcoin.ninja/2024/11/22/ln-routing-replay/ but I recently started being more rigorous about our pathfinding scorer. Might be something to play with, it seems like the simple “just keep upper and lower bound on each channel’s liquidity” approach performs *worse* than always assigning each hop a 50% success probability. Keeping a histogram of those bounds, though, does reasonably good.

We discard what we've "learned" after an hour. We could degrade faster, or we could try to measure channels recovery speed. This requires more analysis though!

Your data would be a useful starting point!

At least my initial analysis of the data seems to say that degrading instantly is better than any other time constant 😭. (May just be some nasty bug in the way we’re calculating probabilities?)

Err, no, sorry, that’s wrong. Degrading instantly is bad (the learning does help!), but the model itself is worse than the naive “I dunno, 50/50 always” “model”, even when you learn.

(The LDK “historical model”, however, seems to do okay, keeping histograms of the liquidity bounds)

That's a great post, thanks for the nuance

Congrats on the release, its great feeling to get the big ones finally over the finish line.