Replying to Avatar Dr. Hax

OK, I'm looking for some help from a fellow lnd node runner.

Problem 1: My comnection to my peer keeps dropping about every 3.5 minutes

Logs say "pong response failure" and "timeout while waiting for pong response -- disconnecting".

The error in the peer's log says the same. That it also had a timeout while waiting for pong reaponse.

I can reconnect with no problem at all.

There is not any network issues such as packet loss (see notes on the pcap below for the evidence to back this up). Tor is not in the mix for this test.

I am able to sends sats if I do so quickly after connecting to the peer. So it seems like things can work properly if the connection issue can be sorted out.

I took a pcap of a connection, transaction and disconnection. Near the end, I see the client (node which initiated the connection) just absolutely slamming PSH,ACKs to the tune of 37 of them in just under 0.001 seconds. Then it sends a TCP Retransmission 0.006 seconds later and gets an ACK 0.036 seconds later, which is a perfectly reasonable response time.

The next batch is some TCP keepalives and keepalice ACKs. Some PSH,ACKs and ACKs in sub ms response time, followed by a retransmit and and ACK from the other side.

Finally 2 more keepalives and Keepalive ACKs in 0.012 seconds and then we get the FIN,ACK from the client followed by the RST,ACK from the server (remote peer to which we connected).

The FIN,ACK did come 5 seconds after the last ACK, so I feel like the server should have responded sooner, but at the same time I don't feel like a 5 second lag should cause a connection to be dropped and no attempt to ever be made to connect to it again. Also, these blitzkreigs of packets within 1ms is absurd.

Any ideas on where I should look next? I guess take pcaps on both sides and compare them?

This is absolutely brutal. I wouldn't expect most sysadmins to go through this much trouble to track down this issue, let alone any normal human be expected to do so.

You're not going to believe this. When I change the log level from info to debug, the problem disappears entirely. FML

nostr:nevent1qqsdyp0fynta04lnxusw2qkvjcn7ea73haelds7pr5s4ezm9spzykyqpzpmhxue69uhkummnw3ezumt0d5hsygxnp65cafj7j5ler2un76esafg7kv79qmu86j0kqzsnnthsp254zypsgqqqqqqs9ar6vj

Reply to this note

Please Login to reply.

Discussion

I've seen bugs that presented like that.

For me it was extra mcu instructions, preventing a race condition or hidding atomicy issues

Good luck.

The first thing a friend said was "That sucks. Race condition." I wish they were wrong.

I'm just going to keep it at debug logging for now. I don't want to take on development of yet another project. I don't have time to get to know another code base. I struggle to keep up with what I contribute to now as it is.

Sounds like the right call it's working, and not your codebase. time critical stuff can be a nightmare.