For most connections I use a 30 second timeout. For advertising relay lists I use a 5 second timeout.
Gossip client tracks jobs, unfinished jobs with benign errors will try again after a timeout, but there is a long list of error conditions which change the behavior.
For a few random examples, if the relay returned status 4000, we exclude connections to that relay for 600 seconds. If the connection timed out, we exclude it for 60 seconds. If we got a NOT_FOUND, 600 seconds. If we got a websocket error, 15 seconds. Some of them are probably poorly chosen numbers and could use review.
We then randomize the exclusion value a bit so that retries to multiple relays don't all line up together. For jobs that are not marked "persistent" we don't bother trying again. From my reading of the code right now it looks like if a relay is excluded and a new job wants to talk to it, an error is thrown rather than queueing up that job to happen after the exclusion, which might not be ideal.
Relays marked rank=0 are never connected to, so users can disable relays that don't work, but this involves the user who usually isn't going to notice these details.