The network outage was intermittent, and 2 nodes still had quorum (yes Im aware that's too low a vote), taking the ENTIRE cluster down over 5 seconds of network loss is absolutely nuts to me. The machines in the picture I shared had consumer UPSs, crappier network cards, configurations, and switches in comparison and still did better in terms of stability.

Reply to this note

Please Login to reply.

Discussion

Whey can't proxmox just kill all services to accomplish fencing? It takes like 10 minutes for a single server to boot into the OS. I think even the kernel watchdog can do a reset without a full system reboot.

And no I'm not going to adjust my hardware to boot faster, it's old and needs memory checking and firmware updates. Hardware reboot should not be a normal condition.

I’ll have to explore behavior more, most settings can be tweaked so there should be enough wiggle room.

I'm playing with pacemaker and SBF which appears to accomplish STONITH fencing via the kernel watchdog, but I'm not certain yet. I haven't gotten my cluster established yet.