We haven't had too many issues in the day job yet, but we're also not clustering our proximox instances....yet.

Stand alone, they seemed to work just fine. But we do have separate cores. It's going to get interesting when we start clustering them, I'm sure.

Reply to this note

Please Login to reply.

Discussion

I guess we'll see how many network engineers it takes to keep them stable. 🤣

I think it's one of those things you have to get right the first time, otherwise you're stuck in the hell of "i can't touch the network because the entire cluster will hard crash and take 25 minutes to come back online and God I hope no disks were corrupted"

My UPSs are getting kind of old and sometimes brown-outs don't trip fast enough, I had an issue where a quick power loss tripped up my main switch and looking at the logs 3/5 nodes lost quorum and the whole cluster hard crashed and hardware reset. I lost 3 VMs in the process I had to restore from backup. Took almost 2 hours to recover at 3 am XD

Fr

That anecdote is a power issue though. This is where wholistic approach matters highly, one bad power system fucking higher layers is so damn frequent and it sucks when that’s what you’re stuck using.

No power, no revenue.

The network outage was intermittent, and 2 nodes still had quorum (yes Im aware that's too low a vote), taking the ENTIRE cluster down over 5 seconds of network loss is absolutely nuts to me. The machines in the picture I shared had consumer UPSs, crappier network cards, configurations, and switches in comparison and still did better in terms of stability.

Whey can't proxmox just kill all services to accomplish fencing? It takes like 10 minutes for a single server to boot into the OS. I think even the kernel watchdog can do a reset without a full system reboot.

And no I'm not going to adjust my hardware to boot faster, it's old and needs memory checking and firmware updates. Hardware reboot should not be a normal condition.

I’ll have to explore behavior more, most settings can be tweaked so there should be enough wiggle room.

I'm playing with pacemaker and SBF which appears to accomplish STONITH fencing via the kernel watchdog, but I'm not certain yet. I haven't gotten my cluster established yet.

I also cannot fix crappy power, that's a condition id expect to survive... Line interactive UPS are prohibitively expensive even for many businesses.

Also reminds me I want to consider getting switches with redundant hot-swap PSUs

Edge-core. 👨‍🍳🤌💋