When power cycling your (x86) server isn't enough to recover it

113 points
1/21/1970
6 days ago
by zdw

Comments


GlenTheMachine

In grad school we built a lot of logic boards from scratch. They were used for submersible robots, and we had a 350,000 gallon water tank that we kept heated to 88 degrees. This was three stories above the ground in a metal building. You can't really air condition that, so in the summer it got quite hot.

It was not uncommon to return from lunch to find than an embedded computer board that had been working when you left wasn't any more. One way to debug them was to put them in the refrigerator for a while. If they then worked, you knew you had a bad solder joint or an IC that was on the verge of failing.

3 days ago

nielsbot

Wow. That's 3M lbs of water. (1.34M kg)

3 days ago

Szpadel

I have at least 2 regular cases where full power off was required to resolve the issue.

First one is dell latitude laptop with fingerprint reader, randomly after few days of operation, fingerprint reader stops responding and login screens freeze for a minute until it timeouts few times. reboot does not solve it, nor suspending machine. it needs to be powered off and on again (hibernation to disk also works).

second case is my pc with ASRock creator x570, after long time if keeping it suspended, WiFi card stopped to function and just throwed some errors in dmesg on driver initialization. here even power off and on did not help, but flipping switch on power supply for few second resolved the issue

3 days ago

apfsx

I’ve actually had some strange anomalies happen like this on a couple laptops I have. Rebooting or even holding the power button long enough to do according to the manufacturer some kind of CMOS or hard reset didn’t work either. I had to open up the bottom, cover unplugged, the battery completely Then re-plugged in and everything went back to operational condition.

3 days ago

Latty

The WiFi/Bluetooth one was common on AM4, I think, I also had that issue.

3 days ago

duffyjp

The integrated wifi/bt on my AM5 board was so bad I had to disable it and use a PCIe card.

For obvious reasons AMD boards don’t tend to ship with Intel wifi, but in my experience anything else sucks. The intel 6e cards are amazing and dirt cheap.

3 days ago

jmb99

> For obvious reasons AMD boards don’t tend to ship with Intel wifi

Funnily enough, the threadripper (at least WRX90, and at least asrock) come with an Intel dual 10Gb LAN card. Probably because none of the alternatives are good enough for a pro board.

3 days ago

toast0

> For obvious reasons AMD boards don’t tend to ship with Intel wifi, but in my experience anything else sucks.

Cause realtek checks the box for has wifi and costs probably $3 less? If you care, you can swap it, and if you don't, you don't.

3 days ago

doubled112

I’ve had some weirdness with Intel WiFi cards over the years too, especially when dual booting.

3 days ago

duffyjp

Was it the 9560 by chance? (The original AC / wifi5 one) Those were terrible. Our house isn’t practical to wire, so I had a lot of them. All swapped to AX210 cards (6E) and those work phenomenally.

I also dual boot, in addition to being an incurable distro hopper, and these AX210 cards worked out of the box in basically everything.

3 days ago

tonyarkles

Yeah I’ve got a Lenovo Legion laptop that I dual-boot Windows and Linux. I haven’t tried in a while but for at least a year it was impossible to soft-reboot to switch OSes if you wanted wifi to work. My best theory was that Windows and Linux had different firmware that they loaded into it at boot and they weren’t reloading that after a soft reboot (just using whatever was already running on the card).

3 days ago

speckx

My friend had an issue with a laptop that did not resolve until the battery was fully drained.

3 days ago

Szpadel

This reminded me issue that I had once in one of dell laptops at work. It frozen somewhere deep enough that power button was not even responsive. I remember that I had full battery, display was off and fans were spinning on lowest speed. I figured out that I needed to pull out the battery because waiting until it drains would take ages.

I had to do this secretly because company warranty/service deal would require it to send to dell/request technician

2 days ago

Aachen

Wouldn't the quicker solution be disconnecting the battery for 2 seconds?

3 days ago

pests

Not everyone has the skills or knowledge to disassemble their laptop. I haven’t had a removable easily replaceable battery since I feel 2006ish. My current one requires 8 security screws on the bottom, a bracket removed, and even I had some issues when I did a swap earlier this year.

3 days ago

trebligdivad

A BIOS can forget to reset some devices. A physical device might have a design flaw where it forgets to reset some registers on reset. A BIOS (including device firmware) can forget to zero some RAM/initialise a structure and get lucky.

3 days ago

garganzol

Yep, this is a typical flaw and it can cause annoying situations. I met it in my practice.

3 days ago

snakeyjake

Are these Dells?

Some Dells have a "feature" where something, somewhere, in their mess of a UEFI/iDRAC stack will get corrupted and will stay wrong through power cycles until you physically unplug the servers from power and hold down the power button to discharge a capacitor and clear out the NVRAM where the corrupted value is.

Most recently this impacted a PowerEdge R7525 server we have where the iDRAC was enforcing a power cap of ~300 watts leaving the system to be less than 1/10th as performant as it should have been. Manually setting a new power cap did nothing except update the values displayed in the UI. Multiple six minute (because of their mess of a UEFI/iDRAC stack) reboots of both the server and the iDRAC did nothing.

Dell was less than useful except for the fact that they hosted the answer. After raging against their CSA script/LLM auto-reply bullshit for days an aggrieved user with the same issue looking for help in their forums finally posted that he did the cap drain trick and it worked.

Saved me tons of wasted time. Thanks, anonymous fellow frustrated dell customer!

3 days ago

chasil

No, I actually uploaded the firmware images here:

https://github.com/corna/me_cleaner/issues/233

2 days ago

kyrofa

The Linux kernel supports rebooting using a number of different strategies[1]. Some PCs need a different one than the default in order to make sure everything is properly reset.

[1]: https://github.com/torvalds/linux/blob/9b2ffa6148b1e4468d08f...

3 days ago

mjg59

Linux now uses exactly the same reboot strategy as Windows does, so no PC should "need" a different one - it may be the case that driver code leaves the hardware in a state the system vendor didn't test, and using a different reboot approach may work around that, but it's not fundamentally the reboot method that's causing the problem there (https://mjg59.dreamwidth.org/3561.html goes into some more detail on how all this actually works)

3 days ago

kyrofa

Yes, I didn't mean to imply that Linux was doing anything wrong, just that some hardware seems to work better with other approaches, for the reasons you state.

3 days ago

geocrasher

Whenever I power cycle something that doesn't go right the first time, I leave it off for at least 30 seconds so all the caps can discharge and any saved state can reset. Especially true of routers etc.

3 days ago

ijustlovemath

You can further be sure of this by pressing the On button while the power supply is disconnected. Ofc make sure it's always off when you connect or disconnect the power supply.

3 days ago

klysm

Depends on how the on button is implemented, and the power management of the system. On older devices I would expect this to be more reliable.

3 days ago

geocrasher

Indeed, this used to be my "secret trick" for laptops that wouldn't power on: Disconnect the battery and power supply, hold the power button for 30 seconds, then power it back up. Worked every time.

3 days ago

chasil

I ran MECleaner once, and removed power from a desktop, waited ten seconds, plugged it back in, and the test for the presence of the ME was still positive.

I unplugged it and left it overnight, and the next day, the ME was gone.

This was the ARC version, but it can remain operational for some time after power is removed.

3 days ago

trilbyglens

Probably a capacitor in there somewhere that slowly discharges when unplugged for a longer time.

3 days ago

Joel_Mckay

IIRC, on most modern intel cpus removing/blanking the ME will reboot the machine every 20 minutes or so. It is unfortunately an irremovable OEM hardware RAT on most modern systems.

That being said, there are some versions of BIOS that do allow turning the ME off, but most motherboard and laptop manufacturers will not allow general consumers to install that version of the firmware. There are some groups that have figured out how to sign a patched fully feature-unlocked BIOS on a per machine basis (disabling ME is a simple Y/N flag), but YMMV given these tools are nearly impossible to get working.

AMD should end the clown show of RATs, and eat the remaining Intel market. =3

3 days ago

doublepg23

I was under the impression some boutique Linux laptop manufacturers like System76 and StarLabs flashed Coreboot.

3 days ago

Joel_Mckay

Indeed, they used the coreboot nvramtool to set the disable IME flag.

It's still there, but unlike most consumer BIOS can apparently be turned off (whatever that means to Intel.)

Personally, I don't hold a lot of hope outdated on-chip minix OS can't be exploited/activated anyway. =3

3 days ago

chasil

This was on a Core 2 duo, the last generation where it could be totally removed.

3 days ago

DaSHacka

> IIRC, on most modern intel cpus removing/blanking the ME will reboot the machine every 20 minutes or so. It is unfortunately an irremovable OEM hardware RAT on most modern systems.

Yes, if ME detects a problem when initializing it grants you a 20 minute window as a grace period, presumably to allow users to attempt to fix it.

> There are some groups that have figured out how to sign a patched fully feature-unlocked BIOS on a per machine basis (disabling ME is a simple Y/N flag), but YMMV given these tools are nearly impossible to get working.

You can also just flip the HAP bit[0], I'd assume that's what those advanced (usually leaked dev build) BIOS firmwares do anyway.

> AMD should end the clown show of RATs, and eat the remaining Intel market. =3

AMD has PSP[1], which is functionally equivalent (though with a significantly smaller attack surface, when left enabled)

I personally am of the belief that both technologies are likely backdoored. There's so much pointing against them[2], that the simplest explanation is they're more likely than not a mandated backdoor that chipmakers eventually expanded for other purposes (such as recent versions of ME handling suspend-related power management)

[0] https://github.com/corna/me_cleaner/wiki/HAP-AltMeDisable-bi...

[1] https://en.m.wikipedia.org/wiki/AMD_Platform_Security_Proces...

[2] https://en.m.wikipedia.org/wiki/Intel_Management_Engine#Asse...

3 days ago

Joel_Mckay

Computrace was replaced by the Absolute BIOS module, so yes... 100% RAT features have been active for sometime. Whatever legitimate asset recovery and remote drive deletion features it offers, is superseded by potential backdoors on the refurbished PC market.

This is why we can't have nice things. =3

3 days ago

guerrilla

The AMD equivalent is the PSL, right? Can that be disabled on any CPUs?

3 days ago

DaSHacka

I am unaware of the PSL, but I know AMD PSP is the equivalent to ME for most AMD chips [0].

Some motherboards allow you to disable it, and it doesn't do as much as ME in the first place (no network modules or built-in remote access purpose like ME)

[0] https://en.m.wikipedia.org/wiki/AMD_Platform_Security_Proces...

3 days ago

guerrilla

Typo, I meant PSP.

3 days ago

markhahn

I usually prefer the 'reset' option (such as in IPMI). After all, this is the as-designed way to politely ask all devices to re-initialize.

Yes, power-cycling is more unambiguous, but afaikt, the example here is purely that power cycling really needs a noticable off-period so that all devices can fully come down. Otherwise, there's no real standard on what should happen - this or that component might stay up or retain state.

The other reason I like 'reset' is that lots of devices (fans, disks, probably all power systems - definitely including PSUs) have lifetime limits in power cycles. Mostly this is minor, unless you do something like reboot cluster nodes after a job (concievably a paranoid security requirement), or some automation gets in a loop and continually zaps a server.

3 days ago

jcalvinowens

OP, what Linux is this? I'm really curious, I don't recognize that trace format and I can't find the code to print exception traces with the eight bangs on the first line like that anywhere in the upstream git history. I think they're actually from the BIOS?

   !!!! X64 Exception Type - 12(#MC - Machine-Check)  CPU Apic ID - 00000000 !!!!
My story: I had an Intel NUC running Linux back in the day, which would get stuck in standby such that I had to remove and replace the CMOS battery to get it to boot again! I never figured that one out...
3 days ago

pzmarzly

This is a trace from the BIOS, it is not uncommon to have them printed over the serial console. Potentially the BIOS is based on EDK2 source code, in which case you can take a look here for the implementation of the trace printing logic: https://github.com/tianocore/edk2/blob/9e6537469d4700d9d793e...

3 days ago

neuroelectron

I've seen similar behavior when trying out a fork bomb in the terminal on both Linux and Windows. My guess is that on windows the fork bomb made it into the virtual memory and was recorded to disk and wasn't cleaned out completely during boot.

It too, 3 reboots to clear up the errors. Generally on the linux system one extra reboot was necessary about half of the time.

3 days ago

chiph

This sounds like something the younger generation is having to relearn. When you power down a machine, leave it off for at least 30 seconds. One to let the various capacitors drain & discharge, and Two to let any network devices see that the machine is off so they update their internal tables.

You want your cold boot to truly be a cold boot.

2 days ago

vachina

I’ve came across Acer laptops that’d always bluescreen on restart after a PROCHOT shutdown. The fix is to pull out the battery for a few seconds and then plug it back in, magically fixes the bluescreen.

3 days ago

bell-cot

I think it was the 1970's when I first heard of the "remove power, wait a good while, try again" strategy.

The subject was a cheap little black & white TV set that my folks had. Dad was an amateur radio operator, who mostly built his own equipment. He could have dissembled it, traced circuits, and calculated the wait time if he'd cared to.

3 days ago

Tsiklon

This sort of issue is a relatively regular occurrence in the server fleet I and my team deal with. A handful of servers a month (out of thousands) end up misbehaving - a system won’t respond to a power up command from the BMC, a PCI-E device doesn’t appear properly after boot up etc etc

Standard troubleshooting before getting the vendor involved or replacing parts is to action a power drain.

BMCs with high uptime can be especially prone to this, often forgetting how to talk to the system they’re attached to.

a day ago

wibbily

Something like this happened to me once. Lost power in a lightning storm and when it came back my computer could no longer shut off.

Like, at all. Would just hang when you tried. Couldn’t exit from BIOS after changing settings, couldn’t suspend to RAM. Had to yoink the cord whenever I needed to restart. Wild stuff.

Perhaps like Frankenstein the lightning was a breath of life, and with its new sentience my PC was trying to preserve its existence. At any rate I reflashed the BIOS after a few months and it never happened again.

3 days ago

wanderr

I had a desktop that would so something similar occasionally ~15 years ago. I am impatient, so rather than leaving it off for a while I would unplug, hit the power button, plug it back in and turn it on. Usually the fans would even spin for a fraction of a second, there was so much residual power in the caps.

3 days ago

petemc_

When managing large numbers of Dell rack mounted servers, a flea power drain is something you become very familiar with.

3 days ago

kccqzy

I've experienced a similar problem with a Thunderbolt port on a machine. Nothing that plugs into the machine would be recognized. Not even a simple USB device. Power cycling multiple times didn't fix it. But powering off and leaving the machine off for a few minutes fixed it.

Given the problem occurred only once, I didn't do any more investigation on why.

3 days ago

fourfour3

I have an Intel NUC 8i5BEH that does this repeatedly. Pain in the backside. The best way to “fix” it that I found quickly is to unplug the NUC and short the power connector for a few seconds. :/

2 days ago

kccqzy

My problem occurred on an Intel NUC11TNK so it might be exactly the same issue as yours.

21 hours ago

zoky

Bad electrons. Turning off the power lets them drain out.

3 days ago

NBJack

I have had several laptops over the years like this. Full shutdown and power on does not reset some problems, like missing audio, missing wifi, etc. For Lenovo devices, I have to go as far as using the 'recovery' button. This goes for DP Alt Mode as well. Kinda annoying, but at least there's a solution.

3 days ago

dxdxdt

I don't get it. That post was a whole bag of nothing. Why are you guys upvoting it?

3 days ago

magicalhippo

I've had it happen to me, so not a whole bag of nothing, and might be surprising to some.

Also, a topic which can spur some interesting comments.

3 days ago

Reventlov

I had the problem on APU4C4, iirc. You install openwrt on it, everything is working fine, then, you reboot and you get nothing on the serial port.

You unplug/plug it, cold boot it, and then it works again.

3 days ago