[6.11.1] Server goes to "sleep" after migrating mobo (and other issues)


Go to solution Solved by Squid,

Recommended Posts

Hi guys looking for theories/advice/commiseration,

Some context:

2 weeks ago, I replaced my very old boot USB and moved to a highly-reviewed Samsung USB (per spaceinvaderone), this was in preparation to move into a new case with more storage capacity, and migrate motherboards from one used PC to another. After doing this, my server was humming along perfectly with the new USB in the old hardware configuration. My dream is to have the old unraid-server to be a machine for pre-clearing drives when I get new ones, and the one I am building to be the main storage behemoth.

This past weekend, I shut down to rebuild in the new case. In the new case, there is a brand new corsair rm850 psu and a noctua cpu cooler with a low profile since it is a rackmount case. The CPU is an i7-6700 that was being used in the previous build, and it is going into a Asus TUF Z270 Mark 1 motherboard that was recently used for an old gaming PC. For storage I am adding 3 more 16tb drives, expanding my array from 2x16tb to 5x, maintaining 1 parity drive. I am also adding an additional M.2 SSD which is mostly why I am using this motherboard, in the previous build I had 1 M.2 SSD that was strictly for appdata share, the new one will be for cache pool. Additionally, I am adding a Dell H310 6bps SAS HBA LSI-9211-81 from ArtOfServer's ebay store, which is pre-flashed, because it will support some of the drives on this new case. Build goes smoothly. 

 

First Boot shows Asus splash screen, then some bios from the HBA takes over and tells me it recognizes one drive that is connected to it, thats good. We go back to Asus splash and I open BIOS. Enable CPU virtualization, enable VT-D (the passthrough-devices one), set boot order to USB (uefi) Can't recall if I was using legacy or uefi before, just set UEFI first then legacy second. Save and exit.

Loop around and boot is blocked by some American Megatrends splash saying I need to press F1 to run setup since the RAID configuration has changed (what?), I open bios and check the drives, see all drives are marked AHCI. I notice here that it seems like one HDD is missing. I save and exit. I inspect the sata cables, they are plugged in well, not sure why that drive didn't appear. Find boot setting to disable "wait for F1 if boot errors".

 

Reboot, smooth to Unraid. Go to desktop PC to manage from website. Nothing unusual inside the OS, except the drive is missing. I start the pre-clear process for the recognized drives because I am stupid and I think about how using two M.2s maybe uses one of my sata lanes now, which might ignore the HDD. The next day I check on the pre-clear and find I cannot connect to the local URL, go to the box and plug in monitor, server appears asleep. I cannot wake it up with anything. Ok, have to shutdown physically. Turn off PSU. Take this opportunity to see if I can recognize the unfound HDD in different sata port, move the cable. 

 

Reboot. American Megatrends blocker appears again, and does not go away. I have to hit F1 to go to BIOS. I notice that all my BIOS settings are wiped, so I apply them again. CMOS battery problem? This mobo is like 3-5 years old I think. Apply everything, save, reset, boot unraid.

 

Unraid OS looks normal. Pre-clear had finished but because of unexpected shutdown Unraid wants to run parity check to start array. OK. Start array, go to start VM for HomeAssistant. ERROR: YOU MUST ENABLE VIRTUALIZATION ON YOUR MOBO. Wtf? I just set that. Ok, pause parity check. Shut down from UI. Boot for BIOS. American megatrends appears again, yes MOBO is misbehaving for sure. Set all the settings again, this time also disable C-State management, hoping this is what is putting the server to sleep. Boot to Unraid. Parity check wants to run because of unclean shutdown, thinking I am fixed now. Run parity overnight, qBitTorrent/Plex/Stash/Tautulli and HomeAssistant VM. Parity runs into the next day (17 hour parity), eventually its done. After work I check on server, can't load UI, go look at machine its asleep. This time, turn on, start array, pause parity check, turn on syslog copy to USB. If it sleeps, at least I will know why.

 

Overnight, sometime it goes to sleep again. In the morning (Today) I boot to Unraid and have a syslog that does not tell me anything useful I can see. Unraid says that two of my 16tb drives (parity and another) now have UDMA CRC error count 1, very scary. I bought a new CMOS battery, should arrive tomorrow. Thinking to just turn off server until I have it installed, scared that these perpetual sleep and shutdowns will ruin my drives. Looking for advice, or other theories.

 

Attached diagnostics and syslog.

 

TL;DR New motherboard seems to make server fall asleep at night. After reboot, BIOS seems to wipe settings. Old CMOS battery to blame for everything?

castle-diagnostics-20230427-1137.zip syslog

Link to comment

No I haven't run a memtest yet. The memory, CPU, parity.1x16tb+ storage.1x16tb HDDs was brought over from the previous iteration of the server, which had been running stable for about 7 months now, so I expected the memory to perform fine. My current ongoing test was to see if the server would go to sleep if I never started the array, since I thought after seeing the UDMA CRC error, maybe some of the new drives are at fault. Does that seem fruitless?

My thinking was: If the server never goes to sleep, it means there is some sort of unlogged issue that happens between drives, containers and VM. If the server goes to sleep, it means that whatever's causing sleep has nothing to do with the drives.

Link to comment
24 minutes ago, strictlyparanoid said:

scrubbed.thumb.jpg.79bd3efec385f244d4c837872da26a33.jpg

Checked on memtest before going to bed, seems like something has been identified!

Man, am I reading that right, 26,000 MB is failing? That's basically all the ram is toast? (The config is 8gb in 4 slots)

No - that shows that there are 2 failing addresses  each of which had a bit flipped - quite likely to be the same RAM stick. 

 

however even 1 error is too many as all I/O goes via RAM so that needs to be fixed. 

Link to comment

Since my ram were 2 different sets of 2 sticks, I've pulled out the two sticks that were significantly older of the 4, and now I am on 2 passes with no errors.

After work, I will attempt booting with just the two passing ram

 

Still, after some unexpected shutdowns it seemed BIOS was wiped, is that indicative of a dead/dying CMOS battery? Or can that also be laid against my faulty ram?

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.