May 15, 20179 yr Hi all, Last night i went to upgrade my UnRAID system from 6.2.2 to the latest and greatest, and noticed some escalating odd behavior. First, when i went to do the "automated" (click here to upgrade) upgrade, it didn't work - it said something to the effect of "unable to write to flash". I thought that was odd, so being a normally windows guy i decided to reboot because a "reboot fixes everything". I stopped the array, and as soon as i did i noticed that one of the drives became unavailable (red X), and two more became "unknown" (with the expected drive name there and a dropdown to choose a disk). Very odd. I rebooted the server and unraid as well as all the drives came back up fine, and i was able to upgrade the OS. All drives reported as available. Being that it had been rebooted several times at this point i elected to run a parity check. This ran all day (the server doesn't have the fastest proc in the world) and when i came back in the evening to check on it i found that the parity check was listed as "incomplete", and that one of the drives had become unavailable. On the display attached to the unraid server i notice a ton of XFS and I/O errors. I shut down the server, checked all of the drive mountings and the cabling and all seemed well. I reseated the cables and the drives just to be on the safe side, and fired the server back up. When i did, i noticed the display was reporting similar I/O errors, and now the Web GUI is unresponsive (the page doesn't even load). I've got it rebooted into "safe mode" now as a precautionary measure as i am not home and would like to troubleshoot remotely. Can anyone advise what my "next steps" are? Thanks in advance! Edited May 15, 20179 yr by jfeeser
May 15, 20179 yr Community Expert Run a memtst (from the boot menu). That is a good start when you have a number of different issues with changing symptoms. Flaky PS's have been known create similar problems. EDIT: you could upload a diagnostics file. 'Tools' >>> 'Diagnostics' ---- or type diagnostics on the command line. (The latter one puts the file in the logs folder/directory of the Flash Drive. Edited May 15, 20179 yr by Frank1940
May 15, 20179 yr Community Expert There's probably filesystem corruption on one of the disks, start the array and then grab the diags on the CLI like Frank posted.
May 15, 20179 yr Author Sounds good. Can i do either of those things remotely from safe mode? I only ask because i don't have physical access to the server right now, it's booted into safe mode, and i'm telnetted in.
May 15, 20179 yr Community Expert 8 minutes ago, jfeeser said: Can i do either of those things remotely from safe mode? start and grab diags yes.
May 15, 20179 yr Author Apologies, how do i accomplish that? It's sad, i know windows and network gear backwards and forwards, but anything beyond the basics in *nix and i'm kindof out of my depth.
May 15, 20179 yr Author I at least got that far I mean the commands to start the array from within safe mode.
May 15, 20179 yr Community Expert I though you were already in safe mode, if not there's no need to start in safe mode.
May 15, 20179 yr Author Right, what i'm saying is that i'm currently in safe mode and would like to start the array to do the diagnostics you guys mentioned. Can that be done from safemode or do i need to reboot into "normal" mode? If i can start the array from safe mode, what are the commands to do so?
May 15, 20179 yr Author Hah, silly me. I assumed that safe-mode was CLI only and never actually bothered to check if the webGUI worked. Guess the caffiene hasn't kicked in yet. I'll pull the diagnostics and report back.
May 15, 20179 yr Community Expert We need the diags after starting the array. Edited May 15, 20179 yr by johnnie.black
May 15, 20179 yr Author Here you go. Of note is that when i started the array this time, a _different_ disk showed up as unmountable in addition to the one that has red-x'ed previously. SMART status for _all_ of my drives (even the X'ed out one) are green. feezfileserv-diagnostics-20170515-1009.zip
May 15, 20179 yr Community Expert You didn't grab the logs after the errors and before rebooting, so just guessing but problems on multiple disks when there's a SASLP that would be my prime suspect, don't forget to grab the diags before rebooting if it happens again. For now, run xfs_repair on disk4 (md4): https://lime-technology.com/wiki/index.php/Check_Disk_Filesystems#Drives_formatted_with_XFS And then rebuild disk9 using the old disk since SMART looks good: http://lime-technology.com/wiki/index.php/Troubleshooting#Re-enable_the_drive Edited May 15, 20179 yr by johnnie.black
May 15, 20179 yr Author Thanks. When doing the xfs_repair on md4, it spits this out: root@feezfileserv:/boot/logs# xfs_repair -v /dev/md4 Phase 1 - find and verify superblock... - block cache size set to 663264 entries Phase 2 - using internal log - zero log... zero_log: head block 776022 tail block 775959 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. (this is after stopping the array and restarting it in maintenance mode) Should i just go ahead and do the "xfs_repair -Lv /dev/md4", or is there something else i should try first?
May 15, 20179 yr Community Expert Use -L, it's normal in theses cases and usually there's no data loss.
May 15, 20179 yr Author Okay. After all of that Disk 4 re-detected properly, and disk 9 is re-building. Time to re-verify that all my backups are up to date. Thank you SO MUCH for all your help, and putting up with my novice-ness.
May 15, 20179 yr Author Looks like i may have spoken too soon....the parity rebuild for Disk 9 seems to have just stopped itself, and the drive is back to a red X. Here's a new diagnostic dump....any thoughts? feezfileserv-diagnostics-20170515-1057.zip
May 15, 20179 yr Community Expert Like I suspected it's the SASLP, these sometimes help: -disable vt-d if not needed -look for a board bios upgrade -use the controller in a different slot if available If nothing help, best solution it to replace it with a LSI controller.
May 15, 20179 yr Author Strange. Any idea what could've caused the sudden change? I've been using the server in this configuration for almost a year without incident.
May 15, 20179 yr Community Expert It happens to a significant amount of users with both the SASLP and the SAS2LP, any change in hardware or software (like a kernel change from upgrading unRAID) can trigger the issue.
May 15, 20179 yr Author After some searching the forums for some recommended models, do you think this would be a suitable replacement? https://www.amazon.com/3P0R3-Controller-PCI-E-mini-SAS-PowerEdge/dp/B00ZSXK1YO/ref=sr_1_1?s=electronics&ie=UTF8&qid=1494866284&sr=1-1&keywords=dell+perc+h310
May 15, 20179 yr Community Expert Yes, but needs to be crossflahed to LSI IT: http://lime-technology.com/oldforum/index.php?topic=12767.msg259006#msg259006
Archived
This topic is now archived and is closed to further replies.