[SOLVED] Odd system behavior, need help repairing

May 15, 20179 yr

Hi all,

Last night i went to upgrade my UnRAID system from 6.2.2 to the latest and greatest, and noticed some escalating odd behavior. First, when i went to do the "automated" (click here to upgrade) upgrade, it didn't work - it said something to the effect of "unable to write to flash". I thought that was odd, so being a normally windows guy i decided to reboot because a "reboot fixes everything". I stopped the array, and as soon as i did i noticed that one of the drives became unavailable (red X), and two more became "unknown" (with the expected drive name there and a dropdown to choose a disk). Very odd.

I rebooted the server and unraid as well as all the drives came back up fine, and i was able to upgrade the OS. All drives reported as available. Being that it had been rebooted several times at this point i elected to run a parity check. This ran all day (the server doesn't have the fastest proc in the world) and when i came back in the evening to check on it i found that the parity check was listed as "incomplete", and that one of the drives had become unavailable. On the display attached to the unraid server i notice a ton of XFS and I/O errors.

I shut down the server, checked all of the drive mountings and the cabling and all seemed well. I reseated the cables and the drives just to be on the safe side, and fired the server back up. When i did, i noticed the display was reporting similar I/O errors, and now the Web GUI is unresponsive (the page doesn't even load).

I've got it rebooted into "safe mode" now as a precautionary measure as i am not home and would like to troubleshoot remotely. Can anyone advise what my "next steps" are? Thanks in advance!

Edited May 15, 20179 yr by jfeeser

Quote

May 15, 20179 yr

Community Expert

Run a memtst (from the boot menu). That is a good start when you have a number of different issues with changing symptoms. Flaky PS's have been known create similar problems.

EDIT: you could upload a diagnostics file. 'Tools' >>> 'Diagnostics' ---- or type diagnostics on the command line. (The latter one puts the file in the logs folder/directory of the Flash Drive.

Edited May 15, 20179 yr by Frank1940

Quote

May 15, 20179 yr

Community Expert

There's probably filesystem corruption on one of the disks, start the array and then grab the diags on the CLI like Frank posted.

Quote

May 15, 20179 yr

Author

Sounds good. Can i do either of those things remotely from safe mode? I only ask because i don't have physical access to the server right now, it's booted into safe mode, and i'm telnetted in.

Quote

May 15, 20179 yr

Community Expert

8 minutes ago, jfeeser said:

Can i do either of those things remotely from safe mode?

start and grab diags yes.

Quote

May 15, 20179 yr

Author

Apologies, how do i accomplish that? It's sad, i know windows and network gear backwards and forwards, but anything beyond the basics in *nix and i'm kindof out of my depth.

Quote

May 15, 20179 yr

Community Expert

SSH into your server, google putty.

Quote

May 15, 20179 yr

Author

I at least got that far

I mean the commands to start the array from within safe mode.

Quote

May 15, 20179 yr

Community Expert

I though you were already in safe mode, if not there's no need to start in safe mode.

Quote

May 15, 20179 yr

Author

Right, what i'm saying is that i'm currently in safe mode and would like to start the array to do the diagnostics you guys mentioned. Can that be done from safemode or do i need to reboot into "normal" mode? If i can start the array from safe mode, what are the commands to do so?

Quote

May 15, 20179 yr

Community Expert

You start the array normally using the GUI.

Quote

May 15, 20179 yr

Author

Hah, silly me. I assumed that safe-mode was CLI only and never actually bothered to check if the webGUI worked. Guess the caffiene hasn't kicked in yet. I'll pull the diagnostics and report back.

Quote

May 15, 20179 yr

Author

There's the diagnostics file.

feezfileserv-diagnostics-20170515-1003.zip

Quote

May 15, 20179 yr

Community Expert

We need the diags after starting the array.

Edited May 15, 20179 yr by johnnie.black

Quote

May 15, 20179 yr

Author

Here you go. Of note is that when i started the array this time, a _different_ disk showed up as unmountable in addition to the one that has red-x'ed previously. SMART status for _all_ of my drives (even the X'ed out one) are green.

feezfileserv-diagnostics-20170515-1009.zip

Quote

May 15, 20179 yr

Community Expert

You didn't grab the logs after the errors and before rebooting, so just guessing but problems on multiple disks when there's a SASLP that would be my prime suspect, don't forget to grab the diags before rebooting if it happens again.

For now, run xfs_repair on disk4 (md4):

https://lime-technology.com/wiki/index.php/Check_Disk_Filesystems#Drives_formatted_with_XFS

And then rebuild disk9 using the old disk since SMART looks good:

http://lime-technology.com/wiki/index.php/Troubleshooting#Re-enable_the_drive

Edited May 15, 20179 yr by johnnie.black

Quote

May 15, 20179 yr

Author

Thanks. When doing the xfs_repair on md4, it spits this out:

root@feezfileserv:/boot/logs# xfs_repair -v /dev/md4
Phase 1 - find and verify superblock...
        - block cache size set to 663264 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 776022 tail block 775959
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

(this is after stopping the array and restarting it in maintenance mode)

Should i just go ahead and do the "xfs_repair -Lv /dev/md4", or is there something else i should try first?

Quote

May 15, 20179 yr

Community Expert

Use -L, it's normal in theses cases and usually there's no data loss.

Quote

May 15, 20179 yr

Author

Okay. After all of that Disk 4 re-detected properly, and disk 9 is re-building. Time to re-verify that all my backups are up to date.

Thank you SO MUCH for all your help, and putting up with my novice-ness.

Quote

May 15, 20179 yr

Author

Looks like i may have spoken too soon....the parity rebuild for Disk 9 seems to have just stopped itself, and the drive is back to a red X. Here's a new diagnostic dump....any thoughts?

feezfileserv-diagnostics-20170515-1057.zip

Quote

May 15, 20179 yr

Community Expert

Like I suspected it's the SASLP, these sometimes help:

-disable vt-d if not needed

-look for a board bios upgrade

-use the controller in a different slot if available

If nothing help, best solution it to replace it with a LSI controller.

Quote

May 15, 20179 yr

Author

Strange. Any idea what could've caused the sudden change? I've been using the server in this configuration for almost a year without incident.

Quote

May 15, 20179 yr

Community Expert

It happens to a significant amount of users with both the SASLP and the SAS2LP, any change in hardware or software (like a kernel change from upgrading unRAID) can trigger the issue.

Quote

May 15, 20179 yr

Author

After some searching the forums for some recommended models, do you think this would be a suitable replacement?

https://www.amazon.com/3P0R3-Controller-PCI-E-mini-SAS-PowerEdge/dp/B00ZSXK1YO/ref=sr_1_1?s=electronics&ie=UTF8&qid=1494866284&sr=1-1&keywords=dell+perc+h310

Quote

May 15, 20179 yr

Community Expert

Yes, but needs to be crossflahed to LSI IT:

http://lime-technology.com/oldforum/index.php?topic=12767.msg259006#msg259006

Quote

[SOLVED] Odd system behavior, need help repairing

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)