blurb2m Posted September 21, 2018 Share Posted September 21, 2018 (edited) As soon as I brought the server back up after reboot from 6.6.0 upgrade I had a drive die. SMART at the time showed it had never had a failure. I'm doing a read-check right now but some of these numbers are crazy. I'm not sure what my next step should be. Could this be related to my LSI 9211-8i causing problems? tower-diagnostics-20180921-1502.zip Edited September 21, 2018 by blurb2m Quote Link to comment
blurb2m Posted September 21, 2018 Author Share Posted September 21, 2018 I am really leaning towards LSI error since Parity, Disk 1, Disk 2 are all on HBA. Quote Link to comment
blurb2m Posted September 21, 2018 Author Share Posted September 21, 2018 I erased the BIOS off the LSI 9211. (It is set to IT mode) Disks 1,2,4 and Parity are on HBA. I tried unmounting Disk 4 and then remounting. It starts a rebuild and gets about 2.9% Status: Date Duration Speed Status Errors 2018-09-21, 16:28:43 10 min, 54 sec Unavailable Canceled 0 This is how it always starts with Disk 4 failing. Sep 21 16:28:43 Tower kernel: sd 11:0:1:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00Sep 21 16:28:43 Tower kernel: sd 11:0:1:0: [sdi] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 0a 1f 09 30 00 00 04 00 00 00Sep 21 16:28:43 Tower kernel: print_req_error: I/O error, dev sdi, sector 169806128Sep 21 16:28:43 Tower kernel: sd 11:0:1:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00Sep 21 16:28:43 Tower kernel: sd 11:0:1:0: [sdi] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 0a 1f 0d 30 00 00 04 00 00 00Sep 21 16:28:43 Tower kernel: print_req_error: I/O error, dev sdi, sector 169807152Sep 21 16:28:43 Tower kernel: sd 11:0:1:0: [sdi] Synchronizing SCSI cacheSep 21 16:28:43 Tower kernel: sd 11:0:1:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Quote Link to comment
JorgeB Posted September 21, 2018 Share Posted September 21, 2018 2 hours ago, blurb2m said: I am really leaning towards LSI Looks like it: Sep 21 09:12:44 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Make sure it's well seated and/or try a different slot. Quote Link to comment
blurb2m Posted September 21, 2018 Author Share Posted September 21, 2018 47 minutes ago, johnnie.black said: Looks like it: Sep 21 09:12:44 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Make sure it's well seated and/or try a different slot. I moved it over to another slot x16 one. Time to start looking for replacement drives since I used my spare last week... Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 Disk 4 failed with its "red X". It sat this way for a few hours with no other errors. I just tried to stop the array and I'm getting those read errors: Sep 21 22:22:31 Tower emhttpd: Unmounting disks... Sep 21 22:22:31 Tower emhttpd: shcmd (1235): umount /mnt/disk1 Sep 21 22:22:31 Tower kernel: XFS (md1): Unmounting Filesystem Sep 21 22:22:31 Tower emhttpd: shcmd (1236): rmdir /mnt/disk1 Sep 21 22:22:31 Tower emhttpd: shcmd (1237): umount /mnt/disk2 Sep 21 22:22:31 Tower kernel: XFS (md2): Unmounting Filesystem Sep 21 22:22:31 Tower emhttpd: shcmd (1238): rmdir /mnt/disk2 Sep 21 22:22:31 Tower emhttpd: shcmd (1239): umount /mnt/disk3 Sep 21 22:22:31 Tower kernel: XFS (md3): Unmounting Filesystem Sep 21 22:22:31 Tower kernel: md: disk0 read error, sector=8590695320 ### [PREVIOUS LINE REPEATED 1908015 TIMES] ### Should I try rolling back to 6.5.3 for giggles and see if that does anything? Quote Link to comment
itimpi Posted September 22, 2018 Share Posted September 22, 2018 You can try rolling back but I doubt it will achieve anything.! Whether the error reports are a side-effect of your disk4 problem I have no idea. However Disk0 is the parity disk and the number of errors suggests it may have dropped offline for some reason. Unraid is likely to have been trying to access it as part of the process of unmounting the ‘emulated’ disk4. Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 2 hours ago, itimpi said: You can try rolling back but I doubt it will achieve anything.! Whether the error reports are a side-effect of your disk4 problem I have no idea. However Disk0 is the parity disk and the number of errors suggests it may have dropped offline for some reason. Unraid is likely to have been trying to access it as part of the process of unmounting the ‘emulated’ disk4. I agree about Disk0. This is my first 2 weeks with a HBA card but I would expect it to be slightly more reliable than this if that is my issue. There were a few hiccups during firmware updating trying to use the x399 motherboard and I have to resort to using Z97 mobo on my gaming PC to get it to initially flash over to IT mode BIOS and FW. And today to delete the BIOS. Just going to keep the server off (what is this word "off"?!?) until I can pickup new 8tb drive in the morning. - Should I boot into safe mode and change the setting to auto start the array? - Is there a way to force the array to unmount? Tonight it felt like it was hanging there for 30 minutes with no log activity. Quote Link to comment
JorgeB Posted September 22, 2018 Share Posted September 22, 2018 Type reboot on the console, if it doesn't work after a few minutes you'll need to force it.Also please post new diags showing SMART for all disks, since a few were missing from the previous ones. Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 Here is the diags from a fresh startup with Disk4 removed (since it was already marked invalid) tower-diagnostics-20180922-0332.zip Not sure what to make of lines 1030-1031 from syslog.txt or if that is normal... Sep 22 03:30:55 Tower kernel: AMD-Vi: Event logged [ Sep 22 03:30:55 Tower kernel: INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00] Quote Link to comment
JorgeB Posted September 22, 2018 Share Posted September 22, 2018 Disks look fine, if you keep getting errors there's likely still a problem with the HBA, also you need to check file system on the emulated disk4, better do it before rebuilding. 12 minutes ago, blurb2m said: Sep 22 03:30:55 Tower kernel: AMD-Vi: Event logged [ Sep 22 03:30:55 Tower kernel: INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00] This is related to virtualization, probably harmless and certainly unrelated to current issues. Quote Link to comment
JorgeB Posted September 22, 2018 Share Posted September 22, 2018 And don't forget to check or post SMART for old disk4 to confirm it's actually failing. Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 Shouldn't this show as emulated? Do I just run this for a check?Check will start Read-Check of all array disks. Quote Link to comment
JorgeB Posted September 22, 2018 Share Posted September 22, 2018 3 minutes ago, blurb2m said: Shouldn't this show as emulated? It is emulated, just there's currently no disk assigned there. 3 minutes ago, blurb2m said: Do I just run this for a check? No, this: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 (edited) I have a SMART for old Disk4 from 15SEP? WDC_WD30EFRX-68EUZN0_WD-WMC4N0J179UN-20180915-1502 disk4 (sdd).txt When I tried stopping array, unmounting Disk4, starting array, stopping, then remounting. It would run a rebuild and only get 0.5% - 2.9% of the way before erroring out and disabling the disk. Also attached a Safe Mode snapshot of Disk4 at 2300 on 21SEP. Starting Check Filesystem on Disk4 now. Check Filesystem on Disk4 results: Phase 1 - find and verify superblock... - block cache size set to 1512304 entries Phase 2 - using internal log - zero log... zero_log: head block 2356505 tail block 2356499 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_fdblocks 489295541, counted 490033385 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 1 - agno = 3 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Sat Sep 22 04:10:12 2018 Phase Start End Duration Phase 1: 09/22 04:09:58 09/22 04:09:58 Phase 2: 09/22 04:09:58 09/22 04:09:58 Phase 3: 09/22 04:09:58 09/22 04:10:03 5 seconds Phase 4: 09/22 04:10:03 09/22 04:10:04 1 second Phase 5: Skipped Phase 6: 09/22 04:10:04 09/22 04:10:12 8 seconds Phase 7: 09/22 04:10:12 09/22 04:10:12 Total run time: 14 seconds Edited September 22, 2018 by blurb2m added check filesystem status. Quote Link to comment
JorgeB Posted September 22, 2018 Share Posted September 22, 2018 8 minutes ago, blurb2m said: Also attached a Safe Mode snapshot of Disk4 at 2300 on 21SEP. Looks fine, doubt it's the problem. 9 minutes ago, blurb2m said: Check Filesystem on Disk4 results: You need to remove the -n (no modify) flag or no repair will be done. Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 Phase 1 - find and verify superblock... - block cache size set to 1512304 entries Phase 2 - using internal log - zero log... zero_log: head block 2356505 tail block 2356499 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. run it with -vL ? Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 Phase 1 - find and verify superblock... - block cache size set to 1512304 entries Phase 2 - using internal log - zero log... zero_log: head block 2356505 tail block 2356499 ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. xfs_repair: libxfs_device_zero write failed: Input/output error Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 Figured I would include a new diags since I got a nice popup stating: unRAID array errors: 22-09-2018 04:43 Warning [TOWER] - array has errors Array has 3 disks with read errors tower-diagnostics-20180922-0451.zip Quote Link to comment
JorgeB Posted September 22, 2018 Share Posted September 22, 2018 HBA failed again: Sep 22 04:24:47 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Sep 22 04:24:48 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! You'll need to find another HBA to test or use this one in a different board, so find out where the problem is. Quote Link to comment
blurb2m Posted September 22, 2018 Author Share Posted September 22, 2018 Which is a better solution. Run stay down two drives until a new one can come in or run it with one array drive down and one cache drive? Only have 8 Mobo ports until new one can arrive Quote Link to comment
JorgeB Posted September 22, 2018 Share Posted September 22, 2018 You only have one parity, so the array can only have one disabled drive. Quote Link to comment
blurb2m Posted September 23, 2018 Author Share Posted September 23, 2018 @johnnie.black thanks for all the help. I'll update Tuesday when the LSI Logic SAS 9207-8i Storage Controller LSI00301 comes in. Quote Link to comment
Maticks Posted September 23, 2018 Share Posted September 23, 2018 I ran into the same issues on 6.6.0. And it started complaining about open files and watcher inotify issues. even ran into some weirfd BTRFS issues which was my cache drive of all things. LSI Controller and onboard motherboard controller. 4 Hours on 6.5.3 very stable again... i think ill give 6.6.0 a miss till the bugs are ironed out. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.