WuzzyFuzzy Posted February 22, 2018 Posted February 22, 2018 Version: 6.3.5 Plugins: See Attached Image Dockers: See Attached Image (Krusader was not running) Hardware: https://pcpartpicker.com/list/s4PBQV (Not running through the NIC, and there are various other older HDDS in there with the WD-Reds) System was running stable for ~28days, woke up to all dockers stopped. Mover finished running at 2:17am, followed by a general protection error minutes later which led to a host of other issues that I am unfamiliar with. tower-syslog-20180221-0742.zip
trurl Posted February 22, 2018 Posted February 22, 2018 Could you elaborate? Are you having a problem? I assume 6.3.5 is the unRAID version and the title is an error message or something. Where are you seeing the message? If you can, go to Tools - Diagnostics in the webUI and post the complete diagnostics zip.
WuzzyFuzzy Posted February 22, 2018 Author Posted February 22, 2018 Sorry about that, prematurely submitted the post. Additional info has been added.
JorgeB Posted February 22, 2018 Posted February 22, 2018 Your NVMe device dropped offline causing filesystem issues: Feb 21 03:00:19 Tower kernel: nvme nvme0: I/O 198 QID 0 timeout, reset controller Feb 21 03:01:13 Tower kernel: nvme nvme0: Device not ready; aborting reset Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007 Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007 Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007 Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007 Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007 Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007 Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467168 Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 125809520 Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 662282784 Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467216 Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467144 Feb 21 03:01:13 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Feb 21 03:01:13 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467184 There's also filesystem corruption on disk7: Feb 21 07:29:16 Tower kernel: XFS (md7): xfs_iread: validation failed for inode 17179869283 failed Feb 21 07:29:16 Tower kernel: ffff880103e1a600: 94 af c2 88 98 e9 4e eb 88 c3 07 7b e2 aa 78 7c ......N....{..x| Feb 21 07:29:16 Tower kernel: ffff880103e1a610: 1f 49 9a 1f 98 70 63 08 58 67 0d 9c 90 9f 2b c5 .I...pc.Xg....+. Feb 21 07:29:16 Tower kernel: ffff880103e1a620: 46 d3 7f 9a 4e cb f2 c5 eb 28 45 ad f7 a5 c2 2e F...N....(E..... Feb 21 07:29:16 Tower kernel: ffff880103e1a630: 76 24 e6 87 46 9c bc d1 39 41 13 d6 d2 22 05 5d v$..F...9A...".] Feb 21 07:29:16 Tower kernel: XFS (md7): Internal error xfs_iread at line 514 of file fs/xfs/libxfs/xfs_inode_buf.c. Caller xfs_iget+0x3b8/0x5d4 Feb 21 07:29:16 Tower kernel: CPU: 3 PID: 9027 Comm: shfs Tainted: G D W 4.9.30-unRAID #1 Feb 21 07:29:16 Tower kernel: Hardware name: System manufacturer System Product Name/PRIME B350-PLUS, BIOS 0803 06/05/2017 Feb 21 07:29:16 Tower kernel: ffffc900132dfaa0 ffffffff813a4a1b ffff8803fc50d000 ffffffff81964ca9 Feb 21 07:29:16 Tower kernel: ffffc900132dfab8 ffffffff8129c98d ffffffff812a1c44 ffffc900132dfaf0 Feb 21 07:29:16 Tower kernel: ffffffff8129c9de 0000020200000020 ffff880104b2e180 00000000ffffff8b ... Feb 21 07:29:17 Tower kernel: XFS (md7): Corruption detected. Unmount and run xfs_repair
WuzzyFuzzy Posted February 22, 2018 Author Posted February 22, 2018 Your NVMe device dropped offline causing filesystem issues: Feb 21 03:00:19 Tower kernel: nvme nvme0: I/O 198 QID 0 timeout, reset controllerFeb 21 03:01:13 Tower kernel: nvme nvme0: Device not ready; aborting resetFeb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467168Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 125809520Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 662282784Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467216Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467144Feb 21 03:01:13 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0Feb 21 03:01:13 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467184 There's also filesystem corruption on disk7: Feb 21 07:29:16 Tower kernel: XFS (md7): xfs_iread: validation failed for inode 17179869283 failedFeb 21 07:29:16 Tower kernel: ffff880103e1a600: 94 af c2 88 98 e9 4e eb 88 c3 07 7b e2 aa 78 7c ......N....{..x|Feb 21 07:29:16 Tower kernel: ffff880103e1a610: 1f 49 9a 1f 98 70 63 08 58 67 0d 9c 90 9f 2b c5 .I...pc.Xg....+.Feb 21 07:29:16 Tower kernel: ffff880103e1a620: 46 d3 7f 9a 4e cb f2 c5 eb 28 45 ad f7 a5 c2 2e F...N....(E.....Feb 21 07:29:16 Tower kernel: ffff880103e1a630: 76 24 e6 87 46 9c bc d1 39 41 13 d6 d2 22 05 5d v$..F...9A...".]Feb 21 07:29:16 Tower kernel: XFS (md7): Internal error xfs_iread at line 514 of file fs/xfs/libxfs/xfs_inode_buf.c. Caller xfs_iget+0x3b8/0x5d4Feb 21 07:29:16 Tower kernel: CPU: 3 PID: 9027 Comm: shfs Tainted: G D W 4.9.30-unRAID #1Feb 21 07:29:16 Tower kernel: Hardware name: System manufacturer System Product Name/PRIME B350-PLUS, BIOS 0803 06/05/2017Feb 21 07:29:16 Tower kernel: ffffc900132dfaa0 ffffffff813a4a1b ffff8803fc50d000 ffffffff81964ca9Feb 21 07:29:16 Tower kernel: ffffc900132dfab8 ffffffff8129c98d ffffffff812a1c44 ffffc900132dfaf0Feb 21 07:29:16 Tower kernel: ffffffff8129c9de 0000020200000020 ffff880104b2e180 00000000ffffff8b...Feb 21 07:29:17 Tower kernel: XFS (md7): Corruption detected. Unmount and run xfs_repair Any idea what could cause the NVMe issue? The NVMe cache drive has dropped offline before on a reboot. The tower itself hasn't been moved in weeks, so I don't feel like it's a physical connection issue. M.2, so no loose cables to pinpoint here. Not sure how to prevent this from happening again. Was it likely that the drive 7 corruption was a consequence of the other issue? Or just an unfortunate coincidence in timing? Should the xfs repair be done first or does the cache drive need sorted out prior? Thank you for your feedback. Sent from my SM-G950U using Tapatalk
trurl Posted February 22, 2018 Posted February 22, 2018 For future reference, you did not actually follow my instructions: 9 hours ago, trurl said: go to Tools - Diagnostics in the webUI and post the complete diagnostics zip. We would really much rather have the complete diagnostics, which contains the syslog, SMART for all disks, and many other useful things. If everyone would just post the diagnostics it would save us and them a lot of trouble. It might even be useful to post them now.
JorgeB Posted February 22, 2018 Posted February 22, 2018 52 minutes ago, WuzzyFuzzy said: Any idea what could cause the NVMe issue? The NVMe cache drive has dropped offline before on a reboot. The tower itself hasn't been moved in weeks, so I don't feel like it's a physical connection issue. M.2, so no loose cables to pinpoint here. Not sure how to prevent this from happening again. Like trurl mentioned, post the diags, the syslog you posted isn't even complete, so I can't see the NVMe model you have, some have know issues, e.g. WD Black. 54 minutes ago, WuzzyFuzzy said: Was it likely that the drive 7 corruption was a consequence of the other issue? Or just an unfortunate coincidence in timing? Not related, at least not likely. 54 minutes ago, WuzzyFuzzy said: Should the xfs repair be done first or does the cache drive need sorted out prior? It doesn't matter, it should be done now.
WuzzyFuzzy Posted February 22, 2018 Author Posted February 22, 2018 43 minutes ago, trurl said: For future reference, you did not actually follow my instructions: We would really much rather have the complete diagnostics, which contains the syslog, SMART for all disks, and many other useful things. If everyone would just post the diagnostics it would save us and them a lot of trouble. It might even be useful to post them now. Understood, that is my mistake. I exported both yesterday but grabbed the wrong one and didn't catch the error. I moved the original diagnostics to a PC I don't have access to at this time. I've exported a new diagnostics just now and am attaching to this post. 22 minutes ago, johnnie.black said: Like trurl mentioned, post the diags, the syslog you posted isn't even complete, so I can't see the NVMe model you have, some have know issues, e.g. WD Black. Not related, at least not likely. It doesn't matter, it should be done now. Diagnostics have been added. Surprise surprise... It is indeed a WD Black. I'd like to make sure i'm interpreting what you are saying in your last comment. You are saying that it does not matter if the cache drive issue is sorted out first, that the XFS Repair should be done ASAP, correct? I'll try to remote in and start that over lunch today in ~4hrs I'd add that there are known smart errors on data drive 2, but I haven't seen any new errors recently. It is on the list of drives to replace with a WD-Red-10 here soon. I appreciate your time looking through this, and apologize for putting up the syslog and not the diagnostics originally. Hope it didn't waste too much of your time. tower-diagnostics-20180222-0739.zip
JorgeB Posted February 22, 2018 Posted February 22, 2018 1 minute ago, WuzzyFuzzy said: Surprise surprise... It is indeed a WD Black. The problem here is the Marvell controller used by these devices, you might get away with it if you disable VT-d, but if you can trade it for a Samsung or Toshiba/OCZ it would be best. 3 minutes ago, WuzzyFuzzy said: You are saying that it does not matter if the cache drive issue is sorted out first, that the XFS Repair should be done ASAP, correct? Correct.
trurl Posted February 22, 2018 Posted February 22, 2018 12 minutes ago, WuzzyFuzzy said: I'd add that there are known smart errors on data drive 2, but I haven't seen any new errors recently. It is on the list of drives to replace with a WD-Red-10 here soon. Device Model: WDC WD20EARS-60MVWB0 Serial Number: WD-WCAZA8350585 197 Current_Pending_Sector 0x0032 001 001 000 Old_age Always - 65534 I doubt that is even a real count; 65534 = 0xFFFE = -2 Maybe it somehow got decremented from zero.
JorgeB Posted February 22, 2018 Posted February 22, 2018 50 minutes ago, trurl said: Maybe it somehow got decremented from zero. Agree, very likely a firmware issue.
WuzzyFuzzy Posted February 22, 2018 Author Posted February 22, 2018 36 minutes ago, johnnie.black said: Agree, very likely a firmware issue. SMART Stats are persistent in the sense that if I brought these drives over from an old build with other hardware, that the errors would display still, right? If my memory serves, this drive had errors on it from before adding it this build, and I haven't got around to replacing it. I'm slightly hesitant to be confident that there is a current firmware issue (outside of what the NVME Controller is executing). Unless you guys suggest differently.
trurl Posted February 22, 2018 Posted February 22, 2018 5 minutes ago, WuzzyFuzzy said: if I brought these drives over from an old build with other hardware, that the errors would display still, right? Yes. The firmware issue would be in the drive itself, where the SMART attributes are determined and stored.
WuzzyFuzzy Posted February 22, 2018 Author Posted February 22, 2018 2 hours ago, johnnie.black said: The problem here is the Marvell controller used by these devices, you might get away with it if you disable VT-d, but if you can trade it for a Samsung or Toshiba/OCZ it would be best. Going to place an order for the 850 EVO 500GB here momentarily. Is this drive known to be more stable for UnRAID or should I look at alternatives?http://a.co/1wLGEXP
JorgeB Posted February 22, 2018 Posted February 22, 2018 5 minutes ago, WuzzyFuzzy said: Going to place an order for the 850 EVO 500GB here momentarily. Is this drive known to be more stable for UnRAID or should I look at alternatives?http://a.co/1wLGEXP Yes, it's a very used device in the community, if you prefer NVMe the 960 EVO is also a good option.
WuzzyFuzzy Posted February 22, 2018 Author Posted February 22, 2018 5 hours ago, johnnie.black said: The problem here is the Marvell controller used by these devices, you might get away with it if you disable VT-d, but if you can trade it for a Samsung or Toshiba/OCZ it would be best. Correct. Attempted to stop the array via the UI, became unresponsive and will not reload. Is there a way to gracefully stop the array from command or should I hard reset. The array will attempt to start upon reboot, per settings. I imagine it will likely fail with the cache drive being unreachable.
JorgeB Posted February 22, 2018 Posted February 22, 2018 on the console: reboot or if you want to shutdown instead poweroff Sometimes it can take a few minutes until it eventually responds, but if it doesn't after say 15 or 30 minutes tops you'll need to hard reboot.
WuzzyFuzzy Posted February 23, 2018 Author Posted February 23, 2018 4 hours ago, johnnie.black said: Sometimes it can take a few minutes until it eventually responds, but if it doesn't after say 15 or 30 minutes tops you'll need to hard reboot. Let it go for an hour and it was still hung. Hard reset - it booted into the array automatically without the cache drive. The dockers did not start, thankfully. Stopped the array and started in maintenance mode. Ran the check with -nv and I've attached the results. Running the repair - If however issues were found, the display of results will indicate the recommended action to take. Based off this line in the wiki, I'd have expected some sort of note saying there was an issue and a suggested course of action. I don't see that so am unsure of next steps. The line (below) of the output suggests to me that their may have been files moved to the Lost+Found, but no repairs were suggested. Not sure what I'm missing here. =\ "moving disconnected inodes to lost+found" XFS_Check.txt
WuzzyFuzzy Posted February 23, 2018 Author Posted February 23, 2018 If it is of additional use, I have attached a more recent diagnostics export. Maybe there there will be additional detail in there that was not present before. tower-diagnostics-20180222-2034.zip
JorgeB Posted February 23, 2018 Posted February 23, 2018 You need to ruin xfs_repair without the -n (no modify) flag.
WuzzyFuzzy Posted February 23, 2018 Author Posted February 23, 2018 I just completed the xfs_check without the -n flag. I do not see anything in the log(attached) that suggests any repairs had to be made. Just to verify, md7 referenced in your original post has to refer to the disk labeled disk 7, correct? Not sure what I'm doing wrong here. repair -v.txt
JorgeB Posted February 23, 2018 Posted February 23, 2018 7 minutes ago, WuzzyFuzzy said: I do not see anything in the log(attached) that suggests any repairs had to be made. It's not always visible, sometimes the only way to know if corruption is detected is to check the exit status of xfs_repair -n, either way it should be fixed now, and yes, disk7 is md7.
WuzzyFuzzy Posted February 24, 2018 Author Posted February 24, 2018 Put in the new cache drive, and restored via CA Backup / Restore Appdata and everything is currently running well. Thanks for all your help, it is greatly appreciated. If it acts up in a similar manner, which i'm not expecting it to, I'll reply to this thread.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.