General Protection Fault -> csum failure -> Read only state (Solved)

February 22, 20188 yr

Version: 6.3.5
Plugins: See Attached Image
Dockers: See Attached Image (Krusader was not running)
Hardware: https://pcpartpicker.com/list/s4PBQV (Not running through the NIC, and there are various other older HDDS in there with the WD-Reds)

System was running stable for ~28days, woke up to all dockers stopped. Mover finished running at 2:17am, followed by a general protection error minutes later which led to a host of other issues that I am unfamiliar with.

image.png.7da396406ad4f7bd3b172c18878f0095.png

tower-syslog-20180221-0742.zip

Edited February 27, 20188 yr by WuzzyFuzzy
Marking thread as solved.

Quote

February 22, 20188 yr

Community Expert

Could you elaborate? Are you having a problem?

I assume 6.3.5 is the unRAID version and the title is an error message or something. Where are you seeing the message?

If you can, go to Tools - Diagnostics in the webUI and post the complete diagnostics zip.

Quote

February 22, 20188 yr

Author

Sorry about that, prematurely submitted the post. Additional info has been added.

Quote

February 22, 20188 yr

Community Expert

Your NVMe device dropped offline causing filesystem issues:

Feb 21 03:00:19 Tower kernel: nvme nvme0: I/O 198 QID 0 timeout, reset controller
Feb 21 03:01:13 Tower kernel: nvme nvme0: Device not ready; aborting reset
Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007
Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007
Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007
Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007
Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007
Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007
Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467168
Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 125809520
Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 662282784
Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467216
Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467144
Feb 21 03:01:13 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Feb 21 03:01:13 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467184

There's also filesystem corruption on disk7:

Feb 21 07:29:16 Tower kernel: XFS (md7): xfs_iread: validation failed for inode 17179869283 failed
Feb 21 07:29:16 Tower kernel: ffff880103e1a600: 94 af c2 88 98 e9 4e eb 88 c3 07 7b e2 aa 78 7c  ......N....{..x|
Feb 21 07:29:16 Tower kernel: ffff880103e1a610: 1f 49 9a 1f 98 70 63 08 58 67 0d 9c 90 9f 2b c5  .I...pc.Xg....+.
Feb 21 07:29:16 Tower kernel: ffff880103e1a620: 46 d3 7f 9a 4e cb f2 c5 eb 28 45 ad f7 a5 c2 2e  F...N....(E.....
Feb 21 07:29:16 Tower kernel: ffff880103e1a630: 76 24 e6 87 46 9c bc d1 39 41 13 d6 d2 22 05 5d  v$..F...9A...".]
Feb 21 07:29:16 Tower kernel: XFS (md7): Internal error xfs_iread at line 514 of file fs/xfs/libxfs/xfs_inode_buf.c.  Caller xfs_iget+0x3b8/0x5d4
Feb 21 07:29:16 Tower kernel: CPU: 3 PID: 9027 Comm: shfs Tainted: G      D W       4.9.30-unRAID #1
Feb 21 07:29:16 Tower kernel: Hardware name: System manufacturer System Product Name/PRIME B350-PLUS, BIOS 0803 06/05/2017
Feb 21 07:29:16 Tower kernel: ffffc900132dfaa0 ffffffff813a4a1b ffff8803fc50d000 ffffffff81964ca9
Feb 21 07:29:16 Tower kernel: ffffc900132dfab8 ffffffff8129c98d ffffffff812a1c44 ffffc900132dfaf0
Feb 21 07:29:16 Tower kernel: ffffffff8129c9de 0000020200000020 ffff880104b2e180 00000000ffffff8b
...
Feb 21 07:29:17 Tower kernel: XFS (md7): Corruption detected. Unmount and run xfs_repair

Quote

February 22, 20188 yr

Author

Your NVMe device dropped offline causing filesystem issues:

Feb 21 03:00:19 Tower kernel: nvme nvme0: I/O 198 QID 0 timeout, reset controllerFeb 21 03:01:13 Tower kernel: nvme nvme0: Device not ready; aborting resetFeb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: nvme nvme0: completing aborted command with status: 0007Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467168Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 125809520Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 662282784Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467216Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467144Feb 21 03:01:13 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0Feb 21 03:01:13 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0Feb 21 03:01:13 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 87467184

There's also filesystem corruption on disk7:

Feb 21 07:29:16 Tower kernel: XFS (md7): xfs_iread: validation failed for inode 17179869283 failedFeb 21 07:29:16 Tower kernel: ffff880103e1a600: 94 af c2 88 98 e9 4e eb 88 c3 07 7b e2 aa 78 7c  ......N....{..x|Feb 21 07:29:16 Tower kernel: ffff880103e1a610: 1f 49 9a 1f 98 70 63 08 58 67 0d 9c 90 9f 2b c5  .I...pc.Xg....+.Feb 21 07:29:16 Tower kernel: ffff880103e1a620: 46 d3 7f 9a 4e cb f2 c5 eb 28 45 ad f7 a5 c2 2e  F...N....(E.....Feb 21 07:29:16 Tower kernel: ffff880103e1a630: 76 24 e6 87 46 9c bc d1 39 41 13 d6 d2 22 05 5d  v$..F...9A...".]Feb 21 07:29:16 Tower kernel: XFS (md7): Internal error xfs_iread at line 514 of file fs/xfs/libxfs/xfs_inode_buf.c.  Caller xfs_iget+0x3b8/0x5d4Feb 21 07:29:16 Tower kernel: CPU: 3 PID: 9027 Comm: shfs Tainted: G      D W       4.9.30-unRAID #1Feb 21 07:29:16 Tower kernel: Hardware name: System manufacturer System Product Name/PRIME B350-PLUS, BIOS 0803 06/05/2017Feb 21 07:29:16 Tower kernel: ffffc900132dfaa0 ffffffff813a4a1b ffff8803fc50d000 ffffffff81964ca9Feb 21 07:29:16 Tower kernel: ffffc900132dfab8 ffffffff8129c98d ffffffff812a1c44 ffffc900132dfaf0Feb 21 07:29:16 Tower kernel: ffffffff8129c9de 0000020200000020 ffff880104b2e180 00000000ffffff8b...Feb 21 07:29:17 Tower kernel: XFS (md7): Corruption detected. Unmount and run xfs_repair

Any idea what could cause the NVMe issue? The NVMe cache drive has dropped offline before on a reboot. The tower itself hasn't been moved in weeks, so I don't feel like it's a physical connection issue. M.2, so no loose cables to pinpoint here. Not sure how to prevent this from happening again.

Was it likely that the drive 7 corruption was a consequence of the other issue? Or just an unfortunate coincidence in timing?

Should the xfs repair be done first or does the cache drive need sorted out prior?

Thank you for your feedback.

Sent from my SM-G950U using Tapatalk

Quote

February 22, 20188 yr

Community Expert

For future reference, you did not actually follow my instructions:

9 hours ago, trurl said:

go to Tools - Diagnostics in the webUI and post the complete diagnostics zip.

We would really much rather have the complete diagnostics, which contains the syslog, SMART for all disks, and many other useful things. If everyone would just post the diagnostics it would save us and them a lot of trouble.

It might even be useful to post them now.

Quote

February 22, 20188 yr

Community Expert

52 minutes ago, WuzzyFuzzy said:

Any idea what could cause the NVMe issue? The NVMe cache drive has dropped offline before on a reboot. The tower itself hasn't been moved in weeks, so I don't feel like it's a physical connection issue. M.2, so no loose cables to pinpoint here. Not sure how to prevent this from happening again.

Like trurl mentioned, post the diags, the syslog you posted isn't even complete, so I can't see the NVMe model you have, some have know issues, e.g. WD Black.

54 minutes ago, WuzzyFuzzy said:

Was it likely that the drive 7 corruption was a consequence of the other issue? Or just an unfortunate coincidence in timing?

Not related, at least not likely.

54 minutes ago, WuzzyFuzzy said:

Should the xfs repair be done first or does the cache drive need sorted out prior?

It doesn't matter, it should be done now.

Quote

February 22, 20188 yr

Author

43 minutes ago, trurl said:

For future reference, you did not actually follow my instructions:

We would really much rather have the complete diagnostics, which contains the syslog, SMART for all disks, and many other useful things. If everyone would just post the diagnostics it would save us and them a lot of trouble.

It might even be useful to post them now.

Understood, that is my mistake. I exported both yesterday but grabbed the wrong one and didn't catch the error. I moved the original diagnostics to a PC I don't have access to at this time. I've exported a new diagnostics just now and am attaching to this post.

22 minutes ago, johnnie.black said:

Like trurl mentioned, post the diags, the syslog you posted isn't even complete, so I can't see the NVMe model you have, some have know issues, e.g. WD Black.

Not related, at least not likely.

It doesn't matter, it should be done now.

Diagnostics have been added. Surprise surprise... It is indeed a WD Black.
I'd like to make sure i'm interpreting what you are saying in your last comment. You are saying that it does not matter if the cache drive issue is sorted out first, that the XFS Repair should be done ASAP, correct? I'll try to remote in and start that over lunch today in ~4hrs

I'd add that there are known smart errors on data drive 2, but I haven't seen any new errors recently. It is on the list of drives to replace with a WD-Red-10 here soon.

I appreciate your time looking through this, and apologize for putting up the syslog and not the diagnostics originally. Hope it didn't waste too much of your time.

tower-diagnostics-20180222-0739.zip

Quote

February 22, 20188 yr

Community Expert

1 minute ago, WuzzyFuzzy said:

Surprise surprise... It is indeed a WD Black.

The problem here is the Marvell controller used by these devices, you might get away with it if you disable VT-d, but if you can trade it for a Samsung or Toshiba/OCZ it would be best.

3 minutes ago, WuzzyFuzzy said:

You are saying that it does not matter if the cache drive issue is sorted out first, that the XFS Repair should be done ASAP, correct?

Correct.

Quote

February 22, 20188 yr

Community Expert

12 minutes ago, WuzzyFuzzy said:

I'd add that there are known smart errors on data drive 2, but I haven't seen any new errors recently. It is on the list of drives to replace with a WD-Red-10 here soon.

Device Model:     WDC WD20EARS-60MVWB0
Serial Number:    WD-WCAZA8350585
197 Current_Pending_Sector  0x0032   001   001   000    Old_age   Always       -       65534

I doubt that is even a real count; 65534 = 0xFFFE = -2

Maybe it somehow got decremented from zero.

Quote

February 22, 20188 yr

Community Expert

50 minutes ago, trurl said:

Maybe it somehow got decremented from zero.

Agree, very likely a firmware issue.

Quote

February 22, 20188 yr

Author

36 minutes ago, johnnie.black said:

Agree, very likely a firmware issue.

SMART Stats are persistent in the sense that if I brought these drives over from an old build with other hardware, that the errors would display still, right? If my memory serves, this drive had errors on it from before adding it this build, and I haven't got around to replacing it.
I'm slightly hesitant to be confident that there is a current firmware issue (outside of what the NVME Controller is executing). Unless you guys suggest differently.

Edited February 22, 20188 yr by WuzzyFuzzy
spelling

Quote

February 22, 20188 yr

Community Expert

5 minutes ago, WuzzyFuzzy said:

if I brought these drives over from an old build with other hardware, that the errors would display still, right?

Yes. The firmware issue would be in the drive itself, where the SMART attributes are determined and stored.

Quote

February 22, 20188 yr

Author

2 hours ago, johnnie.black said:

The problem here is the Marvell controller used by these devices, you might get away with it if you disable VT-d, but if you can trade it for a Samsung or Toshiba/OCZ it would be best.

Going to place an order for the 850 EVO 500GB here momentarily. Is this drive known to be more stable for UnRAID or should I look at alternatives?
http://a.co/1wLGEXP

Edited February 22, 20188 yr by WuzzyFuzzy
Formatting on Amazon Link

Quote

February 22, 20188 yr

Community Expert

5 minutes ago, WuzzyFuzzy said:

Going to place an order for the 850 EVO 500GB here momentarily. Is this drive known to be more stable for UnRAID or should I look at alternatives?
http://a.co/1wLGEXP

Yes, it's a very used device in the community, if you prefer NVMe the 960 EVO is also a good option.

Quote

February 22, 20188 yr

Author

5 hours ago, johnnie.black said:

The problem here is the Marvell controller used by these devices, you might get away with it if you disable VT-d, but if you can trade it for a Samsung or Toshiba/OCZ it would be best.

Correct.

Attempted to stop the array via the UI, became unresponsive and will not reload. Is there a way to gracefully stop the array from command or should I hard reset. The array will attempt to start upon reboot, per settings. I imagine it will likely fail with the cache drive being unreachable.

Quote

February 22, 20188 yr

Community Expert

on the console:

reboot

or if you want to shutdown instead

poweroff

Sometimes it can take a few minutes until it eventually responds, but if it doesn't after say 15 or 30 minutes tops you'll need to hard reboot.

Quote

February 23, 20188 yr

Author

4 hours ago, johnnie.black said:

Sometimes it can take a few minutes until it eventually responds, but if it doesn't after say 15 or 30 minutes tops you'll need to hard reboot.

Let it go for an hour and it was still hung. Hard reset - it booted into the array automatically without the cache drive. The dockers did not start, thankfully.
Stopped the array and started in maintenance mode. Ran the check with -nv and I've attached the results.

Running the repair - If however issues were found, the display of results will indicate the recommended action to take.

Based off this line in the wiki, I'd have expected some sort of note saying there was an issue and a suggested course of action. I don't see that so am unsure of next steps.

The line (below) of the output suggests to me that their may have been files moved to the Lost+Found, but no repairs were suggested. Not sure what I'm missing here. =\
"moving disconnected inodes to lost+found"

XFS_Check.txt

Quote

February 23, 20188 yr

Author

If it is of additional use, I have attached a more recent diagnostics export. Maybe there there will be additional detail in there that was not present before.

tower-diagnostics-20180222-2034.zip

Quote

February 23, 20188 yr

Community Expert

You need to ruin xfs_repair without the -n (no modify) flag.

Quote

February 23, 20188 yr

Author

I just completed the xfs_check without the -n flag. I do not see anything in the log(attached) that suggests any repairs had to be made. Just to verify, md7 referenced in your original post has to refer to the disk labeled disk 7, correct? Not sure what I'm doing wrong here.

repair -v.txt

Quote

February 23, 20188 yr

Community Expert

7 minutes ago, WuzzyFuzzy said:

I do not see anything in the log(attached) that suggests any repairs had to be made.

It's not always visible, sometimes the only way to know if corruption is detected is to check the exit status of xfs_repair -n, either way it should be fixed now, and yes, disk7 is md7.

Quote

February 24, 20188 yr

Author

Put in the new cache drive, and restored via CA Backup / Restore Appdata and everything is currently running well. Thanks for all your help, it is greatly appreciated.
If it acts up in a similar manner, which i'm not expecting it to, I'll reply to this thread.

Quote

General Protection Fault -> csum failure -> Read only state (Solved)

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)