system went unresponsive during parity rebuild

daan_SVK · April 7, 2021

So I popped in a new pre-cleared parity drive, run into a bump with pre-clearing the old parity drive as per this thread and I thought that I'm golden. The system went unresponsive when the parity rebiuld was around 18%. Last thing I noticed was that the CPU usage went to 100% which never happens as I have an 8 core x5600 and just a bunch of light weight dockers.

this is where I'm at:

- the gui won't load - getting 500 Internal Server Error from the browser

- I hooked up a monitor to the server, logged in, ran htop and thats where the console froze and doesn't take any input

- the server still seems to be writing to the array or parity, but I can only tell by the solid LED light on the chassis

- the old parity drive was hooked up for pre-clearing in an external USB dock and last time I saw it paused

- the server still pings fine

getting a little spooked here, what are my options?

- should I just wait for 2 days in hope the new parity will eventually recalculate and the array comes back alive?

- should I disconnect the old parity drive in the caddy that I was pre-clearing

thanks for reading, fingers crossed.

trurl · April 8, 2021

Since parity isn't valid anyway might as well reboot and get us diagnostics.

daan_SVK · April 8, 2021

14 hours ago, trurl said:

Since parity isn't valid anyway might as well reboot and get us diagnostics.

thanks Trurl,

I waited overnight and the writing to the array stopped. I rebooted it as you suggested and the server came back up claiming the parity rebuild finished but I still get an "Parity Invalid" exclamation mark on the parity drive.

I have attached the diagnostics as per your suggestion.

Should I just start the array attempting to rebuild the parity again? The disk was propperly precleared originally.

tower-diagnostics-20210408-0928.zip

Edited April 8, 2021 by daan_SVK

trurl · April 8, 2021

2 hours ago, daan_SVK said:

disk was propperly precleared

Not relevant. Preclear is only a test/burnin utility. When Unraid needs a clear disk (only when adding a disk to a new data slot in an array with valid parity, so parity will remain valid) Unraid will clear the disk itself.

Why do you have 100G allocated to docker.img? Have you had problems filling it? 20G is usually more than enough and making it larger won't fix filling, it will only make it take longer to fill.

2 hours ago, daan_SVK said:

Should I just start the array attempting to rebuild the parity again?

Yes, and post new diagnostics after starting. Lots of things we can't tell about your situation unless the array is started.

daan_SVK · April 8, 2021

1 hour ago, trurl said:

Why do you have 100G allocated to docker.img?

no particular reason to be honest, this is how it was originally set up.

I started the array again few hour ago, seems to be rebuilding without issues at the moment (at 8.5% now). Server is responsive and all dockers and services are up. I disconnected the USB caddy that was running the pre-clear on the old parity drive in case that might have be the issue.

new diagnostic file is attached.

thanks again for reviewing it, much appreciated.

tower-diagnostics-20210408-1335.zip

trurl · April 8, 2021

1 hour ago, daan_SVK said:

this is how it was originally set up.

You mean somebody else set it up for you?

Why are some of your disks still ReiserFS?

Parity build looks good so far, but looks like some cache corruption:

Apr  8 12:01:02 Tower shfs: copy_file: /mnt/cache/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv /mnt/disk3/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv (5) Input/output error
Apr  8 12:01:03 Tower crond[2026]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Apr  8 13:00:54 Tower shfs: copy_file: /mnt/cache/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv /mnt/disk3/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv (5) Input/output error
Apr  8 13:00:54 Tower kernel: btrfs_print_data_csum_error: 12 callbacks suppressed
Apr  8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 1
Apr  8 13:00:54 Tower kernel: btrfs_dev_stat_print_on_error: 12 callbacks suppressed
Apr  8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 35, gen 0
Apr  8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 2
Apr  8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 34, gen 0
Apr  8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 1
Apr  8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 36, gen 0
Apr  8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 2
Apr  8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 35, gen 0
Apr  8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 1
Apr  8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 37, gen 0
Apr  8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 2
Apr  8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 36, gen 0
Apr  8 13:00:55 Tower crond[2026]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null

Why was mover running?

Didn't look at SMART. Do any disks have SMART warnings on the Dashboard page?

daan_SVK · April 8, 2021

39 minutes ago, trurl said:

2 hours ago, daan_SVK said:

this is how it was originally set up.

You mean somebody else set it up for you?

I set it up a long time ago. Outside of being waste of space, this is not an issue, is it?

39 minutes ago, trurl said:

Why are some of your disks still ReiserFS?

again, this is original configuration. I think some of those disks are over 6 years old. I tend not to fix things if they aren't broken. I was about to replace the oldest drive with the old parity drive.

39 minutes ago, trurl said:

Why was mover running?

this has to do most likely with the parity corruption you pointed out. It was most likely trying to move the movie listed on the log:

Apr  8 12:01:02 Tower shfs: copy_file: /mnt/cache/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv /mnt/disk3/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv (5) Input/output error
Apr  8 12:01:03 Tower crond[2026]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Apr  8 13:00:54 Tower shfs: copy_file: /mnt/cache/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv /mnt/disk3/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv (5) Input/output error
Apr  8 13:00:54 Tower kernel: btrfs_print_data_csum_error: 12 callbacks suppressed

39 minutes ago, trurl said:

Didn't look at SMART. Do any disks have SMART warnings on the Dashboard page?

no SMART errors reported from any of the drives.

So the Mover is the culprit here? Should Mover be disabled in similar cases? I was under the impression the array should be fully operational while the parity is calculated. I certainly never experienced any issues on my monthly parity checks.

Also, should I be concerned with the cache corruption? How would I go about addressing that? Should I scrub it in maintainance mode once the parity rebuild has been compleated?

Edited April 8, 2021 by daan_SVK

JorgeB · April 9, 2021

8 hours ago, trurl said:

csum failed

Btrfs is detecting data corruption, this suggests a hardware issue, like bad RAM, and not that surprising since it's a known issue with Ryzen and overclocked RAM.

daan_SVK · April 9, 2021

7 hours ago, JorgeB said:

known issue with Ryzen and overclocked RAM.

I will try to disable XMP and run the RAM on a stock setting.

the server was up for over 60 days prior, also it was stress tested with memtest for two days before putting into production.

is there something I can do to resolve the cache corruption? this being a pool of two devices, I should be able to run a corrective scrub, correct?

Edited April 9, 2021 by daan_SVK

JorgeB · April 9, 2021

You should run a scrub to identify the corrupt files, most likely both copies will be corrupt, and in that case it can't be fixed, but you can delete them and restore from backups.

daan_SVK · April 9, 2021

1 hour ago, JorgeB said:

You should run a scrub to identify the corrupt files, most likely both copies will be corrupt, and in that case it can't be fixed, but you can delete them and restore from backups.

thanks JorgeB,

I deleted the file I thought caused the mover to hang and run the read only scur, however I dont see any corrupted files listed on the output:


[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sdb1
UUID: d1294b70-d13c-4027-b0b5-12417226b0dc
cache and super generation don't match, space cache will be invalidated
found 128267644928 bytes used, no error found
total csum bytes: 20018700
total tree bytes: 368099328
total fs tree bytes: 303235072
total extent tree bytes: 33718272
btree space waste bytes: 64207774
file data blocks allocated: 1508281032704
 referenced 126905487360

JorgeB · April 9, 2021

1 hour ago, daan_SVK said:

however I dont see any corrupted files listed on the output

That's not a scrub, it's a file system check, there's also a GUI option for the scrub, corrupt files, if any, will be listed in the syslog.

daan_SVK · April 9, 2021

14 minutes ago, JorgeB said:

That's not a scrub, it's a file system check, there's also a GUI option for the scrub, corrupt files, if any, will be listed in the syslog.

ah, yes, thank you for pointing that out. No errors in that one:

UUID:             d1294b70-d13c-4027-b0b5-12417226b0dc
Scrub started:    Fri Apr  9 09:41:13 2021
Status:           finished
Duration:         0:04:15
Total to scrub:   238.90GiB
Rate:             959.28MiB/s
Error summary:    no errors found

the cach log still shows 43 corrupted entries

Apr 8 09:21:28 Tower kernel: sdc: sdc1
Apr 8 09:21:28 Tower kernel: sd 2:0:0:0: [sdc] Attached SCSI disk
Apr 8 09:21:28 Tower kernel: BTRFS: device fsid d1294b70-d13c-4027-b0b5-12417226b0dc devid 2 transid 9310751 /dev/sdc1 scanned by udevd (1642)
Apr 8 09:21:59 Tower emhttpd: MTFDDAK512TBN-1AR1ZABHA_UGXVL01J1BF7U0 (sdc) 512 1000215216
Apr 8 09:21:59 Tower emhttpd: import 31 cache device: (sdc) MTFDDAK512TBN-1AR1ZABHA_UGXVL01J1BF7U0
Apr 8 09:21:59 Tower emhttpd: read SMART /dev/sdc
Apr 8 11:21:28 Tower kernel: BTRFS info (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 23, gen 0
Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 24, gen 0
Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 25, gen 0
Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 26, gen 0
Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 27, gen 0
Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 28, gen 0
Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 29, gen 0
Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 30, gen 0
Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 31, gen 0
Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 35, gen 0
Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 36, gen 0
Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 37, gen 0
Apr 8 14:01:09 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 38, gen 0
Apr 8 14:01:09 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 39, gen 0
Apr 8 14:01:09 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 40, gen 0
Apr 8 15:01:11 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 41, gen 0
Apr 8 15:01:11 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 42, gen 0
Apr 8 15:01:11 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 43, gen 0
Apr 9 09:22:25 Tower emhttpd: read SMART /dev/sdc
Apr 9 09:40:16 Tower emhttpd: read SMART /dev/sdc
Apr 9 09:40:30 Tower kernel: BTRFS info (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 43, gen 0

what's my option addressing those

Edited April 9, 2021 by daan_SVK

system went unresponsive during parity rebuild

Recommended Posts

daan_SVK

Link to comment

trurl

Link to comment

daan_SVK

Link to comment

trurl

Link to comment

daan_SVK

Link to comment

trurl

Link to comment

daan_SVK

Link to comment

JorgeB

Link to comment

daan_SVK

Link to comment

JorgeB

Link to comment

daan_SVK

Link to comment

JorgeB

Link to comment

daan_SVK

Link to comment

Join the conversation