Constant issues with unRAID

jvlarc · March 28, 2021

Hi everyone, so I've been running unRAID since last year July, so still pretty new to this but you can say it's been a pretty addictive thing for me and with this new addiction, I've been messing around with unRAID and slowly tweaking my server from time to time and it may have seem to have lead to the following issues sorry if this post is really long would like to get all these issues ironed out so that I could properly plan my next steps to not mess up my system.

TLDR : I've been having a ton of issues recently, and here are some issues which I'm facing and I really do hope someone could help me, kinda pulling my hair out on these issues.

Toshiba N300 Drives kept having seek error rates at random intervals(RMA 2 drives 8 times)
10GBe card isn't working, network issues
Docker kept getting more than 20GB and Logs getting filled up, corrupted
Had the whole system bring my devices connected to the AP down
Redid a new flash drive install and getting BTRFS errors again
BTRFS error on cache

1. Toshiba N300 Drives (May not be related but just checking if any other toshiba users seem to be having the same issue

So I got 2x Toshiba 6TB N300 drives which I brought over from my Synology in good health prior to switching to unRAID, however the drives kept throwing out 'seek error rate - 'in the past' and 'FAILING NOW' after a few days to weeks of usage and these kept giving really loud sounds from the drives. and I even had some which was only used for less than 24 hrs and they started throwing out the same smart error again.

The HDDs are all properly mounted in my Fractal R6 casing with the rubber grommets to the tray.

Got so pissed after having gone down to RMA these 2 drives at least >8 times that I got into contact with Toshiba and they have said it's a 'phenomenon'. Just kinda worried that as I kept replacing the drives, I have to do a parity-sync every single time which puts unnecessary stress on all my drives, am finally done with these Toshiba drives and also redid my config a little by removing one a few days back and in the process of pre-clearing a new 12tb WD Elements which I'm about to shuck to replace the other Toshiba drive.

2. I have a Mellanox MT27500 ConnectX-3 10GBe card in both my unRAID and Main PC, used to work fine until a few months ago and I haven't got this solved up till now, followed SIO tutorial which got it working in the beginning but somehow it didn't work one day, together with accessing via 1Gbe Windows Network share, Had to map a network drive to overcome this, finally read somewhere a few weeks ago to remove the network config file from flash and I finally got the 1Gbe to access via the Network share once again(seem to be having this same issue again after the new flash install.), but 10gbe still does not work(both cards are detected, just can't seem to map or access the 10Gbe link)

3. About 1 week ago was when I started having serious issues, 1 of my 1 year old 4TB WD Red (Disk 4) which I was monitoring from 4 UDMA CRC error count gave a notification for 868 error, which the retailed told me to replace a new SATA cable. Did that but still the error increase, told me that he would refund me as I bought a few pieces. Took it out and return, but my scatter brain did not remember that it had files on. I used unBALANCE to scatter the files from the Disk into the other drives. This was done successfully with all files intact.

Docker file usage were getting more than 20GB and Logs getting filled, till the dockers were inaccessible and basically had to restart my entire server everyday and decided to re-do a new config as to remove Disk 4 and one of the 6TB N300 (Disk 7) which was empty as well and have been parity building ever since the new config.

Everyday docker containers fail, I have to turn off docker>delete docker image (I had issues deleting the docker image and it would stay in the systems folder which I used Krusader to delete which finally worked)>stop array> start array >start docker then docker would start.

4. *tower-diagnostics-20210327-2325* 2 days ago, with parity still building(probably 4th time), I accessed the server as usual, and all was sort of 'normal' (By normal I mean the normal crash I would expect at the end of the day but I wanted parity to sync finish) , until I got to work, suddenly everything was down again, tried to remotely access my main pc from office, no luck there, called my dad as he was home to check if both my unRAID server and main PC was shut down, but it was both turned on, and the AP he is connected to outside is working fine, which brought me to suspect my AP was a problem and walked him through resetting my AP but no luck. Finally got home and tried re-diagnosing the issue, finally found the main culprit was the server which somehow crashed only my room's AP and the other devices connected to it as the rest of the household is ok. which brings me to part 5.

5. I decided I had enough of the unSTABILITY after 1 week of constant crashes and even after scouring the forums and internet for help, I just cannot take it any longer and thought starting a fresh copy of unRAID 6.9.1 might help solve whatever minor tweaks which I may have somehow screwed up the system.

6. *tower-diagnostics-20210329-0027* Redid a new flash install onto my flash drive that very night and went away with reinstalling docker containers till 3-4am in the morning have to mention that whilst installing, the system wasn't as responsive and it felt abit sluggish when installing new docker containers.

I saw that I had a reallocated error notification on disk sdb which is my cache 1 drive. This has a reallocated sector count of 1 which has been there for awhile now and thought that it was ok, maybe it might be a better idea to remove it completely from the system (What's the best way to do so and make cache 2 the main cache drive? I've just manually ran the backup and restore app) as I've been getting these errors in my log since a few weeks back.

Quote

Mar 27 12:56:23 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620822528 csum 0x0e5a0d2f expected csum 0xb5ad876e mirror 2
Mar 27 12:56:23 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 1056, gen 0
Mar 27 12:56:23 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620826624 csum 0x6eba6090 expected csum 0xb7cac426 mirror 2
Mar 27 12:56:23 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 1057, gen 0
Mar 27 12:56:23 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620822528 csum 0xdc9ae9bb expected csum 0xb5ad876e mirror 1
Mar 27 12:56:23 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 1409, gen 0
Mar 27 12:56:23 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620826624 csum 0x6eba6090 expected csum 0xb7cac426 mirror 1
Mar 27 12:56:23 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 1410, gen 0
Mar 27 12:56:23 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620822528 csum 0xdc9ae9bb expected csum 0xb5ad876e mirror 2
Mar 27 12:56:23 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 1058, gen 0
Mar 27 12:56:23 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620826624 csum 0x2c973b20 expected csum 0xb7cac426 mirror 2
Mar 27 12:56:23 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 1059, gen 0
Mar 27 12:56:24 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620822528 csum 0x0e5a0d2f expected csum 0xb5ad876e mirror 2
Mar 27 12:56:24 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 1060, gen 0
Mar 27 12:56:24 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620822528 csum 0xdc9ae9bb expected csum 0xb5ad876e mirror 1
Mar 27 12:56:24 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 1411, gen 0
Mar 27 12:56:24 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620822528 csum 0x0e5a0d2f expected csum 0xb5ad876e mirror 2
Mar 27 12:56:24 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 1061, gen 0
Mar 27 12:56:24 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 7978112 off 620822528 csum 0xdc9ae9bb expected csum 0xb5ad876e mirror 2
Mar 27 12:56:24 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 1062, gen 0

is this what's been causing all the issues? Also I tried to do a repair scrub on device sdb1 and they showed 22 files which were corrupted(Thank god these are just Linux ISO files which can be redownloaded so I'm not too worried about it if I were to delete them), could someone advise me on how I should delete these corrupted files as well? is it as simple as just deleting them?

Further more about 18hrs later after it's been running 'stable', I lost access to Plex again, this seems like re-installing the flash drive did not fix the issue as the webui was really sluggish and my Plex docker was unable to restart. Deleted Docker image and when I tried turning it back on it doesn't work again.

Now I'm just going to let my parity-sync finish (Currently at 92%) before I proceed on with the next advisable step.

Sorry that this post is really long as this is getting really frustrating and having lack of sleep due to work and solving this, thus my thoughts are all over the place and thank you for taking the time to read and assist.

tower-diagnostics-20210327-2325.zip tower-diagnostics-20210329-0027.zip

Edited March 29, 2021 by jvlarc

jvlarc · March 29, 2021

So finally the parity was rebuilt and this morning I woke up hoping for a reply here but nothing and my drowsy ass wanted to get plex back up working as I share with quite a number of family and friends and some have been asking me constantly the past few days. Remembering I did the Backup/Restore just as I was typing the post above late last night.

So my drowsy ass stopped the array and then restarted it, however both the cache drive were suddenly unmountable, I thought heck, I got a backup done last night and shouldn't be too worried, so stopped the array and tried only putting cache 2 as the only cache drive but it didn't allow to mount, then I decided to do something stupid, which was to change the cache drives position. This was where the whole thing went wrong.

So as soon as I start array, they started formatting and once it was done I went to backup/restore to restore the app data and this was what I got, but the notification I got was green thinking that it was successfully restored.

Now I just checked my plex and I got no contents available. went to the appdata folder and only a few containers were there. However the important ones like bitwardenrs and nextcloud along with mariadb were all gone. On closer inspection I realize that something was invalid compressed data format as shown above. I then checked the backups folder and realized the compressed files was only 40gb, however it was 130+gb previously. and now it's only 11gb.

Is there any way I can recover all these things? Really any help will be appreciated at this point.

Edited March 29, 2021 by jvlarc

JorgeB · March 29, 2021

15 hours ago, jvlarc said:

Toshiba N300 Drives kept having seek error rates at random intervals(RMA 2 drives 8 times)

This is quite common with these drives, possibly a firmware issue.

15 hours ago, jvlarc said:

Redid a new flash drive install and getting BTRFS errors again

BTRFS error on cache

Start by running memtest, checksum errors usually mean a hardware problem.

jvlarc · March 29, 2021

Hi JorgeB, thanks for this, at least I'm not running around headless

Didn't know that Toshiba drives were incompatible with unRAID, at least that's good to know and will remove both drives from my system. Thank you for confirming this, guess I'll update Toshiba's side.

Great, am running the memtest86 now as I type this and seems to throw 17 bad error, suspected one cheap used ram which I had added in 2 weeks back may have been the cause of it, removed it and running the memtest again, shall update.

As for the SSD, I'm currently running the EaseUS data recovery and shall update on the progress, crossing fingers that I'd get the data back

Edited March 29, 2021 by jvlarc

JorgeB · March 29, 2021

53 minutes ago, jvlarc said:

Didn't know that Toshiba drives were incompatible with unRAID,

They aren't, it's problem with the disks, it happens with any OS, many similar report with FreeNAS for example.

ken-ji · March 29, 2021

For data points:

I'm using 4x 8TB N300 (about 1y 10m poweron time old)

I've haven't seen any weird SMART attribute ever, so it's probably just the firmware on them.

Vr2Io · March 29, 2021

FYR. I use Toshiba X300/MG04/MD04 6TB disk ( all those should be identical as N300 ) with Unraid in 3yrs+, around 7 disks. They haven't give me trouble.

Edited March 29, 2021 by Vr2Io

JorgeB · March 29, 2021

They are usually fine, I have several myself, but there's an unusually high number of SMART "seek rate error" failing now reported errors, but like mentioned it's not limited to Unraid, possibly a firmware issue, since most times the raw value returns to normal after a power cycle.

jvlarc · March 29, 2021

Noted on the drives, maybe it might be a specific drive model as they had mentioned and only issues w the newer drives models was running them in my synology and never had an issue w the older drives except for the noise from the disk as it's 7200rpm.

Also yes I managed to find a bad stick of RAM and have removed it from the system, no more BTRFS error in the logs and I really pray it stays that way, did run the memtest86 again for the remaining 3 sticks for around 3hrs and all had no issues.

*Update just got another BTRFS error and I end up taking out the 2 dodgy rams and the remaining 2 has been running well since I got them new*

I realized that Easeus does not support BTRFS data recovery and just tried UFS explorer after scouring the forum again, seems like it managed to find some of the metadata but only 12.27gb of it and it cost a huge amount for the software itself does anyone have an alternative?
Am I SoL for my appdata? 😢

Edit - Tried extracting out the files using Winrar on windows, however it'll still say the tar.gz is corrupted, what gives?

Edited March 29, 2021 by jvlarc

jvlarc · March 30, 2021

So it seems like this BTRFS error is persistent, espeecially after I tried to 'Resume' my DelugeVPN torrents, however nothing ever seems to seed or download. is there something else going on?

Quote

Mar 30 17:28:31 Tower kernel: BTRFS warning (device md4): csum failed root 5 ino 380623 off 7575584768 csum 0x1632a1db expected csum 0x43d13607 mirror 1 Mar 30 17:28:31 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2081, gen 0 Mar 30 17:28:32 Tower kernel: BTRFS warning (device md4): csum failed root 5 ino 380623 off 7575584768 csum 0x1632a1db expected csum 0x43d13607 mirror 1 Mar 30 17:28:32 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2082, gen 0 Mar 30 17:29:50 Tower kernel: BTRFS warning (device md4): csum failed root 5 ino 380623 off 7575584768 csum 0x1632a1db expected csum 0x43d13607 mirror 1 Mar 30 17:29:50 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2083, gen 0 Mar 30 17:29:50 Tower kernel: BTRFS warning (device md4): csum failed root 5 ino 380623 off 7575584768 csum 0x1632a1db expected csum 0x43d13607 mirror 1 Mar 30 17:29:50 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2084, gen 0

Tried running the 2 new RAM sticks which I got in Jan for 3hrs+ and no errors found from memtest86+, however now it seems to be showing this error again though everything else seems to be running ok

tower-diagnostics-20210330-1752.zip

JorgeB · March 30, 2021

The corrupt data will stay corrupt after you fix the RAM problem, you need to run a scrub and delete/restore the corrupt files identified on the syslog.

jvlarc · March 30, 2021

49 minutes ago, JorgeB said:

The corrupt data will stay corrupt after you fix the RAM problem, you need to run a scrub and delete/restore the corrupt files identified on the syslog.

Sorry how should I go about to run the scrub and properly delete/restore the corrupt files issit only available on the cache drive?

Thank you!

Edited March 30, 2021 by jvlarc

JorgeB · March 30, 2021

Click on the disk, in the case above md4=disk4, then click the scrub button, after it's done look at the syslog for the list of corrupt files, then just delete them or restore from backups.

jvlarc · April 1, 2021

On 3/30/2021 at 7:12 PM, JorgeB said:

Click on the disk, in the case above md4=disk4, then click the scrub button, after it's done look at the syslog for the list of corrupt files, then just delete them or restore from backups.

This works great and yes found like 600 corrupted blocks due to the bad ram, finally got it all cleared after 2 tedious days and the server seems to be back up with it's usual responsiveness.😅

Anyway now it seems that DelugeVPN WebUI isn't accessible when VPN is turned on and can only be accessed when VPN is being turned off, however seems to be downloading when VPN is on as well but a ton of torrents are being requested at the same time. Any idea how I could fix this?

JorgeB · April 1, 2021

Best to use the existing docker support thread:

649391616_dockersupport.PNG.ff25c0c4b0d3bf97c7fa73d9efe59e26.PNG

jvlarc · April 6, 2021

Hi again, have managed to solve the previous issue w a full fresh install of deluge and the system has been running fine until last night.

I started getting another BTRFS Error and my docker port keeps entering blocking and forwarding state.

I'm facing another issue as I'm trying to re-install the cache drive which I was trying to recover the appdata from. However when I assign 2 drives to the cache, I only get the notification that I need to format both the cache drive. Since my last mistake, I did not format due to the fear of resetting everything up everything again. So I've removed the 2nd drive and formatted it via UD, have not added back into the array as of yet as I replaced my last toshiba N300 with a new shucked drive last night Is this normal or is something corrupted? I've also ran CA Backup / Restore and verified tar value 0 (assuming 0 errors from what I researched) and have changed the appdata share from 'prefer' to 'yes'

Also Toshiba side has recognized that the N300 has some issues w running unRAID and has offered me to change to their MG (enterprise) series drives, however I would require a small topup with a brand new 5 years warranty, Is this a good idea?

Anyway here's the latest diagnostics, thanks so much for any help! Love this community.

tower-diagnostics-20210406-1254.zip

JorgeB · April 6, 2021

Btrfs errors are from disk4, data corruption detected, this is likely caused by a hardware problem.

As for the cache, formatting one of the devices is not a good idea, but post new diags after a reboot, with the array started.

jvlarc · April 6, 2021

1 hour ago, JorgeB said:

Btrfs errors are from disk4, data corruption detected, this is likely caused by a hardware problem.

As for the cache, formatting one of the devices is not a good idea, but post new diags after a reboot, with the array started.

Any idea how I could find out what hardware is causing the issue?

ok I’ll post it once the parity for the new shucked drive has been rebuilt, estimated another 12hrs 🤧

also I’m never getting connected to my mellanox 10gbe card too is that a problem too?

JorgeB · April 6, 2021

Most common cause is bad RAM, start by running memtest.

jvlarc · April 6, 2021

1 hour ago, JorgeB said:

Most common cause is bad RAM, start by running memtest.

How long should I run the memtest for btw? As I've previously ran them for 3hrs for these 3 RAM sticks I have and no issues were found, removed 1 bad ram stick which threw out a good 25 errors in the first 1hr, I've since removed the additional ram stick so as to not 'mix and match' before I started the server again.

Will do another scrub on disk4 once the parity has been completed. Thanks for the help rendered thus far @JorgeB!

JorgeB · April 6, 2021

If they are new corruptions there's still problem, if it only happens on that disk could also be a disk issue, memtest ideally should run for 24h, though in most cases when it detects a problem it will detect it after a few hours, also note that it can't detect all issues, so a negative result is not a confirmation there's no problem, a positive result confirms there's one.

trurl · April 6, 2021

14 hours ago, jvlarc said:

changed the appdata share from 'prefer' to 'yes'

This is wrong unless your intention is to move appdata from cache to array temporarily so you can reformat it. Prefer means prefer to keep it on cache and that is the recommended setting for appdata, domains, system shares.

All of those shares have files on your array instead of all on cache as they should, and system is the only one correctly set to prefer.

jvlarc · April 7, 2021

12 hours ago, trurl said:

This is wrong unless your intention is to move appdata from cache to array temporarily so you can reformat it. Prefer means prefer to keep it on cache and that is the recommended setting for appdata, domains, system shares.

All of those shares have files on your array instead of all on cache as they should, and system is the only one correctly set to prefer.

Yes, I'm planning to move the appdata to the array temporarily as I reformat both my cache drives and move it back again once I restart the array.

These errors only show up when I try scrubbing disk 4, however it still can't seem to be fixed, should I use unBALANCE to move all files off this drive and reformat it? Will it help solve the isssue or will the corrupted file just move over to another disk as I can't seem to locate the corrupted data on disk 4.

Quote

Apr 7 13:13:43 Tower ool www[6597]: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_scrub 'start' '/mnt/disk4' ''

Apr 7 13:13:43 Tower kernel: BTRFS info (device md4): scrub: started on devid 1

Apr 7 13:14:22 Tower move: error: move, 391: No such file or directory (2): lstat: /mnt/cache/appdata/binhex-plexpass/Plex Media Server/plexmediaserver.pid

Apr 7 13:15:44 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2637, gen 0

Apr 7 13:15:44 Tower kernel: BTRFS error (device md4): unable to fixup (regular) error at logical 201858580480 on dev /dev/md4

Apr 7 13:15:44 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2638, gen 0

Apr 7 13:15:44 Tower kernel: BTRFS error (device md4): unable to fixup (regular) error at logical 201859952640 on dev /dev/md4

Apr 7 13:25:53 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2639, gen 0

Apr 7 13:25:53 Tower kernel: BTRFS error (device md4): unable to fixup (regular) error at logical 254439288832 on dev /dev/md4

Apr 7 13:26:26 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2640, gen 0

Apr 7 13:26:26 Tower kernel: BTRFS error (device md4): unable to fixup (regular) error at logical 260863025152 on dev /dev/md4

Apr 7 13:26:26 Tower kernel: BTRFS error (device md4): bdev /dev/md4 errs: wr 0, rd 0, flush 0, corrupt 2641, gen 0

Apr 7 13:26:26 Tower kernel: BTRFS error (device md4): unable to fixup (regular) error at logical 260868759552 on dev /dev/md4

Edited April 7, 2021 by jvlarc

JorgeB · April 7, 2021

Corrupt files can't be moved by the mover, you should delete them or restore from backup.

jvlarc · April 7, 2021

3 hours ago, JorgeB said:

Corrupt files can't be moved by the mover, you should delete them or restore from backup.

Sorry but do you know where I can locate the corrupted files? I'm having issues with finding these 5 corrupted files

Constant issues with unRAID

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation