[SOLVED] Lots of recent issues

thedudezor · September 9, 2020

Been fairly happy with my unraid solution, however recently I started to see some issues that might be forcing me to rethink using this as my daily workstation / enhanced NAS box all in one.

1. Recently noticed that disk2 (sdg) looks like it has some CRC errors on it.

- I have a new drive getting delivered today.

- The plan was to remove the current parity drive (4TB) and install the new 8TB drive, let the parity re-sync, then reuse the current parity drive as drive 2's (sdg) replacement.

2. Past week or so now parity checks are running (scheduled for monthly) almost every other reboot now (I don't keep this online 24/7). I have disabled the checks for now considering #3 on this list

3. Same time period as #2 (seem to be related) Parity checks now cause my VM to become unresponsive whenever the VM's screensaver turns on. If I invoke the parity check manually, the VM is fine... Its only when the system run's it. Since I was so smart when building this system and decided to get a AMD 5700.... The system requires a complete reboot of the box every time this happens (thanks AMD, you.... ). Since this box is my main machine, this becomes very annoying very quickly.

4. My unassigned SSD (sdc) is just flowing with btrfs errors and gets put into read only mode. Lucky for me, my main VM is on nvme0n1 so I'm kind of still up and running, but a few of my other VM's vdisks are now completely missing on this drive from the looks of it. Even if they are not recoverable, its not exactly the end of the world as I do have backups on the array (assuming that also is not completely messed up at this point).

5. Shares are no longer accessible. Says my user is not allowed, so I can not even browse any of my files anymore.

6. The network device br0 keeps going in and out of promiscuous mode. No idea what is causing that.

Is my system doomed and should I jump from this sinking ship?

universe-diagnostics-20200909-1153.zip

JorgeB · September 9, 2020

Some observations:

-CRC errors are a connection problem, usually the SATA cable, no need to replace the disk, replace the cable.

-Parity check will run after an unclean shutdown, most likely what's causing them

-cache filesystem is corrupt, you need to re-format cache, note that you're overclocking the RAM and that's a known issue with AMD, even can in same case cause data corruption, see here for more info.

Squid · September 9, 2020

20 minutes ago, thedudezor said:

3. Same time period as #2 (seem to be related) Parity checks now cause my VM

Your domains share (where the VM's vdisks usually live) is currently located on disks 1, 2, 3 and the cache drive. If you VM's vdisk is sitting on any of the data drives, then normal and expected for performance to go down the drain when a parity check is running. You should ensure that they are on the cache drive for the utmost in performance

trurl · September 9, 2020

1. CRC errors are connection problems, not disk problems. Just click on the warning on the Dashboard to acknowledge it and it won't warn again until it increases. This is not a reason to replace the disk, so just put that aside and worry about the other things.

2. This sounds like the parity check you will always get after an unclean shutdown. Are you shutting down from the webUI? Or are you forcing power down? Here is a link in the FAQ that explains what causes the system to decide you have had an unclean shutdown that triggers parity check:

3. You have appdata, domains, system shares on the array, so your docker/VM performance is being impacted by parity and dockers/VM will keep array disks spinning since these files are always open. You want those shares on cache (or other fast pool) and configured to stay on pool.

Also, have you seen this FAQ:

4. Since you are on latest beta, why aren't you using multiple pools feature instead of UD for these? See this FAQ for ideas on recovering btrfs:

5. Looks like User Shares are mounted so I don't know what your problem might be. Perhaps I am misunderstanding. Can you elaborate?

6. ?

thedudezor · September 9, 2020

1. Got it. I'll have to check the cable / connection and monitor this. Should have known better, thank you.

2. I get the unclean shutdown would cause a check. I'm shutting the server down the same I always would (shut VM -> wait, then use the webUI -> shutdown). Not sure why *recently* this is become a problem. Like I said, its not every time I shutdown, just has for some reason become more frequent. I guess I'll try adding more wait time after the VM is down?.?..

3 & 4. Appdata, System and domains are all set to use cache. My "main" VM (the one that locks up when a system invoked parity check runs) is on the UD nvme. So the vdisk should not be a member of the disks getting checked?

- As for these even sitting on the array in the first place,I stopped docker and moved appdata, so that fixed? As for domains, I wanted to allow disks to be created on the array considering the cache drive is small (250GB) and for the most part, after I create the VM, i plan on moving the file to a UD SSD or nvme.

- The system folder / files I have no idea how to resolve that. Can someone point me to a wiki guide on how to know what folders / files can be safely moved to the user mount?

- As for creating a cache pool, I would need to read more about this as being a option. I would agree that I think it makes sense to pool my two SSD drives as they have similar (even though the smaller drive is a SATA 3gig vs SATA 6gig link as I recall) speed characteristics, however the nvme drive is substantially faster tiered device. Due to that, I think I will always just use UD and tie that disk to my main VM.

- Specifically on 4 (the reported drive errors), the cache drive is not reporting the btfs errors, the UD sdc1 device is. I attempted to recover based on the guide posted and it looks like I can get 1 of the 3 VM vdisks back. Not sure about the integrity of that vdisk, so I''m considering all of the data on this disk a loss. So I guess I need to format the sdc1 disk anyway, so maybe giving the cache drive pool a try is a good option?

5. Not much else I can help with here. The same username / pass has been used for a long time now. I can see the main domain folder, but it is not accessible. Even the public / exportable shares are not displayed. I would say I can test with another known machine.. But my other VM's are missing now due to #3 & 4 listed above. lol I have not tried a reboot, so I hope this auto-magically corrects it's self. [EDIT: A reboot did fix this]

6. Not sure what else to help with here, other then

Quote

Sep 9 14:58:09 Universe kernel: device br0 entered promiscuous mode
Sep 9 14:58:14 Universe kernel: device br0 left promiscuous mode
Sep 9 14:58:14 Universe kernel: veth972b03c: renamed from eth0
Sep 9 14:59:14 Universe kernel: eth0: renamed from vethc4d23a9

It just shows that almost endlessly in the log.

General - I apparently didn't realize 3600MHz was a overclock. Pretty sure the mobo I have defaulted to this speed using the baked in memory profile. I'll dial that back to 3200 and see if things improve. Again, thank you.

Edited September 9, 2020 by thedudezor

trurl · September 9, 2020

1 hour ago, thedudezor said:

2. I get the unclean shutdown would cause a check. I'm shutting the server down the same I always would (shut VM -> wait, then use the webUI -> shutdown). Not sure why *recently* this is become a problem. Like I said, its not every time I shutdown, just has for some reason become more frequent. I guess I'll try adding more wait time after the VM is down?.?..

A clean shutdown that doesn't happen, or that isn't recorded, is really what it is about.

On 12/21/2019 at 1:28 PM, trurl said:

Another thing that can happen is the flash drive is not mounted or not otherwise writable for some reason. This often happens when people are having problems with flash and it gets dropped. This seems to be more common for USB3 ports.

Syslog server might reveal whether flash could be written on previous shutdown:

1 hour ago, thedudezor said:

3 & 4. Appdata, System and domains are all set to use cache. My "main" VM (the one that locks up when a system invoked parity check runs) is on the UD nvme. So the vdisk should not be a member of the disks getting checked?

There are 4 different settings for cache:

Cache-yes means try to write to cache if there is room, and move any on cache to the array. This is how you had appdata and system set in those diagnostics.
Cache-prefer means try to write to cache if there is room, and move any on array to cache. This is how you want appdata and system.
Cache-only means write to cache even if there isn't room, and don't move.
Cache-no means write directly to the array and don't move. This is how you had domains set, though it did have some files on cache from a previous setting.

Note that mover ignores cache-no and cache-only shares, so any files that already exist will not be moved. Also, mover can't move open files, so if docker and VM services are enabled in Settings, then mover can't move them. And finally, mover won't overwrite existing files, so if there are duplicates they won't get moved either.

1 hour ago, thedudezor said:

As for these even sitting on the array in the first place,I stopped docker and moved appdata, so that fixed?

Don't know. How did you move and what is appdata cache setting now? You can see how much of each disk is used by each of your user shares by going to the User Shares page and clicking Compute... for the share.

1 hour ago, thedudezor said:

As for domains, I wanted to allow disks to be created on the array considering the cache drive is small (250GB) and for the most part, after I create the VM, i plan on moving the file to a UD SSD or nvme.

That is how some people prefer to do it. But your VM only needs a vdisk for its OS. It can access other Unraid storage over the virtual network for other purposes. Do as you will with domains, except for perhaps using a pool instead of UD.

1 hour ago, thedudezor said:

The system folder / files I have no idea how to resolve that. Can someone point me to a wiki guide on how to know what folders / files can be safely moved to the user mount?

Set it to cache-prefer, stop docker and VM services in Settings, run mover. Might be some duplicates that don't get moved that will have to be cleaned up manually.

1 hour ago, thedudezor said:

As for creating a cache pool, I would need to read more about this as being a option.

Multiple pools is really what I was thinking, one pool for normal "caching", and another for those things that need faster access, such as appdata, domains, system. That is what I do. Did you read the release thread for the beta you are running?

1 hour ago, thedudezor said:

5. Not much else I can help with here.

Possibly the corruption on cache is breaking user shares even though they appear to mount. Fixing that will probably fix this.

JonathanM · September 9, 2020

1 hour ago, thedudezor said:

I'm shutting the server down the same I always would (shut VM -> wait, then use the webUI -> shutdown). Not sure why *recently* this is become a problem. Like I said, its not every time I shutdown, just has for some reason become more frequent. I guess I'll try adding more wait time after the VM is down?.?..

Use the stop array button, and only use the shutdown button after the stop command returns properly.

JorgeB · September 10, 2020

Also please post new diags is a parity check begins after booting, that will confirm if it's because an unclean shutdown.

12 hours ago, thedudezor said:

the cache drive is not reporting the btfs errors, the UD sdc1 device is

You're right of course, my bad, only took a quick look at the syslog and assumed it was cache.

12 hours ago, thedudezor said:

Not sure about the integrity of that vdisk, so I''m considering all of the data on this disk a loss.

Depends how the data was recovered, if the vdisk was copied normally from the device, even if using one of recovery mount options it will be OK since any corruption would be detected by btrfs, if it was recovered with btrfs restore it might be corrupt since that option is a last resort and it doesn't verify checksums.

thedudezor · September 10, 2020

Here is what I have done so far.

- I removed the overclocked RAM profile in the bios.

- A reboot seems to have auto-magically corrected the missing share issue, the user not allowed issue and I am no longer seeing the virtual nic bouncing in and out of promiscuous mode endlessly in the logs.

- I already had planned on upgrading the parity drive to a lager one so I could then re-use the old parity drive in the array as I need more space (old config was, 4TB parity, 3x 2TB, planning to move this to 8TB parity with 2x 2TB and 1x 4TB). Once the rebuild is done, I will swap out one of the 2TB drives with the old 4TB drive.

- I found that the disk2 drive (sdg) had possibly not been fully seated into the hotswap bay. (For some reason I used different screws on this drive caddy. Go figure....) I will clear the CRC errors and monitor and / or maybe just replace this drive anyway due to upgrade mentioned above.

- I wanted to try a cache pool as a possible option since my other SSD was fubar'ed with btfs errors. I added that and it's almost completely full (1.2TB) now even though I have only appdata, system and a "temp" user share with hardly any data in it set to preferred... Not sure how to undo this now? I almost don't even want a cache drive. I'm OK with having spinning disk array speeds for anything other then "temp" VM to VM file sharing (small, sub 10GB files). Not sure how screwed up things will get if I pull that newly added 2nd drive back out and just use it as a UD drive again.

- I will start shutting down via first stopping the array, waiting for that to confirm, then shutdown. So far this has not been a issue, but like I said, a unclean shutdown wouldn't happen all the time anyway.

trurl · September 10, 2020

17 minutes ago, thedudezor said:

I wanted to try a cache pool as a possible option since my other SSD was fubar'ed with btfs errors. I added that and it's almost completely full (1.2TB) now even though I have only appdata, system and a "temp" user share with hardly any data in it set to preferred... Not sure how to undo this now? I almost don't even want a cache drive.

Diagnostics would give us a clearer and more complete idea how you have reconfigured.

thedudezor · September 10, 2020

5 minutes ago, trurl said:

Diagnostics would give us a clearer and more complete idea how you have reconfigured.

Attached.

universe-diagnostics-20200910-1001.zip

trurl · September 10, 2020

I thought you meant you had added another pool, which is what I was suggesting.

18 hours ago, trurl said:

Did you read the release thread for the beta you are running?

It looks like instead you added another disk of a different size to existing cache pool. Default is btrfs raid1, which gives a mirror with total capacity only equal to the smaller disk. So, that cache pool only has 240G. And, a pool is all one storage volume, so there is no way to separate out different things onto different drives, such as appdata, etc.

Also, your appdata and system still have files on the array, and your parity is invalid.

thedudezor · September 10, 2020

Wow. Yes, I completely misunderstood adding a new pool for adding the drive to the cache pool. Had I looked at the release thread, I would have realized that we can now add SSD's to a pool (not just the cache only pool). Makes complete sense why its raid0, so like you said, even though the 1TB SSD is in the cache pool, its restricted to the smallest member of that pool (the 240GB drive). It is somewhat misleading considering the WebUI shows total size of that cache pool to be 1.2TB when it should be shown as the actual usable 240G?

Yes I know the parity is invalid, the rebuild to the new larger parity drive is still ongoing and I am not going to make any further changes until that process is completed. I need access to the VM's that got nuked in the btfs error'ed drive (that I can backup from the array disks), but I refuse to do anything until that rebuild is done. I do not have another SATA port on my mobo, so that is why I didn't just add the new parity drive while the other parity was still installed.

Edited September 10, 2020 by thedudezor

trurl · September 10, 2020

2 hours ago, thedudezor said:

its raid0

2 hours ago, trurl said:

Default is btrfs raid1

thedudezor · September 11, 2020

I think this post can be tagged as solved.

I think based on everyone's support here I have everything running well now and is stable.

Thank you everyone for your help as always!

[SOLVED] Lots of recent issues

Recommended Posts

thedudezor

Link to comment

JorgeB

Link to comment

Squid

Link to comment

trurl

Link to comment

thedudezor

Link to comment

trurl

Link to comment

JonathanM

Link to comment

JorgeB

Link to comment

thedudezor

Link to comment

trurl

Link to comment

thedudezor

Link to comment

trurl

Link to comment

thedudezor

Link to comment

trurl

Link to comment

thedudezor

Link to comment

Join the conversation