Jump to content

Mover Related Cache Issue


DaWord2011

Recommended Posts

I have been recently experiencing issues with the mover. When invoking the mover, manually or scheduled, sometimes (about half the time) the mover will lock up part way through and act like it has completed. After analyzing the cache array's capacity I will notice that it has not completed the move, and restarting will change nothing. When I notice this mover issue, if I try and access my array I can navigate through it, however, I cannot read or write to any of it, instead, I get a permissions error. If I attempt to use any of my apps (all installed on my cache array) they will not function properly. Plex, for instance, will load but none of the content will be available. If I reboot the server all of the data is retained and the array will come back up and work properly. If I simply stop and start the array again, it will state that the cache array cannot be added and there is no filesystem. At this time, a separate SSD that I have as a VM drive (unassigned) will also halt working. Although not all at once, the VM contained on it will simply BSOD and reboot, and when it comes back up it will go into the default bios screen and not show any boot options. I am a little stumped, I have recently been dealing with random power downs and found that the PSU was the cause of the issue. After replacing the original unit with a new Seagate option, the power issues have been resolved. Any ideas? Attached are some anonymized logs that I grabbed on two separate occasions after noticing the issue. 

Diag-anon.zip

Diag-r&r-anon0817.zip

 

Server Specs are as follows: 

2x Xeon E5-2630 V2

ASRock EP2C602 Motherboard

4 x 8GB Samsung ECC RAM

Seasonic Focus Gold 850w PSU

6 x 4TB Seagate HDD's

1 x 2TB WD Red HDD

2 x 256GB SSD's (Cache)

1 x 128GB (VM's)

Link to comment

That's a nasty looking syslog! It's chock full of error messages. You have an NTFS-formatted LiteOn disk (/dev/sdi) mounted under Unassigned Devices that is throwing up I/O errors continuously. There's no SMART report for it. There are also a lot of messages that suggest you have multiple bad SATA cables (or, maybe, a bad controller) as they repeatedly reset the SATA interface. Your cache SSDs have no SMART reports either. There are BTRFS errors reported on your cache pool - whether that indicates a corrupt file system or whether the hardware is faulty I can't tell because there is no SMART. How did your server get into that state? Do you have notifications enabled?

Link to comment

My thoughts exactly, all 3 of the SSD's look like they are having issues. I have started to wonder if this is an issue associated with possibly my case or mobo. These 3 SSD's are in their own 4 drive caddy but each of them are going to the motherboard SATA instead of my SAS 9211-8i card I have all of my HDD's hooked into. Manually ran a quick SMART test on all 3 SSD's and didn't have any new errors I could see. I am starting to suspect a SATA controller or possibly an issue with that 4 drive bay itself. Attached logs from last night, mover successfully ran last night, and the cache is back to it's "empty" state (always appdata in there). These logs should include short SMART tests for the three SSD's as well, and I noticed some errors late at night, however, they do not seem to be as prominent as the other logs. I have notifications set up, however, it is not out of the realm of possibility that they are set up incorrectly. The notifications were what lead me to the unclean shutdown investigation that led me to replace the power supply. I am wondering if the unclean shutdowns may have lead to the BTRFS becoming corrupt. I will first attempt the easy stuff of replacing SATA cables. Then I will attempt to move a SSD's out of the caddy and run them hooked directly in with a cable. If that fails I will move to other board's SATA controller or my SAS card and see if it is related to the mobo controllers. 

tower-diagnostics-20180824-0746.zip

Link to comment
  • 6 months later...

Sorry I forgot to update, found the issue is with the SATA controller on my motherboard (sad it is such a nice board). Moving the cache drives to the other SATA controller resolved the issues. After that I simply bought a second LSI card, flashed it, and dropped it in. After balancing and scrubbing the cache array I haven't had a single error! 

 

thank you for the brainstorming. 

Link to comment
On 8/24/2018 at 8:59 AM, DaWord2011 said:

I have notifications set up, however, it is not out of the realm of possibility that they are set up incorrectly.

You must setup notifications to alert you immediately by email or other agent whenever Unraid detects a problem. You don't want to let problems accumulate. Multiple problems are a lot more difficult if not impossible to recover from.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...