Jump to content

Intermittantly cannot delete files from cache only share


kingy444

Recommended Posts

I recently upgraded to 6.9.2 just before initiating the below so not sure if issue would have been present on previous versions

 

After upgrading to 6.9.2 i cleared my cache and array and started to use BTRFS encryption for cache pool and XFS encryption for the array, in addition i started using Secure instead of public shares.

 

Prior to this i had no issues and all files would be owned by 'nobody'.

 

I have noticed on 3 seperate occassions now an issue where my SMB connection from Windows is causing an issue where when i delete a file rather than deleting the file the file stays but loses its 'owner' as far as Windows is concerned. (note i mean actually blank and not be set to nobody) - No error is received the file just reappears with no owner

Checking the file in UNRAID shell you can see the file still has its owner etc - but further attempts to delete the file from windows will be prevented due to no owner.

 

 

On the first occassion i rebooted the server and everything came good

On the second occassion i browsed to the \\UNRAID\cache\share rather than \\UNRAID\share and could successfully delete files i cannot delete from \\UNRAID\share. I then renamed the share in the UI (for like 10 seconds) then named it back and could again delete files from \\UNRAID\share

 

I am now on the third occassion and again tested successfully deleting from  \\UNRAID\cache\share and that \\UNRAID\share results in files losing ownership instead.

I have not rebooted or renamed incase there is something i should do while the system is still broken

 

This only appears to be affecting the share i have set to "Cache: Only" - i can delete from other shares atm

 

Edited by kingy444
attach diagnostics
Link to comment

Thought i would add that the issue is not linked to the windows host - all windows machines on the network experience the same issues at the same time

 

I do have the option to downgrade to 6.8.3 but just seeking some advice before doing so as everything was solid before and unsure if related to securing shares or the upgrade

Same files in

\\unraid\share

 

vs

\\unraid\cache\share

 

 

 

 

 

 

Edited by kingy444
add pictures
Link to comment

So i downgraded to 6.8.3 and after just over 5 days of uptime the issue has returned so it is not related

 

I can still write to the nas during the time of the issue - but cannot delete files

 

Really appreciate any help - my assumption is it may be related to the cahce drive now being encrypted as i didnt have the issue before and only happens on Windows Hosts (but all windows hosts on the network at the same time) so must somehow be related to the samba conf

 

edit:

After posting i thought i would try setting the share to not use the recycle bin - not sure if that is the issue but didnt want to sit around doing nothing

 

As soon as i excluded the share windows hosts could see the owner again - appreciate any other advice people can provide as i am not sure this will be a permanent resolution

Edited by kingy444
Link to comment
1 hour ago, trurl said:

What point are you trying to make with these? Cache is part of user shares so these are the same files. I recommend not sharing disks on the network.

 

This is a screenshot of the same files in \\unraid\share and \\unraid\cache\share at the same time.

 

You can see in \\unraid\share that the Owner is missing

You can see in \\unraid\cache\share that the Owner is present

 

The fact the owner is missing prevents the file from being deleted. But what happens (when the issue is present) is the file has an Owner, you delete it, it dissappears for about 10 seconds, then comes back with no owner

 

I dont share \\unraid\cache normally but did as part of my troubleshooting here as i could still see the Owner on the files in unraid shell.

 

TLDR:

But your point is exactly mine. These are the same files.

So something is happening to make \\unraid\share forget the Owner of the files until samba restarts

Link to comment
  • 1 month later...

@dlandon tagging you as it looks like you are the owner of the recycle bin plugin from here 

 

I believe i found the issue above and have been able to replicate, but not sure if you need any logs to help reslolve the issue. Details summarised below:

 

After upgrading to 6.9.2 I also redid my shares, setting them to Secure and encrypted.

I then started having strange issues where everynow and again my cache only share would start presenting an issue where Windows machines would not be able to recognise the owner of a file after a failed deletion attempt. Some screenshots on this above

 

I downgraded to 6.8.3 and the issue was still present - i then realised that i had not excluded the new share from the recycle bin, after excluding the share the issue did not present itself so i thought i would try the upgrade again.

 

After two days running 6.9.2 again i now have the issue presenting itself again (and the share is still excluded in the recycle bin)

 

Havent been able to get much help on this one, not sure why the recycle bin would be having issues with a cache only share, and whether the issue is in the plugin or the core ?

 

Really appreciate some input, I havent rebooted or anything yet incase you need specific information while the issue is present.

 

I will need to downgrade to 6.8.3 again at some time too.

 

I only appear to have the issue using user shares on a cache only drive, i havent experienced the issue on a drive that is part of the array - and as noted above the issue appears for all Window machines on the network (making me thing something with the core samba integration) - i can still delete the files if i browse directly to the same files using linux or samba to the cache drive directly (more on that above)

 

Note currently running 2021.10.27 of the plugin

 

EDIT: I have just tested and selecting 'restart' in Recycle Bin Settings has also resolved the issue (temporarily i assume) so a full reboot was not required, but would definitely make me lean more towards the issue being with the plugin

Edited by kingy444
update
Link to comment

The recycle bin functionality is built into samba.  The recycle bin plugin enables that functionality in samba and manages the settings for the samba recycle bin.

 

I suspect a samba issue.  Whenever you restart the recycle bin plugin or make a change to a setting, samba is restarted.  I think this is why your issue clears up when you restart or make a change to the plugin.

 

The best way to confirm this is to remove the recycle bin plugin and see if the issue clears up.

Link to comment
11 hours ago, dlandon said:

The recycle bin functionality is built into samba.  The recycle bin plugin enables that functionality in samba and manages the settings for the samba recycle bin.

 

I suspect a samba issue.  Whenever you restart the recycle bin plugin or make a change to a setting, samba is restarted.  I think this is why your issue clears up when you restart or make a change to the plugin.

 

The best way to confirm this is to remove the recycle bin plugin and see if the issue clears up.

 

if thats all the restart is doing then i agree it is going to be samba related - any idea who to tag here to get some assistance in core @dlandon?

Edited by kingy444
Link to comment
  • 3 months later...

I've been asked to look at your situation again since you haven't been able to resolve it.  Let's reset and come at it again.  Do the following:

  • Upgrade to 6.9 if you're not already on 6.9.
  • Update all your plugins.
  • Create the issue where the ownership of the files gets lost.
  • Post your diagnostics.

While you are doing that, I'll set up a pool device as a btrfs encrypted device and see if I can recreate your issue,  The problem I may have is that I am currently on 6.10, so I may not be able to reproduce the issue.  If I can't, I'll have roll back to 6.9.

Link to comment

there is diagnostics in the first post too from before the encryption - here is another, hopefully the two combined help

I dont currently have the issue - but this was a diagnostic i took when i was trying to work it out myself

 

I am only using 1 user, theirs all have read only (no one actually uses them they all log on to the my pc anyway). The main and only user with access on this share has read/write access with SMB set to Secure and Auto case sensitive names

 

I am wondering if it is somehow related to a bad drive but would love something to point there in a more concrete way then a stab in the dark before dropping $$ on a replacement simply to test.

I did have a good period for a while (which i hated as i had no root cause) - then I recently had the log fill to which i found there was a likelyhood of a bad sata cable (cant remember the exact error but found errors on one of the cache drives via "btrfs dev stats -c /mnt/cache"

 

Post fixing the above, scrubbing etc it was about a week later the issues occurred again for the first time in about a month

 

Edited by kingy444
Link to comment

Your log only goes to Feb 26th, so it is not up to today.

 

I see this in your log that I don't understand:

Feb 16 17:51:20 unRAID nginx: 2022/02/16 17:51:20 [alert] 10670#10670: worker process 32187 exited on signal 6
Feb 16 17:51:22 unRAID nginx: 2022/02/16 17:51:22 [alert] 10670#10670: worker process 32232 exited on signal 6
Feb 16 17:51:24 unRAID nginx: 2022/02/16 17:51:24 [alert] 10670#10670: worker process 32369 exited on signal 6
Feb 16 17:51:26 unRAID nginx: 2022/02/16 17:51:26 [alert] 10670#10670: worker process 32439 exited on signal 6
Feb 16 17:51:28 unRAID nginx: 2022/02/16 17:51:28 [alert] 10670#10670: worker process 32498 exited on signal 6
Feb 16 17:51:30 unRAID nginx: 2022/02/16 17:51:30 [alert] 10670#10670: worker process 32556 exited on signal 6
Feb 16 17:51:32 unRAID nginx: 2022/02/16 17:51:32 [alert] 10670#10670: worker process 32615 exited on signal 6
Feb 16 17:51:34 unRAID nginx: 2022/02/16 17:51:34 [alert] 10670#10670: worker process 32687 exited on signal 6
Feb 16 17:51:36 unRAID nginx: 2022/02/16 17:51:36 [alert] 10670#10670: worker process 32726 exited on signal 6
Feb 16 17:51:38 unRAID nginx: 2022/02/16 17:51:38 [alert] 10670#10670: worker process 324 exited on signal 6
Feb 16 17:51:40 unRAID nginx: 2022/02/16 17:51:40 [alert] 10670#10670: worker process 397 exited on signal 6
Feb 16 17:51:42 unRAID nginx: 2022/02/16 17:51:42 [alert] 10670#10670: worker process 453 exited on signal 6
Feb 16 17:51:44 unRAID nginx: 2022/02/16 17:51:44 [alert] 10670#10670: worker process 513 exited on signal 6
Feb 16 17:51:46 unRAID nginx: 2022/02/16 17:51:46 [alert] 10670#10670: worker process 585 exited on signal 6
Feb 16 17:51:46 unRAID nginx: 2022/02/16 17:51:46 [alert] 10670#10670: worker process 643 exited on signal 6
Feb 16 17:51:48 unRAID nginx: 2022/02/16 17:51:48 [alert] 10670#10670: worker process 644 exited on signal 6
Feb 16 17:51:50 unRAID nginx: 2022/02/16 17:51:50 [alert] 10670#10670: worker process 704 exited on signal 6
Feb 16 17:51:52 unRAID nginx: 2022/02/16 17:51:52 [alert] 10670#10670: worker process 748 exited on signal 6
Feb 16 17:51:54 unRAID nginx: 2022/02/16 17:51:54 [alert] 10670#10670: worker process 899 exited on signal 6
Feb 16 17:51:56 unRAID nginx: 2022/02/16 17:51:56 [alert] 10670#10670: worker process 958 exited on signal 6
Feb 16 17:51:58 unRAID nginx: 2022/02/16 17:51:58 [alert] 10670#10670: worker process 1023 exited on signal 6
Feb 16 17:52:00 unRAID nginx: 2022/02/16 17:52:00 [alert] 10670#10670: worker process 1094 exited on signal 6
Feb 16 17:52:02 unRAID nginx: 2022/02/16 17:52:02 [alert] 10670#10670: worker process 1160 exited on signal 6
Feb 16 17:52:04 unRAID nginx: 2022/02/16 17:52:04 [alert] 10670#10670: worker process 1336 exited on signal 6
Feb 16 17:52:06 unRAID nginx: 2022/02/16 17:52:06 [alert] 10670#10670: worker process 1509 exited on signal 6
Feb 16 17:52:08 unRAID nginx: 2022/02/16 17:52:08 [alert] 10670#10670: worker process 1584 exited on signal 6
Feb 16 17:52:10 unRAID nginx: 2022/02/16 17:52:10 [alert] 10670#10670: worker process 1665 exited on signal 6
Feb 16 17:52:12 unRAID nginx: 2022/02/16 17:52:12 [alert] 10670#10670: worker process 1721 exited on signal 6
Feb 16 17:52:14 unRAID nginx: 2022/02/16 17:52:14 [alert] 10670#10670: worker process 1781 exited on signal 6
Feb 16 17:52:16 unRAID nginx: 2022/02/16 17:52:16 [alert] 10670#10670: worker process 1849 exited on signal 6
Feb 16 17:52:18 unRAID nginx: 2022/02/16 17:52:18 [alert] 10670#10670: worker process 1908 exited on signal 6
Feb 16 17:52:19 unRAID root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token
### [PREVIOUS LINE REPEATED 1 TIMES] ###

 

Your 'cache' disk has a lot of issues in the SMART rport.  This is an example:

Error 122 [1] occurred at disk power-on lifetime: 41127 hours (1713 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 51 00 00 00 00 00 00 00 01 a0 00

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  ec 00 00 00 00 00 00 00 00 00 00 a0 08 42d+08:12:53.709  IDENTIFY DEVICE

 

You probably need to replace that disk.  The 'cache2' disk also seens to be having issues.

Link to comment

thanks - i havent been able to force the issue - this was a diagnostic from the last occurance

 

I did check the SMART report in the web but it said the disk passed all issues (and still does - see cache disk below)

Why is the UI not showing there to be an error? cache2 has similar values

 

im a little surprised i havent had any docker issues as that lives on those drives too

image.thumb.png.8951e46f2808d05968e5c21b3757807a.png

Link to comment
  • 1 month later...
On 3/4/2022 at 2:59 PM, dlandon said:

Scroll down the SMART report and you'll see the disk errors I'm talking about.

 

Here's the complete report:

WDC_WD2003FZEX-00SRLA0_WD-WMC6N0H9MZL6-20220221-0901 cache (sdk).txt 19.19 kB · 2 downloads

 

Thanks for your help to here @dlandon - unfortunately it doesnt appear to have been the issue

 

I have removed cache1 and cache2 (wd black) with a new Samsung SSD 870 EVO (just a single cache drive atm) and changed to xfs from btrfs as part of this process

 

It appears roughly 22 days of online in the issue has resurfaced and i cant see any SMART errors for the SSD in the logs

 

Could you please take another look?

 

EDIT: I noted the "refused mount request" too - cleared this now - this was related to a share i deleted and forgot about an active connection

 

EDIT2: Want to highlight the issue only occurs on a cache only share and only when accessing via \\unraid\sharename and not via \\unraid\cache\sharename - makes me think there has to be something in the software here - and that the issue dissapears post samba restart would indicate the same to me

 

Edited by kingy444
Link to comment

The syslog in the posted diagnostics are basically nothing but the 

On 4/19/2022 at 9:19 AM, kingy444 said:

EDIT: I noted the "refused mount request" too - cleared this now - this was related to a share i deleted and forgot about an active connection

 

 

Now that you've fixed that issue, can you reboot replicate your issue again and then post a clean set of diagnostics

 

 

Link to comment

I had held off rebooting as i wasnt sure if the machine being in a 'broken' state was of use.

 

Originally the issue would reappear every 2-3 days but since replacing with the SSD the issue has taken roughtly 21 days to reappear (and didnt want to wait that long to provide something of use)

Unfortunately i am not sure what the catalyst is (and this would go a log way to fixing the issue of course) so i have no way to 'force' the issue.

 

I will do a reboot shortly - however as the hdd issue was obviously a red herring could you take a look at the diagnostics in the first post in this thread. Perhaps there is something there that was missed due to the hdd diagnosis

Link to comment

The problem with the original diagnostics is that there was also an issue in them with the filesystem (which you've now fixed)

 

I would however like you to uninstall the Recycle Bin plugin (and reboot after removing)  While I agree with @dlandon that this technically shouldn't be an issue, removing it would help isolate the problem since it removes the recycle bin functionality from Samba

Link to comment
On 4/26/2022 at 9:38 PM, AndrewZ said:

I would however like you to uninstall the Recycle Bin plugin (and reboot after removing)  While I agree with @dlandon that this technically shouldn't be an issue, removing it would help isolate the problem since it removes the recycle bin functionality from Samba

 

Just want to throw my two cents in here before disabling (work in IT so understand where you are going)

 

Given we are currently looking at an unknown length of time before reoccurance with no known way to 'force' a reoccurrance is that the best course?

 

When we were looking at 2-3 days between occurrances i do believe i already tried this (and had success) but this was much easier to gauge when we were talking 6-7 days to test. With the last occurrance occuring 21 days of uptime when do we call the test a success/failure?

 

I believe excluding the share did help and my intial thought was it being related to recycle bin. 

 

Please let me know how you want me to proceed - I am just concerned we could remove it and wait 4-6 weeks. Then enable it and have to wait the same again. Thinking maybe look at doing that after getting the next set of diagnostics. if of use i have also attached some from my recent clean reboot

 

One of the weird things i have with this (noting i am a windows guy) is that:

\\unraid\share loses the the Owner

\\unraid\cache\share Owner is present (and files can be deleted)

 

So permissions appear fine - just the way the smb share is presented at the /mnt/user vs /mnt/cache appears to be an issue

 

Edited by kingy444
Link to comment

Bear in mind that this has never once been reported before.

 

That throws a monkey wrench into it, along with recycle bin being active, and the possibility that a script or a docker container or something is modifying the permissions on you (due to misconfiguration?)

 

Under ideal circumstances, I'd love for you to upgrade to 6.10-rc4+  The way any bugs get handled on any version of the OS that it's the current (6.9.2 is the current "stable", but not the "current" version) is that they are then duplicated against the current (as of right now rc4 which is public and rc4i which is private).  If the issue cannot be replicated under the current, then it is technically already solved. 

 

The one thing I'm actually surprised at is that no replies asked you to "prove" the ownership issue (that I see)

ls -ail /mnt
ls -ail /mnt/user
ls -ail /mnt/cache

 

Link to comment

When i get the issue again i can report that info - i have checked that personally (-l switch only) and confirmed owner was still listed

 

not sure if this is related but a copy/paste of your post highlighted that the share in question (and .Recycle Bin) have my account listed as owner where the others have root or nobody listed. At first i thought this related to before moving to secure shares but a share post secure has nobody as owner

 

 

 

 

Edited by kingy444
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...