Rebooted, lost cache disk, BTRFS operation running?


Recommended Posts

1 hour ago, JorgeB said:

2nd device wasn't added to the pool, try again, stop array, unassign sdi, start array, stop array, re-assign sdi, start array and post new diags.

I tried this previously as suggested by some other post (and intended to point out with "When setting up the pool (again) everything works fine"), but it didn't help making the pool consistent over reboots/time.

The last time I tried this, the problem reoccured after a while.

I checked the SSD but it seems OK, data is stored persistently for a week even without power.

server-diagnostics-20210714-1857-newpool.zip

Link to comment
On 7/15/2021 at 8:30 AM, JorgeB said:

Now it's working correctly, if it stops working I'd need to see the diags showing the problem, before rebooting.

Alright, it just started happening again. Here is a log.

The thing is, it only occurs after restarts - not while running, so I can't really provide logs from before.

One thing i noticed is the encryption symbol (green lock, left from the pool name) was there before the reboot and now is gone.

So maybe it's some issue with the drive header being corrupted?

server-diagnostics-20210716-1300_again.zip

Edited by Kosmos
Link to comment
19 minutes ago, Kosmos said:

The thing is, it only occurs after restarts - not while running, so I can't really provide logs from before.

That is strange, I do see some data corruption detected, you should run memtest and then a scrub on the pool, other than that next time grab diags before rebooting (stop the array, grab diags, then reboot) and new diags after rebooting if the same thing happens.

Link to comment
3 minutes ago, JorgeB said:

That is strange, I do see some data corruption detected, you should run memtest and then a scrub on the pool, other than that next time grab diags before rebooting (stop the array, grab diags, then reboot) and new diags after rebooting if the same thing happens.

Thanks for looking into this 😃

I will try to get the logs in this order (after my vacations).

From the current point of view, do you think it's some hardware failure?

Best regards

Link to comment
  • 1 month later...

Hey everyone,

thanks for sticking with me.

In the meantime, I moved my hard drives to completely new hardware.

The problem persists: Sometimes after reboots, the encryption symbol of the second cache disk goes missing and when the array starts, it's throwing the "missing disk" errors.

Furthermore, I have a feeling that it appears more often when there was data written to the disk before rebooting - maybe a coincidence and not a causality...

 

When I changed the disk order in the pool, the disk went missing as well, unfortunately losing all its data.

 

This leads me to the conclusion that the ssd (controller) is damaged.

I will continue to try and grab logs before and after a reboot with the problem occuring.

Link to comment

I managed to get logs before and after rebooting when the problem occured (attached).

This time, I created a new pool (different name) from the same ssd drives just before it happened.
With the old name, it did'nt happen in the last 10 +- 2 reboots.

So maybe it's a cache management issue after all?

Best regards

after-reboot.zip before_reboot.zip

Edited by Kosmos
files were missing
Link to comment

Problem is that this device wasn't decrypted after the reboot:

 

Aug 30 20:40:37 Server emhttpd: import 32 cache device: (sdc) SanDisk_SSD_PLUS_240GB_184302A005B3

 

It's strange since the device was there, but since it wasn't decrypted it can be used by btrfs, so it was like the device wasn't present and the pool balanced to single, not sure why this is happening, if you have a spare try replacing that SSD with a different one, if it still happens it's likely a bug.

Link to comment
  • 2 weeks later...

It should be decrypted only after entering the password and starting the array, right?
However, Unraid doesn't show the encrypted volume properly before starting the array, already (after boot)

 

Also, when usind the "failing" ssd in another pool (single, not raid) it's working properly.

The combination of the "failing" ssd with a different HDD continued to fail,

but the combination of the other "working" ssd with the other HDD did not fail after many reboots.

(logs attached)

 

So it seems to me that this particular disk is not working properly in a (encrypted) btrfs raid1 (pool).

Could it be due to the pcie -> sata addon card they are attached to (all 3)?

 

I may try to change ports or use the pool without encryption.

1a_before-reboot-diagnostics-20210830-2036.zip 1b_after-reboot-diagnostics-20210830-2100.zip 2a_before-reboot-diagnostics-20210901-1613.zip 2b_after-reboot-diagnostics-20210901-1735.zip 3a_before-reboot-diagnostics-20210901-2343.zip 3b_after-reboot-diagnostics-20210901-2351.zip

Link to comment
5 minutes ago, Kosmos said:

It should be decrypted only after entering the password and starting the array, right?

Correct.

 

Not sure why that device is not being decrypted, it's being detected so it should also be decrypted, but I've never used encryption, so not familiar that with how it works, could be an Unraid bug, if you have a different device test with that, if it works it was likely a device problem, if it's the same it's likely a bug.

Link to comment
On 9/8/2021 at 6:26 PM, JorgeB said:

Correct.

 

Not sure why that device is not being decrypted, it's being detected so it should also be decrypted, but I've never used encryption, so not familiar that with how it works, could be an Unraid bug, if you have a different device test with that, if it works it was likely a device problem, if it's the same it's likely a bug.

In my opinion, it could be both.

It appears, the partition information is lost at some point, so Unraid does not detect there is a cache partition on the ssd.

As a consequence, nothing can be decrypted. However, the problem appears without encryption as well.

I tried to change the sata controller and cable as well, but it didn't help either.

 

Anyway, I wonder why this particular disk is recognized correctly sometimes, but sometimes not after reboots.

So either unraid is not reading the disk correctly,

or the ssd is resetting/deleting it's partition (headers) sometimes for unknown reasons...

 

I attached a screenshot of the unassigned drives after the problem occured (encryption lock symbol and partition gone)
and a SMART report of this ssd as well.

 

Screenshot 2021-09-15 133926.png

server-smart-20210915-1213.zip

Edited by Kosmos
Link to comment
2 hours ago, JorgeB said:

I suspect that it's a device problem.

Probably, yes,
I will ask SanDisk about this.

Thanks again for your continued help!
See you on the next one 😛

 

p.s: you may flag this topic solved, I can not do it, because I missed creating a new one and took over from johnsanc (😇)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.