BTRFS bdev /dev/ disks error (random disks)

September 29, 20241 yr

Hey!

I am writing this today after trying to solve the problem myself for the past year to avoid bothering you, but I have a severe lack of skills soooo...

Here is the thing:

I had a problem with a storage disk, that had its data corrupted, and eventually comes to a "read only" error.

When it was in "read only" state, any try to write something led to an error 5.

It happened while doing hard read/write work (like downloading "50 ISOs of linux" simultaneously, or copying 3TB of data from a disk to another).

So I tried to remove the disk, but then the problem moved to another disk.

I bought like 4 others disks in total during this year, and precleared all of them to be sure: preclear read, zeroing and post-clear read all finished without any error (50h per disk).

It seems that the I/O problem "moves" from a disk to another, depending of the disks that are in the array (if I remove the 2 12TB it mooves to the SSD cache, if I add some it might moove to the NVMe, or to the 12TB, I can't find any pattern there).

I tought the SATA of the motherboard were too saturated, so I bought an LSI 9300-16i card, and even designed a 3D cooler adapter to keep it cool, but the problem persists.

In the meantime, Fix Common Problems told me today that an "invalid folder" with the name of an old share is still within /mnt.

Some "flash device corrupted also" while starting the array, but disappearing eventually.

I am a bit confused now about what to test then... Maybe the PSU since it is a G650M, known to be a disaster? I bought another PSU to troubleshoot even this, but I assume that the problem is more software than hardware now.

If someone sees something that I missed, it could help me a lot!

Thanks ❤️

mc5-diagnostics-20240929-1442.zip

Quote

September 29, 20241 yr

Community Expert

Looks like both pool devices dropped offline in the past, run a correcting scrub on the pool and post the results.

Quote

September 29, 20241 yr

Author

Seems to have worked!

All this time and money just for that, you saved me!

A bit of panic while doing the scrub, since I only saw "137 errors fixed" then the "main" tab alternating between empty array and Error 500, but after a forced reboot it seems that I don't have any red line for the moment, thanks Jorge ❤️

I also added a weekly scrub to cache and pools now, to avoid it happenning again.

Will try to copy 3TB and stress the disks again, but it seems that it was that simple...

Quote

September 29, 20241 yr

Author

Oopsi, talked too soon...

mc5-diagnostics-20240929-1734.zip

Edited September 29, 20241 yr by resolute-clearance8449
Added the diag

Quote

September 30, 20241 yr

Community Expert

Run another correcting scrub and post the results.

Quote

September 30, 20241 yr

Author

Hi, here are the results:

Cache:

UUID: dc947b32-2638-4059-927b-d0a51c5d878a

Scrub started: Mon Sep 30 09:33:50 2024

Status: finished

Duration: 0:02:54

Total to scrub: 178.41GiB

Rate: 1.02GiB/s

Error summary: no errors found

Secondary pool: I'll come back to you when ended!
image.png.1102712eec3842c1be2fe0c0e58834fa.png

But what is odd is that errors keep going with use, since it had no errors before the copy started.

Quote

September 30, 20241 yr

Community Expert

Scrub is still running, post the results when done.

Quote

October 2, 20241 yr

Author

Scrub ended, here are the results:

UUID: 250fadc9-bf35-4060-aa6d-c030b89bca9a

Scrub started: Wed Oct 2 01:10:44 2024

Status: finished

Duration: 8:18:13

Total to scrub: 5.00TiB

Rate: 175.37MiB/s

Error summary:

read=294772105

csum=256

Corrected: 272

Uncorrectable: 294772089

Unverified: 0

In the meantime, if it can help:

Since the duration went to 3 to 4 days, I stopped the scrub, and erased the disks (I can afford to loose all the data, it is saved on another disk).

I then relaunched a scrub, that found no errors.

But I had a lot of lines "kernel: sd 7:0:6:0: Power-on or device reset occurred", so I went to Tools / System Devices to find out that 7:0:6:0 were attributed to one of the disks of the pool.

So I changed the disk for another that I just bought: no more "Power-on or device reset occurred", except when really powering the disks on I assume.

I tried to copy one season at first, then scrub the pool with the new disk: no error.

Tried to copy an entire show then scrub: no error.

Tried to copy all of the "show" folder then scrub: 24 uncorrectable errors, but 0 corrected and 0 unverified.

Tried to copy some others folders then scrub: 294 772 089 uncorrectable errors, and 272 corrected (the result that it above), and logs that look like a christmas tree!

I am starting to think that the disk that keeped disconnecting basically corrupted the data and that I "only" have to retrieve them to remove all the errors

Diag attached as usual, if needed

Thanks for your help!

mc5-diagnostics-20241002-0938.zip

Quote

October 2, 20241 yr

Community Expert

There are already a lot of device errors, and the syslog already rotated, so cannot see the start of the problem, and if they are new or old, but looks like a device dropped offline.

If the data can be deleted, delete all the existing data, reset the pool stats, start copying again and post new diags after new errors.

Quote

1

October 2, 20241 yr

Author

Well, it was faster than I thought 😅

Copy crashed, pool went into read-only, and I had no access to its settings.

After a reboot, read-only was gone and I had access to the scrub, so here is the result:

UUID: 250fadc9-bf35-4060-aa6d-c030b89bca9a

Scrub started: Wed Oct 2 16:47:04 2024

Status: finished

Duration: 0:23:22

Total to scrub: 192.16GiB

Rate: 140.35MiB/s

Error summary:

verify=13732

csum=247432

Corrected: 261164

Uncorrectable: 0

Unverified: 0

mc5-diagnostics-20241002-1754.zip

Edited October 2, 20241 yr by resolute-clearance8449
Added the diag (again)

Quote

October 2, 20241 yr

Community Expert

Disk is dropping offline, replace both cables and try again.

Quote

October 2, 20241 yr

Author

Already tried, and swapping PSU cables too, but nothing worked.

Would it be possible that, since they all come from the same batch, they might be all faulty?

If you think that it might be a possibility, I will buy some Western Digital Gold for example, and try with it!

Attached the sound of one of them:

Quote

October 2, 20241 yr

Community Expert
Solution

That's not sounding good, if the cables were replaced it could be a bad disk.

Quote

1

October 2, 20241 yr

Community Expert

It might be worth checking that there is not a power related issue.

Quote

October 2, 20241 yr

Author

Sure, especially with a G650M, even if the sound of the disks is a bit scary! I will try to go back to my faithful WD and keep you updated, thanks a lot for your time!!

Quote

October 7, 20241 yr

Author

Hey! Quick update since I recieved the new WD Red Plus saturday: I copied 4TB of data on it and not a single issue.

From what I see now, I am stunned that all this mess might only be caused by all the Seagate drives being faulty (4 in total).

Next steps to be sure :

- relaunch all the apps that were using it, and stress test for a week or so

- if still no errors, add the "backup disk" to create a Raid1 and confirm (or not) if the issue came from the disks, or the raid architecture itself!

Keeping you updated obviously (and maybe 1 or 2 people to whom the problem could happen in the future, hello to you)

mc5-diagnostics-20241007-1331.zip

Edited October 7, 20241 yr by resolute-clearance8449

Quote

October 7, 20241 yr

Community Expert

6 minutes ago, resolute-clearance8449 said:

From what I see now, I am stunned that all this mess might only be caused by all the Seagate drives being faulty (4 in total).

It could also be due to something that happened while they were in transit as one assumed they all travelled together.

Quote

1

October 7, 20241 yr

Author

Yes 100% possible too. Will tell in about a week now, but all seems to be resolved thanks to your help Fingers crossed!

Quote

October 24, 20241 yr

Author

Hi! Last update (I hope), after 2 weeks of testing: all works perfectly

- in "btrfs single" configuration, not a single error no matter the disk (new WD Red Plus 10TB or "old" WD Black 8TB, each on their own pool) even while writing / reading intensively for a week

- so I copied the ~6TB of data from the 8TB to the 10TB

- I then deleted all the content of the 8TB and merged the 2 disks in one pool
- Unraid switched from "btrfs single" to "raid 1", and all went smoothly

Everything works like a charm since, thanks a lot for your help (and long live Western Digital!) ❤️

mc5-diagnostics-20241024-1041.zip

Quote

1

BTRFS bdev /dev/ disks error (random disks)

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)