Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

BTRFS bdev /dev/ disks error (random disks)

Featured Replies

Hey!

 

I am writing this today after trying to solve the problem myself for the past year to avoid bothering you, but I have a severe lack of skills soooo...

 

Here is the thing: 

 

I had a problem with a storage disk, that had its data corrupted, and eventually comes to a "read only" error.

 

When it was in "read only" state, any try to write something led to an error 5. 

 

It happened while doing hard read/write work (like downloading "50 ISOs of linux" simultaneously, or copying 3TB of data from a disk to another).

 

So I tried to remove the disk, but then the problem moved to another disk. 

 

I bought like 4 others disks in total during this year, and precleared all of them to be sure: preclear read, zeroing and post-clear read all finished without any error (50h per disk).

 

It seems that the I/O problem "moves" from a disk to another, depending of the disks that are in the array (if I remove the 2 12TB it mooves to the SSD cache, if I add some it might moove to the NVMe, or to the 12TB, I can't find any pattern there).

 

I tought the SATA of the motherboard were too saturated, so I bought an LSI 9300-16i card, and even designed a 3D cooler adapter to keep it cool, but the problem persists.

 

In the meantime, Fix Common Problems told me today that an "invalid folder" with the name of an old share is still within /mnt.

 

Some "flash device corrupted also" while starting the array, but disappearing eventually.

 

I am a bit confused now about what to test then... Maybe the PSU since it is a G650M, known to be a disaster? I bought another PSU to troubleshoot even this, but I assume that the problem is more software than hardware now.

 

If someone sees something that I missed, it could help me a lot!

 

Thanks ❤️

MC5 Main.jpg

MC5 cache errors.jpg

Preclear result.jpg

LSI Cooler.jpg

mc5-diagnostics-20240929-1442.zip

Solved by JorgeB

  • Community Expert

Looks like both pool devices dropped offline in the past, run a correcting scrub on the pool and post the results.

  • Author

Seems to have worked!

 

All this time and money just for that, you saved me!

 

A bit of panic while doing the scrub, since I only saw "137 errors fixed" then the "main" tab alternating between empty array and Error 500, but after a forced reboot it seems that I don't have any red line for the moment, thanks Jorge ❤️

 

I also added a weekly scrub to cache and pools now, to avoid it happenning again.

 

Will try to copy 3TB and stress the disks again, but it seems that it was that simple...

Array 1.PNG

Array 2.jpg

  • Community Expert

Run another correcting scrub and post the results.

  • Author

Hi, here are the results:

 

Cache:

UUID: dc947b32-2638-4059-927b-d0a51c5d878a

Scrub started: Mon Sep 30 09:33:50 2024

Status: finished

Duration: 0:02:54

Total to scrub: 178.41GiB

Rate: 1.02GiB/s

Error summary: no errors found

 

Secondary pool: I'll come back to you when ended!
image.png.1102712eec3842c1be2fe0c0e58834fa.png

 

But what is odd is that errors keep going with use, since it had no errors before the copy started.

  • Community Expert

Scrub is still running, post the results when done.

  • Author

Scrub ended, here are the results:
 

UUID: 250fadc9-bf35-4060-aa6d-c030b89bca9a

Scrub started: Wed Oct 2 01:10:44 2024

Status: finished

Duration: 8:18:13

Total to scrub: 5.00TiB

Rate: 175.37MiB/s

Error summary:

read=294772105

csum=256

Corrected: 272

Uncorrectable: 294772089

Unverified: 0

 

In the meantime, if it can help:

 

Since the duration went to 3 to 4 days, I stopped the scrub, and erased the disks (I can afford to loose all the data, it is saved on another disk).

 

I then relaunched a scrub, that found no errors.

 

But I had a lot of lines "kernel: sd 7:0:6:0: Power-on or device reset occurred", so I went to Tools / System Devices to find out that 7:0:6:0 were attributed to one of the disks of the pool.

 

So I changed the disk for another that I just bought: no more "Power-on or device reset occurred", except when really powering the disks on I assume.

 

I tried to copy one season at first, then scrub the pool with the new disk: no error.

 

Tried to copy an entire show then scrub: no error.

 

Tried to copy all of the "show" folder then scrub: 24 uncorrectable errors, but 0 corrected and 0 unverified.

 

Tried to copy some others folders then scrub: 294 772 089 uncorrectable errors, and 272 corrected (the result that it above), and logs that look like a christmas tree!

 

image.thumb.png.7fb4de629590155566c14d25e4c93aea.png

 

I am starting to think that the disk that keeped disconnecting basically corrupted the data and that I "only" have to retrieve them to remove all the errors

 

Diag attached as usual, if needed :) 

 

Thanks for your help!

mc5-diagnostics-20241002-0938.zip

  • Community Expert

There are already a lot of device errors, and the syslog already rotated, so cannot see the start of the problem, and if they are new or old, but looks like a device dropped offline.

 

If the data can be deleted, delete all the existing data, reset the pool stats, start copying again and post new diags after new errors.

  • Author

Well, it was faster than I thought 😅


Copy crashed, pool went into read-only, and I had no access to its settings.

After a reboot, read-only was gone and I had access to the scrub, so here is the result:

 

 

UUID: 250fadc9-bf35-4060-aa6d-c030b89bca9a

Scrub started: Wed Oct 2 16:47:04 2024

Status: finished

Duration: 0:23:22

Total to scrub: 192.16GiB

Rate: 140.35MiB/s

Error summary:

verify=13732

csum=247432

Corrected: 261164

Uncorrectable: 0

Unverified: 0

Pool infos.PNG

Read only.PNG

After reset cache.PNG

Too many profiles.PNG

mc5-diagnostics-20241002-1754.zip

Edited by resolute-clearance8449
Added the diag (again)

  • Community Expert

Disk is dropping offline, replace both cables and try again.

  • Author

Already tried, and swapping PSU cables too, but nothing worked.

 

Would it be possible that, since they all come from the same batch, they might be all faulty?

 

If you think that it might be a possibility, I will buy some Western Digital Gold for example, and try with it!

 

Attached the sound of one of them:

 

 

 

  • Community Expert
  • Solution

That's not sounding good, if the cables were replaced it could be a bad disk.

  • Community Expert

It might be worth checking that there is not a power related issue.

  • Author

Sure, especially with a G650M, even if the sound of the disks is a bit scary! I will try to go back to my faithful WD and keep you updated, thanks a lot for your time!!

  • Author

Hey! Quick update since I recieved the new WD Red Plus saturday: I copied 4TB of data on it and not a single issue.

 

From what I see now, I am stunned that all this mess might only be caused by all the Seagate drives being faulty (4 in total).

 

Next steps to be sure :

relaunch all the apps that were using it, and stress test for a week or so

- if still no errors, add the "backup disk" to create a Raid1 and confirm (or not) if the issue came from the disks, or the raid architecture itself!

 

Keeping you updated obviously (and maybe 1 or 2 people to whom the problem could happen in the future, hello to you)

mc5-diagnostics-20241007-1331.zip

Edited by resolute-clearance8449

  • Community Expert
6 minutes ago, resolute-clearance8449 said:

From what I see now, I am stunned that all this mess might only be caused by all the Seagate drives being faulty (4 in total).

It could also be due to something that happened while they were in transit as one assumed they all travelled together.

  • Author

Yes 100% possible too. Will tell in about a week now, but all seems to be resolved thanks to your help :) Fingers crossed!

  • 3 weeks later...
  • Author

Hi! Last update (I hope), after 2 weeks of testing: all works perfectly :)

 

- in "btrfs single" configuration, not a single error no matter the disk (new WD Red Plus 10TB or "old" WD Black 8TB, each on their own pool) even while writing / reading intensively for a week

- so I copied the ~6TB of data from the 8TB to the 10TB

- I then deleted all the content of the 8TB and merged the 2 disks in one pool
- Unraid switched from "btrfs single" to "raid 1", and all went smoothly

 

Everything works like a charm since, thanks a lot for your help (and long live Western Digital!) ❤️ 

mc5-diagnostics-20241024-1041.zip

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.