btrfs drive problems etc.

JorgeB · November 25, 2019

1 minute ago, G Speed said:

It is IT mode?

No, it's using the megaraid driver:

01:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS 2008 [Falcon] [1000:0073] (rev 03)
    Subsystem: Dell PERC H310 [1028:1f78]
    Kernel driver in use: megaraid_sas
    Kernel modules: megaraid_sas

BTW, I've seen the same controller in raid mode still using the HBA driver (mpt3sas), not sure why sometimes one or the other is used, but that one is using the megaraid driver, SMART might work if you set the correct options, but it would be best to flash the controller to IT mode.

G Speed · November 25, 2019

10 minutes ago, johnnie.black said:
No, it's using the megaraid driver:
01:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS 2008 [Falcon] [1000:0073] (rev 03)
    Subsystem: Dell PERC H310 [1028:1f78]
    Kernel driver in use: megaraid_sas
    Kernel modules: megaraid_sas
BTW, I've seen the same controller in raid mode still using the HBA driver (mpt3sas), not sure why sometimes one or the other is used, but that one is using the megaraid driver, SMART might work if you set the correct options, but it would be best to flash the controller to IT mode.

Thanks for catching that
I will fix that up, but trying to figure out the disk issue more so

One thought, when I "scrub" it's only reading existing data.. not free space.
So the whole disk is not being read.

Should I do an extended smart?
At least the whole disk will be read...

On that note, what comand do I use for that + logging?

JorgeB · November 25, 2019

Scrub is used to check data integrity, not the disk, something is corrupting the data, even if the disk is failing it should never return corrupt data, though like mentioned it's happened before, running a SMART extended test is a good idea, alternatively you can also run a parity check (non correct), that will also read the entire disk.

G Speed · November 25, 2019

35 minutes ago, johnnie.black said:

Scrub is used to check data integrity, not the disk, something is corrupting the data, even if the disk is failing it should never return corrupt data, though like mentioned it's happened before, running a SMART extended test is a good idea, alternatively you can also run a parity check (non correct), that will also read the entire disk.

Can I do a parity check on a single disk?

JorgeB · November 25, 2019

No, it will check all disks.

G Speed · November 25, 2019

13 minutes ago, johnnie.black said:

No, it will check all disks.

Might as well just do extended smart then
Is this correct?
smartctl -t long /dev/sdc
followed by
smartctl -a -A /dev/sdc >/boot/smart.txt

Edited November 25, 2019 by G Speed

JorgeB · November 25, 2019

10 minutes ago, G Speed said:

Is this correct?
smartctl -t long /dev/sdc

Yes

10 minutes ago, G Speed said:

smartctl -a -A /dev/sdc >/boot/smart.txt

Use -x instead of -a, more info.

G Speed · November 26, 2019

This is correct?

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 972 minutes for test to complete.

Just to confirm, I can't see anything on the unraid server.. as in that disk is Spun down

JorgeB · November 26, 2019

That's correct but since you're not running the test from the GUI you need to disable spin down, or SMART test will be interrupted.

G Speed · November 26, 2019

3 hours ago, johnnie.black said:

That's correct but since you're not running the test from the GUI you need to disable spin down, or SMART test will be interrupted.

Hmmm disk is spun down.. but it seems to be working?

Offline data collection status:0x00OOfflinedatacollection activity was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status:243Self-test routine in progress...

30% of test remaining.

JorgeB · November 26, 2019

Looks like it is, or it should say "aborted by host", see if it finishes.

G Speed · November 26, 2019

Did I mess something up? Drive is FINE?

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 5588 -

smart.txt

JorgeB · November 27, 2019

That's expected, except for the recent reported_unc error SMART looked fine.

G Speed · November 27, 2019

1 hour ago, johnnie.black said:

That's expected, except for the recent reported_unc error SMART looked fine.

Correct, but that happened previously and no increase.

So why can't I move files over? Back to the same problem.. bug with the rc?

JorgeB · November 27, 2019

Not a bug, your data is getting corrupted, you need to find why, if you already scrubbed the other disks like recommended and corruption is limited to disk2 it's likely a disk problem, not bad sectors so not detectable by SMART, but something else, if corruption also affects other disks then it's likely other problem, like bad RAM, controller, board/CPU, etc.

G Speed · November 27, 2019

51 minutes ago, johnnie.black said:

Not a bug, your data is getting corrupted, you need to find why, if you already scrubbed the other disks like recommended and corruption is limited to disk2 it's likely a disk problem, not bad sectors so not detectable by SMART, but something else, if corruption also affects other disks then it's likely other problem, like bad RAM, controller, board/CPU, etc.

Everything else is fine, doing a non correcting parity check now.. 2TB left; no errors so far..

G Speed · November 29, 2019

So I scrubed all my drives 0 Errors...

JorgeB · November 29, 2019

Then the most logical answer would be a problem with that disk that's silently corrupting data, I would replace it.

dvanders · September 28, 2020

Google brought me to this old thread because I get the exact same failed csum errors on my btrfs: "csum 0x2ac15d26"

And it looks like we have the same model drive: Seagate Barracuda Compute ST8000DM004-2CX188

These cheap SMR clearly have some systematic problem which manifests occasionally as poorly written, then unreadable bytes. I would guess that 0x2ac15d26 is the shash_digest of all 0x00s or 0xFFs. The drive itself never reports any problem -- SMART tests always succeed.

As a workaround I run scrub weekly on this FS -- it finds and fixes a few hundred errors each time (always on newly written data).

btrfs drive problems etc.

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation