[SOLVED] One Drive Write/Read Errors, One Drive Read Errors, One Parity Drive

elecgnosis · February 9, 2021

Woke up this morning and found a red X next to Drive 11.

System log shows read and write errors on Drive 11 and enough read errors on Drive 7 to hit the 128MB log file limit.

I pulled the sys log, but I didn't get smart logs before shutting down.

After shutting down, I reseated the drives.

I restarted and ran short SMART scans on both drives. They passed.

I have not restarted the array.

I have a hot spare precleared and ready to replace one of these drives.

I'm not sure what to do next. Disk 11 might be bad or it might have just been a loose connection. Disk 7 may be no better, though it didn't have any write errors. If I rebuild now it might be fine, but I'd like to know what options I have before that.

I'm grateful for any help or advice you folks can provide.

Full system log attached. Here's a partial system log from before the shutdown (... denotes lines omitted):

Jan 23 13:09:44 OMNI kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13
...
Feb  8 04:54:49 OMNI CA Backup/Restore: Backup / Restore Completed
Feb  8 19:45:10 OMNI login[17876]: ROOT LOGIN  on '/dev/pts/0'
Feb  9 03:16:40 OMNI kernel: sd 9:0:2:0: device_block, handle(0x001e)
Feb  9 03:16:41 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
...
Feb  9 03:16:41 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Feb  9 03:16:42 OMNI kernel: sd 9:0:2:0: device_unblock and setting to running, handle(0x001e)
Feb  9 03:16:42 OMNI kernel: sd 9:0:2:0: [sdd] Synchronizing SCSI cache
Feb  9 03:16:42 OMNI kernel: print_req_error: I/O error, dev sdd, sector 704672
Feb  9 03:16:42 OMNI kernel: md: disk11 read error, sector=704608
...
Feb  9 03:16:42 OMNI kernel: print_req_error: I/O error, dev sdd, sector 707296
Feb  9 03:16:42 OMNI kernel: md: disk11 read error, sector=681936
...
Feb  9 03:16:42 OMNI kernel: md: disk11 read error, sector=779416
Feb  9 03:16:42 OMNI kernel: scsi 9:0:2:0: rejecting I/O to dead device
Feb  9 03:16:42 OMNI kernel: md: disk11 read error, sector=745568
Feb  9 03:16:42 OMNI kernel: mpt2sas_cm0: removing handle(0x001e), sas_addr(0x4433221101000000)
Feb  9 03:16:42 OMNI kernel: mpt2sas_cm0: enclosure logical id(0x500605b001521880), slot(2) 
Feb  9 03:16:42 OMNI kernel: md: disk11 write error, sector=5407464512
...
Feb  9 03:16:42 OMNI kernel: md: disk11 write error, sector=5407464504
Feb  9 03:16:42 OMNI rc.diskinfo[7600]: SIGHUP received, forcing refresh of disks info.
Feb  9 03:16:42 OMNI kernel: md: disk11 write error, sector=565600
...
Feb  9 03:16:42 OMNI kernel: scsi 9:0:2:0: rejecting I/O to dead device
Feb  9 03:16:42 OMNI kernel: md: disk11 read error, sector=745568
Feb  9 03:16:42 OMNI kernel: mpt2sas_cm0: removing handle(0x001e), sas_addr(0x4433221101000000)
Feb  9 03:16:42 OMNI kernel: mpt2sas_cm0: enclosure logical id(0x500605b001521880), slot(2) 
Feb  9 03:16:42 OMNI kernel: md: disk11 write error, sector=779416
Feb  9 03:16:53 OMNI kernel: scsi 9:0:25:0: Direct-Access     ATA      WDC WD100EFAX-68 0A83 PQ: 0 ANSI: 6
Feb  9 03:16:53 OMNI kernel: scsi 9:0:25:0: SATA: handle(0x001e), sas_addr(0x4433221101000000), phy(1), device_name(0x0000000000000000)
Feb  9 03:16:53 OMNI kernel: scsi 9:0:25:0: enclosure logical id (0x500605b001521880), slot(2) 
Feb  9 03:16:53 OMNI kernel: scsi 9:0:25:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Feb  9 03:16:53 OMNI kernel: sd 9:0:25:0: Power-on or device reset occurred
Feb  9 03:16:53 OMNI kernel: sd 9:0:25:0: Attached scsi generic sg3 type 0
Feb  9 03:16:53 OMNI kernel: sd 9:0:25:0: [sdaa] 19532873728 512-byte logical blocks: (10.0 TB/9.10 TiB)
Feb  9 03:16:53 OMNI kernel: sd 9:0:25:0: [sdaa] 4096-byte physical blocks
Feb  9 03:16:54 OMNI kernel: sd 9:0:25:0: [sdaa] Write Protect is off
Feb  9 03:16:54 OMNI kernel: sd 9:0:25:0: [sdaa] Mode Sense: 7f 00 10 08
Feb  9 03:16:54 OMNI kernel: sd 9:0:25:0: [sdaa] Write cache: enabled, read cache: enabled, supports DPO and FUA
Feb  9 03:16:54 OMNI kernel: sdaa: sdaa1
Feb  9 03:16:54 OMNI kernel: sd 9:0:25:0: [sdaa] Attached SCSI disk
Feb  9 03:16:54 OMNI rc.diskinfo[7600]: SIGHUP received, forcing refresh of disks info.
Feb  9 03:16:54 OMNI kernel: BTRFS warning (device md11): duplicate device fsid:devid for aa876094-6b30-4aac-b367-aaf5a6d6fde8:1 old:/dev/md11 new:/dev/sdaa1
Feb  9 03:16:54 OMNI unassigned.devices: Disk with serial 'WDC_WD100EFAX-68LHPN0_XXXXXXXX', mountpoint 'WDC_WD100EFAX-68LHPN0_XXXXXXXX' is not set to auto mount.
Feb  9 03:21:45 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Feb  9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2871 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb  9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2871 Sense Key : 0x2 [current] 
Feb  9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2871 ASC=0x4 ASCQ=0x0 
Feb  9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2871 CDB: opcode=0x88 88 00 00 00 00 00 bd b5 9b c0 00 00 00 20 00 00
Feb  9 03:21:45 OMNI kernel: print_req_error: 120 callbacks suppressed
Feb  9 03:21:45 OMNI kernel: print_req_error: I/O error, dev sdc, sector 3182795712
Feb  9 03:21:45 OMNI kernel: md: disk7 read error, sector=3182795648
...
Feb  9 03:21:45 OMNI kernel: md: disk7 read error, sector=3182795672
Feb  9 03:21:45 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Feb  9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2873 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb  9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2873 Sense Key : 0x2 [current] 
Feb  9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2873 ASC=0x4 ASCQ=0x0 
Feb  9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2873 CDB: opcode=0x88 88 00 00 00 00 03 3c db 10 00 00 00 00 20 00 00
Feb  9 03:21:45 OMNI kernel: print_req_error: I/O error, dev sdc, sector 13905891328
Feb  9 03:21:45 OMNI kernel: md: disk7 read error, sector=13905891264
...
Feb  9 03:21:45 OMNI kernel: md: disk7 read error, sector=13905891288
Feb  9 03:21:45 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Feb  9 03:21:45 OMNI kernel: BTRFS error (device md11): error loading props for ino 76593 (root 5): -5
Feb  9 03:21:45 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
Feb  9 03:21:47 OMNI kernel: sd 9:0:1:0: [sdc] tag#2817 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb  9 03:21:47 OMNI kernel: sd 9:0:1:0: [sdc] tag#2817 Sense Key : 0x2 [current] 
Feb  9 03:21:47 OMNI kernel: sd 9:0:1:0: [sdc] tag#2817 ASC=0x4 ASCQ=0x0 
Feb  9 03:21:47 OMNI kernel: sd 9:0:1:0: [sdc] tag#2817 CDB: opcode=0x88 88 00 00 00 00 03 94 8f 18 a0 00 00 00 20 00 00
Feb  9 03:21:47 OMNI kernel: print_req_error: I/O error, dev sdc, sector 15377307808
Feb  9 03:21:47 OMNI kernel: md: disk7 read error, sector=15377307744
...
Feb  9 03:21:47 OMNI kernel: md: disk7 read error, sector=15377307768
Feb  9 03:21:47 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 4, flush 0, corrupt 0, gen 0
Feb  9 03:21:48 OMNI kernel: sd 9:0:1:0: [sdc] tag#2823 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb  9 03:21:48 OMNI kernel: sd 9:0:1:0: [sdc] tag#2823 Sense Key : 0x2 [current] 
Feb  9 03:21:48 OMNI kernel: sd 9:0:1:0: [sdc] tag#2823 ASC=0x4 ASCQ=0x0 
Feb  9 03:21:48 OMNI kernel: sd 9:0:1:0: [sdc] tag#2823 CDB: opcode=0x88 88 00 00 00 00 02 55 da b9 e0 00 00 00 20 00 00
Feb  9 03:21:48 OMNI kernel: print_req_error: I/O error, dev sdc, sector 10030332384
Feb  9 03:21:48 OMNI kernel: md: disk7 read error, sector=10030332320
...
Feb  9 03:21:48 OMNI kernel: md: disk7 read error, sector=10030332344
Feb  9 03:21:48 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 5, flush 0, corrupt 0, gen 0
Feb  9 03:21:51 OMNI kernel: sd 9:0:1:0: [sdc] tag#2846 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb  9 03:21:51 OMNI kernel: sd 9:0:1:0: [sdc] tag#2846 Sense Key : 0x2 [current] 
Feb  9 03:21:51 OMNI kernel: sd 9:0:1:0: [sdc] tag#2846 ASC=0x4 ASCQ=0x0 
Feb  9 03:21:51 OMNI kernel: sd 9:0:1:0: [sdc] tag#2846 CDB: opcode=0x88 88 00 00 00 00 00 3f c5 d2 40 00 00 00 20 00 00
Feb  9 03:21:51 OMNI kernel: print_req_error: I/O error, dev sdc, sector 1069929024
Feb  9 03:21:51 OMNI kernel: md: disk7 read error, sector=1069928960
...
Feb  9 03:21:51 OMNI kernel: md: disk7 read error, sector=1069928984
Feb  9 03:21:51 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 6, flush 0, corrupt 0, gen 0
...
Feb  9 03:21:51 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 15, flush 0, corrupt 0, gen 0
Feb  9 03:21:54 OMNI kernel: sd 9:0:1:0: Power-on or device reset occurred
Feb  9 03:21:55 OMNI rc.diskinfo[7600]: SIGHUP received, forcing refresh of disks info.
Feb  9 03:22:19 OMNI kernel: sd 9:0:1:0: device_block, handle(0x001d)
Feb  9 03:22:20 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
...
Feb  9 03:22:20 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Feb  9 03:22:21 OMNI kernel: sd 9:0:1:0: device_unblock and setting to running, handle(0x001d)
Feb  9 03:22:21 OMNI kernel: sd 9:0:1:0: [sdc] Synchronizing SCSI cache
Feb  9 03:22:21 OMNI kernel: print_req_error: I/O error, dev sdc, sector 8956304744
Feb  9 03:22:21 OMNI kernel: md: disk7 read error, sector=8956304680
...
Feb  9 03:22:21 OMNI kernel: md: disk7 read error, sector=8956304704
Feb  9 03:22:21 OMNI kernel: print_req_error: I/O error, dev sdc, sector 3896548872
...
Feb  9 03:22:21 OMNI kernel: md: disk7 read error, sector=3896548808
Feb  9 03:22:21 OMNI kernel: md: disk7 read error, sector=3896548832
Feb  9 03:22:21 OMNI kernel: print_req_error: I/O error, dev sdc, sector 5544903688
Feb  9 03:22:21 OMNI kernel: md: disk7 read error, sector=5544903624
...
Feb  9 03:22:30 OMNI kernel: md: disk7 read error, sector=12006235056
Feb  9 03:22:30 OMNI kernel: sd 9:0:26:0: [sdab] Attached SCSI disk
Feb  9 03:22:30 OMNI kernel: md: disk7 read error, sector=11945181264
...
Feb  9 03:41:50 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 3202, flush 0, corrupt 0, gen 0
Feb  9 03:41:50 OMNI kernel: md: disk7 read error, sector=15848823584
...(disk7 read errors continue until EOF)

SystemLogNoSerials.zip

Edited February 11, 2021 by elecgnosis

trurl · February 9, 2021

Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

elecgnosis · February 9, 2021

As requested

diagnostics-20210209-0853.zip

JorgeB · February 9, 2021

Both disks dropped offline are reconnected with a different identifier, this is usually a connection/power problem, could also be controller/expander related.

elecgnosis · February 9, 2021

So I need to troubleshoot all of my connections and maybe even my power supply. That said, it sounds like it may also be okay as is.

How would I reset the red X without replacing the drive that UnRaid disabled?

JorgeB · February 9, 2021

You can rebuild on top, if, and only if, the emulated disk is mounting and data looks correct, or do a new config and re-sync parity.

elecgnosis · February 9, 2021

So I have two paths: Trust the disk's data (new config/re-sync parity) or trust the drive's condition (rebuild on top).

Regardless of which option I go with, if another drive goes bad during either operation, I will lose the contents of both drives.

If I go with rebuilding on top, would it be better to preclear the disk first? Is there any other way to validate the drive's condition?

JonathanM · February 9, 2021

39 minutes ago, elecgnosis said:

So I have two paths: Trust the disk's data (new config/re-sync parity) or trust the drive's condition (rebuild on top).

Third option, rebuild on a totally different drive and keep the dropped drive intact.

41 minutes ago, elecgnosis said:

Regardless of which option I go with, if another drive goes bad during either operation, I will lose the contents of both drives.

True, but if you use a new drive to rebuild on, at least you still have some hope of possible recovery from the excluded drive.

42 minutes ago, elecgnosis said:

If I go with rebuilding on top, would it be better to preclear the disk first?

No. A long smart test would be a good indicator of condition. No need to erase everything currently on the drive, even if it's partially corrupt, it still might be somewhat salvageable.

elecgnosis · February 9, 2021

I wanted to know what possible causes could be and next steps. I think I have what I need to take action, so I'll mark this solved. @jonathanm and @JorgeB, thanks for your help.

If I have trouble after this that I can't understand on my own, should I reply to this thread, open a new one, or reach out over PM?

JorgeB · February 10, 2021

9 hours ago, elecgnosis said:

should I reply to this thread

If it's related to this issue use this thread, so we can remember the full story.

elecgnosis · February 11, 2021

I'm confused now. I chose to pull the drive that had the write error and replace it with my hot spare.

When I started the array, as the drives were mounting, the new drive came up as Unmountable, though the rebuild is still happening.
I haven't done anything with the original drive that had the read error.

I have a bad feeling that the rebuild will result in an empty drive. Can you help me find out what's going on? Do I still have an opportunity to save that data?

elecgnosis · February 11, 2021

I was able to mount the drive with the write error as an unassigned device. I am still able to access its files.

I'm looking at this similar topic, and I think I understand the problem better: While there may not have been any mechanical failure or damage in the drive, its BTRFS filesystem was somehow corrupted?

So, even after rebuilding, I will need to repair the file system on the new drive by using the Scrub command?

trurl · February 11, 2021

post new diagnostics

elecgnosis · February 11, 2021

As requested

diagnostics-20210210-2047.zip

JorgeB · February 11, 2021

Filesystem on the emulated disk is corrupt, best bet is to resync parity with all the original drives.

elecgnosis · February 11, 2021

I put the original disk 11 back in the array. I set a new config with the same drive assignments. I started the array.

Disk 11 still comes up as unmountable/no file system even though when I had it mounted in unassigned devices, all of the data was there.

Not sure what to do now. Am I hosed?

diagnostics-20210211-0101.zip

elecgnosis · February 11, 2021

I had physically removed the original disk 11, but I didn't reassign it to the disk 11 slot. The hot spare was still in its slot when I did the new config.

I reassigned it to slot 11, did another new config, and started up the array. Parity is resyncing. Everything seems to be fine now.

Thanks, everyone.

Edited February 11, 2021 by elecgnosis

JorgeB · February 11, 2021

Strange, if it mounted with UD it should mount in the array

JorgeB · February 11, 2021

Just now, elecgnosis said:

I reassigned it to slot 11, did another new config, and started up the array. Parity is resyncing. Everything seems to be fine now.

Ahh.

[SOLVED] One Drive Write/Read Errors, One Drive Read Errors, One Parity Drive

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation