elecgnosis Posted February 9, 2021 Share Posted February 9, 2021 (edited) Woke up this morning and found a red X next to Drive 11. System log shows read and write errors on Drive 11 and enough read errors on Drive 7 to hit the 128MB log file limit. I pulled the sys log, but I didn't get smart logs before shutting down. After shutting down, I reseated the drives. I restarted and ran short SMART scans on both drives. They passed. I have not restarted the array. I have a hot spare precleared and ready to replace one of these drives. I'm not sure what to do next. Disk 11 might be bad or it might have just been a loose connection. Disk 7 may be no better, though it didn't have any write errors. If I rebuild now it might be fine, but I'd like to know what options I have before that. I'm grateful for any help or advice you folks can provide. Full system log attached. Here's a partial system log from before the shutdown (... denotes lines omitted): Jan 23 13:09:44 OMNI kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13 ... Feb 8 04:54:49 OMNI CA Backup/Restore: Backup / Restore Completed Feb 8 19:45:10 OMNI login[17876]: ROOT LOGIN on '/dev/pts/0' Feb 9 03:16:40 OMNI kernel: sd 9:0:2:0: device_block, handle(0x001e) Feb 9 03:16:41 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) ... Feb 9 03:16:41 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) Feb 9 03:16:42 OMNI kernel: sd 9:0:2:0: device_unblock and setting to running, handle(0x001e) Feb 9 03:16:42 OMNI kernel: sd 9:0:2:0: [sdd] Synchronizing SCSI cache Feb 9 03:16:42 OMNI kernel: print_req_error: I/O error, dev sdd, sector 704672 Feb 9 03:16:42 OMNI kernel: md: disk11 read error, sector=704608 ... Feb 9 03:16:42 OMNI kernel: print_req_error: I/O error, dev sdd, sector 707296 Feb 9 03:16:42 OMNI kernel: md: disk11 read error, sector=681936 ... Feb 9 03:16:42 OMNI kernel: md: disk11 read error, sector=779416 Feb 9 03:16:42 OMNI kernel: scsi 9:0:2:0: rejecting I/O to dead device Feb 9 03:16:42 OMNI kernel: md: disk11 read error, sector=745568 Feb 9 03:16:42 OMNI kernel: mpt2sas_cm0: removing handle(0x001e), sas_addr(0x4433221101000000) Feb 9 03:16:42 OMNI kernel: mpt2sas_cm0: enclosure logical id(0x500605b001521880), slot(2) Feb 9 03:16:42 OMNI kernel: md: disk11 write error, sector=5407464512 ... Feb 9 03:16:42 OMNI kernel: md: disk11 write error, sector=5407464504 Feb 9 03:16:42 OMNI rc.diskinfo[7600]: SIGHUP received, forcing refresh of disks info. Feb 9 03:16:42 OMNI kernel: md: disk11 write error, sector=565600 ... Feb 9 03:16:42 OMNI kernel: scsi 9:0:2:0: rejecting I/O to dead device Feb 9 03:16:42 OMNI kernel: md: disk11 read error, sector=745568 Feb 9 03:16:42 OMNI kernel: mpt2sas_cm0: removing handle(0x001e), sas_addr(0x4433221101000000) Feb 9 03:16:42 OMNI kernel: mpt2sas_cm0: enclosure logical id(0x500605b001521880), slot(2) Feb 9 03:16:42 OMNI kernel: md: disk11 write error, sector=779416 Feb 9 03:16:53 OMNI kernel: scsi 9:0:25:0: Direct-Access ATA WDC WD100EFAX-68 0A83 PQ: 0 ANSI: 6 Feb 9 03:16:53 OMNI kernel: scsi 9:0:25:0: SATA: handle(0x001e), sas_addr(0x4433221101000000), phy(1), device_name(0x0000000000000000) Feb 9 03:16:53 OMNI kernel: scsi 9:0:25:0: enclosure logical id (0x500605b001521880), slot(2) Feb 9 03:16:53 OMNI kernel: scsi 9:0:25:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) Feb 9 03:16:53 OMNI kernel: sd 9:0:25:0: Power-on or device reset occurred Feb 9 03:16:53 OMNI kernel: sd 9:0:25:0: Attached scsi generic sg3 type 0 Feb 9 03:16:53 OMNI kernel: sd 9:0:25:0: [sdaa] 19532873728 512-byte logical blocks: (10.0 TB/9.10 TiB) Feb 9 03:16:53 OMNI kernel: sd 9:0:25:0: [sdaa] 4096-byte physical blocks Feb 9 03:16:54 OMNI kernel: sd 9:0:25:0: [sdaa] Write Protect is off Feb 9 03:16:54 OMNI kernel: sd 9:0:25:0: [sdaa] Mode Sense: 7f 00 10 08 Feb 9 03:16:54 OMNI kernel: sd 9:0:25:0: [sdaa] Write cache: enabled, read cache: enabled, supports DPO and FUA Feb 9 03:16:54 OMNI kernel: sdaa: sdaa1 Feb 9 03:16:54 OMNI kernel: sd 9:0:25:0: [sdaa] Attached SCSI disk Feb 9 03:16:54 OMNI rc.diskinfo[7600]: SIGHUP received, forcing refresh of disks info. Feb 9 03:16:54 OMNI kernel: BTRFS warning (device md11): duplicate device fsid:devid for aa876094-6b30-4aac-b367-aaf5a6d6fde8:1 old:/dev/md11 new:/dev/sdaa1 Feb 9 03:16:54 OMNI unassigned.devices: Disk with serial 'WDC_WD100EFAX-68LHPN0_XXXXXXXX', mountpoint 'WDC_WD100EFAX-68LHPN0_XXXXXXXX' is not set to auto mount. Feb 9 03:21:45 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) Feb 9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2871 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Feb 9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2871 Sense Key : 0x2 [current] Feb 9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2871 ASC=0x4 ASCQ=0x0 Feb 9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2871 CDB: opcode=0x88 88 00 00 00 00 00 bd b5 9b c0 00 00 00 20 00 00 Feb 9 03:21:45 OMNI kernel: print_req_error: 120 callbacks suppressed Feb 9 03:21:45 OMNI kernel: print_req_error: I/O error, dev sdc, sector 3182795712 Feb 9 03:21:45 OMNI kernel: md: disk7 read error, sector=3182795648 ... Feb 9 03:21:45 OMNI kernel: md: disk7 read error, sector=3182795672 Feb 9 03:21:45 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 Feb 9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2873 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Feb 9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2873 Sense Key : 0x2 [current] Feb 9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2873 ASC=0x4 ASCQ=0x0 Feb 9 03:21:45 OMNI kernel: sd 9:0:1:0: [sdc] tag#2873 CDB: opcode=0x88 88 00 00 00 00 03 3c db 10 00 00 00 00 20 00 00 Feb 9 03:21:45 OMNI kernel: print_req_error: I/O error, dev sdc, sector 13905891328 Feb 9 03:21:45 OMNI kernel: md: disk7 read error, sector=13905891264 ... Feb 9 03:21:45 OMNI kernel: md: disk7 read error, sector=13905891288 Feb 9 03:21:45 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 Feb 9 03:21:45 OMNI kernel: BTRFS error (device md11): error loading props for ino 76593 (root 5): -5 Feb 9 03:21:45 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0 Feb 9 03:21:47 OMNI kernel: sd 9:0:1:0: [sdc] tag#2817 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Feb 9 03:21:47 OMNI kernel: sd 9:0:1:0: [sdc] tag#2817 Sense Key : 0x2 [current] Feb 9 03:21:47 OMNI kernel: sd 9:0:1:0: [sdc] tag#2817 ASC=0x4 ASCQ=0x0 Feb 9 03:21:47 OMNI kernel: sd 9:0:1:0: [sdc] tag#2817 CDB: opcode=0x88 88 00 00 00 00 03 94 8f 18 a0 00 00 00 20 00 00 Feb 9 03:21:47 OMNI kernel: print_req_error: I/O error, dev sdc, sector 15377307808 Feb 9 03:21:47 OMNI kernel: md: disk7 read error, sector=15377307744 ... Feb 9 03:21:47 OMNI kernel: md: disk7 read error, sector=15377307768 Feb 9 03:21:47 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 4, flush 0, corrupt 0, gen 0 Feb 9 03:21:48 OMNI kernel: sd 9:0:1:0: [sdc] tag#2823 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Feb 9 03:21:48 OMNI kernel: sd 9:0:1:0: [sdc] tag#2823 Sense Key : 0x2 [current] Feb 9 03:21:48 OMNI kernel: sd 9:0:1:0: [sdc] tag#2823 ASC=0x4 ASCQ=0x0 Feb 9 03:21:48 OMNI kernel: sd 9:0:1:0: [sdc] tag#2823 CDB: opcode=0x88 88 00 00 00 00 02 55 da b9 e0 00 00 00 20 00 00 Feb 9 03:21:48 OMNI kernel: print_req_error: I/O error, dev sdc, sector 10030332384 Feb 9 03:21:48 OMNI kernel: md: disk7 read error, sector=10030332320 ... Feb 9 03:21:48 OMNI kernel: md: disk7 read error, sector=10030332344 Feb 9 03:21:48 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 5, flush 0, corrupt 0, gen 0 Feb 9 03:21:51 OMNI kernel: sd 9:0:1:0: [sdc] tag#2846 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Feb 9 03:21:51 OMNI kernel: sd 9:0:1:0: [sdc] tag#2846 Sense Key : 0x2 [current] Feb 9 03:21:51 OMNI kernel: sd 9:0:1:0: [sdc] tag#2846 ASC=0x4 ASCQ=0x0 Feb 9 03:21:51 OMNI kernel: sd 9:0:1:0: [sdc] tag#2846 CDB: opcode=0x88 88 00 00 00 00 00 3f c5 d2 40 00 00 00 20 00 00 Feb 9 03:21:51 OMNI kernel: print_req_error: I/O error, dev sdc, sector 1069929024 Feb 9 03:21:51 OMNI kernel: md: disk7 read error, sector=1069928960 ... Feb 9 03:21:51 OMNI kernel: md: disk7 read error, sector=1069928984 Feb 9 03:21:51 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 6, flush 0, corrupt 0, gen 0 ... Feb 9 03:21:51 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 15, flush 0, corrupt 0, gen 0 Feb 9 03:21:54 OMNI kernel: sd 9:0:1:0: Power-on or device reset occurred Feb 9 03:21:55 OMNI rc.diskinfo[7600]: SIGHUP received, forcing refresh of disks info. Feb 9 03:22:19 OMNI kernel: sd 9:0:1:0: device_block, handle(0x001d) Feb 9 03:22:20 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) ... Feb 9 03:22:20 OMNI kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) Feb 9 03:22:21 OMNI kernel: sd 9:0:1:0: device_unblock and setting to running, handle(0x001d) Feb 9 03:22:21 OMNI kernel: sd 9:0:1:0: [sdc] Synchronizing SCSI cache Feb 9 03:22:21 OMNI kernel: print_req_error: I/O error, dev sdc, sector 8956304744 Feb 9 03:22:21 OMNI kernel: md: disk7 read error, sector=8956304680 ... Feb 9 03:22:21 OMNI kernel: md: disk7 read error, sector=8956304704 Feb 9 03:22:21 OMNI kernel: print_req_error: I/O error, dev sdc, sector 3896548872 ... Feb 9 03:22:21 OMNI kernel: md: disk7 read error, sector=3896548808 Feb 9 03:22:21 OMNI kernel: md: disk7 read error, sector=3896548832 Feb 9 03:22:21 OMNI kernel: print_req_error: I/O error, dev sdc, sector 5544903688 Feb 9 03:22:21 OMNI kernel: md: disk7 read error, sector=5544903624 ... Feb 9 03:22:30 OMNI kernel: md: disk7 read error, sector=12006235056 Feb 9 03:22:30 OMNI kernel: sd 9:0:26:0: [sdab] Attached SCSI disk Feb 9 03:22:30 OMNI kernel: md: disk7 read error, sector=11945181264 ... Feb 9 03:41:50 OMNI kernel: BTRFS error (device md11): bdev /dev/md11 errs: wr 0, rd 3202, flush 0, corrupt 0, gen 0 Feb 9 03:41:50 OMNI kernel: md: disk7 read error, sector=15848823584 ...(disk7 read errors continue until EOF) SystemLogNoSerials.zip Edited February 11, 2021 by elecgnosis Quote Link to comment
trurl Posted February 9, 2021 Share Posted February 9, 2021 Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread. Quote Link to comment
elecgnosis Posted February 9, 2021 Author Share Posted February 9, 2021 As requested diagnostics-20210209-0853.zip Quote Link to comment
JorgeB Posted February 9, 2021 Share Posted February 9, 2021 Both disks dropped offline are reconnected with a different identifier, this is usually a connection/power problem, could also be controller/expander related. Quote Link to comment
elecgnosis Posted February 9, 2021 Author Share Posted February 9, 2021 So I need to troubleshoot all of my connections and maybe even my power supply. That said, it sounds like it may also be okay as is. How would I reset the red X without replacing the drive that UnRaid disabled? Quote Link to comment
JorgeB Posted February 9, 2021 Share Posted February 9, 2021 You can rebuild on top, if, and only if, the emulated disk is mounting and data looks correct, or do a new config and re-sync parity. Quote Link to comment
elecgnosis Posted February 9, 2021 Author Share Posted February 9, 2021 So I have two paths: Trust the disk's data (new config/re-sync parity) or trust the drive's condition (rebuild on top). Regardless of which option I go with, if another drive goes bad during either operation, I will lose the contents of both drives. If I go with rebuilding on top, would it be better to preclear the disk first? Is there any other way to validate the drive's condition? Quote Link to comment
JonathanM Posted February 9, 2021 Share Posted February 9, 2021 39 minutes ago, elecgnosis said: So I have two paths: Trust the disk's data (new config/re-sync parity) or trust the drive's condition (rebuild on top). Third option, rebuild on a totally different drive and keep the dropped drive intact. 41 minutes ago, elecgnosis said: Regardless of which option I go with, if another drive goes bad during either operation, I will lose the contents of both drives. True, but if you use a new drive to rebuild on, at least you still have some hope of possible recovery from the excluded drive. 42 minutes ago, elecgnosis said: If I go with rebuilding on top, would it be better to preclear the disk first? No. A long smart test would be a good indicator of condition. No need to erase everything currently on the drive, even if it's partially corrupt, it still might be somewhat salvageable. Quote Link to comment
elecgnosis Posted February 9, 2021 Author Share Posted February 9, 2021 I wanted to know what possible causes could be and next steps. I think I have what I need to take action, so I'll mark this solved. @jonathanm and @JorgeB, thanks for your help. If I have trouble after this that I can't understand on my own, should I reply to this thread, open a new one, or reach out over PM? Quote Link to comment
JorgeB Posted February 10, 2021 Share Posted February 10, 2021 9 hours ago, elecgnosis said: should I reply to this thread If it's related to this issue use this thread, so we can remember the full story. Quote Link to comment
elecgnosis Posted February 11, 2021 Author Share Posted February 11, 2021 I'm confused now. I chose to pull the drive that had the write error and replace it with my hot spare. When I started the array, as the drives were mounting, the new drive came up as Unmountable, though the rebuild is still happening. I haven't done anything with the original drive that had the read error. I have a bad feeling that the rebuild will result in an empty drive. Can you help me find out what's going on? Do I still have an opportunity to save that data? Quote Link to comment
elecgnosis Posted February 11, 2021 Author Share Posted February 11, 2021 I was able to mount the drive with the write error as an unassigned device. I am still able to access its files. I'm looking at this similar topic, and I think I understand the problem better: While there may not have been any mechanical failure or damage in the drive, its BTRFS filesystem was somehow corrupted? So, even after rebuilding, I will need to repair the file system on the new drive by using the Scrub command? Quote Link to comment
trurl Posted February 11, 2021 Share Posted February 11, 2021 post new diagnostics Quote Link to comment
elecgnosis Posted February 11, 2021 Author Share Posted February 11, 2021 As requested diagnostics-20210210-2047.zip Quote Link to comment
JorgeB Posted February 11, 2021 Share Posted February 11, 2021 Filesystem on the emulated disk is corrupt, best bet is to resync parity with all the original drives. Quote Link to comment
elecgnosis Posted February 11, 2021 Author Share Posted February 11, 2021 I put the original disk 11 back in the array. I set a new config with the same drive assignments. I started the array. Disk 11 still comes up as unmountable/no file system even though when I had it mounted in unassigned devices, all of the data was there. Not sure what to do now. Am I hosed? diagnostics-20210211-0101.zip Quote Link to comment
elecgnosis Posted February 11, 2021 Author Share Posted February 11, 2021 (edited) I had physically removed the original disk 11, but I didn't reassign it to the disk 11 slot. The hot spare was still in its slot when I did the new config. I reassigned it to slot 11, did another new config, and started up the array. Parity is resyncing. Everything seems to be fine now. Thanks, everyone. Edited February 11, 2021 by elecgnosis 1 Quote Link to comment
JorgeB Posted February 11, 2021 Share Posted February 11, 2021 Strange, if it mounted with UD it should mount in the array Quote Link to comment
JorgeB Posted February 11, 2021 Share Posted February 11, 2021 Just now, elecgnosis said: I reassigned it to slot 11, did another new config, and started up the array. Parity is resyncing. Everything seems to be fine now. Ahh. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.