EH5 Posted June 18 Share Posted June 18 I had been transferring files from portable drives to the array and moving files off of a drive in the that has issues before shutting the array down to fiddle with the failed drive when, after a break a parity check had started. The CPU was at nearly 100% utilization which slowed everything down so I paused the parity check to finish what I was doing and restarted the parity check. It says running but is making no progress and the CPU utilization graphics on the Dashboard don't seem to be working any more. I tried to pause the parity check and even cancel it but after confirming there is no impact on the condition of the parity check. I am including the current diagnostic file. I didn't want to try a reboot until I check here. tower-diagnostics-20240618-1058.zip Quote Link to comment
JorgeB Posted June 18 Share Posted June 18 Constant error for sdl, check/replace cables and post new diags after array start. Quote Link to comment
EH5 Posted June 18 Author Share Posted June 18 I shut down from the array operation menu on the Main tab and swapped the cables on disk 3. When I powered up it indicated that it was not a clean shut down. Interestingly, when I checked the notices it reported successful parity check from a couple of days ago. Now disk 3 and 4 show failures and the array just says starting but has not started. the fix common problems just says disk 4 is disabled and refers me to the main tab. tower-diagnostics-20240618-1808.zip Quote Link to comment
JorgeB Posted June 19 Share Posted June 19 Now there are constant errors with sdj, it's currently unassigned, was this an array disk? Model Family: Western Digital Blue (SMR) Device Model: WDC WD60EZAZ-00ZGHB0 Serial Number: WD-WX11D19A4C2V Jun 18 16:10:06 Tower kernel: sd 10:0:32:0: Power-on or device reset occurred Jun 18 16:10:06 Tower kernel: sd 10:0:32:0: [sdj] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB) Jun 18 16:10:06 Tower kernel: sd 10:0:32:0: [sdj] 4096-byte physical blocks Jun 18 16:10:06 Tower kernel: sd 10:0:32:0: [sdj] Write Protect is off Jun 18 16:10:06 Tower kernel: sd 10:0:32:0: [sdj] Mode Sense: 7f 00 10 08 Jun 18 16:10:06 Tower kernel: sd 10:0:32:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA Jun 18 16:10:06 Tower kernel: sd 10:0:32:0: [sdj] Attached SCSI disk Jun 18 16:10:07 Tower kernel: sd 10:0:32:0: device_block, handle(0x000d) Jun 18 16:10:08 Tower kernel: sd 10:0:32:0: device_unblock and setting to running, handle(0x000d) Jun 18 16:10:10 Tower kernel: sd 10:0:32:0: [sdj] Synchronizing SCSI cache Jun 18 16:10:10 Tower kernel: sd 10:0:32:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Jun 18 16:10:10 Tower kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000) Jun 18 16:10:10 Tower kernel: mpt2sas_cm0: removing handle(0x000d), sas_addr(0x4433221106000000) Jun 18 16:10:10 Tower kernel: mpt2sas_cm0: enclosure logical id(0x5b8ca3a0fb5ac600), slot(5) Jun 18 16:10:12 Tower SysDrivers: SysDrivers Build Complete Jun 18 16:10:20 Tower kernel: mpt2sas_cm0: handle(0xd) sas_address(0x4433221106000000) port_type(0x1) Jun 18 16:10:20 Tower kernel: scsi 10:0:33:0: Direct-Access ATA WDC WD60EZAZ-00Z 0A80 PQ: 0 ANSI: 6 Jun 18 16:10:20 Tower kernel: scsi 10:0:33:0: SATA: handle(0x000d), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000) Jun 18 16:10:20 Tower kernel: scsi 10:0:33:0: enclosure logical id (0x5b8ca3a0fb5ac600), slot(5) Jun 18 16:10:20 Tower kernel: scsi 10:0:33:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) Jun 18 16:10:20 Tower kernel: scsi 10:0:33:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1) Jun 18 16:10:20 Tower kernel: sd 10:0:33:0: Attached scsi generic sg9 type 0 Jun 18 16:10:20 Tower kernel: end_device-10:33: add: handle(0x000d), sas_addr(0x4433221106000000) Jun 18 16:10:20 Tower kernel: sd 10:0:33:0: Power-on or device reset occurred Jun 18 16:10:20 Tower kernel: sd 10:0:33:0: [sdj] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB) Jun 18 16:10:20 Tower kernel: sd 10:0:33:0: [sdj] 4096-byte physical blocks Jun 18 16:10:20 Tower kernel: sd 10:0:33:0: [sdj] Write Protect is off Jun 18 16:10:20 Tower kernel: sd 10:0:33:0: [sdj] Mode Sense: 7f 00 10 08 Jun 18 16:10:20 Tower kernel: sd 10:0:33:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA Jun 18 16:10:20 Tower kernel: sd 10:0:33:0: [sdj] Attached SCSI disk Quote Link to comment
EH5 Posted June 19 Author Share Posted June 19 WD-WX11D19A4C2V was one of the non-array disks I was testing to see if it would pass preclear Quote Link to comment
JorgeB Posted June 19 Share Posted June 19 Disconnect that disk for now and post new diags after array start. Quote Link to comment
EH5 Posted June 19 Author Share Posted June 19 I disconnected that drive (WD-WX11D19A4C2V) and the system started. It looks like drive 3 and 4 of the array may just be toast. Should I just remove them from the array? what is the proper procedure for this? tower-diagnostics-20240619-1037.zip Quote Link to comment
JorgeB Posted June 19 Share Posted June 19 Jun 19 10:34:54 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Jun 19 10:34:58 Tower kernel: ata2: found unknown device (class 0) Jun 19 10:34:58 Tower kernel: ata2: softreset failed (device not ready) Jun 19 10:34:59 Tower kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jun 19 10:34:59 Tower kernel: ata2.00: configured for UDMA/133 Jun 19 10:36:00 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jun 19 10:36:00 Tower kernel: ata3.00: failed command: FLUSH CACHE EXT Jun 19 10:36:00 Tower kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 17 Jun 19 10:36:00 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 19 10:36:00 Tower kernel: ata3.00: status: { DRDY } Jun 19 10:36:00 Tower kernel: ata3: hard resetting link Jun 19 10:36:01 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jun 19 10:36:01 Tower kernel: ata3.00: configured for UDMA/133 Jun 19 10:36:01 Tower kernel: ata3.00: retrying FLUSH 0xea Emask 0x4 Jun 19 10:36:01 Tower kernel: ata3.00: device reported invalid CHS sector 0 Jun 19 10:36:01 Tower kernel: ata3: EH complete Still issues with multiple disks, this time disks 1 and 5, you may have a power/connection problem, any power splitters in use? Quote Link to comment
EH5 Posted June 19 Author Share Posted June 19 Yes there are power splitters. It is a 600W power supply but not enough HDD plugs Quote Link to comment
itimpi Posted June 19 Share Posted June 19 2 minutes ago, EH5 said: Yes there are power splitters. It is a 600W power supply but not enough HDD plugs You need to be careful not to split a connector on a PSU cable too many ways. A SATA->SATA splitter should not split more than 2 way reliably, and a Molex->SATA one can normally get away with 4 way. Quote Link to comment
EH5 Posted June 22 Author Share Posted June 22 I moved everything off the emulated disks and unplugged them. When I reconfigured without them and tried to restart the array and rebuild parity it shows that another disk is disconnected, the Dockers didn't start, and the same issue where the CPU Usage diagnostics don't show and the parity rebuild is not advancing. Is it possible that my UNRAID USB drive is corrupted somehow? tower-diagnostics-20240621-1847.zip Quote Link to comment
itimpi Posted June 22 Share Posted June 22 The syslog shows what look like connection issues (power or SATA) for disk4 and disk5 (in particular disk4) and eventually you started getting write errors on disk4. The SMART information for those drives looks OK. My suspicion would be insufficient power reaching the drives. I do not see how the flash drive could be causing this issue. Quote Link to comment
EH5 Posted June 29 Author Share Posted June 29 I upgrade my power supply to get sufficient sata power connections and switched to a case with easier access to swap out the drives. Tings appeared fine but I got a memory error. I will start a new thread for that as I think all the issues I was dealing with in this tread are fixed. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.