gnollo Posted May 19, 2020 Share Posted May 19, 2020 Drive was fine but now it has been disabled because of read errors after moving it to another slot in my norco caddy. I moved the drive after stopping the array, but not turning off the system. Is there a way to tell unraid to enable it without having to rebuild it? Diagnostics attached. tower-diagnostics-20200519-0921.zip Quote Link to comment
JorgeB Posted May 19, 2020 Share Posted May 19, 2020 You could do a new config but would require a parity sync/check, and since the emulated disk is mounting correctly might as well rebuild it instead.. Quote Link to comment
gnollo Posted May 21, 2020 Author Share Posted May 21, 2020 Now I have a different problem. Removed the drive after stopping and starting the array, and rebooted, and the booting process goes in a loop. I can see the usual process flashing by (PCI Devices listing...) until the line "SYSLINUX 6.03 EDD ...." appears on the screen, then it flashes back to the boot screen and starts again a few times, then the process stops at that line, and I have to hard reboot (CTRL/ALT/DEL does not appear to work anymore). Why o why did I move that drive. Why... Quote Link to comment
gnollo Posted May 21, 2020 Author Share Posted May 21, 2020 (edited) It seems to be a flash problem. Formatted another flash with rufus and installed a fresh copy of unraid server, it boots with no issues. Inserted the current flash in my laptop, it found errors, I scanned and fixed it but it still doesn't work. Copied off all the files I could see after the fix, and now I am reformatting it with rufus, unticked fast format, now it's taking ages. This must be connected with the messages about the flash being unreadable in the past, but it was rebooting fine and the message was appearing only from time to time. I guess I should have listened and replaced the flash straight away? Edited May 21, 2020 by gnollo Quote Link to comment
gnollo Posted May 21, 2020 Author Share Posted May 21, 2020 Mmmh. Turns out it's not strictly a flash problem. Copied the backed up contents from the flash that is failing to boot over to the flash that booted, and that also now is exhibiting the same issue of looping booting. So I created a new fresh install of unraid plain config on the original flash that failed yesterday, copied across the permissioning key alone, and it boots fine. Now I will try to copy across the last version of the flash I saved in December, and see if that works too. Quote Link to comment
gnollo Posted May 21, 2020 Author Share Posted May 21, 2020 It booted but it has the wrong configuration of drives, as it predates my update of parity to a bigger drive. I guess i can stiill do a new config and then run a parity sync/check? Quote Link to comment
gnollo Posted May 21, 2020 Author Share Posted May 21, 2020 Booted fine with new configuration, running parity check now. Two drives (including 7) now show UDMA CRC error counts (4 and 55). Quote Link to comment
gnollo Posted May 21, 2020 Author Share Posted May 21, 2020 Emby server is not running though. It looks like is trying to run 3.5.3.0 Quote Link to comment
gnollo Posted May 21, 2020 Author Share Posted May 21, 2020 1 hour ago, gnollo said: Emby server is not running though. It looks like is trying to run 3.5.3.0 Fixed that too. Deleted library.db and now it's up and running Quote Link to comment
gnollo Posted May 24, 2020 Author Share Posted May 24, 2020 Parity check completed Total size:10 TB Elapsed time:2 days, 1 hour, 27 minutes Current position:10.0 TB (100.0 %) Estimated speed:45.0 MB/sec Estimated finish:completed Sync errors corrected:488083827 Speed a bit on the low side, perhaps because of the number of sync errors? Quote Link to comment
JorgeB Posted May 24, 2020 Share Posted May 24, 2020 24 minutes ago, gnollo said: perhaps because of the number of sync errors? They won't help, were so many errors expected? Quote Link to comment
gnollo Posted May 24, 2020 Author Share Posted May 24, 2020 Last time I checked parity in 2019: Event: Unraid Parity check Subject: Notice [TOWER] - Parity check finished (47 errors) Description: Duration: 21 hours, 50 minutes, 34 seconds. Average speed: 101.8 MB/s Importance: warning Quote Link to comment
gnollo Posted May 24, 2020 Author Share Posted May 24, 2020 4 minutes ago, johnnie.black said: They won't help, were so many errors expected? I have no idea, drive 7 was disabled at one point, and I will have done changes to the drives, so perhaps the corrections are to reflect that that drive did not change? Will do another check in a week and see how that works out in terms of number of sync errors. Quote Link to comment
gnollo Posted May 24, 2020 Author Share Posted May 24, 2020 Also I was still running the emby docker which would have impacted performance, will turn that off next time as I am running parity check Quote Link to comment
gnollo Posted May 28, 2020 Author Share Posted May 28, 2020 Things have gone a lot worse since the parity check. I could not connect to the network drives so I logged on the Unraid Gui - 1 drive disabled (drive2) - 2 other drives with errors - emby server stopped - parity has over three and half billion errors - CPU almost at 100% - tried to stop the array but it's stuck on "stopping" - wont' shut down, even with telnet connection and poweroff The one thing that the affected drives have in common, is that they all reside in one of the Norco SS500. I am only using one power connector for all the three norcos in use, should I change that? Also in that norco connector, I have a drive (parity) which I had to tape on one of the power connectors to allow for unraid to recognise it upon booting (I took it out of an external drive cage). Diagnostics attached. I am thinking hard poweroff at the mains, swap power connector on the affected drive, and reboot. I think rebuilding drive 2 is very dangerous, as I don't feel I can trust parity. I am more inclined to try to force the system to mark all the drives as good and recalculate parity from scratch (not sure how to do that though). The problems continue it seems on my server. tower-diagnostics-20200528-1733.zip Quote Link to comment
gnollo Posted May 28, 2020 Author Share Posted May 28, 2020 Rebooted, used two molex for each Norco cage this time.Disk2 unmountable, no file systemSent from my SM-A520F using Tapatalk Quote Link to comment
gnollo Posted May 28, 2020 Author Share Posted May 28, 2020 Array started with disk 2 disabled.Not sure what to do next, I don't want to lose any of the data on disk2Sent from my SM-A520F using Tapatalk Quote Link to comment
gnollo Posted May 28, 2020 Author Share Posted May 28, 2020 Fresh diagnostics tower-diagnostics-20200528-2354.zip Quote Link to comment
JorgeB Posted May 29, 2020 Share Posted May 29, 2020 You had multiple disks in different controllers going offline at the same time: May 28 11:07:50 Tower kernel: sd 9:0:1:0: device_block, handle(0x000a) May 28 11:07:53 Tower kernel: sd 9:0:1:0: device_unblock and setting to running, handle(0x000a) May 28 11:07:53 Tower kernel: sd 9:0:1:0: [sdj] Synchronizing SCSI cache May 28 11:07:53 Tower kernel: sd 9:0:1:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 May 28 11:07:53 Tower kernel: mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221106000000) May 28 11:07:53 Tower kernel: mpt2sas_cm0: enclosure logical id(0x500605b004dce890), slot(5) May 28 11:07:53 Tower rc.diskinfo[28375]: SIGHUP received, forcing refresh of disks info. May 28 11:08:07 Tower kernel: ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen May 28 11:08:07 Tower kernel: ata5: SError: { PHYRdyChg } May 28 11:08:07 Tower kernel: ata5.00: failed command: WRITE DMA EXT May 28 11:08:07 Tower kernel: ata5.00: cmd 35/00:08:c0:21:cf/00:00:b8:01:00/e0 tag 0 dma 4096 out May 28 11:08:07 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x14 (ATA bus error) May 28 11:08:07 Tower kernel: ata5.00: status: { DRDY } May 28 11:08:07 Tower kernel: ata5: hard resetting link May 28 11:08:08 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300) May 28 11:08:14 Tower kernel: ata5: hard resetting link May 28 11:08:14 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300) May 28 11:08:19 Tower kernel: ata5: hard resetting link May 28 11:08:20 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300) May 28 11:08:20 Tower kernel: ata5.00: disabled May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 Sense Key : 0x5 [current] May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 ASC=0x21 ASCQ=0x4 May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 01 b8 cf 21 c0 00 00 00 08 00 00 May 28 11:08:20 Tower kernel: print_req_error: I/O error, dev sdf, sector 7395549632 May 28 11:08:20 Tower kernel: md: disk2 write error, sector=7395549568 May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] killing request May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device ### [PREVIOUS LINE REPEATED 1 TIMES] ### May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device May 28 11:08:20 Tower kernel: ata5: EH complete May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] CDB: opcode=0x88 88 00 00 00 00 00 03 f5 30 a8 00 00 00 08 00 00 May 28 11:08:20 Tower kernel: print_req_error: I/O error, dev sdf, sector 66400424 May 28 11:08:20 Tower kernel: ata5.00: detaching (SCSI 4:0:0:0) May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Synchronizing SCSI cache May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00 May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Stopping disk May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00 May 28 11:08:20 Tower rc.diskinfo[28375]: SIGHUP received, forcing refresh of disks info. May 28 11:08:21 Tower kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen May 28 11:08:21 Tower kernel: ata6: SError: { PHYRdyChg } May 28 11:08:21 Tower kernel: ata6.00: failed command: READ DMA EXT May 28 11:08:21 Tower kernel: ata6.00: cmd 25/00:08:f0:58:df/00:00:2a:00:00/e0 tag 0 dma 4096 in May 28 11:08:21 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error) May 28 11:08:21 Tower kernel: ata6.00: status: { DRDY } May 28 11:08:21 Tower kernel: ata6: hard resetting link May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697728 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697736 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697744 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697752 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697760 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697768 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697776 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697784 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=11995447336 May 28 11:08:21 Tower kernel: md: disk2 read error, sector=66400360 May 28 11:08:21 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300) May 28 11:08:27 Tower kernel: ata6: hard resetting link May 28 11:08:27 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300) May 28 11:08:33 Tower kernel: ata6: hard resetting link May 28 11:08:33 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300) May 28 11:08:33 Tower kernel: ata6.00: disabled This suggests a power/connection problem. Quote Link to comment
gnollo Posted May 29, 2020 Author Share Posted May 29, 2020 I agree, something affected that Norco controller, power likely. I doubled up now on power connectors for each of the two Norco's 500 now. What do I do with the disabled drive? I had to do an unclean shutdown as the tower would not stop. DId that cause filesystem corruption? Quote Link to comment
JorgeB Posted May 29, 2020 Share Posted May 29, 2020 Unassign the disable disk and start the array to check if the emulated disk is mounting correctly and contents look OK. Quote Link to comment
gnollo Posted May 29, 2020 Author Share Posted May 29, 2020 The array has already started, and the drive is already emulated. Do you want me to stop the array first, then unassign the disk and restart? Quote Link to comment
JorgeB Posted May 29, 2020 Share Posted May 29, 2020 No need if already started, check that data looks OK and if yes you can rebuild on top. Quote Link to comment
gnollo Posted May 29, 2020 Author Share Posted May 29, 2020 I stopped the array anyway, unassigned the disk and restarted. Although content is emulated, I cannot get to disk2 as I usually do via \\tower\disk2. And now I get read errors from disk7, which is not on the same norco unit that was affected before. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.