mathomas3 Posted July 29, 2022 Share Posted July 29, 2022 (edited) Hello all, I think I had a little power flick or something, but I noticed that my server's fans starting spinning a little bit quicker for a second which prompted me to check the server. After pulling up the page disk 8 is red balled and all ther other disks have 300-400ish errors. Wanting to prevent further errors I tried to stop the array and that's where it's stuck. ' lsof | grep /mnt lsof: WARNING: can't stat() xfs file system /mnt/disk8 Output information may be incomplete. ' Before any of this happened I suspected that disk 10 was on the way out due to 114 read errors that happened yesterday, though an extended SMART diag came back good... And now the parity drive is red balled... HELP! tower-diagnostics-20220729-1405.zip Edited July 29, 2022 by mathomas3 Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 Also there is this error... Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md11, logical block 1953506608, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522944, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522945, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522946, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522947, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522948, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522949, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522950, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522951, async page read Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md6, logical block 1953506608, async page read Jul 29 12:10:01 Tower kernel: blk_update_request: I/O error, dev loop2, sector 41942912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 You are going to have to go for the power switch. Then post new diagnostics after rebooting and starting the array. Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 2 minutes ago, trurl said: You are going to have to go for the power switch. Then post new diagnostics after rebooting and starting the array. As a sysadmin... that's something you tell users all the time but not something you ever want to be told yourself 🤔 Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 tower-diagnostics-20220729-1423.zip Rebooted and posting Diags. Those two disks are still red balled... funny thing is though that disk 8 is a 2 month old ssd... Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 3 minutes ago, mathomas3 said: disk 8 is a 2 month old ssd Why do you have SSD in array with HDD parity? SSDs in the array can't be trimmed, and can only be written at parity speed. Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 faster read times while also having the data protected is what I was aiming for... though I have been considering moving it out of the array for sometime now Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 Give me a little while to go through these diagnostics, fortunately we have those from before reboot also. Post a screenshot of Main - Array Devices Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 1 minute ago, trurl said: Give me a little while to go through these diagnostics, fortunately we have those from before reboot also. Post a screenshot of Main - Array Devices Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 2 minutes ago, trurl said: Give me a little while to go through these diagnostics, fortunately we have those from before reboot also. Post a screenshot of Main - Array Devices Before I forget to mention it. Thank you for the quick assistance. It's been a while since I have had drive failures, but never had more then one die at a time. I did recently move UNRaid into this hardware, an HP 1u Server and a 24bay DAS, both with dual PSUs. Hasn't given me any issues up till now. Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 Oh and after posting the most recent Diags I disabled docker to limit writes to the array. Dont think that should effect anything, but I figure it's for the best Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 Never got around to converting filesystem on disk3? Your RAID controller is making it complicated identifying disks, one of many reasons RAID controllers are NOT recommended. And it looks like all the SAS disks have disconnected and reconnected as different devices, adding to the confusion. Are they all on that controller? Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 3 minutes ago, trurl said: Never got around to converting filesystem on disk3? Your RAID controller is making it complicated identifying disks, one of many reasons RAID controllers are NOT recommended. And it looks like all the SAS disks have disconnected and reconnected as different devices, adding to the confusion. Are they all on that controller? correct. All of the drives are connected via the RAID controller When I moved everything over I had to use a different RAID controller(thought I was using a recommended one from the unraid wiki) If I recall disks 9-11 are SAS and all the rest are SATA drives Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 I don't know how to read SMART reports for SAS drives, or even if that RAID controller is giving useful SMART reports. SMART report for parity looks fine, hasn't had extended self-test recently. SMART report for disk8 looks fine, no self-test run. Not going to look at others since there is so many. Do any have SMART warnings on the Dashboard page? All 14 array data disks mounted including emulated disk8 so that's good. Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 1 minute ago, mathomas3 said: had to use a different RAID controller(thought I was using a recommended one from the unraid wiki) You must be looking at very old wiki. No RAID controllers are recommended these days. Some can be flashed to IT mode so they are no longer RAID controllers. Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 2 minutes ago, trurl said: I don't know how to read SMART reports for SAS drives, or even if that RAID controller is giving useful SMART reports. SMART report for parity looks fine, hasn't had extended self-test recently. SMART report for disk8 looks fine, no self-test run. Not going to look at others since there is so many. Do any have SMART warnings on the Dashboard page? All 14 array data disks mounted including emulated disk8 so that's good. Negative. Smart reports look good for the ones I have looked at. Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 Jul 14 00:00:01 Tower kernel: mdcmd (36): check NOCORRECT Jul 14 00:00:01 Tower kernel: Jul 14 00:00:01 Tower kernel: md: recovery thread: check P Q ... Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 Sense Key : 0x3 [current] [descriptor] Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 ASC=0x11 ASCQ=0x0 Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 CDB: opcode=0x28 28 00 73 7d 14 85 00 00 80 00 Jul 15 05:55:20 Tower kernel: blk_update_request: critical medium error, dev sdg, sector 15500616856 op 0x0:(READ) flags 0x0 phys_seg 114 prio class 0 Jul 15 05:55:20 Tower kernel: md: disk10 read error, sector=15500616792 Jul 15 05:55:20 Tower kernel: md: disk10 read error, sector=15500616800 That looks like a disk problem. And similar for rest of parity check, but it completes Jul 15 06:08:55 Tower kernel: md: sync done. time=108534sec Jul 15 06:08:55 Tower kernel: md: recovery thread: exit status: 0 Some other syslog entries about the controller for disk7 after that while you were trying to shut things down. Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 Some things about your configuration not ideal. Often when I see this in diagnostics I say unrelated to the problem, but in your case it does seem related. Your appdata, domains, and system shares have files on the array. Specifically disk8. Ideally, all files for these shares would be on fast pool (cache) and set to stay there, so docker/VM performance isn't impacted by slower array, and so array disks can spin down, since these files are always open. Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 1 hour ago, mathomas3 said: had a little power flick Do you have UPS? Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 36 minutes ago, trurl said: all the SAS disks have disconnected and reconnected as different devices That remark applies to both sets of diagnostics. I don't think you can rebuild anything until you resolve that. Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 Just now, trurl said: Do you have UPS? I do have a UPS. Though I need to get a larger one, Im pushing this current one on what it can handle. For power I connected 1 PSU to the UPS and the second PSU directly to the wall. Im thinking what happened is that the DAS lost power for 0.01 of a second causing the errors on the disks and since Parity and disk 8 were actively being accessed they got disabled. Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 2 minutes ago, trurl said: That remark applies to both sets of diagnostics. I don't think you can rebuild anything until you resolve that. When you say that the SAS disks have been disconnected and reconnected as different devices... what do you mean? I didnt reassign the disks when everything was moved into the new box. Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 14 minutes ago, trurl said: Some other syslog entries about the controller for disk7 after that while you were trying to shut things down. Actually I don't think you were shutting down yet, there was another syslog after that I needed to go through in those first diagnostics. There are some entries in both of those syslogs about cache full, but it isn't now, I don't know if it was earlier or not. That next syslog has similar controller problems with disk29 (parity2), more for disk8, and then looks like all disks start acting up. Eventually you get a write error on disk0 (parity), disk8, and disk29 (parity2). Write errors always disable disks, but that is one too many, so parity2 doesn't get disabled. Quote Link to comment
mathomas3 Posted July 29, 2022 Author Share Posted July 29, 2022 3 minutes ago, trurl said: Actually I don't think you were shutting down yet, there was another syslog after that I needed to go through in those first diagnostics. There are some entries in both of those syslogs about cache full, but it isn't now, I don't know if it was earlier or not. That next syslog has similar controller problems with disk29 (parity2), more for disk8, and then looks like all disks start acting up. Eventually you get a write error on disk0 (parity), disk8, and disk29 (parity2). Write errors always disable disks, but that is one too many, so parity2 doesn't get disabled. So you think both parity drives are hosed? Quote Link to comment
trurl Posted July 29, 2022 Share Posted July 29, 2022 5 minutes ago, mathomas3 said: When you say that the SAS disks have been disconnected and reconnected as different devices... what do you mean? I didnt reassign the disks when everything was moved into the new box. I mean while the server was running. Hardware issues made them disconnect, and then they reconnected, but linux gave them new device letters so Unraid doesn't know them anymore. That happened in those earlier diagnostics where things went south, and again in these current diagnostics. Do you have any disks showing in Unassigned Devices? Take a look at your latest diagnostics, in the smart folder. None of the SAS disks are assigned and they have different sd designations than in your screenshot. You can also see what I mean about RAID controllers making it difficult to identify disks since you have to actually open up each of those SMART reports to figure out which disk they are. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.