dyker Posted July 2, 2023 Share Posted July 2, 2023 (edited) Version: 6.11.5 I've had Unraid since 2017 and have ran monthly parity checks without problems. I have a 3 drive array. I've replaced the drives to upgrade once or twice so the drives aren't too old. On June 28 I popped off my case because I wanted to replace a drive with a larger drive. First I ran a parity check. I left the case off because I've not had a problem running with the case off before but apparently I also haven't ran a parity check with the case off? The drives warmed up to 49-50C during the parity check. I didn't bother looking at the results when it was done with the parity check, because I got busy with the weekend. But apparently there were 3 errors. Then, on July 1 (a few days later) my system ran it's monthly parity check... case still off. Yep, I just ran everything with the case off and forgot about it while I enjoyed the weekend. And apparently 2 more errors. These are literally the first errors I've had on Parity since going with Unraid, dates below are when they completed: Parity-Check 2023-07-01, 15:06:21 (Saturday)6 TB 15 hr, 6 min, 19 sec 110.4 MB/sOK 2 ERRORS Parity-Check 2023-06-29, 05:10:01 (Thursday)6 TB 14 hr, 23 min, 24 sec 115.8 MB/sOK 3 ERRORS Parity-Check 2023-06-01, 14:00:56 (Thursday)6 TB 14 hr, 55 sec 118.9 MB/sOK 0 ERRORS (AND ALL PRIOR CHECKS ALL THE WAY BACK TO 2017 ZERO ERRORS) So apparently DURING THE PARITY CHECK I also got emails saying my drives were hot: Event: Unraid Disk 2 temperature Subject: Warning [VDUNRAID] - Disk 2 is hot (47 C) Event: Unraid Parity disk temperature Subject: Warning [VDUNRAID] - Parity disk is hot (46 C) I later received emails that the drives returned to normal temperatures. I didn't see any of the emails until today, because, like I said, I got busy this weekend and didn't look at the server or any email. Well, do I have a problem now? Was the the heat to blame for the parity errors? What should I do now? All drives show healthy SMART. I actually am building a 2nd Unraid server and was going to put on a new SATA controller on this config and move this SATA controller to the one I'm building. So I'm glad to see the problems now before I started all the changes, but really need advice. I've not rebooted in 6 months. I just started a new parity check, but unchecked the "fix errors" button to see if I get a clean parity check.... Oh, with the case on so I should get good air flow. If anyone can provide advice or insight, I've attached the log, and would be grateful. Should I just say "OK, they are fixed, glad the parity errors are fixed" and just watch things? Thank you in advance! vdunraid-diagnostics-20230702-1704.zip Edited July 2, 2023 by dyker clarity Quote Link to comment
dyker Posted July 2, 2023 Author Share Posted July 2, 2023 (edited) I also see this in my logs... errors, not sure how to interpret, or if it is related, but tons of "failed" errors on ata7. not sure which disk is ata7, if I should be concerned about this, and if it is a problem how would I get Unraid to tell me about it earlier? In my log (see first post) these ATA7 errors go way back to JUNE 1 when the last parity check happened. So some of these ata7 errors predate the beginning of the parity errors. Jul 1 09:09:33 VDUnraid kernel: ata7.00: cmd 60/40:f8:d0:a9:bc/05:00:c6:01:00/40 tag 31 ncq dma 688128 in Jul 1 09:09:33 VDUnraid kernel: res 40/00:f8:d0:a9:bc/00:00:c6:01:00/40 Emask 0x10 (ATA bus error) Jul 1 09:09:33 VDUnraid kernel: ata7.00: status: { DRDY } Jul 1 09:09:33 VDUnraid kernel: ata7: hard resetting link Jul 1 09:09:40 VDUnraid kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jul 1 09:09:40 VDUnraid kernel: ata7.00: configured for UDMA/33 Jul 1 09:09:40 VDUnraid kernel: ata7: EH complete Jul 1 09:10:05 VDUnraid kernel: ata7.00: exception Emask 0x10 SAct 0x1 SErr 0x10002 action 0xe frozen Jul 1 09:10:05 VDUnraid kernel: ata7.00: irq_stat 0x00400000, PHY RDY changed Jul 1 09:10:05 VDUnraid kernel: ata7: SError: { RecovComm PHYRdyChg } Jul 1 09:10:05 VDUnraid kernel: ata7.00: failed command: READ FPDMA QUEUED Jul 1 09:10:05 VDUnraid kernel: ata7.00: cmd 60/58:00:68:b1:07/01:00:c7:01:00/40 tag 0 ncq dma 176128 in Jul 1 09:10:05 VDUnraid kernel: res 40/00:00:68:b1:07/00:00:c7:01:00/40 Emask 0x10 (ATA bus error) Jul 1 09:10:05 VDUnraid kernel: ata7.00: status: { DRDY } Jul 1 09:10:05 VDUnraid kernel: ata7: hard resetting link Jul 1 09:10:12 VDUnraid kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jul 1 09:10:12 VDUnraid kernel: ata7.00: configured for UDMA/33 Jul 1 09:10:12 VDUnraid kernel: ata7: EH complete Jul 1 09:11:11 VDUnraid kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jul 1 09:11:11 VDUnraid kernel: ata7.00: configured for UDMA/33 Jul 1 09:11:42 VDUnraid kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jul 1 09:11:42 VDUnraid kernel: ata7.00: configured for UDMA/33 Jul 1 09:25:15 VDUnraid kernel: ata7.00: exception Emask 0x10 SAct 0x300 SErr 0x10002 action 0xe frozen Jul 1 09:25:15 VDUnraid kernel: ata7.00: irq_stat 0x00400000, PHY RDY changed Jul 1 09:25:15 VDUnraid kernel: ata7: SError: { RecovComm PHYRdyChg } Jul 1 09:25:15 VDUnraid kernel: ata7.00: failed command: READ FPDMA QUEUED Jul 1 09:25:15 VDUnraid kernel: ata7.00: cmd 60/b8:40:e8:d0:e6/03:00:d0:01:00/40 tag 8 ncq dma 487424 in Jul 1 09:25:15 VDUnraid kernel: res 40/00:40:e8:d0:e6/00:00:d0:01:00/40 Emask 0x10 (ATA bus error) Jul 1 09:25:15 VDUnraid kernel: ata7.00: status: { DRDY } Jul 1 09:25:15 VDUnraid kernel: ata7.00: failed command: READ FPDMA QUEUED Jul 1 09:25:15 VDUnraid kernel: ata7.00: cmd 60/88:48:a0:d4:e6/01:00:d0:01:00/40 tag 9 ncq dma 200704 in Jul 1 09:25:15 VDUnraid kernel: res 40/00:40:e8:d0:e6/00:00:d0:01:00/40 Emask 0x10 (ATA bus error) Jul 1 09:25:15 VDUnraid kernel: ata7.00: status: { DRDY } Jul 1 09:25:15 VDUnraid kernel: ata7: hard resetting link Jul 1 09:25:21 VDUnraid kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jul 1 09:25:21 VDUnraid kernel: ata7.00: configured for UDMA/33 Jul 1 09:25:21 VDUnraid kernel: ata7: EH complete Here are a few more from June 28 before the parity check: Jun 28 15:15:48 VDUnraid kernel: ata7.00: status: { DRDY } Jun 28 15:15:48 VDUnraid kernel: ata7.00: failed command: READ FPDMA QUEUED Jun 28 15:15:48 VDUnraid kernel: ata7.00: cmd 60/40:e8:90:ed:75/05:00:1a:00:00/40 tag 29 ncq dma 688128 in Jun 28 15:15:48 VDUnraid kernel: res 40/00:00:18:f5:75/00:00:1a:00:00/40 Emask 0x10 (ATA bus error) Jun 28 15:15:48 VDUnraid kernel: ata7.00: status: { DRDY } Jun 28 15:15:48 VDUnraid kernel: ata7.00: failed command: READ FPDMA QUEUED Jun 28 15:15:48 VDUnraid kernel: ata7.00: cmd 60/48:f0:d0:f2:75/02:00:1a:00:00/40 tag 30 ncq dma 299008 in Jun 28 15:15:48 VDUnraid kernel: res 40/00:00:18:f5:75/00:00:1a:00:00/40 Emask 0x10 (ATA bus error) Jun 28 15:15:48 VDUnraid kernel: ata7.00: status: { DRDY } Jun 28 15:15:48 VDUnraid kernel: ata7: hard resetting link Jun 28 15:15:53 VDUnraid kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jun 28 15:15:54 VDUnraid kernel: ata7.00: configured for UDMA/33 Jun 28 15:15:54 VDUnraid kernel: ata7: EH complete Jun 28 15:16:49 VDUnraid kernel: ata7.00: exception Emask 0x10 SAct 0xff00 SErr 0x10002 action 0xe frozen Jun 28 15:16:49 VDUnraid kernel: ata7.00: irq_stat 0x00400000, PHY RDY changed Jun 28 15:16:49 VDUnraid kernel: ata7: SError: { RecovComm PHYRdyChg } Jun 28 15:16:49 VDUnraid kernel: ata7.00: failed command: READ FPDMA QUEUED Jun 28 15:16:49 VDUnraid kernel: ata7.00: cmd 60/40:40:f0:8a:5f/05:00:1b:00:00/40 tag 8 ncq dma 688128 in Jun 28 15:16:49 VDUnraid kernel: res 40/00:48:30:90:5f/00:00:1b:00:00/40 Emask 0x10 (ATA bus error) Jun 28 15:16:49 VDUnraid kernel: ata7.00: status: { DRDY } Jun 28 15:16:49 VDUnraid kernel: ata7.00: failed command: READ FPDMA QUEUED Jun 28 15:16:49 VDUnraid kernel: ata7.00: cmd 60/38:48:30:90:5f/02:00:1b:00:00/40 tag 9 ncq dma 290816 in Jun 28 15:16:49 VDUnraid kernel: res 40/00:48:30:90:5f/00:00:1b:00:00/40 Emask 0x10 (ATA bus error) Jun 28 15:16:49 VDUnraid kernel: ata7.00: status: { DRDY } Jun 28 15:16:49 VDUnraid kernel: ata7.00: failed command: READ FPDMA QUEUED Jun 28 15:16:49 VDUnraid kernel: ata7.00: cmd 60/b8:50:68:92:5f/01:00:1b:00:00/40 tag 10 ncq dma 225280 in Jun 28 15:16:49 VDUnraid kernel: res 40/00:48:30:90:5f/00:00:1b:00:00/40 Emask 0x10 (ATA bus error) Jun 28 15:16:49 VDUnraid kernel: ata7.00: status: { DRDY } Edited July 2, 2023 by dyker Quote Link to comment
itimpi Posted July 2, 2023 Share Posted July 2, 2023 Those types of errors tend to indicate connection issues. It is worth checking that all cables (power and SATA) are well seated as they can work loose due to vibration and/or thermal effects. Quote Link to comment
dyker Posted July 2, 2023 Author Share Posted July 2, 2023 (edited) Thank you for the reply! I found that ata7 is my parity drive. disk:0 description: ATA Disk product: WDC WD60EFAX-68J vendor: Western Digital physical id: 0 bus info: scsi@7:0.0.0 <<< THIS IS ATA 7, SCSI @ 7 logical name: /dev/sde version: 0A82 serial# : MATCHES PARITY DRIVE size: 5589GiB (6001GB) Replaced the cable. Also replaced the SATA card (it was on a daughter card and part of my planned upgrade was replacing a 2-port SATA card with a 4-port SATA card so I just went ahead and did it). Now what. Run parity check again? With writing Correction? Also is there a reason Unraid didn't tell me about all these errors? I mean, I guess it did in the log, but it seems like that should have been raised at a higher level to make it obvious to me somehow. Want to know if I should have a setting somewhere to make errors more obvious. Should I manually scan the logs for a few weeks and hope not to see errors? Edited July 2, 2023 by dyker clarity Quote Link to comment
JorgeB Posted July 3, 2023 Share Posted July 3, 2023 Run a correcting check. 9 hours ago, dyker said: Also is there a reason Unraid didn't tell me about all these errors? It would warn you about the sync errors if notifications are enabled. Quote Link to comment
dyker Posted July 3, 2023 Author Share Posted July 3, 2023 (edited) I do have system notifications enabled. I did get notifications about the parity errors end of June, but never any message about the SATA cable communication issues from early June as indicated a few posts above in the log. Is there a different setting to get those surfaced proactively? Edited July 3, 2023 by dyker Quote Link to comment
Solution JorgeB Posted July 3, 2023 Solution Share Posted July 3, 2023 17 minutes ago, dyker said: about the failing drive There's no failed drive for now, there are what look like connection issues, replace both cables. Quote Link to comment
dyker Posted July 3, 2023 Author Share Posted July 3, 2023 Thanks for your help. Sorry I edited the post when you were replying. Is there a way to surface those communication errors? I'm guessing "no" unless I scan the logs myself? I'll do that for a few weeks. Quote Link to comment
JorgeB Posted July 3, 2023 Share Posted July 3, 2023 36 minutes ago, dyker said: Is there a way to surface those communication errors? Personally I have a script emailing me every day any dmesg errors or warnings. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.