oh-tomo Posted March 25, 2021 Share Posted March 25, 2021 Quote Event: Unraid array errors Subject: Warning [SERVER2018] - array has errors Description: Array has 1 disk with read errors Importance: warning Disk 1 - ST12000VN0007-2GS116_********* (sdf) (errors 16) unRAID disabled the disk (red x) I removed the disk. Contents are currently emulated. I placed the disk in a HDD USB dock connected to a Windows 10 VM and ran some SeaTools tests: Quote --------------- SeaTools for Windows v1.4.0.7 --------------- 2021-03-23 11:45:23 PM Model Number: ST12000V Serial Number: ******** Firmware Revision: 0302 Short DST - Started 2021-03-23 11:45:23 PM Short DST - Pass 2021-03-23 11:46:25 PM Short Generic - Started 2021-03-23 11:46:40 PM Short Generic - Pass 2021-03-23 11:47:34 PM Long Generic - Started 2021-03-23 11:47:56 PM Long Generic - Pass 2021-03-24 11:41:03 PM Identify - Started 2021-03-25 11:29:05 AM 24 hours running a Long Generic SeaTools test and the HDD passed. This is troubling. The HDD is under warranty until August. But unRAID thinks the HDD is error prone while SeaTools tests aren't finding errors. I can't put the disk back in the array because unRAID has disabled it and I can't send the disk in for RMA to Seagate because their software doesn't show any problems. How should I proceed? Quote Link to comment
Squid Posted March 25, 2021 Share Posted March 25, 2021 A write to the drive failed. That's a given since unraid disabled the drive. Most of the time this is because of poorly connecting the drive. SMART report from the drive would be helpful. Quote Link to comment
trurl Posted March 25, 2021 Share Posted March 25, 2021 34 minutes ago, oh-tomo said: How should I proceed? Now that you have removed the disk it is a little more complicated how you should proceed. You should have asked for advice before doing anything. Ideally we would have gotten the diagnostics before you did anything and before rebooting your server. Then we would be better able to see why a write to the disk failed. As mentioned, bad connections are much more common than bad disks. Is your server currently running without the disk? Quote Link to comment
oh-tomo Posted March 25, 2021 Author Share Posted March 25, 2021 10 minutes ago, trurl said: Now that you have removed the disk it is a little more complicated how you should proceed. You should have asked for advice before doing anything. Ideally we would have gotten the diagnostics before you did anything and before rebooting your server. Then we would be better able to see why a write to the disk failed. As mentioned, bad connections are much more common than bad disks. Is your server currently running without the disk? Yes the server is running without the disk. Quote Link to comment
trurl Posted March 25, 2021 Share Posted March 25, 2021 Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread. Quote Link to comment
oh-tomo Posted March 25, 2021 Author Share Posted March 25, 2021 9 minutes ago, trurl said: Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread. Attached! server2018-diagnostics-20210325-1309.zip Quote Link to comment
trurl Posted March 25, 2021 Share Posted March 25, 2021 Emulated disk1 is mounted so that's good. Is the disk currently plugged in to your server? Quote Link to comment
oh-tomo Posted March 25, 2021 Author Share Posted March 25, 2021 2 minutes ago, trurl said: Emulated disk1 is mounted so that's good. Is the disk currently plugged in to your server? It's plugged into a USB HDD dock and the USB is connected to a Win10 VM. Quote Link to comment
trurl Posted March 25, 2021 Share Posted March 25, 2021 Stop the array and set disk1 to Not Assigned (or however it's worded) then start the array again. Then shutdown, install the disk into your server again, boot up, but don't reassign the disk. Just start up normally and post new diagnostics. Quote Link to comment
oh-tomo Posted March 26, 2021 Author Share Posted March 26, 2021 6 hours ago, trurl said: Stop the array and set disk1 to Not Assigned (or however it's worded) then start the array again. Then shutdown, install the disk into your server again, boot up, but don't reassign the disk. Just start up normally and post new diagnostics. Here are the new diagnostics. server2018-diagnostics-20210325-2125.zip Quote Link to comment
trurl Posted March 26, 2021 Share Posted March 26, 2021 12TB disk (sdg) serial ending 0TGDM I assume was disk1. SMART for that disk looks OK and since emulated disk1 is mounted should be good to rebuild to that disk. Not sure what this end of syslog stuff is, seems related to nvidia and VMs: Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0 Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: device [8086:a33c] error status/mask=00000001/00002000 Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: [ 0] RxErr (First) Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::fc54:ff:fe00:2981. Mar 25 21:21:59 Server2018 avahi-daemon[4413]: New relevant interface vnet0.IPv6 for mDNS. Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Registering new address record for fe80::fc54:ff:fe00:2981 on vnet0.*. Mar 25 21:23:08 Server2018 kernel: kvm [9357]: vcpu2, guest rIP: 0xfffff802693f6192 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop ### [PREVIOUS LINE REPEATED 9 TIMES] ### Mar 25 21:23:13 Server2018 kernel: kvm_set_msr_common: 12094 callbacks suppressed Is it causing any apparent problems? Quote Link to comment
oh-tomo Posted March 26, 2021 Author Share Posted March 26, 2021 7 hours ago, trurl said: 12TB disk (sdg) serial ending 0TGDM I assume was disk1. SMART for that disk looks OK and since emulated disk1 is mounted should be good to rebuild to that disk. Not sure what this end of syslog stuff is, seems related to nvidia and VMs: Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0 Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: device [8086:a33c] error status/mask=00000001/00002000 Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: [ 0] RxErr (First) Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::fc54:ff:fe00:2981. Mar 25 21:21:59 Server2018 avahi-daemon[4413]: New relevant interface vnet0.IPv6 for mDNS. Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Registering new address record for fe80::fc54:ff:fe00:2981 on vnet0.*. Mar 25 21:23:08 Server2018 kernel: kvm [9357]: vcpu2, guest rIP: 0xfffff802693f6192 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop ### [PREVIOUS LINE REPEATED 9 TIMES] ### Mar 25 21:23:13 Server2018 kernel: kvm_set_msr_common: 12094 callbacks suppressed Is it causing any apparent problems? I haven’t noticed any VM issues with Nvidia. Rebuild of disk1 has started. Quote Link to comment
oh-tomo Posted March 27, 2021 Author Share Posted March 27, 2021 16 hours ago, oh-tomo said: I haven’t noticed any VM issues with Nvidia. Rebuild of disk1 has started. Rebuild is done! server2018-diagnostics-20210326-2333.zip Quote Link to comment
oh-tomo Posted March 28, 2021 Author Share Posted March 28, 2021 Smart health warning for Disk 1. Unraid Disk 1 SMART health [198]: 27-03-2021 23:31 Warning [SERVER2018] - offline uncorrectable is 24 ST12000VN0007-2GS116_*0TGDM (sdi) server2018-diagnostics-20210328-0016.zip Quote Link to comment
trurl Posted March 28, 2021 Share Posted March 28, 2021 also pending sectors plus reallocated increased. Why are you rebuilding parity2? Replace disk1. Quote Link to comment
oh-tomo Posted March 28, 2021 Author Share Posted March 28, 2021 13 minutes ago, trurl said: also pending sectors plus reallocated increased. Why are you rebuilding parity2? Replace disk1. parity2 is rebuilding because that HDD was upgraded to a larger HDD. Rebuild is done. What do I do with old disk1 after I replace it? Wipe and send to Seagate? "WARRANTY Valid till 09/Aug/2021" server2018-diagnostics-20210328-0832.zip Quote Link to comment
oh-tomo Posted March 28, 2021 Author Share Posted March 28, 2021 Rebuild of new disk1 (old parity2) started. Should I wait until rebuild is complete before wiping and RMA-ing old disk1? server2018-diagnostics-20210328-0925.zip Quote Link to comment
oh-tomo Posted March 29, 2021 Author Share Posted March 29, 2021 Rebuild of new disk1 finished. server2018-diagnostics-20210329-0818.zip Quote Link to comment
trurl Posted March 29, 2021 Share Posted March 29, 2021 Looks good. I notice that disk has a few reallocated but should be OK as long as they don't start increasing. Does it (or any other) show SMART warnings on the Dashboard page? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected? You might want to run an Extended SMART test on your disks occasionally. Quote Link to comment
oh-tomo Posted March 29, 2021 Author Share Posted March 29, 2021 5 hours ago, trurl said: Looks good. I notice that disk has a few reallocated but should be OK as long as they don't start increasing. Does it (or any other) show SMART warnings on the Dashboard page? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected? You might want to run an Extended SMART test on your disks occasionally. Dashboard shows SMART errors on: disk1 - Reallocated_Sector_Ct - 8 disk3 - UDMA_CRC_Error_Count - 64 disk5 - UDMA_CRC_Error_Count - 25 I am set up with email notifications. Past SMART health warnings have been about the *GDM (removed as disk1, formerly parity1 -- on 11/15 and 12/15) and *LX1 (new disk1, former parity2 -- all on 03/23) 12TB drives. Quote Link to comment
trurl Posted March 29, 2021 Share Posted March 29, 2021 36 minutes ago, oh-tomo said: I am set up with email notifications. But are you getting those SMART warnings in email? You can control how different Notifications are given to you, SMART is important enough that you need to know about them even if you don't happen to open up the webUI. I have nothing notifying me in the Browser, and I get emails for Array Status, Notices, Warnings, and Alerts. When you do get a SMART warning, you need to make the warning go away. How you do this depends. If it is serious enough you replace the disk. If it is not serious enough, you Acknowledge it by clicking on it in the Dashboard. Once acknowledged, it will not warn you again unless it changes. No point in letting a Warning just sit there and not do anything at all about it. If you Acknowledge it, and it comes back, then you know it has gotten worse. CRC Errors are connection issues and not really a disk problem, but the disk counts these and keeps the count in its firmware. Basically it means the data it received was inconsistent. Not all connection problems will show up there because if it isn't getting any data it can't check the consistency. A small number of Reallocated is usually fine since disks are designed to have some spare sectors for that purpose. Those warnings are OK to acknowledge but of course we still want to see the results of the extended test on disk1. Quote Link to comment
oh-tomo Posted March 29, 2021 Author Share Posted March 29, 2021 26 minutes ago, trurl said: But are you getting those SMART warnings in email? You can control how different Notifications are given to you, SMART is important enough that you need to know about them even if you don't happen to open up the webUI. I have nothing notifying me in the Browser, and I get emails for Array Status, Notices, Warnings, and Alerts. When you do get a SMART warning, you need to make the warning go away. How you do this depends. If it is serious enough you replace the disk. If it is not serious enough, you Acknowledge it by clicking on it in the Dashboard. Once acknowledged, it will not warn you again unless it changes. No point in letting a Warning just sit there and not do anything at all about it. If you Acknowledge it, and it comes back, then you know it has gotten worse. CRC Errors are connection issues and not really a disk problem, but the disk counts these and keeps the count in its firmware. Basically it means the data it received was inconsistent. Not all connection problems will show up there because if it isn't getting any data it can't check the consistency. A small number of Reallocated is usually fine since disks are designed to have some spare sectors for that purpose. Those warnings are OK to acknowledge but of course we still want to see the results of the extended test on disk1. I got these SMART health emails on Nov 15: Quote Event: Unraid Disk 1 SMART health [5] Subject: Warning [SERVER2018] - reallocated sector ct is 40 Description: ST12000VN0007-2GS116_*GDM (sdf) Importance: warning Quote Event: Unraid Disk 1 SMART health [5] Subject: Warning [SERVER2018] - reallocated sector ct is 96 Description: ST12000VN0007-2GS116_*GDM (sdf) Importance: warning: Quote Event: Unraid Disk 1 SMART health [5] Subject: Warning [SERVER2018] - reallocated sector ct is 176 Description: ST12000VN0007-2GS116_*GDM (sdf) Importance: warning And on Dec 15 this email: Quote Event: Unraid Disk 1 SMART health [5] Subject: Warning [SERVER2018] - reallocated sector ct is 184 Description: ST12000VN0007-2GS116_*GDM (sdf) Importance: warning Should I have acted on them? There weren't any read error alert emails until March 23. I'll start an extended test on new disk1 tomorrow. Does it take as long as a rebuild? Quote Link to comment
trurl Posted March 30, 2021 Share Posted March 30, 2021 32 minutes ago, oh-tomo said: Should I have acted on them? Yes since they were increasing. Not sure where people like to draw the line but 3 digits is too many for me. 35 minutes ago, oh-tomo said: Does it take as long as a rebuild? Something like that. Quote Link to comment
oh-tomo Posted March 30, 2021 Author Share Posted March 30, 2021 21 hours ago, trurl said: Yes since they were increasing. Not sure where people like to draw the line but 3 digits is too many for me. What action should I have taken at 3 digits? 21 hours ago, trurl said: Something like that. SMART extended self-test of disk1 has reached 40% I'm doing a Windows long format of old disk 1 before submitting it for Seagate RMA. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.