disabled disk passes SeaTools tests

oh-tomo · March 25, 2021

Quote

Event: Unraid array errors

Subject: Warning [SERVER2018] - array has errors

Description: Array has 1 disk with read errors

Importance: warning

Disk 1 - ST12000VN0007-2GS116_********* (sdf) (errors 16)

unRAID disabled the disk (red x)

I removed the disk. Contents are currently emulated. I placed the disk in a HDD USB dock connected to a Windows 10 VM and ran some SeaTools tests:

Quote

--------------- SeaTools for Windows v1.4.0.7 ---------------

2021-03-23 11:45:23 PM

Model Number: ST12000V

Serial Number: ********

Firmware Revision: 0302

Short DST - Started 2021-03-23 11:45:23 PM

Short DST - Pass 2021-03-23 11:46:25 PM

Short Generic - Started 2021-03-23 11:46:40 PM

Short Generic - Pass 2021-03-23 11:47:34 PM

Long Generic - Started 2021-03-23 11:47:56 PM

Long Generic - Pass 2021-03-24 11:41:03 PM

Identify - Started 2021-03-25 11:29:05 AM

24 hours running a Long Generic SeaTools test and the HDD passed. This is troubling. The HDD is under warranty until August. But unRAID thinks the HDD is error prone while SeaTools tests aren't finding errors. I can't put the disk back in the array because unRAID has disabled it and I can't send the disk in for RMA to Seagate because their software doesn't show any problems.

How should I proceed?

Squid · March 25, 2021

A write to the drive failed. That's a given since unraid disabled the drive. Most of the time this is because of poorly connecting the drive. SMART report from the drive would be helpful.

trurl · March 25, 2021

34 minutes ago, oh-tomo said:

How should I proceed?

Now that you have removed the disk it is a little more complicated how you should proceed. You should have asked for advice before doing anything.

Ideally we would have gotten the diagnostics before you did anything and before rebooting your server. Then we would be better able to see why a write to the disk failed. As mentioned, bad connections are much more common than bad disks.

Is your server currently running without the disk?

oh-tomo · March 25, 2021

10 minutes ago, trurl said:

Now that you have removed the disk it is a little more complicated how you should proceed. You should have asked for advice before doing anything.

Ideally we would have gotten the diagnostics before you did anything and before rebooting your server. Then we would be better able to see why a write to the disk failed. As mentioned, bad connections are much more common than bad disks.

Is your server currently running without the disk?

Yes the server is running without the disk.

trurl · March 25, 2021

Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

oh-tomo · March 25, 2021

9 minutes ago, trurl said:

Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

Attached!

server2018-diagnostics-20210325-1309.zip

trurl · March 25, 2021

Emulated disk1 is mounted so that's good.

Is the disk currently plugged in to your server?

oh-tomo · March 25, 2021

2 minutes ago, trurl said:

Emulated disk1 is mounted so that's good.

Is the disk currently plugged in to your server?

It's plugged into a USB HDD dock and the USB is connected to a Win10 VM.

trurl · March 25, 2021

Stop the array and set disk1 to Not Assigned (or however it's worded) then start the array again.

Then shutdown, install the disk into your server again, boot up, but don't reassign the disk. Just start up normally and post new diagnostics.

oh-tomo · March 26, 2021

6 hours ago, trurl said:

Stop the array and set disk1 to Not Assigned (or however it's worded) then start the array again.

Then shutdown, install the disk into your server again, boot up, but don't reassign the disk. Just start up normally and post new diagnostics.

Here are the new diagnostics.

server2018-diagnostics-20210325-2125.zip

trurl · March 26, 2021

12TB disk (sdg) serial ending 0TGDM I assume was disk1.

SMART for that disk looks OK and since emulated disk1 is mounted should be good to rebuild to that disk.

Not sure what this end of syslog stuff is, seems related to nvidia and VMs:

Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0:   device [8086:a33c] error status/mask=00000001/00002000
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0:    [ 0] RxErr                  (First)
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::fc54:ff:fe00:2981.
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: New relevant interface vnet0.IPv6 for mDNS.
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Registering new address record for fe80::fc54:ff:fe00:2981 on vnet0.*.
Mar 25 21:23:08 Server2018 kernel: kvm [9357]: vcpu2, guest rIP: 0xfffff802693f6192 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
### [PREVIOUS LINE REPEATED 9 TIMES] ###
Mar 25 21:23:13 Server2018 kernel: kvm_set_msr_common: 12094 callbacks suppressed

Is it causing any apparent problems?

oh-tomo · March 26, 2021

7 hours ago, trurl said:

12TB disk (sdg) serial ending 0TGDM I assume was disk1.

SMART for that disk looks OK and since emulated disk1 is mounted should be good to rebuild to that disk.

Not sure what this end of syslog stuff is, seems related to nvidia and VMs:





Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0:   device [8086:a33c] error status/mask=00000001/00002000
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0:    [ 0] RxErr                  (First)
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::fc54:ff:fe00:2981.
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: New relevant interface vnet0.IPv6 for mDNS.
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Registering new address record for fe80::fc54:ff:fe00:2981 on vnet0.*.
Mar 25 21:23:08 Server2018 kernel: kvm [9357]: vcpu2, guest rIP: 0xfffff802693f6192 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
### [PREVIOUS LINE REPEATED 9 TIMES] ###
Mar 25 21:23:13 Server2018 kernel: kvm_set_msr_common: 12094 callbacks suppressed

Is it causing any apparent problems?

I haven’t noticed any VM issues with Nvidia. Rebuild of disk1 has started.

oh-tomo · March 27, 2021

16 hours ago, oh-tomo said:

I haven’t noticed any VM issues with Nvidia. Rebuild of disk1 has started.

Rebuild is done!

server2018-diagnostics-20210326-2333.zip

trurl · March 27, 2021

LOOKS GOOD!

oh-tomo · March 28, 2021

Smart health warning for Disk 1.

Unraid Disk 1 SMART health [198]: 27-03-2021 23:31

Warning [SERVER2018] - offline uncorrectable is 24
ST12000VN0007-2GS116_*0TGDM (sdi)

server2018-diagnostics-20210328-0016.zip

trurl · March 28, 2021

also pending sectors plus reallocated increased.

Why are you rebuilding parity2?

Replace disk1.

oh-tomo · March 28, 2021

13 minutes ago, trurl said:

also pending sectors plus reallocated increased.

Why are you rebuilding parity2?

Replace disk1.

parity2 is rebuilding because that HDD was upgraded to a larger HDD. Rebuild is done.

What do I do with old disk1 after I replace it? Wipe and send to Seagate? "WARRANTY Valid till 09/Aug/2021"

server2018-diagnostics-20210328-0832.zip

oh-tomo · March 28, 2021

Rebuild of new disk1 (old parity2) started.

Should I wait until rebuild is complete before wiping and RMA-ing old disk1?

server2018-diagnostics-20210328-0925.zip

oh-tomo · March 29, 2021

Rebuild of new disk1 finished.

server2018-diagnostics-20210329-0818.zip

trurl · March 29, 2021

Looks good.

I notice that disk has a few reallocated but should be OK as long as they don't start increasing.

Does it (or any other) show SMART warnings on the Dashboard page? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

You might want to run an Extended SMART test on your disks occasionally.

oh-tomo · March 29, 2021

5 hours ago, trurl said:

Looks good.

I notice that disk has a few reallocated but should be OK as long as they don't start increasing.

Does it (or any other) show SMART warnings on the Dashboard page? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

You might want to run an Extended SMART test on your disks occasionally.

Dashboard shows SMART errors on:

disk1 - Reallocated_Sector_Ct - 8

disk3 - UDMA_CRC_Error_Count - 64

disk5 - UDMA_CRC_Error_Count - 25

I am set up with email notifications. Past SMART health warnings have been about the *GDM (removed as disk1, formerly parity1 -- on 11/15 and 12/15) and *LX1 (new disk1, former parity2 -- all on 03/23) 12TB drives.

trurl · March 29, 2021

36 minutes ago, oh-tomo said:

I am set up with email notifications.

But are you getting those SMART warnings in email?

You can control how different Notifications are given to you, SMART is important enough that you need to know about them even if you don't happen to open up the webUI. I have nothing notifying me in the Browser, and I get emails for Array Status, Notices, Warnings, and Alerts.

When you do get a SMART warning, you need to make the warning go away. How you do this depends. If it is serious enough you replace the disk. If it is not serious enough, you Acknowledge it by clicking on it in the Dashboard. Once acknowledged, it will not warn you again unless it changes.

No point in letting a Warning just sit there and not do anything at all about it. If you Acknowledge it, and it comes back, then you know it has gotten worse.

CRC Errors are connection issues and not really a disk problem, but the disk counts these and keeps the count in its firmware. Basically it means the data it received was inconsistent. Not all connection problems will show up there because if it isn't getting any data it can't check the consistency.

A small number of Reallocated is usually fine since disks are designed to have some spare sectors for that purpose.

Those warnings are OK to acknowledge but of course we still want to see the results of the extended test on disk1.

oh-tomo · March 29, 2021

26 minutes ago, trurl said:

But are you getting those SMART warnings in email?

You can control how different Notifications are given to you, SMART is important enough that you need to know about them even if you don't happen to open up the webUI. I have nothing notifying me in the Browser, and I get emails for Array Status, Notices, Warnings, and Alerts.

When you do get a SMART warning, you need to make the warning go away. How you do this depends. If it is serious enough you replace the disk. If it is not serious enough, you Acknowledge it by clicking on it in the Dashboard. Once acknowledged, it will not warn you again unless it changes.

No point in letting a Warning just sit there and not do anything at all about it. If you Acknowledge it, and it comes back, then you know it has gotten worse.

CRC Errors are connection issues and not really a disk problem, but the disk counts these and keeps the count in its firmware. Basically it means the data it received was inconsistent. Not all connection problems will show up there because if it isn't getting any data it can't check the consistency.

A small number of Reallocated is usually fine since disks are designed to have some spare sectors for that purpose.

Those warnings are OK to acknowledge but of course we still want to see the results of the extended test on disk1.

I got these SMART health emails on Nov 15:

Quote

Event: Unraid Disk 1 SMART health [5]
Subject: Warning [SERVER2018] - reallocated sector ct is 40
Description: ST12000VN0007-2GS116_*GDM (sdf)
Importance: warning

Quote

Event: Unraid Disk 1 SMART health [5]
Subject: Warning [SERVER2018] - reallocated sector ct is 96
Description: ST12000VN0007-2GS116_*GDM (sdf)
Importance: warning:

Quote

Event: Unraid Disk 1 SMART health [5]
Subject: Warning [SERVER2018] - reallocated sector ct is 176
Description: ST12000VN0007-2GS116_*GDM (sdf)
Importance: warning

And on Dec 15 this email:

Quote

Event: Unraid Disk 1 SMART health [5]
Subject: Warning [SERVER2018] - reallocated sector ct is 184
Description: ST12000VN0007-2GS116_*GDM (sdf)
Importance: warning

Should I have acted on them?

There weren't any read error alert emails until March 23.

I'll start an extended test on new disk1 tomorrow. Does it take as long as a rebuild?

trurl · March 30, 2021

32 minutes ago, oh-tomo said:

Should I have acted on them?

Yes since they were increasing. Not sure where people like to draw the line but 3 digits is too many for me.

35 minutes ago, oh-tomo said:

Does it take as long as a rebuild?

Something like that.

oh-tomo · March 30, 2021

21 hours ago, trurl said:

Yes since they were increasing. Not sure where people like to draw the line but 3 digits is too many for me.

What action should I have taken at 3 digits?

21 hours ago, trurl said:

Something like that.

SMART extended self-test of disk1 has reached 40%

I'm doing a Windows long format of old disk 1 before submitting it for Seagate RMA.

disabled disk passes SeaTools tests

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation