disabled disk passes SeaTools tests


Recommended Posts

Quote

 

Event: Unraid array errors

Subject: Warning [SERVER2018] - array has errors

Description: Array has 1 disk with read errors

Importance: warning

 

 

Disk 1 - ST12000VN0007-2GS116_********* (sdf) (errors 16)

 

 

unRAID disabled the disk (red x)

 

I removed the disk.  Contents are currently emulated.   I placed the disk in a HDD USB dock connected to a Windows 10 VM and ran some SeaTools tests:

 

Quote

 

--------------- SeaTools for Windows v1.4.0.7 ---------------

2021-03-23 11:45:23 PM

Model Number: ST12000V

Serial Number: ********

Firmware Revision: 0302

Short DST - Started 2021-03-23 11:45:23 PM

Short DST - Pass 2021-03-23 11:46:25 PM

Short Generic - Started 2021-03-23 11:46:40 PM

Short Generic - Pass 2021-03-23 11:47:34 PM

Long Generic - Started 2021-03-23 11:47:56 PM

Long Generic - Pass 2021-03-24 11:41:03 PM

Identify - Started 2021-03-25 11:29:05 AM

 

 

24 hours running a Long Generic SeaTools test and the HDD passed.   This is troubling.  The HDD is under warranty until August.  But unRAID thinks the HDD is error prone while SeaTools tests aren't finding errors.  I can't put the disk back in the array because unRAID has disabled it and I can't send the disk in for RMA to Seagate because their software doesn't show any problems.   

 

How should I proceed?   

 

Link to comment
34 minutes ago, oh-tomo said:

How should I proceed? 

Now that you have removed the disk it is a little more complicated how you should proceed. You should have asked for advice before doing anything.

 

Ideally we would have gotten the diagnostics before you did anything and before rebooting your server. Then we would be better able to see why a write to the disk failed. As mentioned, bad connections are much more common than bad disks.

 

Is your server currently running without the disk?

Link to comment
10 minutes ago, trurl said:

Now that you have removed the disk it is a little more complicated how you should proceed. You should have asked for advice before doing anything.

 

Ideally we would have gotten the diagnostics before you did anything and before rebooting your server. Then we would be better able to see why a write to the disk failed. As mentioned, bad connections are much more common than bad disks.

 

Is your server currently running without the disk?

 

Yes the server is running without the disk.   

Link to comment

12TB disk (sdg) serial ending 0TGDM I assume was disk1.

 

SMART for that disk looks OK and since emulated disk1 is mounted should be good to rebuild to that disk.

 

Not sure what this end of syslog stuff is, seems related to nvidia and VMs:

Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0:   device [8086:a33c] error status/mask=00000001/00002000
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0:    [ 0] RxErr                  (First)
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::fc54:ff:fe00:2981.
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: New relevant interface vnet0.IPv6 for mDNS.
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Registering new address record for fe80::fc54:ff:fe00:2981 on vnet0.*.
Mar 25 21:23:08 Server2018 kernel: kvm [9357]: vcpu2, guest rIP: 0xfffff802693f6192 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
### [PREVIOUS LINE REPEATED 9 TIMES] ###
Mar 25 21:23:13 Server2018 kernel: kvm_set_msr_common: 12094 callbacks suppressed

 

Is it causing any apparent problems?

Link to comment
7 hours ago, trurl said:

12TB disk (sdg) serial ending 0TGDM I assume was disk1.

 

SMART for that disk looks OK and since emulated disk1 is mounted should be good to rebuild to that disk.

 

Not sure what this end of syslog stuff is, seems related to nvidia and VMs:





Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0:   device [8086:a33c] error status/mask=00000001/00002000
Mar 25 21:21:59 Server2018 kernel: pcieport 0000:00:1c.0:    [ 0] RxErr                  (First)
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::fc54:ff:fe00:2981.
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: New relevant interface vnet0.IPv6 for mDNS.
Mar 25 21:21:59 Server2018 avahi-daemon[4413]: Registering new address record for fe80::fc54:ff:fe00:2981 on vnet0.*.
Mar 25 21:23:08 Server2018 kernel: kvm [9357]: vcpu2, guest rIP: 0xfffff802693f6192 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
### [PREVIOUS LINE REPEATED 9 TIMES] ###
Mar 25 21:23:13 Server2018 kernel: kvm_set_msr_common: 12094 callbacks suppressed

 

Is it causing any apparent problems?

I haven’t noticed any VM issues with Nvidia.   Rebuild of disk1 has started.   

Link to comment

Looks good.

 

I notice that disk has a few reallocated but should be OK as long as they don't start increasing.

 

Does it (or any other) show SMART warnings on the Dashboard page? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

 

You might want to run an Extended SMART test on your disks occasionally.

 

Link to comment

 

5 hours ago, trurl said:

Looks good.

 

I notice that disk has a few reallocated but should be OK as long as they don't start increasing.

 

Does it (or any other) show SMART warnings on the Dashboard page? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

 

You might want to run an Extended SMART test on your disks occasionally.

 

 

Dashboard shows SMART errors on:

 

disk1 - Reallocated_Sector_Ct - 8

disk3 - UDMA_CRC_Error_Count - 64

disk5 - UDMA_CRC_Error_Count - 25

 

I am set up with email notifications.  Past SMART health warnings have been about the *GDM (removed as disk1, formerly parity1 -- on 11/15 and 12/15) and *LX1 (new disk1, former parity2 -- all on 03/23) 12TB drives.

 

Link to comment
36 minutes ago, oh-tomo said:

I am set up with email notifications.

But are you getting those SMART warnings in email?

 

You can control how different Notifications are given to you, SMART is important enough that you need to know about them even if you don't happen to open up the webUI. I have nothing notifying me in the Browser, and I get emails for Array Status, Notices, Warnings, and Alerts.

 

When you do get a SMART warning, you need to make the warning go away. How you do this depends. If it is serious enough you replace the disk. If it is not serious enough, you Acknowledge it by clicking on it in the Dashboard. Once acknowledged, it will not warn you again unless it changes.

 

No point in letting a Warning just sit there and not do anything at all about it. If you Acknowledge it, and it comes back, then you know it has gotten worse.

 

CRC Errors are connection issues and not really a disk problem, but the disk counts these and keeps the count in its firmware. Basically it means the data it received was inconsistent. Not all connection problems will show up there because if it isn't getting any data it can't check the consistency.

 

A small number of Reallocated is usually fine since disks are designed to have some spare sectors for that purpose.

 

Those warnings are OK to acknowledge but of course we still want to see the results of the extended test on disk1.

 

Link to comment
26 minutes ago, trurl said:

But are you getting those SMART warnings in email?

 

You can control how different Notifications are given to you, SMART is important enough that you need to know about them even if you don't happen to open up the webUI. I have nothing notifying me in the Browser, and I get emails for Array Status, Notices, Warnings, and Alerts.

 

When you do get a SMART warning, you need to make the warning go away. How you do this depends. If it is serious enough you replace the disk. If it is not serious enough, you Acknowledge it by clicking on it in the Dashboard. Once acknowledged, it will not warn you again unless it changes.

 

No point in letting a Warning just sit there and not do anything at all about it. If you Acknowledge it, and it comes back, then you know it has gotten worse.

 

CRC Errors are connection issues and not really a disk problem, but the disk counts these and keeps the count in its firmware. Basically it means the data it received was inconsistent. Not all connection problems will show up there because if it isn't getting any data it can't check the consistency.

 

A small number of Reallocated is usually fine since disks are designed to have some spare sectors for that purpose.

 

Those warnings are OK to acknowledge but of course we still want to see the results of the extended test on disk1.

 

 

I got these SMART health emails on Nov 15:

 

Quote

Event: Unraid Disk 1 SMART health [5]
Subject: Warning [SERVER2018] - reallocated sector ct is 40
Description: ST12000VN0007-2GS116_*GDM (sdf)
Importance: warning

 

Quote

Event: Unraid Disk 1 SMART health [5]
Subject: Warning [SERVER2018] - reallocated sector ct is 96
Description: ST12000VN0007-2GS116_*GDM (sdf)
Importance: warning:

 

Quote

Event: Unraid Disk 1 SMART health [5]
Subject: Warning [SERVER2018] - reallocated sector ct is 176
Description: ST12000VN0007-2GS116_*GDM (sdf)
Importance: warning

 

And on Dec 15 this email:

 

Quote

Event: Unraid Disk 1 SMART health [5]
Subject: Warning [SERVER2018] - reallocated sector ct is 184
Description: ST12000VN0007-2GS116_*GDM (sdf)
Importance: warning

 

Should I have acted on them?

 

There weren't any read error alert emails until March 23.

 

I'll start an extended test on new disk1 tomorrow.   Does it take as long as a rebuild?

 

Link to comment
21 hours ago, trurl said:

Yes since they were increasing. Not sure where people like to draw the line but 3 digits is too many for me.

 

What action should I have taken at 3 digits?

 

21 hours ago, trurl said:

Something like that.

 

SMART extended self-test of disk1 has reached 40% 

 

I'm doing a Windows long format of old disk 1 before submitting it for Seagate RMA.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.