Smart error HDD - extended self-test stuck at 10%.


DrBobke

Recommended Posts

Hi All,

 

since some months, I am having smart errors on two drives (one Parity and one array drive). I have done the smart Short self-test on both and it couldn't find any issues.  I then thought to run the extended self-test, at first I tried both simultaneously, but it keeps stuck at 10% (when it starts, it starts at 10% and never progresses), after a few days of running, I decided to 'kill' the extended test on the normal HDD and only let it run the test on the Parity drive. After a few days, still at 10%, I decided to kill that test too.  I rebooted my pc and the Unraid and ran another extended test, but it has now been running for 13 days and 21 hours and it is still stuck at 10%.  Anyone knows why it doesn't want to progress further? It's driving me crazy.  I even noticed a few weeks ago, that after having rebooted my PC, the self-test read 'aborted by user', which seems strange, as the self-test should be run inside unraid itself and there is no reason for it aborting just from rebooting my PC, no?

 

I have read different articles on the forum (this one was most helpful),

but seems I cannot even get it to run through the entire thing.  The Parity drive is only a few months old, the HDD is just over 1 year old, but as said, it has been like that since the beginning, or at least for several months. Both (all in the array), have been pre-cleared before and were ready to go, bought them from good suppliers that I trust and are new (no other deployment in another machine).  The Parity is even my 2nd parity drive, should that help. 

 

Hope someone can help.

Thanks in advance!

Best regards,

 

Link to comment
1 hour ago, DrBobke said:

completed without error

Extended test were disk self test, Unraid just polling the status from disk periodic. If it stuck (not abort) it may be polling issue. Btw, as success now then no problem at all.

 

Unraid have SMART polling timer setting, what is curent value ?

 

Test time should be around 0.5TB per hour, if 14TB need much more then 28hrs, then something still abnormal.

 

Edited by Vr2Io
Link to comment

I'm pretty sure it has taken longer than 28 hours. Also, the SMART status on the disk still shows as error in orange with a thumbs down.

Disk 3 (12TB) is still clearing and also stuck at 10% (I don't see CPU or other parameters shoot up unexpectedly).

 

I don't know where I can find the SMART polling timer setting, do you mean 

SMART capabilities:0x0003Saves SMART data before entering power-saving mode.

Supports SMART auto save timer. ?  I am sure I wouldn't have touched any such setting, so it should be at default, if there is any...

Link to comment

The disk is not being used during the test.  Yesterday evening, my weekly Parity-Check started, which will end in a few hours.  But that was after the Parity II test ended.

 

Is there something I can/should change there? And anything you can tell me why the SMART status in Dashboard remains on Error or what I can do to fix that? As said, disk is less than 1 year old, the other one is also still in warranty at just over 1 year old.

Link to comment
2 hours ago, DrBobke said:

Also, the SMART status on the disk still shows as error

 

It is because item 5 and 196 not zero value, this not a good sign, if disk have warranty then I will got RMA.

FYR, my disks almost all problem free even they age 5yrs+, only 2 got problem, but I also have hard time for many disks got trouble in past.

 

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   050    -    0
  2 Throughput_Performance  P-S---   100   100   050    -    0
  3 Spin_Up_Time            POS--K   100   100   001    -    7721
  4 Start_Stop_Count        -O--CK   100   100   000    -    799
  5 Reallocated_Sector_Ct   PO--CK   100   100   050    -    8
  7 Seek_Error_Rate         PO-R--   100   100   050    -    0
  8 Seek_Time_Performance   P-S---   100   100   050    -    0
  9 Power_On_Hours          -O--CK   084   084   000    -    6427
 10 Spin_Retry_Count        PO--CK   100   100   030    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    6
 23 Helium_Condition_Lower  PO---K   100   100   075    -    0
 24 Helium_Condition_Upper  PO---K   100   100   075    -    0
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    0
193 Load_Cycle_Count        -O--CK   100   100   000    -    908
194 Temperature_Celsius     -O---K   100   100   000    -    42 (Min/Max 25/54)
196 Reallocated_Event_Count -O--CK   100   100   000    -    1
197 Current_Pending_Sector  -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
220 Disk_Shift              -O----   100   100   000    -    100925444
222 Loaded_Hours            -O--CK   097   097   000    -    1407
223 Load_Retry_Count        -O--CK   100   100   000    -    0
224 Load_Friction           -O---K   100   100   000    -    0
226 Load-in_Time            -OS--K   100   100   000    -    584
240 Head_Flying_Hours       P-----   100   100   001    -    0

 

2 hours ago, DrBobke said:

I don't know where I can find the SMART polling timer setting

 

Try check below setting, I set 600 means polling every 10min.

 

image.png.5781b29319c643a54f9a64ae3b411ea1.png

 

Pls note all extended test was "abort by host", never complete.

 

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%      6231         -
# 2  Extended offline    Aborted by host               90%      6203         -
# 3  Extended offline    Aborted by host               90%      6089         -
# 4  Extended offline    Aborted by host               90%      6083         -
# 5  Short offline       Completed without error       00%      6081         -

 

A normal test result like as below

 

image.png.549966416ead5b469a6d5b8dbabfa331.png

 

Edited by Vr2Io
  • Like 1
Link to comment
6 hours ago, DrBobke said:

I could first return the Parity drive and get a new one and then disk 3

This is normal procedure to replace fault disk one by one. For disk3 item "199" means SATA interface record error, in most case, it is controller / cable issue. If your system have such issue, then same error can happen again on the replacement disk. So, RMA or not for disk3 really depends on you.

 

199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    6 

 

6 hours ago, DrBobke said:

I could copy that over to an external HDD too if needed, or doesn't it work that way?

That's good, so if rebuild disk3 in fail or corrupt, you still got a backup.

Link to comment

Okay, thanks a lot! I will check if I can see anything 'not right' with disk 3.  Thanks a lot for all your help, I really appreciate it!  

What is the best course of action in removing the Parity II? Shut down UnRaid, pull the drive out and power back up, or do I need to do that in another way?

Link to comment
1 hour ago, DrBobke said:

What is the best course of action in removing the Parity II? Shut down UnRaid, pull the drive out and power back up, or do I need to do that in another way?

Stop Array, unassign parity2, restart array to commit change.

Your steps would leave Unraid complaining about a missing parity2 drive :( 

  • Like 1
Link to comment
3 hours ago, DrBobke said:

What is the best course of action in removing the Parity II? Shut down UnRaid, pull the drive out and power back up, or do I need to do that in another way?

 

1 hour ago, itimpi said:

Stop Array, unassign parity2, restart array to commit change.

 

Both fine, end up also rebuild parity2 with replacement disk.

  • Like 1
Link to comment

Hey all, I unassigned Parity II, as said (stop array, go to Parity section, select dropdown and unassign, checked that it was unassigned and shut down unraid, removed Parity II and booted up again, but was stuck with 'Stale configuration', the only way to being able to start the array again, is insert Parity II again.  What is going on? I obviously want to start the array without Parity II being present...

 

Edit - SHIT! I just see I have lost all my dockers??? Nextcloud, duckdns, mariadb, Plex, Wireguard, OpenVPN, all of them gone!  Also my 2 VM's are gone! HEEEELLLPPPP

leonore-diagnostics-20220115-1450.zip

 

I have just checked, my entire AppData folder is EMPTY! 😮 Luckily, I still have a backup of it from 9/01/2022 at 1.06AM. Can I restore without having to use the backup? I hope to God I didn't loose any data, but how on earth is this possible? I did the right things, no?

Edited by DrBobke
added lost dockers and VM's
Link to comment
2 hours ago, DrBobke said:

Edit - SHIT! I just see I have lost all my dockers??? Nextcloud, duckdns, mariadb, Plex, Wireguard, OpenVPN, all of them gone!  Also my 2 VM's are gone! HEEEELLLPPPP

Did not spot nothing obvious in the diagnostics.

 

The symptoms suggest that the drive where the appdata and/or system shares are located might have problems.   Looking at the diagnostics appdata share seems to be on disk3 and system on disk1.    If that is what you expect I would suggest running a file system check on these drives.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.