Accuracy of Failed Disk/Preclear Reporting


Recommended Posts

One of the spinning disks in my array has been reporting errors. Recently it went offline. As a result I replaced the disk. After the new disk was rebuilt I ran pre-clear on the failed disk. Unassigned devices is reporting the pre-clear finished successfully. I cant see anything notable in the log file, though I do not really know what would be considered a problem. I have attached the log. As a result I wondering about the reliability of disk error reporting. Is the disk good or bad? What should I have confidence in? The array reporting problems and taking the disk offline or the pre-clear check?

 

Oct 31 00:50:30 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: dd if=/dev/sdc of=/tmp/.preclear/sdc/fifo count=2096640 skip=512 iflag=nocache,count_bytes,skip_bytes Oct 31 00:50:31 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: verifying the rest of the disk. Oct 31 00:50:31 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: cmp /tmp/.preclear/sdc/fifo /dev/zero Oct 31 00:50:31 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: dd if=/dev/sdc of=/tmp/.preclear/sdc/fifo bs=2097152 skip=2097152 count=3000590884864 iflag=nocache,count_bytes,skip_bytes Oct 31 01:23:48 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 10% verified @ 148 MB/s Oct 31 01:58:04 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 20% verified @ 141 MB/s Oct 31 02:33:47 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 30% verified @ 135 MB/s Oct 31 03:11:30 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 40% verified @ 129 MB/s Oct 31 03:51:17 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 50% verified @ 120 MB/s Oct 31 04:33:51 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 60% verified @ 114 MB/s Oct 31 05:19:12 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 70% verified @ 106 MB/s Oct 31 06:08:59 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 80% verified @ 94 MB/s Oct 31 07:04:27 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: progress - 90% verified @ 84 MB/s Oct 31 08:08:01 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: dd - read 3000592982016 of 3000592982016 (0). Oct 31 08:08:01 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: elapsed time - 7:17:28 Oct 31 08:08:01 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: dd exit code - 0 Oct 31 08:08:02 preclear_disk_WD-WCC4N1AKP9J6_127711: Post-Read: post-read verification completed! Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: Cycle 1 Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: ATTRIBUTE                INITIAL  NOW    STATUS Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: Reallocated_Sector_Ct    0        0      - Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: Power_On_Hours           47106    47128  Up 22 Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: Temperature_Celsius      34       31     Down 3 Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: Reallocated_Event_Count  0        0      - Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: Current_Pending_Sector   0        0      - Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: Offline_Uncorrectable    0        0      - Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: UDMA_CRC_Error_Count     0        0      - Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: S.M.A.R.T.: SMART overall-health self-assessment test result: PASSED Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: Cycle: elapsed time: 21:33:48 Oct 31 08:08:06 preclear_disk_WD-WCC4N1AKP9J6_127711: Preclear: total elapsed time: 21:33:53

Link to comment
8 minutes ago, darrenyorston said:

One of the spinning disks in my array has been reporting errors. Recently it went offline. As a result I replaced the disk.

Probably nothing wrong with the disk. Connection problems are much more common than disk problems. Since you likely rebooted since you had that problem we wouldn't be able to see what caused the disk to be disabled.

 

Attach diagnostics to your NEXT post in this thread and we can see if there is anything else of concern.

Link to comment
1 hour ago, trurl said:

Probably nothing wrong with the disk. Connection problems are much more common than disk problems. Since you likely rebooted since you had that problem we wouldn't be able to see what caused the disk to be disabled.

 

Attach diagnostics to your NEXT post in this thread and we can see if there is anything else of concern.

 

tower-diagnostics-20211106-1328.zip

Link to comment

SMART for that disk looks fine, and preclear agrees.

 

11 hours ago, darrenyorston said:

The array reporting problems and taking the disk offline

Any time a write to a disk fails, Unraid disables it because it is no longer in sync. After a disk is disabled, it isn't used again until rebuilt. Reads from the disk are emulated by reading all other disks and getting its data from the parity calculation. Writes to the disk are emulated by updating parity. That initial failed write, and any subsequent emulated writes, can be recovered by rebuilding the disk, but the physical disk itself didn't get any of those writes so it is out-of-sync.

 

As mentioned, bad connections are much more common than bad disks, and bad connections can cause failed writes that disables the disk.

Link to comment
11 hours ago, darrenyorston said:

One of the spinning disks in my array has been reporting errors.

 

I would assume that both of these failures have been the same disk.  If that is the case and it happens again, change the SATA data cable and check the SATA power cable to that drive.  If you are using a power splitter type cable, consider replacing it also.  (Avoid the type where the wires are molded into the plastic, they are a potential fire hazard.) 

 

Always double (and even triple) check every SATA connector to make sure that is firmly seated before applying power.   It is not uncommon that in changing out one drive that the connector on another drive will be unseated ever so slightly!

 

For several years, I  have always had spare 'cold' drive(s) for my systems.  Whenever a drive goes off-line, my first action is to replace it with the spare.  I then run a two or three preclear cycles on that 'failed' drive.  If it passes, it becomes the cold spare.  IF it has any SMART errors other than an 'UDMA CRC error rate' (#199), it is headed for the trash bin.   (Fortunately, I have never had a second failure of the same Disk within a short period.  But if I were to experience that, I would be doing what I suggested in the first paragraph...)

Link to comment
10 hours ago, Frank1940 said:

 

I would assume that both of these failures have been the same disk.  If that is the case and it happens again, change the SATA data cable and check the SATA power cable to that drive.  If you are using a power splitter type cable, consider replacing it also.  (Avoid the type where the wires are molded into the plastic, they are a potential fire hazard.) 

 

Always double (and even triple) check every SATA connector to make sure that is firmly seated before applying power.   It is not uncommon that in changing out one drive that the connector on another drive will be unseated ever so slightly!

 

For several years, I  have always had spare 'cold' drive(s) for my systems.  Whenever a drive goes off-line, my first action is to replace it with the spare.  I then run a two or three preclear cycles on that 'failed' drive.  If it passes, it becomes the cold spare.  IF it has any SMART errors other than an 'UDMA CRC error rate' (#199), it is headed for the trash bin.   (Fortunately, I have never had a second failure of the same Disk within a short period.  But if I were to experience that, I would be doing what I suggested in the first paragraph...)

That is a good idea, thanks.

Link to comment
11 hours ago, Frank1940 said:

 

I would assume that both of these failures have been the same disk.  If that is the case and it happens again, change the SATA data cable and check the SATA power cable to that drive.  If you are using a power splitter type cable, consider replacing it also.  (Avoid the type where the wires are molded into the plastic, they are a potential fire hazard.) 

 

Always double (and even triple) check every SATA connector to make sure that is firmly seated before applying power.   It is not uncommon that in changing out one drive that the connector on another drive will be unseated ever so slightly!

 

For several years, I  have always had spare 'cold' drive(s) for my systems.  Whenever a drive goes off-line, my first action is to replace it with the spare.  I then run a two or three preclear cycles on that 'failed' drive.  If it passes, it becomes the cold spare.  IF it has any SMART errors other than an 'UDMA CRC error rate' (#199), it is headed for the trash bin.   (Fortunately, I have never had a second failure of the same Disk within a short period.  But if I were to experience that, I would be doing what I suggested in the first paragraph...)

I had checked the cabling for each drive. They didnt appear to be out of place; they didnt move when I pressed them. And the system had been restarted as I have been trying to resolved an issue with VMs randomly freezing.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.