Jump to content

Mihail

Members
  • Posts

    9
  • Joined

  • Last visited

Posts posted by Mihail

  1. Well it seems to be "fixed" now. Not sure how much I trust the disk but it does seem to operate and the full smart test went OK

     

    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0027   248   228   021    Pre-fail  Always       -       6591
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       51
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   048   048   000    Old_age   Always       -       38381
     10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       50
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       41
    193 Load_Cycle_Count        0x0032   170   170   000    Old_age   Always       -       91660
    194 Temperature_Celsius     0x0022   119   103   000    Old_age   Always       -       33
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%     38370         -
    # 2  Short offline       Completed without error       00%     38358         -
    # 3  Short offline       Completed: read failure       70%     38209         3130108536
    # 4  Extended offline    Completed without error       00%     38138         -
    # 5  Short offline       Completed: read failure       60%     38042         3130108384

     

  2. At least the short test seems to succeed now. Left the long test running, but it will take until morning. My confidance in SMART is not great at the moment :)

     

    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%     38358         -
    # 2  Short offline       Completed: read failure       70%     38209         3130108536
    # 3  Extended offline    Completed without error       00%     38138         -
    # 4  Short offline       Completed: read failure       60%     38042         3130108384
    # 5  Extended offline    Completed without error       00%     37998         -

    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%     38358         -
    # 2  Short offline       Completed: read failure       70%     38209         3130108536
    # 3  Extended offline    Completed without error       00%     38138         -
    # 4  Short offline       Completed: read failure       60%     38042         3130108384
    # 5  Extended offline    Completed without error       00%     37998         -

     

  3. Hi,

     

    This is possibly a silly question but I was wondering how a smart test can fail, but the entire mechnical hdd is able to write the entire disk and read it back with no errors reported. What am I not understanding?

     

    Tested clearing the disk with shred 3 passes and tested reading the whole disk back with dd and got no errors, normal performance figures. How is this possible?

     

    Maybe some of you drive experts can figure this out.


     

    === START OF INFORMATION SECTION ===
    Model Family:     Western Digital Red
    Device Model:     WDC WD60EFRX-68MYMN1
    Serial Number:    WD-#################
    LU WWN Device Id: 5 0014ee 2b73476cd
    Firmware Version: 82.00A82
    User Capacity:    6,001,175,126,016 bytes [6.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    5700 rpm
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0027   253   228   021    Pre-fail  Always       -       5800
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       50
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   048   048   000    Old_age   Always       -       38350
     10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       49
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       41
    193 Load_Cycle_Count        0x0032   170   170   000    Old_age   Always       -       91648
    194 Temperature_Celsius     0x0022   115   103   000    Old_age   Always       -       37
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed: read failure       70%     38209         3130108536
    # 2  Extended offline    Completed without error       00%     38138         -
    # 3  Short offline       Completed: read failure       60%     38042         3130108384
    # 4  Extended offline    Completed without error       00%     37998         -
    # 5  Short offline       Completed without error       00%     37874         -

     

  4. Still broken, got critical disk errors last night and all wm's went down because of it.

     

    Please help me figure it out, since having to reboot and rebuild the parity every 10 days or so is getting tiresome.

     

    I attached a new diagnostic dump before and after rebooting. Before reboot smart data was not included in the dump. I attached the syslog from the last bootup to system failure.
     

    vbarum-diagnostics-20210217-2315-beforereboot.zip vbarum-diagnostics-20210217-2341-after reboot.zip syslog

  5. 1 minute ago, JorgeB said:

     

    This is a rather common issue with the SATA controller on Ryzen boards, newer kernel on v6.9 helps in some cases, newer BIOS might also help, but you already did that.

    I ran into some posts regarding this error with NVMe drives and some with talk about disabling IOMMU grouping but now with SATA drives.
    Would you advise to move to the beta channel for a fix through using a newer kernel?

  6. The setup

    Ryzen 7 3700X

    MAG B550 TOMAHAWK - BIOS A53 (Latest Beta bios with ComboAM4PIV2 1.1.9.0) - Updated after second failure

    128GB DDR4 2666MHz

    550W PSU


    Version
    7C91vA53(Beta version)
    Release Date
    2020-12-30
    File Size
    17.96 MB

    Drives

    Parity WDC_WD102KRYZ-01A5AB0_VCH9Z3KP - 10 TB

    Disk1 WDC_WD4002FYYZ-01B7CB0_K3G42TLB - 4 TB

    Disk2 WDC_WD102KRYZ-01A5AB0_VCH8VSTP - 10 TB

    Disk3 WDC_WD4002FYYZ-01B7CB0_K3G4VK1B - 4 TB

     

    Cache Samsung_SSD_860_EVO_2TB_S4X1NJ0N702274P - 2 TB

    Cache Samsung_SSD_860_EVO_2TB_S4X1NJ0N702273Y - 2 TB

     

    All drives are SATA and are plugged into the onboard SATA controller

     

    Running Unraid 6.8.3

     

     

    The problem

     

    After running for several days (2-14) the system experiences an error that causes cascading read errors across multiple disks. Typically unraid marks disk1 as disabled and either puts the whole array into read only mode or locks up virtual machines completely.

     

    The errors seemed to point to communication issues with the drives so the first step was to sacrifice the SATA cables to the IT gods and replace them with new ones. This did not fix the issue.

     

    After rebooting the server everything returns to normal operation. Have tried re-adding disk1 to the array twice. The array rebuild goes completes with out issue as does the extended smart test. The system along with it's virtual machines then works fine for several days before encountering a similar issue.

     

    On the last round of failures I did not re-add the 4TB Disk1 that was marked disabled to try to rule it out of the equation. The same cascading failure happened two days later and the system was working normally after reboot.

     

    I suspect that this could be a kernel issue or a hardware issue with the SATA controller.

     

    Most of the time the errors start like this:

    Jan 12 08:13:59 vbarum kernel: ahci 0000:02:00.1: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000cafd0000 flags=0x0000]
    Jan 12 08:13:59 vbarum kernel: ata4.00: exception Emask 0x10 SAct 0x38000 SErr 0x0 action 0x6 frozen
    Jan 12 08:13:59 vbarum kernel: ata4.00: irq_stat 0x08000000, interface fatal error
    Jan 12 08:13:59 vbarum kernel: ata4.00: failed command: WRITE FPDMA QUEUED
    Jan 12 08:13:59 vbarum kernel: ata4.00: cmd 61/08:78:e8:79:4c/00:00:3c:00:00/40 tag 15 ncq dma 4096 out
    Jan 12 08:13:59 vbarum kernel:         res 40/00:88:90:7a:4c/00:00:3c:00:00/40 Emask 0x10 (ATA bus error)
    Jan 12 08:13:59 vbarum kernel: ata4.00: status: { DRDY }
    Jan 12 08:13:59 vbarum kernel: ata4.00: failed command: WRITE FPDMA QUEUED
    Jan 12 08:13:59 vbarum kernel: ata4.00: cmd 61/08:80:20:7a:4c/00:00:3c:00:00/40 tag 16 ncq dma 4096 out
    Jan 12 08:13:59 vbarum kernel:         res 40/00:88:90:7a:4c/00:00:3c:00:00/40 Emask 0x10 (ATA bus error)
    Jan 12 08:13:59 vbarum kernel: ata4.00: status: { DRDY }
    Jan 12 08:13:59 vbarum kernel: ata4.00: failed command: WRITE FPDMA QUEUED
    Jan 12 08:13:59 vbarum kernel: ata4.00: cmd 61/08:88:90:7a:4c/00:00:3c:00:00/40 tag 17 ncq dma 4096 out
    Jan 12 08:13:59 vbarum kernel:         res 40/00:88:90:7a:4c/00:00:3c:00:00/40 Emask 0x10 (ATA bus error)
    Jan 12 08:13:59 vbarum kernel: ata4.00: status: { DRDY }
    Jan 12 08:13:59 vbarum kernel: ata4: hard resetting link
    Jan 12 08:14:00 vbarum kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
    Jan 12 08:14:05 vbarum kernel: ata4.00: qc timeout (cmd 0xec)
    Jan 12 08:14:05 vbarum kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
    Jan 12 08:14:05 vbarum kernel: ata4.00: revalidation failed (errno=-5)
    Jan 12 08:14:05 vbarum kernel: ata4: hard resetting link
    Jan 12 08:14:15 vbarum kernel: ata4: softreset failed (1st FIS failed)
    Jan 12 08:14:15 vbarum kernel: ata4: hard resetting link
    Jan 12 08:14:25 vbarum kernel: ata4: softreset failed (1st FIS failed)
    Jan 12 08:14:25 vbarum kernel: ata4: hard resetting link

    The AHCI error is not always the first one and the kernel reports read errors on multiple disks.

     

    Please help me with fixing the issue. I was unable to make sense of the logs, but I hope that there are more knowledgeable people here.

     

    I have attached full syslogs and hardware profile to this post. Hopefully they are useful in diagnosing the issue.

     

     

     

    syslog.zip hwprofile.txt

×
×
  • Create New...