• No SMART reports for some of cache SSDs


    luzfcb
    • Minor

    Hello, on my Unraid 6.12.8, I have 2 1TB nvme SSDs from the same manufacturer, acting as cache using BTFS 

    I recently noticed in the logs that apparently one of the SSDs might be dying.

    Mar 13 18:35:13 f kernel: btrfs_end_super_write: 282 callbacks suppressed
    Mar 13 18:35:13 f kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
    Mar 13 18:35:13 f kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 326893147, rd 40441477, flush 23342600, corrupt 3, gen 0
    Mar 13 18:35:13 f kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 1
    Mar 13 18:35:13 f kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 326893148, rd 40441477, flush 23342600, corrupt 3, gen 0
    Mar 13 18:35:13 f kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 326893149, rd 40441477, flush 23342600, corrupt 3, gen 0
    Mar 13 18:35:13 f kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
    Mar 13 18:35:13 f kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 1



    image.thumb.png.df23a02121e88d1af1c70c00da253065.png

    When I click on the device name and then go to the Self-Test tab, and then click on Download, a .zip file is downloaded and inside it, there is a TXT file, that contains the result of the SMART analysis.

    It turns out that only on 1 of the SSDs, the TXT file contains the SMART test report, on the other SSD the TXT file only contains

     

    smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-Unraid] (local build)
    Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
    
    Smartctl open device: /dev/nvme0 failed: No such device

     

    Questions:

    1 - Is this a bug?
    2 - Why apparently only 1 of the SSDs is being used? ( Cache 2 is used but the )
    3 - In the Self-Test tab of Cache 2, when clicking SMART error log, the error is displayed:

    Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
      0        106     0  0x0001  0x4004      -            0     0     -  Invalid Field in Command


    Why is the error not counted in "Pool Devices" -> "Errors" column?



    Some extra information:

     

    root@f:~# cat /etc/unraid-version 
    version="6.12.8"


     

    root@f:~# mount | grep nvm
    /dev/nvme0n1p1 on /mnt/cache type btrfs (rw,noatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)

     

    root@f:~# ls /dev/ | grep nvm
    nvme1
    nvme1n1
    nvme1n1p1



    Cache 1:

    image.thumb.png.678d8e8eee8fecff071fc12b690f0997.png

     

    Cache 2: 

    image.thumb.png.9fc477a9c85677e8ed89edcf46613255.png

    .
     

     


     




    User Feedback

    Recommended Comments

    Quote

    It turns out that only on 1 of the SSDs, the TXT file contains the SMART test report, on the other SSD the TXT file only contains

     

    That, and all the other errors above, suggest the device dropped offline, please post the diagnostics.

    Link to comment

    Same issue here. Raid-1 SSD cache, 2 drives. One seems to be completely gone, there is no device under /dev anymore.

    Unraid reports "Healthy" and did not send any notification. BTRFS seems to be shot, again no error visible anywhere.

    This is highly dangerous behavior, to be honest. How can unraid report "Healthy" SMART when there isn´t even any smart data?

    image.png.d8f994259f184077cc4da846215698ae.png

     

    image.thumb.png.3d2431ce8a50041851e957bbbc545ab7.png

     

    The only indication is the missing temperature and the weirdly low number of reads/writes. The drive is not in the system anymore.

     

    Notice the lack of /dev/sdf:

    image.png.cc2b6fd40c04aaedf53ce0bb5e232503.png

     

    Discrepancy because visible in the UI as well, when looking at the SMART attributes for that drive:

    image.png.fbca24b2dc04f7680d8978307cafa064.png

     

    Again, how is this reported as "Healthy" and no notification whatsoever... I lost significant data probably, because the first drive of the cache seems to have a corrupted BTRFS as well.

     

    File system check in the UI is useless somehow as well. UI shows this (currently running):

    image.thumb.png.bfecb7b9aaa95194041393815e913e39.png

     

    0 errors, nice isn´t it.

     

    This is the current log:. /dev/sdd seems shot too:

    image.thumb.png.ee95cb04cdeb668e244ec30a7ded80a1.png

     

    Several ten thousands BTRFS errors..

     

    At this point I don´t trust the UI for either SMART or BTRFS status anymore.

     

    Serious problem in from my point of view.

     

    Maik.

     

    Link to comment
    15 minutes ago, Maik75 said:

    Several ten thousands BTRFS errors..

    Same issue, one of your devices dropped offline, Unraid does not currently monitor pool devices, it's an old feature request of mine, for now you see here for better pool monitoring.

    Link to comment

    I could accept the BTRFS issue - however what about the "SMART Healthy" information? If you display an utterly wrong information, then why display SMART for pool devices at all? This is misleading at least.

    Your own UI display for smart details shows "failed" btw, this I consider a bug. Why show "failed" in one part of the UI and "Healthy" in another?

     

    Edited by Maik75
    Link to comment
    7 minutes ago, Maik75 said:

    however what about the "SMART Healthy" information?

    Since the device dropped offline there's no SMART, you need to reconnect the device to get that again.

    Link to comment

    Yes I understand that. Displaying "Healthy" for the drive in questions however seems like a serious UI bug, wouldn´t you agree?

     

    Link to comment
    1 hour ago, Maik75 said:

    Displaying "Healthy" for the drive in questions however seems like a serious UI bug, wouldn´t you agree?

    I do, the only clue you get for now when a pool device drops, is that is stops showing the temp.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.