• [6.11.1] Nvme Cache Controlller (970 PRO)


    Kltc
    • Closed

    Hi.

     

    The NVME controller cache sems like its malfunctioning and so i'm posting a report, as requested.

    I'm also going to post everything that has led up to this.

    image.png.7546f7de71e7e79e0d85ded2a2726b88.png

     

    So ... I upgraded recently to 6.11.0 from 6.10.3. Apparently had no issues, so i upgraded right away to 6.11.1, finished the upgrade and then to visit my mother (2/3 days). When i came back, i had the W11 VM paused. Worried, first thing i did was check the logs and the state of the cache drive, and by instinct, backed up everything from the cache on the array. That was the right call ...

    Later i found out the cache was corrupted. Very odd, but it could happen. Formatted the drive, restored the files, but the errors kept appearing. However i started to notice ... things. Like how the corrupted files "moved". Suddenly, the cache upped the unrecoverable sectors from 45 to 225.... So i decided something drastic (and desperate).

     

    I had a W11 VM with a nvme passthrough, and that nvme has the same capacity as the cache.

    So now i'm trying to fix things with it, hence the bug report. 

     

    Previous Cache: Samsung EVO 860 (500Gb)

    New Cache: Samsung PRO 970 (500Gb)

     

    As for the diagnostics, its attached. However keep in mind that VMs and Dockers are off (i'm still trying to restore the cache).

     

    Post Scriptum: i even booted the VM from the array (i know, i was desperate to get my data off the 970). I didn't notice this error then, but it was giving me problems as well, so i'm wondering if it even is affecting the passthrough. Had no issues on 6.10.3.

     

    cronos-diagnostics-20221017-1751.zip




    User Feedback

    Recommended Comments

    Like mentioned by the kernel try adding that to syslinux.cfg, on the main GUI page click on flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

    nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

    e.g.:

    append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off


    Reboot and see if it makes a difference.

    Link to comment

    Hi. I did.... as stated in the previous comment.

    Even after that change, the problem repeats.

     

    When i try to access the cache:

    image.png.8b18ab122ae2ea8dc5e64a195d4df3ef.png

     

    I'll try for the 3rd time, and get back here.

    Link to comment
    2 minutes ago, Kltc said:

    I did.... as stated in the previous comment.

    Sorry, missed that part, if that doesn't help could be a board/BIOS/kernel issue, I use the same device with v6.11.1 without any issues, and didn't need to disable the low power options.

    Link to comment
    6 minutes ago, JorgeB said:

    Sorry, missed that part, if that doesn't help could be a board/BIOS/kernel issue, I use the same device with v6.11.1 without any issues, and didn't need to disable the low power options.

    No problem, i appreciate the help. I'm completely lost on what i could do now, as i saw no particular changes from 6.10.3 (but i might be mistaken). I didn't change the BIOS and it's the same board (ASUS Prime X470 PRO), so i'm even considering if i could / should reinstall the flash (my backup is already of 6.11.1).

    Link to comment

    Well, thanks but ... No luck so far (still with the pcie_aspm option off as well as the latency one).

     

    I just updated the BIOS from 2018 to the latest driver, tried another restore on the cache, and got the same errors:

    image.thumb.png.a7de0dd7622faf3ff74b5e9300e1bddd.png

    Even thou it took more time to throw an error.

     

    I find it even wierder that i can't even see what's on the drive when this happens.

    image.png.d38b222ed27d646a72a4042b90f051fa.png

     

    I got no idea if i can do a simple downgrade from 6.11.1 to 6.10.3, but maybe that might be the best here, i'm completely out of ideas.

    Link to comment
    7 hours ago, Kltc said:

    I find it even wierder that i can't even see what's on the drive when this happens.

    Device is dropping offline so that part is normal, question is why is it dropping with just a kernel change, since we know it's not a device problem I suspect some board compatibility/BIOS issue.

    Link to comment

    Hi. So ... here's the status update, and a few more coments.

     

    TLDR: NVMe controller on the motherboard is gone. Not dead, but worse, as it even corrupts the filesystem. The fix ? PCIe NVMe adapter.

     

    Starting from where we left off, first thing i did was revert to 6.10.3. Good idea, but ... same errors. Very, very odd. Did a few searches, and most of them referred to a bogus motherboard error, which looked the same. The difference here is, their motherboards were new, mine isn't (4 years old). Still, I decided to get a PCIe to NVMe adapter board, and ... it works. No errors, all fine. The system still needed some repairs, but I managed to repair it without major issue. And no more corrupted blocks.

    Now I'm trying to reach ASUS to find out why I can only see one NVMe and not the two on the adapter (already went to the BIOS to check). Might even get my W11 VM 100% back, but let's see. I do see a light at the end now, just hope it's not a car or a train.

     

    Again, thanks for the help.

    • Like 1
    Link to comment

    Hey,

     

    It's the ASUS PRIME X470-PRO. As for the bios, it's the latest, the 6042.

    Since I'm writing this, I'll give another status update. It still isn't over. At least it might help someone that might come across this.

     

    So I got the PCIe adapter, but it didn't work. The 970 Evo worked, but the 860 Evo wasn't recognized. I decided to get another NVMe to check (WD Black SN750). Turns out .. the problem was on the 860 Evo. Ok, problem solved ! (or so I thought).

     

    Later, thinking everything was fixed, I went and tried out a game (Insurgency). "Odd ... the game is slow." Turned on the Riva Statistics ... "The Clock is all over the place... Is the GPU is gone as well ?". This GPU is a GTX 970, old, I know, but it still should get the job done. Still, I tested it at another PC. The GPU is fine. *sigh* Now I'm wondering if my "marvellous" energy grid messed up my PSU. It already fried a microwave and a monitor so ... now I bought a PSU, and I'm waiting on it to arrive.

     

    Guess it was a train after all, let's see how heavy it is 😅

    Link to comment

    Hey,

     

    Final post. After a lot of tests and rebuilds, i've nailed down the messed up part. It wasn't the motherboard, it was ... the CPU (Ryzen 2700x).

     

    Apparently (and i got no idea how), the energy grid managed to fry part of the CPU. Meanwhile, with all the parts i've replaced, i've got a new computer. Even the motherboard is working normally now, with another CPU.

     

    Again, thanks for the help, and hope all of this might help someone out.

    • Like 1
    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.