• [6.11.5] Silent file corruption when using the nvme cache (trimmed file, parts of file replaced with zeroes)


    vasiby
    • Urgent

    I was copying my movie library from a synology nas into the unraid box. Unraid share was mounted into synology over SMB and a 1.9TB copy was initiated from the synology web gui. At the end of the 8hour process, synology reported success and no errors.

     

    By chance I noticed that a movie file was corrupted in the unraid share. After that I have spent several days deleting the MOVIES folder entirely and repeating the transfer again to reproduce the corruption.

    After everything was copied I used FreeFileSync to check if the contents match.

     

     

    My Cache is 2 identical 1TB nvme drives in raid1 btrfs encrypted, cache set to 'Yes'.

     

     

    1.  The silent corruption looks like this, the file sizes are identical.

     

    In the doctor strange movie,  bytes 282700000 - 2827FFFFF - all zeroes, but the original file from synology has correct data.

     

    image.thumb.png.7eaec09ba455ea89a551e0082a9d4a55.png

     

    I had another movie (justice league) the zeroes happened in the range 264038000 - 264137FFF.

     

    The file verification was performed AFTER the mover had copied the files into the HDD.

     

     

    2. In another test transfer a movie file was truncated, probably couldn't fit into the cache. But synology didn't receive an error, it reported a successful transfer (the police academy movie), in unraid cache it was 2.6gb, but the original is 5.8gb. No error was reported to synology so it continued with the copy as if nothing had happened.

     

    image.thumb.png.49607b5ce3e9da3276f4b049e598bda9.png

     

     

    The diagnostics are from the truncated file test, there were no zeroes inside other files, just the truncation.

     

     

    The main problem is that no copy  errors were reported and the only way to catch the issue is to compare by content.

     

     

    If the cache is set to 'No', the entire 1.9TB is transferred without errors and every file  is correct and verified.

     

    vs-tower-diagnostics-20230506-1354.zip




    User Feedback

    Recommended Comments

    Not sure if it is related to your error, but there are a lot of 

    May  6 00:21:58 vs-tower  shfs: share cache full

    In the syslog.   You do not have a Minimum Free Space value set for your ‘cache’ pool - it should be set to be more than the largest file you expect to write as file systems getting too full can have unpredictable effects.

     

    There are also a lot of lines of the form:

    May  6 03:05:19 vs-tower kernel: CIFS: __readahead_batch() returned 870/1024

    In the syslog.  Not sure what they mean but they are not normal.

     

    neither of these, though, explain why there should be corruption :)  if you run a scrub is the corruption detected as since the pool is formatted btrfs it should automatically have checksums associated with it to check integrity.

     

    Link to comment
    CIFS: __readahead_batch()

     

    I only see it when I mount a SMB share using unassigned devices and pass it into a docker volume as read/write-slave. But in my case I do it only to verify the already transferred files using freefilesync docker in unraid. When I copy files from synology I don't see such messages and don't mount anything with unassigned devices plugin.

    I will set the minimum free space and try the transfer again.

     

     

    About the 1 megabyte of zeroes that is inserted into a file.

    I remember  that I was getting warnings that the nvme was hot (50C or more) so I went to the cache settings and changed Warning/Critical disk temperature threshold (°C) while synology was transferring data, maybe changing this setting caused sometething to the nvme and it just wrote a megabyte of zeroes.

     

    It should not be a SATA controller issue because the mover placed 2 movies on 2 different HDDs, one was connected to the Motherboard, 2nd to the PCIE LSI Broadcom controller. Both eded with a megabyte or zeroes in the middle.

     

    When I didn't touch any cache settings, the only issue was the file size, I didn't find any files with an "empty" region.

     

    The SSDs are cheap SiliconPower

    image.thumb.png.5449aebdeab77d45d4040b2ea4ff9981.png

     

     

    Link to comment

    This looks more like a device problem, if you transfer directly to the array do you still find any corruption issues?

    Link to comment

    If I set cache to "No" then there's no corruption. The main issue is that both file truncation and zero block insertion isn't detected in any way even in raid1 and the copy isn't aborted with an error. If cache is set to "prefer" then synology gets an error that there's no free space and aborts the copy.

     

    Yesterday I tested with  the Minimum free space set to 10GB on the Cache and the 1.9TB were transferred correctly, in previous tests  it was set to 0 (default). 

    Edited by vasiby
    Link to comment
    57 minutes ago, vasiby said:

    If cache is set to "prefer" then synology gets an error that there's no free space and aborts the copy.

    Cache=yes and Cache=prefer have the same behavior regarding that, data is written to cache, if there's a minimum free space set the transfer will overflow to the array, if there's not the cache will get completely full and the transfer abort with an ENOSPC error.

    Link to comment

    Looks like in my case the transfer abort error didn't happen and synology just kept copying the file and the rest of the movie library did overflow to the array.

    Edited by vasiby
    Link to comment

    I don't know how that could be an Unraid issue, please retry transferring from a PC, if it does as expected retry with the Synology, what were you using to tranfer the files?

    Link to comment

    I've repeated the transfer several times more, there are no problems if the minimum cache size is set to 10G or 100G, so that there's some free space left on the pool.

    when it's set to 0 the movie can't fit and only half of it remains in the cache, but the synology continues copying and reports success at the end and doesn't receive the abort error.

    1364213393_Screenshot2023-05-09122844.png.20b392d12585598f1d5fa9b6ab3cb4c2.png

     

    I couldn't reproduce the 1 megabyte of zeroes again.

    Edited by vasiby
    Link to comment

    I mounted syno and unraid folders (1.9TB)  into a windows11 PC and it aborted the copy when the cache is full, also there's no partially copied file in the cache. Don't know why it asks for 892gb free space, the folder itself is 57.7GB, but anyways I got the error unlike synology.

     

    image.png.143dec6f71d0f5fd59ff8bf0d923e006.png

     

    Also I get this in syslog when copying from windows
    image.thumb.png.5ae205f686b614583cbca441e59f3c24.png

     

     

    Edited by vasiby
    Link to comment

    You can ignore those SMB warnings, and Synology should also abort the copy, if it doesn't it's probably a Synology issue, or of the application you are using to make the copy.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.