Intermittent file corruption on write to unRAID

April 1, 201412 yr

I am faced with an intermittent issue where a file that I move to my server when read back from the server no longer has the correct CRC32. Having noticed it happen a few times (more often on larger files) paranoia has set in and I started including CRC32 in the filenames as well.

When a file compares bad, I can re-check it's CRC32 multiple times from the server and it is always wrong. Last night for example 2 out of about 300 files that I moved had bad CRC.

I copied both of the bad copies back to my machine and compared them using HxD file compare to the source local files. In both cases, a single bit was toggled wrong in a random place within the file.

This happens when running normal with a parity drive, out of curiosity, I moved a ton of files with no parity drive in place and still had the problem occur, albeit the files moved a lot quicker.

Interestingly, if I grab all the files that I have tagged with a CRC32 in the filename and read them back from unRAID, always check out ok, if the file copies correctly, it seems to stay good on the server.

Now before I include logs and specs, this has been happening since my first unRAID server, and I have replaced my server board/case/cooling/cables/memory/cpu three times since I first bought a key some 4 years ago and have experienced it from version 4.x until the last version 5.

I had always attributed it to maybe a dodgy network card and resigned myself to always performing writes to the array using a write/test approach. I now want to get to the bottom of it because this has persisted through different motherboards, PSU, hard drives that were 1TB and 1.5TB that are now all 2, 3 and 4TB drives.

I thought it may have something to do with drives spinning up/down as it would seem that when I browsed to a new share, the current copy would pause for a number of seconds (I assume while a drive spun up) and sometimes that file would end up being corrupt, but I couldn't reproduce it often.

So - I can't afford to keep my drives spinning at all times to avoid the problem, any suggestions? One thing that is common to my servers is the adaptec 1430 4 port sata cards.

As a test, all the files that had successfully copied, I have for the last two nights just been reading all files sequentially that have a CRC32 in the filename and not a single error reading these files that were successfully copied. At my wits end.

I thought ethernet did CRC32 on each packet, is it possible unRAID is not checking the packets at all? Even the gigabit switch has been changed out for a different model.

Quote

April 1, 201412 yr

Interesting issue. I had a Seagate 1TB USBv3 2.5" external drive that's driving me insane with the same basic issue -- MOST copies to it are good; but probably 1 out of 100 will be corrupted ... so of course the confidence for ANY file was always low. I finally just tossed the drive -- not worth the hassle. [i had tried various USB ports, both v3 and v2; reformatted it several times, using different block sizes; etc. ... finally just decided the aggravation could be fixed with $80 (for a new drive) ... so that's what I did.]

I suspect, although I didn't think of it until I'd already tossed the drive, that the issue wasn't actually the drive, but the USB bridge device. I SHOULD have removed the drive and tried it via other USB bridge devices and/or external caddies ... but since I didn't I can't say with certainty that was the problem (but I'm pretty sure it was).

The point of that story with regards to your issue is that I suspect you're encountering a very similar issue => and since the only common denominator among all the various systems you've been having this issue with is your 1430SA controllers, you likely have a controller with a bad SATA port -- or possible just have a cable you need to replace (have you changed the SATA cables as you've evolved the system?).

This can be very tricky to diagnose, but it's easy if you pay VERY close attention to the specifics. The next time you encounter this issue, STOP !! Do NOT delete the bad file off your server until you examine the individual disk shares and determine which specific disk the file is on. Make a note of that. (then you can delete the bad file and copy it to the server again)

Repeat that process until you have that data for at least 3-4 failed files. If they're all on the same disk drive, then you can stop collecting these details. If not, keep going until you have a few more data points.

Now note the serial number of the drive(s) that had bad copies; shut down; and examine the server to determine exactly which drives those are; and, if there's more than one drive involved, see if they all happen to be connected to the same SATA card.

By now, I'm sure you see where I'm headed. If there's only one drive that has bad files on it, replace the SATA cable to that drive. If that doesn't resolve the issue, plug it into a different SATA port. If several drives have the problem, but they're all connected to the same SATA card, then replace the card.

etc. It will take some time -- but you've tolerated this long enough. Start collecting the data that will let you isolate where the issue is, and replace the offending component(s) !!

Quote

April 1, 201412 yr

Author

Thank you for the suggestion! So far, all corrupt files have been to one disk, but thats the most empty disk and has been taking the majority of writes. I will keep track of whenever it happens and see if it is common to one drive or if on different drives to one controller.

In any case I bought a set of new SATA cables and am ready to replace willy nilly!

Quote

April 1, 201412 yr

With all the corruption on the same disk, it's very likely that you either have a defective SATA port or cables. Clearly I'd replace the SATA cable for that drive -- hopefully that alone will fix your issue.

If not, you then need to change that drive to a different SATA port. Note that if no other drives on the 1430A that this drive is connected to are having problems, it's likely NOT the controller card (although it could be the specific port the drive is connected to).

I'm sure with a bit of patience and keeping track of the specifics of the bad copies, you'll be able to isolate the cause and get rid of this problem once and for all

Quote

April 1, 201412 yr

With all the corruption on the same disk, it's very likely that you either have a defective SATA port or cables. Clearly I'd replace the SATA cable for that drive -- hopefully that alone will fix your issue.

If not, you then need to change that drive to a different SATA port. Note that if no other drives on the 1430A that this drive is connected to are having problems, ...

I have 2 of those Adaptec 1430A controllers in my test box. They won't handle a locking SATA cable, and the four cheap SATA connectors (8 for me) that easily come apart and don't hold the connectors tight are persnickety as all get out. I finally got all 8 ports connected reliably to drives, but I dread the day I go back in to upgrade drives. I highly recommend other controllers to those who have not purchased yet, but for us that have them, get 5 in 3s, take your time and hook up the controllers with new SATA cables that fit tight, test each 5in3 drive position well, and don't open the computer again for many years!

Quote

April 1, 201412 yr

Interesting that you've had issues connecting your cables to 1430SA's. I've got 3 of them in my original UnRAID server and have never had any problems with the connections being loose or "persnickety"

I do use locking cables -- even though they won't lock at the 1430SA end, they do on the drives.

Quote

April 1, 201412 yr

I remember on my first server having frequent disk errors in the syslog and missing drives after reboots. I bought all new locking cables and painstakingly disassembled the box and hooked up all the drives, testing each one and regression testing the others before moving on. I had several problems requiring me to start over.

I remember there were plastic pieces that were part of 1430A ports that kept coming off and getting lost inside the case. And plugging SATA cables in on the end of the card and then turning them upward and sideways to get to the drives caused the cables to want to twist in their slots causing most of my problems. (Even when I had fixed them in the past a few days later they had pulled a fraction of a mm loose and lost connection). I was frustrated the locking cables didn't attach to the 1430s but I finally shoved the locking cables onto the slots anyway to create enough tension to stop the cables from working themselves loose. That was the key.

All in all - they were not my fav. But they work great once the drives a securely connected.

Quote

Intermittent file corruption on write to unRAID

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)