Jump to content

[Solved] Widespread file corruption on multiple disks (5.0.5)


Recommended Posts

BAsics:

 

v5.0.5, Plus license

5 disks, 4TB ea, plus a cache drive (120GB SSD)

 

I switched from a 4.x server to a new 5.x server about two years ago.  it's becoming clear to me that I have widespread file corruption over at least two of the four data disks. The corruption typically is only seen in large files (250MB and larger, but it could very well exist everywhere and I'm just not seeing it. I've had issues with video files which used to play fine having problems (unable to copy file, or artifacts that did not used to be there), and now that I'm reinstalling a bunch of software on a new PC, many of my installation archives are bad - they simply fail CRC checks or other pre-install verifications of container integrity. It is a "current" and not a past or data transfer issue from the old server as I have verified that recent files (placed in the last 3 months) are also corrupt.

 

Most of the data is commercial, or re-rippable, but there are personal files on there as well, and - being 1-2 years into this server - almost everything in the distant past has been overwritten on backups, and the new backups are likely to be corrupted. (edit, a parity check runs every month on the 1st, which suggests to me that the errors are introduced at write time)

 

My question is: how do I go about testing to determine the cause of the corruption?  Is there a simple way to isolate whether it's an OS issue, a disk issue, a memory issue, a network error, or some other error? I'm at a loss for efficiently assessing the likely causes and getting them corrected. Any suggestions are appreciated. Note that I rarely work with linux - I'm perfectly comfortable with the command line, but I primarily work in windows so most of the internals of unRaid are fairly black-box to me. TIA

 

Edit: my only reason, currently, for not moving to v6 is to prevent things up worse than they already are. If upgrading will make the troubleshooting easier, then I'll do that first.

Link to comment

Syslog has been rotated, is useless, only repetitive garbage.  Need one shortly after booting, either fresh or from backup.

 

The last SMART report (for the Cache drive) is from Disk 4 (serial ending in 9W8).

 

Cache/Disk4 has or has had a bad SATA cable, but probably not related to your corruption.

 

You haven't provided any info on your hardware.  Any chance this is an nForce motherboard (nForce4 or lower)?

Link to comment

Thanks for your patience...the corrected smart for the cache is attached.

 

System:

Gigabyte GA-78LMT-S2P [southbridge = AMD SB710 -> 6 x SATA 3Gb/s onboard]

AMD Bios 3.0.A

AMD FX8320 CPU

2 x G.Skill 4GB DDR3-1333 RAM

NetGear GA311 (rev A1) 1000bT NIC (onboard NIC is not recognized in unRaid)

ICY Dock MB155SP-B 5 disk cage (for data and parity)

XFX 550W power supply (20A 5V rail, 45A 12V rail)

 

The system hasn't been physically moved in at least a year, probably more.

 

I've re-booted. In the interim I ran a Memtest once through, using all CPU cores, and the results were zero errors.

I reseated all the drives and checked (physically) the cable on Disk 4. It's locking-style, and seemed to be seated securely (not that it would matter if the cable was dodgy).

 

Note that Disk 4 is currently not used for storage. It is set up as an overflow for Disk 2.

Disk 1 has known corruption (executable files and archives which are failing CRC or other checks),

Disk 2 I also suspect has data corruption, but it's all movies so it's harder to tell.

Disk 3 Is either stuff I never use (archived/non-active DVD rips) or downloaded TV - and they quality of that is poor in the best of times ;-)

 

Attached is the correct cache SMART, a new Syslog created after bootup, then a second from late in the day (after all of the testing below).

 

 

---------------

Further testing done today:

Note: I know almost nothing about how an MD5 hash is created, so all of this may have just been pissing in the wind, but I'm going to post it just in case there is something useful...

 

I tried transferring a two test files (500MB zip file full of drivers, and a 4kB text file) and computed an MD5 hash of the file

 

500GB File:

PC - PC -> Pass (using FVIC utility)

PC - Disk 1 -> fail (using md5sum in telnet session)

PC - Disk 3 -> fail (using md5sum in telnet session)

PC - Disk1 - PC -> fail

PC - Disk3 - PC -> fail

  *After this test I removed the cache drive from the array and restarted the tests, just to make sure I wasn't hitting the cache only

Disk1 - Disk3 -> Pass (using telnet session using cp, md5sum)

 

4kB file

 

PC - PC -> Pass (using FVIC utility)

PC - Disk 1 -> Pass (using telnet and md5sum)

PC - Disk 1 - PC -> Pass

PC - Disk 3 - PC -> Pass

 

Tests which failed were re-run after switching the port on the switch, all failed

Tests which failed were re-run after switching the cable to the switch, all failed

Tests which failed were re-run after switching the server to the main router, all failed

Tests which failed were re-run after switching the PC to a new cable and connected to the main router, all failed

 

As a network test: I copied the files to a second laptop (windows sharing) through the main router, the 500MB failed MD5 and SHA1, the 4kB passed. I then put two laptops on a separate router, isolated from the rest of the network, and the file transferred correctly.Now that I know that works, tomorrow I'll try isolating other parts of the network. 

Syslogs.zip

Link to comment

Alright - first, thank you for taking a look at this, and I apologize for being a bit paranoid.

 

Upon further testing, it seems that my *primary* machine has either a bad network cable or a bad network port. SOB. I do so much work on that machine I just never suspected it was wonky, and I rarely transfer big files except at that machine because it's on the gigabit switch with the server (so I get great throughput).

 

You'll have to excuse me while I go collect the hair (I've been pulling out)  that's all of the office and see about gluing back onto my head.

Link to comment

I may have spoken too soon. I had success with files in the 800MB range, but I'm still having issues with network transfers of larger files (1.5+GB). To eliminate network gear, I connected directly between the server NIC and the client computer. That still produced CRC errors. The errors occur in both directions (to and from the server) and occur over both SMB and FTP.  Multiple copies of the same file result in different checksums.

 

I have a new (Intel) network card on order, but I'm open to any other suggestions. 

Link to comment

I suspect (and hoping) that that Intel NIC may solve all your problems.  I feel for you, I've been in a similar situation, except the defective chip was on the motherboard.  The system is basically unusable if you can't trust basic operations.  As you know, all memory accesses, bus transfers, and network transfers HAVE to be 100.00000% perfect.  I spent many many hours slowly isolating the source of my corruption, figured out it was a bus-related 4 byte register somewhere on the motherboard, and replaced the motherboard.  My errors were on the order of 1 bit per gigabyte transferred, so transferring small files was useless for testing.  I had to transfer and file compare multi-gigabyte files, over and over.

 

I took a look at your syslogs.  The SATA cable to Disk1 (Seagate 4TB with serial ending L3Z) needs to be replaced, causing BadCRC errors.  These are NOT related to your CRC errors, as they are caught, reported, then retried until a perfect packet is sent, so not the source of CRC errors elsewhere.  They do result in slowdowns in drive accesses though.

 

At one point, the network (Realtek chipset) was popping in and out.  Once you install the Intel, you'll want to disable this onboard Realtek one.

Link to comment

The Intel NIC looks like it has done the job. I've done 70-80GB of transfers with zero errors to multiple discs and in both directions (reading from and writing to to the server). As a bonus, my read transfers are now about double the speed they were with the old NIC.

 

Thanks for the suggestion to disable the onboard NIC - I'm sure I said some choice words about it not being supported back when I installed the board, but didn't actually disable it in the BIOS when I installed the other card. I've also replaced the drive cable on Disk 1 - thanks for that!

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...