Help diagnosing data corruption

February 25, 200917 yr

I have a Intel D945GCFL2 board, and WD 1TB GP drives.

I'm getting corruption when I copy large files (testing with 7GB iso images), at least that's the easiest place for me to discover it.

I get no errors when copying. I did a parity check, and that came back ok.

I've tried xcopy, robocopy, and the shell to copy. I'm not using user shares. Copying from a Vista64 machine.

To test, I copy the file to the server, then do a fc /b to compare source and copy. Running fc multiple times returns the exact same error list, so I dont think it's a problem reading, or with flaky hardware on the read side (or I'd expect to get errors at different locations every time I read back the file)

Any ideas how to go about figuring out where the problem may lie?

I tried to attach a syslog, but the forum software complained that the upload directory was full.

February 25, 200917 yr

1. Use pastebin.com to upload your syslog and put a link here.

2. Run smartctl reports (follow the troubleshooting link in my sig) on each of your drives and post those results too.

3. Run a memtest (you can boot from the unRAID stick and there is a boot option for running a memory test).

Also, if you are overclocking or undervolting or doing any of those kinds of tricks, reset everything to default.

Some high end memory needs extra voltage to run at rated speed. Check to make sure that the memory voltage is correct.

If you copy exactly the same file again, does it corrupted in exactly the same way again?

February 25, 200917 yr

You have my sympathies. This brings back very bad memories of very long days lost in trying to track down errors like this.

You may be interested in this thread. It is long, so you may want to jump to page 4 or 5, but it is similar to your issues, and covers various methods of testing, as well as some of the possible suspects that could be causing the corruption.

Problems like this could be bad memory (test it over night), overclocking (restore defaults), bad power (swap it out for another), components getting too hot (add cooling), bad motherboard (replace it, no other choice), bad network chipset (replace it), may be others too.

The first problem to tackle is determining exactly which machine is suspect, and for that I recommend bringing in a third machine to test against. Once you find a pair of machines that always test perfectly between them, then both are ruled out as suspects, leaving the other computer as the guilty one. Then you have to eliminate the various components as suspect, for example, testing memory, swapping out power and network cards, etc.

If you don't mind posting the list of errors, I may be able to see a pattern that will point to something, or possibly rule out particular components. And I'm also interested in the answer to Brian's last question.

February 25, 200917 yr

Author

Syslog here: http://pastebin.com/m38d8f278

smart output from a drive here: http://pastebin.com/m108c878e

The motherboard is pretty new, not overclocked. Drives are connected to a Promise Sata 300 TX4.

I'll drag the machine near a monitor and run memtest on it.

If it's bad memory, I would expect to see random errors. If I recopy the file, I wont get an error in the same place. I'll do more exhaustive retries of running FC multiple times and make sure it always reports the corruption in the same place (after copying once). I'm concerned it's something with the network stack.

February 26, 200917 yr

Author

It's not bad memory.

Ideas for net testing?

February 26, 200917 yr

and no issues in the syslog or the one SMART report either. Ethernet protocols are error checked, so not likely to be a cause of corruption. Corrupted packets would be resent. You can confirm by using the ifconfig eth0 console command, and check for errors reported.

February 28, 200917 yr

Miwa -

Any updates?

March 3, 200917 yr

Author

I've been out of town, but I need to work on tracking down the problem.

I just did a few more copies, and got corruption again, and then did a ifconfig eth0, with no errors. Here's the output:

eth0      Link encap:Ethernet  HWaddr 00:1c:c0:71:55:41
          inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:7564114 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4759526 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3676430410 (3.4 GiB)  TX bytes:1726858571 (1.6 GiB)
          Interrupt:220 Base address:0x8000

Next I'm going to swap out the power supply for a brand new Seasonic one I have on the shelf.

March 3, 200917 yr

There are a limited number of things it could be. The PSU angle is possible but unlikely IMO. If that doesn't fix the problem, here is my assessment along with suggested way of ruling out each cause.

1. Bad drive. Run the test to multiple drives to see if corruption is limited to one drive

2. Bad cabling / backplane. Hook drive directly to controller with fresh data cables

3. Bad controller. Run the test to drives on different controllers to see if corruption is limited to one controller

4. Bad Memory. Test memory (you already did this, but a more strenuous test may be in order)

5. Bad Network. Try copying a file from the unRAID server to the unRAID server. (Eliminating network completely).

6. Incompatible version of unRAID. Downgrade to 4.3.3 and see if problem recurs

7. Bad Motherboard. Testing for a bad motherboard is hard to do. If it's none of the above, the motherboard is a likely culprit. Replacing it is about the only way I know to test. (You'd want to be able to return it if it wasn't the problem). I have a local Mom and Pop computer store that has helped me test for a bad motherboard for a small fee.

8. CPU. The least likely cause - but does happen. I once had a CPU go bad. It caused random crashes and consistently crashed (at various points) while trying to install a fresh copy of Windows. It affected multiple motherboards with the CPU installed. It was the first one of its kind that my local Mom and Pop shop had ever seen - normally when a CPU goes bad it is totally dead.

My guess is you have a DMA problem with your motherboard. But that's just a guess.

March 3, 200917 yr

Another similar thread: http://lime-technology.com/forum/index.php?topic=90.0

For that one, it appeared to be the mobo.

Cheers,

Bill

March 3, 200917 yr

I hate to be a nuisance, but in troubleshooting I find that it is always better to cover the basics first, even the most obvious of things (is it plugged in? etc). I have seen too much wasted time, troubleshooting the wrong machine. I really do think you need to verify that you are troubleshooting the correct machine, even though it does *look* highly likely that it is the server at fault. Your tests were good, and certainly implicate the server as most likely, but don't eliminate the local machine yet. For example, although writes seem to be the problem and reads seem faultless, the creation and transmission of packets from the local machine is different from the reception of packets, so it is still possible that there is a faulty register or firmware in the local NIC, on the transmission side, and it would not show up in the ifconfig stats. I would like to suggest 2 simple tests, to absolutely confirm that it is the server at fault. Try your iso save and compare to a shared folder on a different machine. And try the same test from that different machine to the unRAID server.

Sorry for being a troubleshooting purist pest!

Once you have isolated it to the unRAID server, Brian's test #5 would be a great one to eliminate the server NIC and networking drivers as suspects. I have to agree that the motherboard looks most likely to me, but as Brian said, you can only determine that by a process of elimination of everything else. The drives and cabling are probably OK, because you would have seen errors in the syslog, and they don't corrupt data, without errors reported. After the motherboard, one of the NIC's is the next most likely.

Another possibility from the past, there have been serious corruption issues when a northbridge chipset on the motherboard was getting too hot. Corruption seemed to go away with extra cooling applied to it. However, I don't think I would completely trust that board ever again, even with extra cooling!

March 3, 200917 yr

Author

Thanks for all the tips. I'll do some more testing from my Vista64 machine to an XP machine, XP machine to server, and do internal disk to disk copies on the server.

I dont think it's cables either, as I'd think that be a lot more random in reproducing errors. And I'd think a parity check would fail if there was something wrong reading the disks. Since the errors don't seem to be random (well, not random in that once the data is written incorrectly, I can always detect the error in the same spot, every time I test), I doubt it's PSU too, but I don't know if Fry's has this board in stock for me to do the usual "Fry's Rental".

I'll then try downgrading, tho' that will make me unhappy. The 4.4 stuff fixed whatever issues that was making my Popcorn Hour not able to use SMB to the server.

The CPU isn't removable, and the northbridge has a (*loud*) fan on it, so I'd think it's not overheating, plus chipsets can survive being pretty insanely hot. This is an intel Atom board.

March 6, 200917 yr

Author

I copied the file from disk to disk on the server (used mc), and then did a cmp:

root@Server:/mnt/disk2# cmp test.iso  /mnt/disk1/test.iso
vty-0225.iso /mnt/disk1/vty-0225.iso differ: char 2187825926, line 9783370

I guess I'll try downgrading next.

March 6, 200917 yr

Author

Well, that went nowhere. 4.3 doesn't support the network chip on the atom board.

Gonna have to get my old RAID5 machine working again, to fc the entire unRAID too. Then I have to decide where to go next.

March 6, 200917 yr

Well, that went nowhere. 4.3 doesn't support the network chip on the atom board.

It was extremely unlikely to make a difference any way. It is very unlikely to be related to any software or kernel issue.

If you have a slot free, can you add a cheap disk controller, and save to drives on it? That might eliminate or incriminate the onboard drive controller.

Edit: sorry, I'm a bit ignorant of your setup. If you have multiple disk controllers, such as onboard and addon card or SATA and IDE, can you try testing to each of them?

March 6, 200917 yr

Author

It's a mini-itx Atom 330 board. Only 1 pci slot, and that has the Promise card in it. Parity is on one of the 2 motherboard sata ports, I believe.

If I move a drive from the Promise to the motherboard, what's the right way to do it so I don't wipe out parity?

March 6, 200917 yr

Thankfully, unRAID is pretty good at figuring out which drives go with which disk numbers or the parity drive, so you can just switch the cables or reconnect as needed. Some times you have to check the Devices tab and re-assign a drive or 2. It's always a good idea to check the Devices tab, whenever you move drives around, just to make sure everything is assigned correctly, especially the parity drive. It's also a very good idea to have a copy of the Devices page, screen shot or print out or drive assignment notes.

March 7, 200917 yr

Author

Another data point is if I do a cp test.iso test2.iso and then cmp them, all on the same disk, there isn't a problem.

Help diagnosing data corruption

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)