Trouble with Asus P5Q WS (RESOLVED)


Recommended Posts

I have been using an Abit IC7 MAX3 for a long time now and it has been stable, but slow. The last parity check was about a week ago and took around 23 hours. I just upgraded to an Asus P5Q WS board and after the first boot I started a parity check, which ran for a while then the server rebooted. I once again started a parity check. This time it completed but not before a few segfaults and a couple thousand lines in the syslog that looked like this:

 

Apr 29 13:29:49 media kernel: ReiserFS: md9: warning: vs-5150: search_by_key: invalid format found in block 5696591. Fsck?
Apr 29 13:29:49 media kernel: ReiserFS: warning: is_tree_node: node level 133 does not match to the expected one 1

 

The web gui is working fine, but the samba shares are not accessible over the network. I can still access the server via telnet though. I'm not really sure what the best way to troubleshoot this is, I suspect I need to run an fsck but am not really sure how to go about doing that. Any other suggestions?

 

Asus P5Q WS

Intel e7400

4GB OCZ RAM

Corsair 750TX

SUPERMICRO AOC-SAT2-MV8

12 mixed SATA disks

 

Thanks for your help.

Link to comment

Wow!  Take a look at your syslog, you won't have to go very far before you see some serious memory corruption!  Scattered but clustered random letters have been modified.

 

Although you will need to run reiserfsck later (see Check Disk File systems page), plus a parity check to correct any parity corruption, there is no point now, until you can completely trust your machine.  Either the memory is bad or the timings are off.  Check your memory specs and compare with the BIOS settings.  I'm guessing you will need to adjust the memory settings in the BIOS to more conservative numbers.  You may also need to adjust the voltage up or down for the memory.  There is a chance your memory sticks are not supported for that board.  Test it thoroughly, over night, with the MemTest on the unRAID boot menu.

 

There is a possibility that this is not memory corruption, but bus problems, unstable.  There's no evidence that you are overclocking, but something is really wrong, hopefully just a memory stick or the memory configuration.

 

I note that in some cases, it looks like just a single bit has flipped.  That would implicate a hardware issue.  But in other cases, it looks more like a wild character pointer, writing to the wrong block.  That would point to program bug somewhere, not hardware, but no one else has ever had this.

 

I would under no conditions write to your data disks, including saving files, for now.  They could be corrupted, and further file system corruption could also occur.

 

JoeL is better at advice concerning checking and correcting your memory setup.  It would also be good to hear from anyone else with that motherboard, as to what memory they are using, and what BIOS settings they found to work best.  By the way, please indicate what memory you have installed.

Link to comment

Thanks for the advice! I'll get the memory test running tomorrow morning and report back my findings.

 

The RAM is OCZ PC2-6400 Value Series (OCZ2V8004GK) - it was cheap. I haven't overclocked anything but I never did bother to check if the voltages the motherboard supplied were ok with the ram. Guess I was a little too excited and didn't exactly follow the best practices, like running a memtest before starting everything up.

 

Thanks again for your help here, I've been lurking around these forums for a while and am always amazed at how helpful people are.

 

 

Link to comment

Thanks for the advice! I'll get the memory test running tomorrow morning and report back my findings.

 

The RAM is OCZ PC2-6400 Value Series (OCZ2V8004GK) - it was cheap. I haven't overclocked anything but I never did bother to check if the voltages the motherboard supplied were ok with the ram. Guess I was a little too excited and didn't exactly follow the best practices, like running a memtest before starting everything up.

 

Thanks again for your help here, I've been lurking around these forums for a while and am always amazed at how helpful people are.

 

 

Not just voltage, but also timing and clock frequency need to be set properly.

 

Let us know how the memory test works out. (Both before and AFTER you change any BIOS settings)

 

Joe L.

Link to comment

I had originally run 1 pass of the memtest before starting up unraid, which passed. I ran it again this morning and it was immediately full of errors. I guess that really goes to show the importance of running it overnight in the first place. Unfortunately I don't have time to make any changes right now, but when I get home from work tonight I've got some time to spend with it.

 

 

Link to comment

I had originally run 1 pass of the memtest before starting up unraid, which passed. I ran it again this morning and it was immediately full of errors. I guess that really goes to show the importance of running it overnight in the first place. Unfortunately I don't have time to make any changes right now, but when I get home from work tonight I've got some time to spend with it.

The BIOS setting might have changed, or memory has actually failed, or it worked better at the temperature you originally tested with... but glad to see it is probably going to be easy to fix.  It might be a single bad ram strip, or it could be voltage/timing/clock speed set wrong.

 

Joe L.

Link to comment

So it turns out that the timings were incorrect, so I adjusted them. I still had errors, but was able to isolate them to one stick of ram. Anyways, I got new ram and am currently on the 4th pass of memtest and all is well so far. I'll let it run for a few more hours before booting unraid.

 

I assume that I should run a parity check when I load up unraid, and see about which disk needs the fsck. Is there anything else I should be aware of?

Thanks

Link to comment

It looks like the problems are resolved. I ran reiserfsck on all the data drives and there were two that had issues: disk 1 and the cache drive. Before I even knew there was a problem with the build I had transferred some data from the cache drive to disk1, so it was not a surprise to me that both of them would need fixing. Other than that, it looks like everything is running smoothly. I'm currently running the preclear script on 2 1.5 TB drives, I can't believe how much faster it is than my old system!

 

I'll attach another syslog here, I'd really appreciate if RobJ could take a quick look to make sure everything is ok. I think it is, but I'm far from an expert when it comes to reading the logs.

 

Thanks again for all the help, it made the process very painless and greatly minimized my chance of data loss.

 

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.