October 5, 201015 yr Hello all, I know that this is an issue that has come up from time to time, but I was hoping that maybe I could get help with it because I am unfamiliar with that do really do in such a situation. The other day while I was using Media Portal in order to watch a TV show from my UnRAID server, playback of the file stopped immediately after starting. When i tried to access the server through my network, I was not getting anything to show up. I went to the UnRAID web configuration and tried to stop the array, but after clicking it and waiting for a few minutes, it was just sitting there with the "Refresh" button available, but when clicked it would not display the options to Reboot or Shutdown the server, among the other options it gives you. After hard rebooting the server, UnRAID started to perform a parity check like it does every time in such a situation where it was shut down incorrectly. After the check completed in roughly 24 errors, it said it found and corrected one error. I tried to then stop the array and reboot, but it froze up on the "Refresh" screen again. I hard rebooted again and then I proceeded to run another parity check right after that one, and after it finished it then said there were 8 Sync Errors, which were corrected. I then AGAIN last night ran a parity check and it returned 11 errors, which were corrected. Obviously this is not right, and it has me concerned that there is a serious issue with my server, whether it be the Hard Drives, cables, SATA controller, etc. I am very unfamiliar with what steps to take in order to see if there is such issues with my server. I have however saved the log file from UnMenu in order to post on here to get help if possible. Keep in mind that the first parity check I did will not appear on this log because i had to hard reboot after trying to stop my array. I'm very nervous that this is a real serious problem, but I don't want to get too over excited about it in case it's something simple. I would greatly appreciate any help that I could get, and if someone could steer me in the right direction as to what to try and do, I would be very grateful. Thank you all for your time, it is much appreciated! syslog-2010-10-04.txt
October 5, 201015 yr Step 1. Perform a memory test. If it is bad, nothing will fix it until you correct it. It might be the voltage, timing, or clock speed is set wrong. Many BIOS get it right, some get it wrong. Then it becomes difficult. It could be one disk, or one disk controller. To isolate the only technique you can use I am aware is repeated checksum tests on a specific large file on a specific disk. One disk at a time. Type md5sum large_file_name.ISO again, and again. If the number differs from one run to the next, it is a suspect disk/controller.
October 5, 201015 yr Agree with Joe. But in addition is recommend running smartctl reports on each drive (after memory test). Occasionally there have been incompatible motherboards that have this type of symptom.
October 5, 201015 yr Author Well I thank you both very much for your quick replies I am going to run MEMTEST first before anything else as suggested, but before I do that I would like to let you know that I recently replaced a 2GB HP RAM stick that I had in the server with 2 x 2GB Corsair RAM sticks that I just purchased. I should have run memtest on them initially before anything else, but I never thought they would have errors on them from the factory (big mistake, haha). Would resetting the motherboard BIOS be something you would suggest doing when installing new ram in case it is causing some sort of conflict?
October 5, 201015 yr Well I thank you both very much for your quick replies I am going to run MEMTEST first before anything else as suggested, but before I do that I would like to let you know that I recently replaced a 2GB HP RAM stick that I had in the server with 2 x 2GB Corsair RAM sticks that I just purchased. I should have run memtest on them initially before anything else, but I never thought they would have errors on them from the factory (big mistake, haha). Would resetting the motherboard BIOS be something you would suggest doing when installing new ram in case it is causing some sort of conflict? The memory itself may be fine, just set up with the wrong timing, clock speed or voltage. It varies with the specific make and model memory strips. As I said, some motherboards get the settings right, some get it wrong. If you just changed the memory, it is most suspect.
October 5, 201015 yr Author Okay, well I guess one of my options might be to put that old RAM chip back in and then run a parity check and see if any errors come up, and if they do, then run a consecutive parity check to see if there are any errors. But before I do that, I would still like to do your first suggestion, and that is to try running Memtest on the current 2 x 2GB chips that I have in there. However, when I select Memtest it doesn't seem to want to do anything. I don't see it performing any of the tests like it usually does when I've run it on other computers in the past. If I remember correctly, I thought I saw a topic on here while researching my current problem where someone was having problems with this version of Memtest (4.00). I know that they are up to version 4.10, but I don't know exactly where memtest is installed inside UnRAID and if it is even easily replaceable. Any suggestions? I know I'm not the smartest when it comes to this stuff, but I want to try and do what I'm being recommended
October 5, 201015 yr Okay, well I guess one of my options might be to put that old RAM chip back in and then run a parity check and see if any errors come up, and if they do, then run a consecutive parity check to see if there are any errors. But before I do that, I would still like to do your first suggestion, and that is to try running Memtest on the current 2 x 2GB chips that I have in there. However, when I select Memtest it doesn't seem to want to do anything. I don't see it performing any of the tests like it usually does when I've run it on other computers in the past. If I remember correctly, I thought I saw a topic on here while researching my current problem where someone was having problems with this version of Memtest (4.00). I know that they are up to version 4.10, but I don't know exactly where memtest is installed inside UnRAID and if it is even easily replaceable. Any suggestions? I know I'm not the smartest when it comes to this stuff, but I want to try and do what I'm being recommended It is a file on your flash drive named memtest
October 5, 201015 yr Author Well it is already late and I'm having trouble extracting Memtest from the newest binary, so I'll mess with it tomorrow after work. I'm very appreciative of your help so far, and I will be sure to report back with the results of Memtest and also the SMART reports on the drives if it comes down to that
October 5, 201015 yr With most of memory, such as DDR2/3, has SPD built-in, i think MB should be able to setup proper clock/voltage during POST automatically. Although users can double check those setup in BIOS as well. Browsing through your syslog, i am more interesting to know why same block/sector has parity error at two consecutive parity check? Oct 3 15:48:54 Tower kernel: md: parity incorrect: 50525200 Oct 4 07:53:02 Tower kernel: md: parity incorrect: 50525200 http://en.wikipedia.org/wiki/Serial_presence_detect
October 5, 201015 yr Because the value of a bit was misread or, more likely corrupted with bad memory. Then on the next parity check it was read correctly.
October 5, 201015 yr Author I did notice that the same block came up twice, and I was concerned about that as well, but if it is something I shouldn't be worried about then I'll leave it be for now. More importantly though, before I went to sleep last night I decided to try something simple. I opened the case to my server, then I unplugged and replugged in each end of my SATA cables at the HDD and the MB. I also took the memory out and re-seated them in the DIMM slot for the hell of it. I ran a parity check and then headed off to sleep, expecting nothing to change and to receive the same errors that I have been. Low and behold however, when I woke up this morning and my parity check was at 70%, it had not yet reported any parity sync issues. Now I know i'm not in the clear yet by any means, especially because there is still a decent amount left to check, but I'm staying optimistic that this will be okay. As my friend told me, and also from reading around in these forums, it appears it is common for issues to go back to the SATA cables themselves. The constant rotation of the HDD can I guess wiggle the connection lose or something, so I really hope this is the case. If it isn't, then it looks like I'll be trying some of the diagnostic tests as discussed earlier (Memtest, SMART, etc). I really appreciate the attention that I've gotten in this topic. Even if it is something simple like the SATA cables, I'm happy that the community on here is so responsive and willing to help When it comes to linux or anything to do with linux code, I'm pretty much in the dark. I've had very little experience with it and it's a bit intimidating to me. I'm always afraid to go messing around with this without getting help from the more educated folks out there. Hopefully I won't require any more help with this issue, but when I get home from work and am able to check my parity I will report back on here with the good or bad news. Thank you all again!
October 5, 201015 yr NiteTalker, your problem may actually be a fundamental flaw in the logic controller between the memory and SATA controller. Normally Parity Bits do not attempt to allow or cache semaphores. Since multiple interlocking pathways exist within the logic controller it’s possible that the crossbar switching fabric (Google it) is causing a delay in the response which is being misread as a parity error. You also said you switched out your memory so it's likely that the logic controller's fabric hasn’t flushed itself of its old scatter/gather I/O table. You'll also note how deploying 16 bit architectures rather than emulating them in software produce less jagged, more reproducible results if you were to downgrade your memory. I recommend you do some research into your mainboard and determine if it suffers from the aforementioned bug. Post back and let me know!
October 6, 201015 yr Author Okay so here is the latest update with my situation. I took the newer set of memory that I had put in my server about a month ago and put it into my HTPC in order to run Memtest since I was unable to get it to work off my UnRAID flash drive. I struggled to extract the latest binary of Memtest in order to put on my flash drive to replace version 4.0 to see if that would work, so putting it in my HTPC was a fast option. I ran it several different times, at first trying each chip individually and then before I went to sleep I let it run all night for roughly 6 errors with both chips in. After all the testing, it didn't produce any errors of any sort. While I was running memtest, I put my old 2GB memory chip back into my server and ran another parity check, which again still yielded parity sync errors. The errors that it produced were also in the same blocks/sectors as when I ran the last parity check with the newer memory installed, so I am thinking that memory is now not the issue, however I could be wrong. I will need to double check the timings in the BIOS again as suggested by Joe L. to see if the motherboard pulled the correct timings and voltage. I should also perform a memtest on the 2GB memory stick just in case there is something wrong with that chip. This is probably what I'll do first before I try anything else unless otherwise suggested. @Arrivalist - I tried searching Google for "crossbar switching fabric" with my motherboard, the Supermicro X7SLA-H, and I didn't seem to find anything. Maybe I need to search harder for this. I know there are other people on this forum that have that particular motherboard as well, so maybe they might know if this board is problematic towards that. You also mentioned that switching the memory may also have caused the logic controller's fabric to not yet flush itself of all data, so maybe that is a problem as well? @Joe L. - If I were to start testing the hard drives / controller, would you recommend doing long SMART tests on each drive first, or should I perform the repeated checksum tests on a large file for each drive before that as you suggested in an earlier post? Not sure how long the checksum tests take, but it's probably faster than the several hours it says it takes to do a long SMART test. I know I'm getting ahead of myself and I should await a response from people here first about the memory to see if I need to perform any more tests. I have attached my latest two system logs from when I did my last two parity checks. The one dated October 5th was done with the 2 x 2GB memory chips in, and the one from October 6th was done with the single 2GB chip installed. Like I said, you'll notice that the areas that the parity is incorrect seem to be in the same block/sector. I hope that this issue isn't very serious, but I'm a little worried that its going to be either one of the disks, sata cables or the motherboard, or something else even. Thank you all again for your time, I know that I'm not the easiest person to help, but I'm hoping that one of these tests will provide a solution. I am probably getting too far ahead of myself and need to slow down because rushing to try and fix this will only lead to more problems. I should exercise all memory testing options first before moving on. From this point forward I am going to stick with the 2GB chip as I know that it worked for the year that I've used it. After checking my timings in the server motherboard for that 2GB chip, I'll run a memtest on it and if that doesn't find any errors, then I'll move to the next step. syslog-2010-10-05.txt syslog-2010-10-06.txt
October 6, 201015 yr If the newer memory had corrupted parity on certain locations then putting the older memory in place would again find those same locations in error and correct them to their true correct values. To test you need to do two consecutive parity checks with a given set of memory. The first might fix the errors caused by the prior memory set, and the second find no errors. It is NOT valid to test the memory in a different PC other than to check for gross failures. It will not be the same timing, clock speed, or voltage. So, go back to your original memory, and run two parity checks consecutively. The second should be error free. The first may correct errors created by the other set of memory.
October 6, 201015 yr However, when I select Memtest it doesn't seem to want to do anything. I don't see it performing any of the tests like it usually does when I've run it on other computers in the past. If I remember correctly, I thought I saw a topic on here while researching my current problem where someone was having problems with this version of Memtest (4.00). That may have been me - when running memtest 4.0 on my Core i5 system, it simply hung with a part of the screen legends painted. http://lime-technology.com/forum/index.php?topic=7497.0 Whether your Atom system suffers a similar problem, I'm not sure. I know that they are up to version 4.10, but I don't know exactly where memtest is installed inside UnRAID and if it is even easily replaceable. Any suggestions? I know I'm not the smartest when it comes to this stuff, but I want to try and do what I'm being recommended It exists as a binary image, called 'memtest', in the top level directory of the usb flash stick. If I remember correctly, I simply downloaded the bootable binary from this site: http://www.softpedia.com/progDownload/Memtest86-plus-Download-13000.html (I see that there's now a 4.15beta), extracted the file from the zip, copied it onto my usb stick and renamed it.
October 6, 201015 yr Author @Joe L. - I was thinking that might be a possibility as well because bjp999 had mentioned something to this effect in an earlier post. Darn, I wish i had set my computer to do another parity check before I went to work, but I figured I'd try and give the server a break because it's been doing parity check after parity check for several days now. I was so close to doing it to, haha. Oh well, I'll have to make sure to do this as soon as I get home so that when I wake up in the morning it will be complete, and hopefully with no errors You are right though about testing the memory in a different computer. The Gigabyte motherboard in my HTPC likely tries to set the timings and voltage to something different than my Supermicro board in my server does. Upon quick glance I couldn't find an option to manually set the timings and voltage in the BIOS for the Supermicro board, but there must be a way to do it. I'll have to explore the manual to see if it is possible. @PeterB - It might have been you, but I think that the problem I am having is just a little bit different. What happens for me is as soon as memtest runs, absolutely nothing happens other than the screen listing out the specs for the memory. None of the tests even start at all, which is awfully strange. I'm hoping that the newer version will help though because like you said, the Atom processor I have may not be compatible with version 4.0 of memtest. So you're saying that all I have to do is remove the .bin extension from the binary and replace the file on my flash drive with that new file? I was sitting here trying to extract the .bin file by manually creating a .cue file for Daemon Tools, but it was not working. ---- Thanks again for all your help guys and I will report back once I run the parity check again with the original ram. I'll keep my fingers cross for no errors, and if I don't get any then I'll know that it was the problem!
October 6, 201015 yr What happens for me is as soon as memtest runs, absolutely nothing happens other than the screen listing out the specs for the memory. None of the tests even start at all, which is awfully strange. I think that is exactly the same as I experienced, just described in a slightly different way. In my case, the reported clock rate specification was way out too. I don't think DDR3 4000 memory is available .... yet! So you're saying that all I have to do is remove the .bin extension from the binary and replace the file on my flash drive with that new file? Yes. Well, almost - you have to remove the '86+-4.10' bit too!
October 7, 201015 yr Author Update: Good news! It seems that the memory was indeed the culprit after all that. I ran the second of two consecutive parity checks as suggested by Joe L. and when I woke up this morning I was pleasantly surprised that there were no errors of any sort in the system log! For now I think I can finally breathe a sigh of relief that it's not something much more serious than this. I still would like to be able to use that 4GB of memory in my server, but it clearly seems that my motherboard does not play nice with it. I may try to put it back in the server and run memtest on it there because running it in the HTPC may have given me different results. I'll also have to look harder into changing the timings for the ram in the BIOS as well as it may not be automatically getting them correctly. For now though I'm just relieved it was something simple, and I don't think I'll be messing around with it again for a while. The only reason I really wanted 4GB of ram in there was to try and help cache_dirs be able to cache more information since I have a lot of data on this server. If i really want it though, i'll probably be better off researching memory chips that are known to be compatible with my motherboard. @ PeterB - Thank you for the information! I will be sure to replace the memtest file that is on the flash drive with the newer one after renaming it as you suggested. Hopefully this version will work for me Thank you to everyone for your help and the time you took to give out advice. It seems I was able to get to the bottom of my issue, and I'm very grateful that I got a lot of help with my problem, even if it was something simple as the memory.
October 7, 201015 yr I still would like to be able to use that 4GB of memory in my server, but it clearly seems that my motherboard does not play nice with it. I may try to put it back in the server and run memtest on it there because running it in the HTPC may have given me different results. I'll also have to look harder into changing the timings for the ram in the BIOS as well as it may not be automatically getting them correctly. I had same experience before when i was using a dual Pentium-III MB from Tyan, at that time this MB can use only certain memory from Kingston. However nowadays most of MB should be able to do the jobs without problem, you might want to check with MB vendor to find out if there is any new BIOS available. My MSI MB originally had problem to detect more than 4GB memory (from different memory vendors), after I reported to MSI and give them those memory information they finally release a new BIOS to fix this problem.
Archived
This topic is now archived and is closed to further replies.