February 18, 201412 yr Hello Please be prepared for a long read.... Here is the situation: I have purchased from ebay a new LSI Megaraid 9240-8i /M1015 controller. It came properly packaged as a commercial product should be - no issues. The packaging however was strictly LSI - no mention whatsoever of M1015. During the long weekend (Toronto..) I have wrestled with it, trying to change the firmware - converting it from RAID controller, into a HBA controller. Since I don't have any spare computer to do this work, I disabled (removed all hard drives from) my older Unraid server which was the target for this controller, and attempted the firmware change procedures on it. After many failures (or semi-failures) which I won't go into, I decided to try my luck on my newer unraid server, disable that too, and start al over again, this time with the efi shell etc. Again quite a few problems, as the controller would not crossflash properly with the newest firmware from LSI (9211, or 9210). Each time the procedure ended with some of the spare disks I was using for test not seen by the controller (one of the card ports essentially disabled). I knew the controller was operational because the same disks were showing just fine when the it had the RAID firmware on it. After an inordinate number hours of working with the controller in both servers, changing bios settings, trying different versions of tools and different approaches, I finally got it working by using the package found here http://lime-technology.com/forum/index.php?topic=12767.msg124393#msg124393 (Thank you, kkm!). Another great resource aside from many in this forum, is this one: http://www.servethehome.com/ibm-serveraid-m1015-part-4/, but I guess my specific combination of motherboard and controller just don't play well. kkm's package contains an older/2011 version of firmware, and I'm uneasy to settle on it, but I have been unsuccessful flashing the latest 2108it from LSI site. Anyway, I managed to program it in efi mode, on my newer unraid box. Once done, I moved it on the older server, checked it and all continued to work as expected. (testing with some spare drives). I plugged back now my unraid hard drives in both servers. The new server has only 4 drives, all connected to the motherboard (asus m5a99x evo r2.0), the older (based on a Supermicro H8Smi-2) has 8, and I connected them to the new controller. Turned on both servers, checked the syslog, checked the main screen - no errors. Both servers came up recognizing all drives - didn't have to identify, assign etc), with configuration valid. All drives shown green, just neither array started yet. Compelled by the apparent non issue, started both arrays (pressed the start button) - again nothing to signal any problem. Final step... started a parity check (uncheck correction).... and well... that didn't work well anymore, as both started complaining of parity sync errors. Lots. I stopped both servers immediately - stopped the parity check. I ran out of time and could not spend any more time on this issue... last thing I did though, I started only my newer server, in maintenance mode, and ran reiserfsck on all 3 volumes. All came back with absolutely no error. I restarted then the array, and started a full parity check (again with no correction), and left it over night. It is almost done now, and in the main menu screen I see lots of reads, a number of writes (200+ on parity disk and disk1, 20-30 writes on the other 2 disks), no errors, but thousands of parity sync errors. The server had absolutely no error at the time when I started this whole nightmare... After this long story, here are some questions: - is this parity sync error to be expected (considering the stuff which happened earlier). - is it safe to run again parity check and now let the parity information to be updated (let it correct). I read in the manual: "The Error statistic displays the number of read and write operations which have failed." My array shows both reads, and writes, but no errors, over night (during parity check). The server wasn't used otherwise during the night. Further down in the manual: "Parity-Check will march through all data disks in parallel, computing parity and checking it against stored parity on the parity disk. If a mismatch occurs, the parity disk will be updated (written) with the computed data and the Sync Errors counter will be incremented." So ... if I read this correctly, the the data won't be touched, only the parity disk will be updated. Is that it?! Obviously I want to make sure the data is preserved, but I'm not sure I understand the source of parity sync errors, and, should any of my data have been corrupted somewhere, it appears there is no way to tell... Is that so? What is a typical error scenario which unraid detects and corrects? I'm not sure I understand that. Obviously I have the other server to deal with, with the new controller and parity sync errors... but that will be my next weekend wrestling... There the issue of data corruption is more problematic, as I knew I had a couple of shabby hard drives (from SMART alerts), but again, when I started this exercise, the main screen didn't show any errors. In all this situation, since I don't want to alter my data disks at this stage, I'm tempted to buy a couple of new large hard drives, start a completely new array, and copy all my disks one by one on this new array. As I free up one disk, run preclear etc, and reintroduce it in the array, copy the next one, etc. repeat until done with all the data checks. What is the opinion of the storage management experts regarding all this, including perhaps suggesting alternate approaches? Much appreciated any and all the thoughts! hg PS: I run Unraid v 5.0.5, and I have ECC memory on both servers (I can't guarantee I chose the best settings for that, though... both have a few options, but little straight forward info on setup options and reasons... therefore I just loaded BIOS optimal defaults for both). I tested the memory when I installed it, but not extensively. Reading the forum, that in itself is a major job, and I may have to do it, if I start from scratch. PPS: parity check finished: https://dl.dropboxusercontent.com/u/80470262/unraid%20screenshot/Feb18tower2.png. As I said, I ran it unchecking the "Correct any Parity-Check errors by writing the Parity disk with corrected parity" box. Therefore many sync errors, but nothing should have been corrected., right? However the screen reports "Parity is valid"... Isn't that strange?! (I believe unmenu also has the same reporting issue - while the main screen shows parity check in progress - and the correct flag unchecked -, unmenu screen displays a "corrected" counter. I have to recheck that and get a screenshot, but I'm almost sure that's the way it is...)
February 18, 201412 yr Attach the syslog. Test the memory by selecting memtest on boot. Let it run overnight.
February 18, 201412 yr Author Thanks, dgaschk... Any thoughts about my general questions, though? I've started the memory test: https://dl.dropboxusercontent.com/u/80470262/unraid%20screenshot/IMG_0289%5B1%5D.JPG only to realize yet another oddity: my RAM shows as DDR2 and ECC turned Off?! It is definitely ECC RAM and DDR3: Kingston KVR1333D3E9SK2/16G 16GB 2X8GB Kits DDR3-1333 CL9 DIMM ECC Dual Channel Memory I've attached some syslogs, including the latest - from last night until the moment i started emtest: https://www.dropbox.com/sh/g3wi5q6fy6qm5q6/EEdEvvqKlv/syslogs-t2?n=80470262; I haven't had any chance to check them, nor I felt I needed to, until last night. I guess this will be my next reading... Thank you. hg Oh.. and by the way... on this server I have only 2 major directories, one of which I happened to run a md5deep on, a couple of days before. It it isn't the biggest directory on the server, and it is not entirely relevant (many small files - jpeg and mov), while the other one is quite large and is storing large files too (dv content uncompressed). Last night I 've run the reverse md5 job (check) against my md5 file and it came back just perfect - it returned no error what so ever... I wish everything else is untouched, but I don't know...
February 18, 201412 yr The writes result from mounting the disks and are normal. The syslog link doesn't work. Please zip and attach. Check the BIOS settings for the RAM.
February 18, 201412 yr Author Updated the link in my previous reply... sorry. It got truncated when I copied it over. Thank you, dgaschk
February 18, 201412 yr The final log is not complete. It doesn't contain the SMART reports at the end. Paste current SMART reports into a post.
February 18, 201412 yr Author I need to wait for that memtest I started earlier, to finish. Not sure how long it takes, but I can't get anything else at this stage from the server. The logs I posted were on the usb flash, including the last syslog which I copied over (var/log/syslog) just seconds before I've powered down the server. I will run a smartctl manually and save/post the output. Thank you
February 19, 201412 yr Author memtest results after 18+hours: at pass 5, no errors interesting new detail, besides the fact that memory is reported (falsely) as DDR2, and the ECC off, is also showing as 64 bits: https://www.dropbox.com/sh/g3wi5q6fy6qm5q6/4V-92cpnjp/unraid%20screenshot/IMG_0297.JPG versus the memory on the old server, which is indeed ddr2, but again ECC showed as off, but 128 bits: https://www.dropbox.com/sh/g3wi5q6fy6qm5q6/W92nl4AIUn/unraid%20screenshot/IMG_0299.JPG Not sure how this comes about... I started memtest on my second server only this morning and it is reported ok after one pass. What is considered satisfactory /sufficient testing? Both show ok results - and I will leave them running until the end of the day, and then report back.
February 19, 201412 yr The RAM appears to be ok. There's no 100% guarantee of this however. Note that the line on the screen says "Settings: ..." You can configure this in BIOS. The automatic setting is rarely wrong, but it may be in this case. Please paste SMART reports.
February 20, 201412 yr Author Hello dgashck: I've added to the same directory both the smart report and the syslog since I rebooted a few hours ago. The memory test gave absolutely no errors, after 9+ passes. https://www.dropbox.com/sh/g3wi5q6fy6qm5q6/EEdEvvqKlv/syslogs-t2?n=80470262 I discovered that testing ecc feature can be turned on in memtest configuration menu. However chipkill and scrubbing etc are not available options in bios (the only thing I can do is enable ecc memory, and ganging/unganging. The setup in this section is quite simplistic, and I'm uncertain how the BIOS is really working with the ECC memory. Ganging the two chips presumably should have enabled the 128bit mode. Memtest however continues to report it as 64bit ... this is awkward. ASUS manual is useless, unfortunately, as it is not offering any considerations relative to the setup. Hoping that a newer BIOS would be more friendly, I've downloaded and installed the latest bios version from ASUS. It doesn't make any visible difference with regards to the visible parts... All being said and done, I've restarted a parity check, with nocorrection. And by the looks of it the same number as 2 days ago: https://dl.dropboxusercontent.com/u/80470262/unraid%20screenshot/t2paritycheck_feb20.png. (added screenshot this morning, Feb 20) The sync errors are still there, and I will have to make a decision what to do with them. I've also had a look at my older server, which has run memtest for a good number of hours. Again no errors. https://dl.dropboxusercontent.com/u/80470262/unraid%20screenshot/IMG_0308%5B1%5D.JPG This one is based on H8SMi-2 Supermicro motherboard, and the ECC menu is available in bios. I turned on all the possible options in there, and I've started memtest with ECC testing turned on. I'm going to stop it tomorrow and will see the results.... .. a few hours later, no errors with ECC turned on: https://dl.dropboxusercontent.com/u/80470262/unraid%20screenshot/IMG_0314%5B1%5D.JPG
February 20, 201412 yr Author Running mprime on my older server, and so far so good, there have been about 20 min since I've started the blended test, and nothing major happened. I guess the memory is fine after all, and the mysterious parity sync errors are unexplainable...
February 21, 201412 yr There's no concrete indication of the problem. It looks like disks 2 and 3 may have power issues, based on the power-off retract counts. Check the cabling to those drives. If you had checksums, the files could easily be checked. There is a procedure to find which files are located at those positions and then you can manually open the files to check them. The simple course is to run a correcting check and hope the errors are on the parity disk, as is usually the case.
February 21, 201412 yr Author Thanks, dgaschk. I'll have a look at the power indication. I had checksums for the smaller folder (~200GB, ~40k files), and all of it tested ok against their md5 signatures. I don't have the same information for the larger folder (1.4TB, ~20k files) - I should have, and will create it from now on. Lesson learned. There is a procedure to find which files are located at those positions and then you can manually open the files to check them. The simple course is to run a correcting check and hope the errors are on the parity disk, as is usually the case. I'll search the procedure in the forums or wiki - I didn't come across it - but then I didn't know what to search for. I'll do a random check as I won't be able to verify by hand 1.4TB of dv. The correcting check is always updating the parity drive? (if all the drives are in the array and functional) Is it only used when rebuilding a lost drive, I guess?
February 22, 201412 yr While I suspect your memory is indeed okay, I would boot with a current MemTest+ CD or flash drive ... v5.01 is 2 1/2 years newer than v4.2, and has far better support for newer chipsets. I suspect it will correctly identify your RAM and run it in the correct dual channel mode. You can create a bootable flash drive with the installer download here: http://www.memtest.org/download/5.01/memtest86+-5.01.usb.installer.zip Then just boot with that flash drive and let it run for 6-8 hours. [i always run it for 24 hours on a new build]
February 22, 201412 yr Good plan. That will at least give you high-confidence that the memory is indeed okay. As for the parity sync errors ... with no errors reported by any of the disks, these are almost certainly "real" parity errors. In any event, unless you take the time to compare all of the data disks against your backups, there's little you can do except simply correct the errors. [Personally, I NEVER run non-correcting checks ... if there's an error, I want it fixed -- but I also keep checksums of all my data and have a complete set of backups, so I can easily confirm whether or not there are any data errors in a case like that.] In your situation, I'd (a) run the latest MemTest (as you're already planning to do); and then (b) run a correcting parity check; and then run a 2nd correcting check and confirm that it has all zeroes (in both the disk errors column and the sync errors).
February 23, 201412 yr Author I would boot with a current MemTest+ CD or flash drive ... v5.01 is 2 1/2 years newer than v4.2, and has far better support for newer chipsets. I suspect it will correctly identify your RAM and run it in the correct dual channel mode. You can create a bootable flash drive with the installer download here: http://www.memtest.org/download/5.01/memtest86+-5.01.usb.installer.zip Then just boot with that flash drive and let it run for 6-8 hours. [i always run it for 24 hours on a new build] garycase, I took your advice a bit further and added the 32 bit binary to the unraid 5.0.5 memory stick... https://dl.dropboxusercontent.com/u/80470262/unraid%20screenshot/mt86plus is the binary for whom is interested... I've actually reinstalled 5.0.5 unraid package from scratch, added my old config files and the key file, modified syslinux.cfg, https://dl.dropboxusercontent.com/u/80470262/unraid%20screenshot/syslinux.cfg and added mt86plus to the root directory. I include here a couple of screens for reference, boot screen, the beginning of the procedure, and the end of the test on my server after 4 hours. This is the original memtest screen: https://www.dropbox.com/s/8fikxfhnolqol9c/IMG_0297.JPG The boot screen with the new binary added: https://www.dropbox.com/s/zar0n44gnx5msxk/IMG_0328.JPG after selecting v5 check: https://www.dropbox.com/s/23v4iaddmwm0sl2/IMG_0329.JPG if you don't press either F1 or F2, it defaults to F1 selection: https://www.dropbox.com/s/z9errnjnng6ipjr/IMG_0325.JPG if you press F2: https://www.dropbox.com/s/w8rdc5zyysc8dgh/IMG_0333.JPG during the test: https://www.dropbox.com/s/c267jath7dbp3ua/IMG_0340.JPG and this is the test result, passed, after running it 4 hours: https://www.dropbox.com/s/jgyivkbnszrju4s/IMG_0344.JPG Thanks again for your suggestion. hg
February 23, 201412 yr Safe to say at this point that you're memory's fine. I'd just do the correcting parity check and then run it a 2nd time to confirm all is well. If you have backups, you could then do a comparison to confirm no data was corrupted, but with no indicated disk errors it's very likely everything's fine. I DO have backups, and the few times I've had sync errors, I've taken the time to fully check all the files, and have never found a problem.
February 23, 201412 yr Author Lots of data w/o backup. Just scattered md5 files, which I check against. Unfortunately not a rigorous process, and takes time (I move stuff around and don't update md5 files/location). By creating a second server I planed to use it as a backup - duplicate everything I have on it. Do just regular maintenance on it, otherwise no other touching.... It is work in progress, but once finished with all the checking and parity sync, I'll finalize rsync-ing the directory structure and I'll feel better I had therefore about 1.5TB of data replicated, but then I messed up both servers in the same time - bad decision. Anyway, both servers show no memory problems, and it may have been cables issues, or too much fiddling with the hardware... Last week I replaced most of the sata cables, and one of the power supplies (bought a single rail) and I hope I reached now a stable situation. Oh, and the "wrestling" with my new MegaRaid SAS 9240-8i controller, seems to have ended successfully, this afternoon. Time will tell. And the monthly parity checks.. Thanks for advice and for monitoring my adventure. I tend to be quite verbose in describing all the events...
February 24, 201412 yr You're heading in the right direction r.e. backups ... but perhaps a bit too much "fiddling" before you got there I suspect everything's okay, however -- just do the parity checks and confirm that; then get your backup strategy in place and you'll be fine.
Archived
This topic is now archived and is closed to further replies.