bland328

Members
  • Posts

    105
  • Joined

  • Last visited

Everything posted by bland328

  1. Shortly after noon local time today, while I was using TeraCopy (high-performance batch copy utility) under Windows 7 to push a large number of large files to my unRAID server, the unRAID server disappeared off the network. When this happened, I tried hitting both http://tower and http://tower:8080 from a browser, but no luck. The LEDs next to the network port on the unRAID box were still active, so I unplugged and replugged the cable, though I wish I wouldn't have done that, since it complicated the system log a bit. The unRAID box itself was still alive, and I was able to log in through the console and successfully browse /mnt/user before putting a copy of my system log on the flash drive. A click of the power button got me a beep, but after five minutes or so, it was still alive. The powerdown console command returned immediately and didn't appear to do anything, either. Eventually, I used the poweroff command, which appeared to get a lot of things stopped before the box powered itself off. After rebooting, the unRAID server appeared on the network without anything else being rebooted or power-cycled, and was perfectly happy. I don't see anything all that interesting in the system log, but I'm no expert yet, either. I'd appreciate other opinions! syslog_2012-06-20.txt
  2. Should I mark this [sOLVED]? I have a workaround, but the defect still exists. Is there a better place for me to log the defect?
  3. Oh...so I can edit the .conf file in the packages folder directly, and not worry about modifying the go script to copy "my own" .conf file in at boot time? Very slick. I did that, as %2D resulted in a literal "%2D" in the address.
  4. If there is a more formal bugtracker in which I should file this, please let me know. The defect I've run into is that the unMENU 1.5 interface for setting ssmtp Configuration Variables doesn't support root email addresses with plus signs in them, even if escaped. I've tried: [email protected] and alert\[email protected] but in both cases, I end up with: alert [email protected] in /usr/local/etc/ssmtp/ssmtp.conf . Is there any workaround right now, other than using the go script to copy a corrected .conf file into place at boot time?
  5. Thanks for the quick response, Joe. In that folder, I see these smtp-related files: ... -rwxrwxrwx 1 root root 7026 Jun 10 19:35 mail-ssmtp-unmenu-package-3.conf* -rwxrwxrwx 1 root root 7045 Jun 7 16:14 mail-ssmtp-unmenu-package-3.conf-2012-06-07-161444.bak* -rwxrwxrwx 1 root root 7020 Jun 10 19:10 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191043.bak* -rwxrwxrwx 1 root root 7041 Jun 10 19:11 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191105.bak* -rwxrwxrwx 1 root root 7042 Jun 10 19:11 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191115.bak* -rwxrwxrwx 1 root root 7042 Jun 10 19:12 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191237.bak* -rwxrwxrwx 1 root root 7042 Jun 10 19:13 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191306.bak* -rwxrwxrwx 1 root root 7042 Jun 10 19:13 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191316.bak* -rwxrwxrwx 1 root root 7042 Jun 10 19:13 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191336.bak* -rwxrwxrwx 1 root root 7042 Jun 10 19:18 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191832.bak* -rwxrwxrwx 1 root root 7041 Jun 10 19:18 mail-ssmtp-unmenu-package-3.conf-2012-06-10-191847.bak* -rwxrwxrwx 1 root root 7041 Jun 10 19:34 mail-ssmtp-unmenu-package-3.conf-2012-06-10-193405.bak* -rwxrwxrwx 1 root root 7026 Jun 10 19:34 mail-ssmtp-unmenu-package-3.conf-2012-06-10-193439.bak* -rwxrwxrwx 1 root root 7041 Jun 10 19:35 mail-ssmtp-unmenu-package-3.conf-2012-06-10-193501.bak* -rwxrwxrwx 1 root root 7041 Jun 10 19:35 mail-ssmtp-unmenu-package-3.conf-2012-06-10-193530.bak* -rwxrwxrwx 1 root root 7045 Jun 7 16:50 mail-ssmtp-unmenu-package.conf* ... drwxrwxrwx 3 root root 4096 May 1 21:20 ssmtp/ -rwxrwxrwx 1 root root 52501 Nov 23 2009 ssmtp_2.64.orig.tar.bz2* -rwxrwxrwx 1 root root 2641 Jun 10 19:35 ssmtp_2.64.orig.tar.bz2.manual_install* -rwxrwxrwx 1 root root 215040 May 1 21:20 ssmtp_2.64.tar* ... The ssmtp subfolder contains the ssmtp source code. I don't know how normal or odd the contents of this folder look, since I've never directly dabbled in this folder before. Is your position that I should delete the June 7 version mail-ssmtp-unmenu-package.conf file? Should those .bak files go, too? And is there anything else I should tidy up? Thanks for your help!
  6. I'm running unRAID 4.7 and unMENU 1.5, and working on configuring SSMTP. Somewhere along the line, ssmtp_2.64.orig.tar.bz2 started showing up twice on the /pkg_manager page. When I click the "Select" button for either of them, the /pkg_manager?select-ssmtp_2.64.orig.tar.bz2=Select+ssmtp_2.64.orig.tar.bz2 page lists the configuration interface twice. All of this is weird, and indicative of something gone wrong, but not a problem, per se. The problem is that each of the first listing of Configuration Variables is a blend of my settings, along with some default ("your_password") settings, and the second listing is all default settings. When I try to edit the Mail ID or Mail Password fields, I have to edit them in both sets and then click the first Save New Values button in order to commit the changes. Is there a config file I can edit to kill one of the SSMTP entries? If so, might there be a "right one" to keep?
  7. Yep...some careful unbending with an X-Acto blade tip got the job done.
  8. It pains me to admit this publicly, but I owe it to the community that has been so helpful, and to anyone who might run into a problem like this down the road: The (used) CPU I put into this build had one corner pin bent flat as a pancake. So, I've also learned something about how my eyesight changes with age. I've also learned that when my wife Googles the symptoms and asks me if maybe the CPU is missing a pin, I should listen. Building your own system is easy, except when it isn't
  9. @chickensoup: I like most things about my experience with MSI and the MSI motherboard, except that I have a computer that isn't very good at math ;-) In all seriousness, though, it may not be the motherboard's fault, and I'm not now anti-MSI. I do, however, have an underclocked-but-functional server at the moment, and I do wish MSI would've let me secure the replacement board with a credit card so that I could've hopefully had an hour or two of server downtime, instead of a week or two. @jonathanm: That's impressive, re Intel-branded boards. I will keep that in mind for my next build! Thanks, guys!
  10. @chickensoup: Thanks for the feedback. I've determined that it isn't drive problems--this computer even has Prime95 calculation problems. When I underclock the RAM down to 1066, however, I don't see any more problems. That doesn't, however, help me sleep at night. I can give up the speed, but I can't trust my data to a box that is seemingly stable only when underclocked. Though I don't consider it utterly proven that the problem is the motherboard, MSI has VERY responsive (email responses often in less than an hour!) customer service, and they've agreed to RMA the board. They won't, however, ship a replacement until a week or two after they've received mine, so I've decided to go a different way: I just ordered an ASUS M5A78L-M LX PLUS motherboard. I also eBay-ed a AMD Athlon II X2 240 2.8 GHz CPU to go with it. Once those parts are in place, I will have replaced everything but the power supply. I probably should've done this a couple weeks ago. Fingers crossed.
  11. A little more info: after finding and fixing the single Sync Error, I started another Parity-Check, which found one more error, at an offset far, far from the previous. For those new to the thread, the three hard drives were all successfully precleared multiple times, show no signs of SMART woes, and have passed manufacturer drive diagnostics. RAM has already been replaced once, so my loose plan is to loosen RAM timings, then when that doesn't work, try a new power supply, then when that doesn't work, get a new motherboard. If that doesn't work, I guess I'll try repainting the case.
  12. I've had uptime of about one week, and in that week I've added one more 1TB drive, and I've copied about 1.6TB to the array. On occasion, I've run a Parity-Check, and until this morning, I've been clean. This morning, I got one Sync Error which certainly wasn't due to an unclean shutdown, since I've had about a week of uptime. So, now I'm back to mistrusting this system. I'd greatly appreciate hearing who advises loosening up the RAM timing (maybe from 9-9-9-24 to 10-10-10-28, as recommended by @jonathanm and @chickensoup?), and who advises doing something else! Thanks to everyone :-)
  13. @Johnm, the creepy bit is that I don't think I fixed it--I think it "just started working," which isn't ideal. When I got the two new sticks of RAM, they made things worse, and wouldn't even pass Memtest. When I tested them one at a time, they did pass Memtest, and when I put them back together again, everything seemed fine.
  14. One more thing...here is an updated and somewhat more flexible version of the troubleshooting script from http://lime-technology.com/wiki/index.php/FAQ#How_To_Troubleshoot_Recurring_Parity_Errors, with easy-to-adjust variables at the head of the file, plus additional reporting and logging: #!/bin/bash LOG_DIR=/var/log/hashes TEST_DRIVE=sdb SKIP=25000 BLOCKS=100000 PASSES=10000 mkdir -p $LOG_DIR cd $LOG_DIR echo $(date) echo $(date) >> $TEST_DRIVE.log echo PASSES=$PASSES, BLOCKS=$BLOCKS, SKIP=$SKIP echo PASSES=$PASSES, BLOCKS=$BLOCKS, SKIP=$SKIP >> $TEST_DRIVE.log for i in $(seq 1 $PASSES) do echo "Begin $TEST_DRIVE, pass $i." dd if=/dev/$TEST_DRIVE skip=$SKIP count=$BLOCKS | md5sum -b >> $TEST_DRIVE.log done exit Enjoy!
  15. @Joe L.: I'm with you there, Joe, and thanks for all your input. Interestingly, in case you missed it, I currently have the motherboard set to Auto regarding all RAM timing, speed, and voltage! Go figure. And I want to throw two other things out there for anyone learning from this thread: 1) My MSI motherboard does the dual channel thing when paired DIMMs are either in slots 1 & 2 or 3 & 4. Even when I was still using the original Kensington RAM, voodoo went away when I put the DIMMs in slots 1 & 3, thereby disabling dual channel mode. I didn't find any way in the BIOS to turn it off. 2) With the original RAM, leaving the motherboard in Auto mode SOMETIMES got the timings and speed wrong, though it always got the voltage right. It was sometimes underclocking the RAM (no harm done, except for lost speed), and sometimes tightening the timings all the way to 7-7-7 (that's no good at all). Regardless of what the BIOS interface claimed were the current RAM settings, it was useful to fire up Memtest86+ just to see what was reported there. If I learn anything else about my case, I'll post it here. A good rule of thumb: it is never the hardware, except when it is.
  16. If you are just joining us, the summary of my parity issues is this: RAM voodoo. While it is possible that the power supply or some other actor created the RAM issues, I did buy a new pair of dual-channel-ready DIMMs, and did get them working somehow despite some initial weirdness. Right now, my BIOS is set to Auto regarding all RAM timing, speed and voltage issues; the BIOS reports 1.504 volts, and Memtest86+ reports 666Mhz (DDR1333), CAS 9-9-9-24. Yesterday, after about 20 hours of successful Memtest86+ testing, I re-added my two drives to the array, restarted it, and checked parity, which took 500+ minutes. The result was 111 sync errors, which initially horrified me, but then I realized that when I last checked (and repaired) parity a couple weeks ago, the RAM voodoo would've resulted in a bunch of erroneous parity corrections. If you can't trust your RAM, all bets are off :-) After the first pass of parity correction with the 111 sync errors (why aren't these called parity errors?), I ran it again. The result was zero errors, and that's the first time I've ever seen that out of this server. So, I'm going to spend the next week or so beating up on this thing before I start trusting it with real data. If I see any more signs of RAM voodoo, I'm going to try loosening up the RAM timing to 10-10-10-28, as recommended by @jonathanm and @chickensoup. If I get desperate, I might even buy a Corsair power supply to replace the HEC. Can anyone think of anything else I should (or shouldn't!) do? Thank you all.
  17. @Joe L.: Thanks for your feedback re the appropriateness of this thread, and for your snarky opinions about Windows ;-) @jonathanm: I hadn't considered the possibility of loosening things up beyond the RAM specs...I guess there wouldn't be much performance hit, especially since this is a file server, not a gaming box. That will be my next move, assuming I change anything at this point! @Johnm: That sounds smart, though I'm error-free and 20 hours in at this point, plus leaving town in a couple days, so I'm thinking I'll switch to some "real world" unRAID testing. If that goes well, I'll run an exhaustive multi-day Memtest while I'm gone. If that doesn't go well, I probably need to make some changes (10-10-10-28?) before continuing with Memtest. @chickensoup: When you say "changing DIMMs", do you mean trying another set? Or do you just mean moving them around? I'm on my second set of DIMMs (Kingston 1GBx2, then Crucial 2GBx2), and I've already seen that with the Kingstons, putting them in slots 1 & 3 appears to eliminate the problem (see http://lime-technology.com/forum/index.php?topic=19936.msg179372#msg179372). Also, I can find a BIOS setting to disable interleaving, but can't for the life of me find a dual-channel setting. So, as I said @Johnm above, I'm error-free after 20 hours of Memtest. I don't know if I accidentally worked something out, or if a slight change in temperature, barometric pressure or the phase of the moon is going to put me right back where I started. If everything had been just perfect with the new RAM, I'd feel great right now. Given that things were initially worse with the new RAM, I don't know what to think. Shall I loosen up the RAM timings? Leave things alone? Contact MSI? Burn some sage? Thanks very much for all the great feedback, everyone. This is a great community!
  18. Well, I have the new RAM (Crucial CT2CP25664BA1339 4GB 2GBx2 240-pin PC3-10600 DIMM DDR3 Memory Kit), and the results are bizarre. The BIOS and Memtest86 agree that I'm running the RAM at 666Mhz (DDR1333), CAS 9-9-9-24. The BIOS confirms 1.504V. I believe these values to be correct, but can't readily confirm it anywhere except the SPD data reported by the BIOS utility. I'm now running 4GB RAM total instead of 2GB, and I don't like the fact that more than one variable has changed. Bad science. 1. With the new RAM installed in the paired dual channel slots 1 and 2, the system boots, and my 10,000-pass MD5 test of 100,000 disk blocks returns about two or three times as many errors as it did with the original Kingston RAM. 2. When I fire up Memtest86, everything looks good until Test #6 [Moving inversions, 32 bit pattern], then many errors are reported in the first pass. 3. I remove the DIMM from slot 2 and fire up Memtest86 again. No errors are reported in two passes. 4. I replace the DIMM in slot 1 with the other, then Memtest86 again. No errors are reported in two passes. 5. I put the free DIMM into slot 2 (the two DIMMs are now back in slots 1 and 2, but swapped relative to step 1), and Memtest86 again. No errors are reported in two passes, which is really unexpected, given step 2! 6. I boot unRAID and retry the 10,000-pass test from step 1. No errors. So, is this system haunted? Was I having a physical seating problem with one of the slots, now accidentally remedied? Should I still order a new power supply? I'm happy the box currently appears to be working, but don't trust it at all, since an hour ago it wasn't working with the same two RAM sticks. And, of course, I'm a little concerned that double the RAM might make affect the outcome of the 10,000-pass test. Thoughts?
  19. Okay...I have a little more news... If I put one of the two matched DIMMs in slot 1, everything works well. If I put the other of the two matched DIMMs in slot 1, everything works well. If I put the two matched DIMMs in slots 1 and 2, I get my MD5 data corruption problem. This is the recommended Dual Channel configuration according to the motherboard documentation. If I put the two matched DIMMs in slots 1 and 3, everything works well, much to my surprise. So...it is looking like my motherboard has an issue with these specific DIMMs, or maybe with the Dual Channel configuration in general, or with something else I'm missing. New DIMMs (Crucial instead of Kensington) are arriving tomorrow, but at this point I'd put money on them making no difference. (Also, I fully recognize that my issue is no longer about unRAID, per se--this is now just about a motherboard that doesn't like the RAM I put in for some reason, and I'd likely be struggling even if this were a Windows box. Is this thread now inappropriate for these forums?)
  20. @lionelhutz and @chickensoup: Thanks for the advice, guys! @Joe L.: I like your positive attitude The situation at the moment is this: I pulled one of the two DIMMs, ran my 10,000-pass drive-reading test, and it passed! So, I pulled that DIMM, put just the other DIMM in, and fired up the 10,000-pass test again, expecting (or, at least, hoping for) failures. But it passed, too! So...either my testing results are influenced by less or differently-configured RAM, or this system works well with one DIMM or the other, but not both. They came packaged together as a "Dual Channel" kit, and have been installed in the appropriate slots, per the motherboard manual. I've ordered a different brand of RAM to be delivered tomorrow, in hopes it will play nice with the motherboard. I'm holding off on a new power supply for the moment. Other thoughts?
  21. @Joe L.: The BIOS reports that it is running the DRAM at 1333Mhz and 1.504V. Though the SPD reports that this is 9-9-9 DRAM, I did try 8-8-8, and found that it destabilized the system--plenty of kernel panics. Should I trust what SPD tells me? @chickensoup and @lionelhutz: I don't have an appropriate alternate power supply handy, but will order one. Does anyone recommend anything more than Corsair, Seasonic and PC Power? @lionelhutz: I'll try one module at a time, then order some new RAM when I order a new power supply. I'll have two servers soon... ;-)
  22. @bonienl: Good point...I did forget about the RAM timings recommendation, and now that I look closely at it, I'm a little confused. The DIMMs don't have a timing sticker, unless the timings are somehow encoded in one of the long numbers on there. At the very least there is no #-#-#-# sort of declaration. I can't find any KVR1333D3K2/2GR timing documentation online, but I can find people who sound like they know of what they speak saying this RAM is 9-9-9-24. In the BIOS, the "DIMM Memory SPD Information" page says both DIMMs are Cycle Time=1CLK; TCL=9CLK; TRCD=9CLK; TRP=9CLK; TRAS=24CLK. The BIOS documentation says that if the DRAM Timing Mode is set to Auto, it gets the timing information from the SPD data. So, the word on the street is that the RAM is 9-9-9-24, and the BIOS says it is 9-9-9-24. Is there still a point to me manually setting it to 9-9-9-24? I'm not resistant to doing this--I just want to make sure I'm doing the right thing. @dgaschk: No, it is three drives, and I can't replicate the behavior on another system using those same drives. @ljh89: There is no overclocking or core-unlocking enabled. Thanks for all the feedback!
  23. Thanks for the response. I do indeed have the latest (17.18) BIOS version. Any other thoughts? Anyone? I'm up against a machine that is fast, "stable" (in that it doesn't crash), and occasionally subtly corrupts disk reads.
  24. Well, that was some excellent advice....thanks, guys. I learned a bunch, but I don't know what to do about it! I have discovered that with either of my drives (2TB WD, 1TB Seagate) and with one other "junk drawer" drive (320GB Seagate), if I read 200,000 blocks 10,000 times, calculating an MD5 hash each time, about 0.1% of the MD5 hashes will different (or, more simply, "wrong"). Even stranger, the hashes are not strictly random when they are wrong. That is, if the hash is "60f3d5b4459a58ba0d4c57cf10e47a3a" 99.9% of the time, I may see that after a couple thousand hashes I get a "c8a48d2d3009d7c897a853a924904029", then the hashes may be right for a couple thousand more reads, and then I may get another "c8a48d2d3009d7c897a853a924904029" hash. Sometimes I'll get a wrong hash that never does repeat itself, but most eventually do. Replacing the SATA cables doesn't make a difference. I have also failed to reproduce these results when testing the same drives on another computer. So...it appears I have built a shiny new nightmare of a file server that corrupts disk reads a statistically significant percentage of the time. Any opinions on what I replace first? RAM? CPU? Power supply? Motherboard? Or am I looking at it wrong?