bland328

Members
  • Posts

    105
  • Joined

  • Last visited

Posts posted by bland328

  1. I'm running unRAID 4.7 and unMENU 1.5, and working on configuring SSMTP.

     

    Somewhere along the line, ssmtp_2.64.orig.tar.bz2 started showing up twice on the /pkg_manager page.

     

    When I click the "Select" button for either of them, the /pkg_manager?select-ssmtp_2.64.orig.tar.bz2=Select+ssmtp_2.64.orig.tar.bz2 page lists the configuration interface twice.

     

    All of this is weird, and indicative of something gone wrong, but not a problem, per se.

     

    The problem is that each of the first listing of Configuration Variables is a blend of my settings, along with some default ("your_password") settings, and the second listing is all default settings.

     

    When I try to edit the Mail ID or Mail Password fields, I have to edit them in both sets and then click the first Save New Values button in order to commit the changes.

     

    Is there a config file I can edit to kill one of the SSMTP entries?  If so, might there be a "right one" to keep?

  2. It pains me to admit this publicly, but I owe it to the community that has been so helpful, and to anyone who might run into a problem like this down the road:

     

    The (used) CPU I put into this build had one corner pin bent flat as a pancake.

     

    So, I've also learned something about how my eyesight changes with age.

     

    I've also learned that when my wife Googles the symptoms and asks me if maybe the CPU is missing a pin, I should listen.

     

    Building your own system is easy, except when it isn't :D

  3. @chickensoup: I like most things about my experience with MSI and the MSI motherboard, except that I have a computer that isn't very good at math ;-)  In all seriousness, though, it may not be the motherboard's fault, and I'm not now anti-MSI.  I do, however, have an underclocked-but-functional server at the moment, and I do wish MSI would've let me secure the replacement board with a credit card so that I could've hopefully had an hour or two of server downtime, instead of a week or two.

     

    @jonathanm: That's impressive, re Intel-branded boards.  I will keep that in mind for my next build!

     

    Thanks, guys!

  4. @chickensoup: Thanks for the feedback.  I've determined that it isn't drive problems--this computer even has Prime95 calculation problems.

     

    When I underclock the RAM down to 1066, however, I don't see any more problems.  That doesn't, however, help me sleep at night.  I can give up the speed, but I can't trust my data to a box that is seemingly stable only when underclocked.

     

    Though I don't consider it utterly proven that the problem is the motherboard, MSI has VERY responsive (email responses often in less than an hour!) customer service, and they've agreed to RMA the board.

     

    They won't, however, ship a replacement until a week or two after they've received mine, so I've decided to go a different way: I just ordered an ASUS M5A78L-M LX PLUS motherboard.  I also eBay-ed a AMD Athlon II X2 240 2.8 GHz CPU to go with it.

     

    Once those parts are in place, I will have replaced everything but the power supply.  I probably should've done this a couple weeks ago.

     

    Fingers crossed.

  5. A little more info: after finding and fixing the single Sync Error, I started another Parity-Check, which found one more error, at an offset far, far from the previous.

     

    For those new to the thread, the three hard drives were all successfully precleared multiple times, show no signs of SMART woes, and have passed manufacturer drive diagnostics.

     

    RAM has already been replaced once, so my loose plan is to loosen RAM timings, then when that doesn't work, try a new power supply, then when that doesn't work, get a new motherboard.

     

    If that doesn't work, I guess I'll try repainting the case.

  6. I've had uptime of about one week, and in that week I've added one more 1TB drive, and I've copied about 1.6TB to the array.  On occasion, I've run a Parity-Check, and until this morning, I've been clean.

     

    This morning, I got one Sync Error which certainly wasn't due to an unclean shutdown, since I've had about a week of uptime.

     

    So, now I'm back to mistrusting this system.

     

    I'd greatly appreciate hearing who advises loosening up the RAM timing (maybe from 9-9-9-24 to 10-10-10-28, as recommended by @jonathanm and @chickensoup?), and who advises doing something else!

     

    Thanks to everyone :-)

  7. One more thing...here is an updated and somewhat more flexible version of the troubleshooting script from http://lime-technology.com/wiki/index.php/FAQ#How_To_Troubleshoot_Recurring_Parity_Errors, with easy-to-adjust variables at the head of the file, plus additional reporting and logging:

     

    #!/bin/bash
    LOG_DIR=/var/log/hashes
    TEST_DRIVE=sdb
    SKIP=25000
    BLOCKS=100000
    PASSES=10000
    
    mkdir -p $LOG_DIR
    cd $LOG_DIR
    echo $(date)
    echo $(date) >> $TEST_DRIVE.log
    echo PASSES=$PASSES, BLOCKS=$BLOCKS, SKIP=$SKIP
    echo PASSES=$PASSES, BLOCKS=$BLOCKS, SKIP=$SKIP >> $TEST_DRIVE.log
    
    for i in $(seq 1 $PASSES)
      do
        echo "Begin $TEST_DRIVE, pass $i."
        dd if=/dev/$TEST_DRIVE skip=$SKIP count=$BLOCKS | md5sum -b >> $TEST_DRIVE.log
      done
    exit

     

    Enjoy!

  8. @Joe L.: I'm with you there, Joe, and thanks for all your input.  Interestingly, in case you missed it, I currently have the motherboard set to Auto regarding all RAM timing, speed, and voltage!  Go figure.

     

    And I want to throw two other things out there for anyone learning from this thread:

     

    1) My MSI motherboard does the dual channel thing when paired DIMMs are either in slots 1 & 2 or 3 & 4.  Even when I was still using the original Kensington RAM, voodoo went away when I put the DIMMs in slots 1 & 3, thereby disabling dual channel mode.  I didn't find any way in the BIOS to turn it off.

     

    2) With the original RAM, leaving the motherboard in Auto mode SOMETIMES got the timings and speed wrong, though it always got the voltage right.  It was sometimes underclocking the RAM (no harm done, except for lost speed), and sometimes tightening the timings all the way to 7-7-7 (that's no good at all).  Regardless of what the BIOS interface claimed were the current RAM settings, it was useful to fire up Memtest86+ just to see what was reported there.

     

    If I learn anything else about my case, I'll post it here.

     

    A good rule of thumb: it is never the hardware, except when it is.

  9. If you are just joining us, the summary of my parity issues is this: RAM voodoo.

     

    While it is possible that the power supply or some other actor created the RAM issues, I did buy a new pair of dual-channel-ready DIMMs, and did get them working somehow despite some initial weirdness.

     

    Right now, my BIOS is set to Auto regarding all RAM timing, speed and voltage issues; the BIOS reports 1.504 volts, and Memtest86+ reports 666Mhz (DDR1333), CAS 9-9-9-24.

     

    Yesterday, after about 20 hours of successful Memtest86+ testing, I re-added my two drives to the array, restarted it, and checked parity, which took 500+ minutes.

     

    The result was 111 sync errors, which initially horrified me, but then I realized that when I last checked (and repaired) parity a couple weeks ago, the RAM voodoo would've resulted in a bunch of erroneous parity corrections.

     

    If you can't trust your RAM, all bets are off :-)

     

    After the first pass of parity correction with the 111 sync errors (why aren't these called parity errors?), I ran it again.

     

    The result was zero errors, and that's the first time I've ever seen that out of this server.

     

    So, I'm going to spend the next week or so beating up on this thing before I start trusting it with real data.

     

    If I see any more signs of RAM voodoo, I'm going to try loosening up the RAM timing to 10-10-10-28, as recommended by @jonathanm and @chickensoup.

     

    If I get desperate, I might even buy a Corsair power supply to replace the HEC.

     

    Can anyone think of anything else I should (or shouldn't!) do?

     

    Thank you all.

  10. @Joe L.: Thanks for your feedback re the appropriateness of this thread, and for your snarky opinions about Windows ;-)

     

    @jonathanm: I hadn't considered the possibility of loosening things up beyond the RAM specs...I guess there wouldn't be much performance hit, especially since this is a file server, not a gaming box.  That will be my next move, assuming I change anything at this point!

     

    @Johnm: That sounds smart, though I'm error-free and 20 hours in at this point, plus leaving town in a couple days, so I'm thinking I'll switch to some "real world" unRAID testing.  If that goes well, I'll run an exhaustive multi-day Memtest while I'm gone.  If that doesn't go well, I probably need to make some changes (10-10-10-28?) before continuing with Memtest.

     

    @chickensoup: When you say "changing DIMMs", do you mean trying another set?  Or do you just mean moving them around?  I'm on my second set of DIMMs (Kingston 1GBx2, then Crucial 2GBx2), and I've already seen that with the Kingstons, putting them in slots 1 & 3 appears to eliminate the problem (see http://lime-technology.com/forum/index.php?topic=19936.msg179372#msg179372).  Also, I can find a BIOS setting to disable interleaving, but can't for the life of me find a dual-channel setting.

     

    So, as I said @Johnm above, I'm error-free after 20 hours of Memtest.  I don't know if I accidentally worked something out, or if a slight change in temperature, barometric pressure or the phase of the moon is going to put me right back where I started.  If everything had been just perfect with the new RAM, I'd feel great right now.  Given that things were initially worse with the new RAM, I don't know what to think.

     

    Shall I loosen up the RAM timings?  Leave things alone?  Contact MSI?  Burn some sage?

     

    Thanks very much for all the great feedback, everyone.  This is a great community!

  11. Well, I have the new RAM (Crucial CT2CP25664BA1339 4GB 2GBx2 240-pin PC3-10600 DIMM DDR3 Memory Kit), and the results are bizarre.

     

    The BIOS and Memtest86 agree that I'm running the RAM at 666Mhz (DDR1333), CAS 9-9-9-24.  The BIOS confirms 1.504V.  I believe these values to be correct, but can't readily confirm it anywhere except the SPD data reported by the BIOS utility.

     

    I'm now running 4GB RAM total instead of 2GB, and I don't like the fact that more than one variable has changed.  Bad science.

     

    1. With the new RAM installed in the paired dual channel slots 1 and 2, the system boots, and my 10,000-pass MD5 test of 100,000 disk blocks returns about two or three times as many errors as it did with the original Kingston RAM.

     

    2. When I fire up Memtest86, everything looks good until Test #6 [Moving inversions, 32 bit pattern], then many errors are reported in the first pass.

     

    3. I remove the DIMM from slot 2 and fire up Memtest86 again.  No errors are reported in two passes.

     

    4. I replace the DIMM in slot 1 with the other, then Memtest86 again.  No errors are reported in two passes.

     

    5. I put the free DIMM into slot 2 (the two DIMMs are now back in slots 1 and 2, but swapped relative to step 1), and Memtest86 again.  No errors are reported in two passes, which is really unexpected, given step 2!

     

    6. I boot unRAID and retry the 10,000-pass test from step 1.  No errors.

     

    So, is this system haunted?  Was I having a physical seating problem with one of the slots, now accidentally remedied?  Should I still order a new power supply?

     

    I'm happy the box currently appears to be working, but don't trust it at all, since an hour ago it wasn't working with the same two RAM sticks.  And, of course, I'm a little concerned that double the RAM might make affect the outcome of the 10,000-pass test.

     

    Thoughts?

  12. Okay...I have a little more news...

     

    If I put one of the two matched DIMMs in slot 1, everything works well.

     

    If I put the other of the two matched DIMMs in slot 1, everything works well.

     

    If I put the two matched DIMMs in slots 1 and 2, I get my MD5 data corruption problem.  This is the recommended Dual Channel configuration according to the motherboard documentation.

     

    If I put the two matched DIMMs in slots 1 and 3, everything works well, much to my surprise.

     

    So...it is looking like my motherboard has an issue with these specific DIMMs, or maybe with the Dual Channel configuration in general, or with something else I'm missing.

     

    New DIMMs (Crucial instead of Kensington) are arriving tomorrow, but at this point I'd put money on them making no difference.

     

    (Also, I fully recognize that my issue is no longer about unRAID, per se--this is now just about a motherboard that doesn't like the RAM I put in for some reason, and I'd likely be struggling even if this were a Windows box.  Is this thread now inappropriate for these forums?)

  13. @lionelhutz and @chickensoup: Thanks for the advice, guys!

     

    @Joe L.: I like your positive attitude  :o

     

    The situation at the moment is this: I pulled one of the two DIMMs, ran my 10,000-pass drive-reading test, and it passed!

     

    So, I pulled that DIMM, put just the other DIMM in, and fired up the 10,000-pass test again, expecting (or, at least, hoping for) failures.  But it passed, too!

     

    So...either my testing results are influenced by less or differently-configured RAM, or this system works well with one DIMM or the other, but not both.  They came packaged together as a "Dual Channel" kit, and have been installed in the appropriate slots, per the motherboard manual.

     

    I've ordered a different brand of RAM to be delivered tomorrow, in hopes it will play nice with the motherboard.  I'm holding off on a new power supply for the moment.

     

    Other thoughts?

  14. @Joe L.: The BIOS reports that it is running the DRAM at 1333Mhz and 1.504V.  Though the SPD reports that this is 9-9-9 DRAM, I did try 8-8-8, and found that it destabilized the system--plenty of kernel panics.  Should I trust what SPD tells me?

     

    @chickensoup and @lionelhutz: I don't have an appropriate alternate power supply handy, but will order one.  Does anyone recommend anything more than Corsair, Seasonic and PC Power?

     

    @lionelhutz: I'll try one module at a time, then order some new RAM when I order a new power supply.

     

    I'll have two servers soon... ;-)

  15. @bonienl: Good point...I did forget about the RAM timings recommendation, and now that I look closely at it, I'm a little confused.

     

    The DIMMs don't have a timing sticker, unless the timings are somehow encoded in one of the long numbers on there.  At the very least there is no #-#-#-# sort of declaration.

     

    I can't find any KVR1333D3K2/2GR timing documentation online, but I can find people who sound like they know of what they speak saying this RAM is 9-9-9-24.

     

    In the BIOS, the "DIMM Memory SPD Information" page says both DIMMs are Cycle Time=1CLK; TCL=9CLK; TRCD=9CLK; TRP=9CLK; TRAS=24CLK.

     

    The BIOS documentation says that if the DRAM Timing Mode is set to Auto, it gets the timing information from the SPD data.

     

    So, the word on the street is that the RAM is 9-9-9-24, and the BIOS says it is 9-9-9-24.  Is there still a point to me manually setting it to 9-9-9-24?  I'm not resistant to doing this--I just want to make sure I'm doing the right thing.

     

    @dgaschk: No, it is three drives, and I can't replicate the behavior on another system using those same drives.

     

    @ljh89: There is no overclocking or core-unlocking enabled.

     

    Thanks for all the feedback!

  16. Well, that was some excellent advice....thanks, guys.  I learned a bunch, but I don't know what to do about it!

     

    I have discovered that with either of my drives (2TB WD, 1TB Seagate) and with one other "junk drawer" drive (320GB Seagate), if I read 200,000 blocks 10,000 times, calculating an MD5 hash each time, about 0.1% of the MD5 hashes will different (or, more simply, "wrong").

     

    Even stranger, the hashes are not strictly random when they are wrong.  That is, if the hash is "60f3d5b4459a58ba0d4c57cf10e47a3a" 99.9% of the time, I may see that after a couple thousand hashes I get a "c8a48d2d3009d7c897a853a924904029", then the hashes may be right for a couple thousand more reads, and then I may get another "c8a48d2d3009d7c897a853a924904029" hash.  Sometimes I'll get a wrong hash that never does repeat itself, but most eventually do.

     

    Replacing the SATA cables doesn't make a difference.

     

    I have also failed to reproduce these results when testing the same drives on another computer.

     

    So...it appears I have built a shiny new nightmare of a file server that corrupts disk reads a statistically significant percentage of the time.

     

    Any opinions on what I replace first?  RAM?  CPU?  Power supply?  Motherboard?

     

    Or am I looking at it wrong?

  17. Thanks for your responses, gentlemen.

     

    The first thing I did upon seeing this result the first time was to run a 24-hour Memtest; no errors were reported.

     

    Also, I did attach SMART results to my original post.

     

    I assume this more complex drive test you both refer to is "reiserfsck --check".  I just did four back-to-back runs on the data drive, and the results are disturbing:

     


     

    Run 1:

    Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

    Checking internal tree.. \/105 (of 145\/ 68 (of 170\bad_indirect_item: block 212886871: The item (1163 1165 0x1d4cb001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (213008304), which is in tree already finished

    Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

    Checking Semantic tree:

    finished

    3 found corruptions can be fixed when running with --fix-fixable

     

    Run 2:

    Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

    Checking internal tree.. \/ 64 (of 145// 30 (of 170\bad_indirect_item: block 47209215: The item (359 369 0xf104001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (47270912), w/ 95 (of 145-/152 (of 170-bad_indirect_item: block 182910983: The item (1151 1152 0x17d6001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (182919168), which is in tree already                        /102 (of 145-/ 45 (of 170\bad_indirect_item: block 212107611: The item (1163 1164 0x551de001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (212462848), which is in tree already finished

    Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

    Checking Semantic tree:

    finished

    5 found corruptions can be fixed when running with --fix-fixable

     

    Run 3:

    Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

    Checking internal tree.. \/ 13 (of 145\/128 (of 170\bad_indirect_item: block 83331021: The item (5 15 0x44d1d001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (83616512), whi/ 71 (of 145-/  3 (of 170/bad_indirect_item: block 66715745: The item (428 429 0x1778c001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (66813956), which is in tree already / 95 (of 145-/154 (of 170|bad_indirect_item: block 182910985: The item (1151 1152 0x1fbe001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (182921216), which is in tree already /126 (of 145-/ 86 (of 170|bad_indirect_item: block 231112707: The item (1231 1257 0x1 IND (1), len 396, location 336 entry count 0, fsck need 0, format new) has the bad pointer (45) to the block (231118080), which is in tree already finished

    Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

    Checking Semantic tree:

    finished

    6 found corruptions can be fixed when running with --fix-fixable

     

    Run 4:

    Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

    Checking internal tree.. finished

    Comparing bitmaps..finished

    Checking Semantic tree:

    finished

    No corruptions found

    There are on the filesystem:

            Leaves 24105

            Internal nodes 146

            Directories 46

            Other files 1500

            Data block pointers 24307524 (0 of them are zero)

            Safe links 0

     


     

    So...3 unfixed corruptions found on the first run, then 5 on the next run, then 6 on the next run, then 0 on the last run.

     

    I emphasize that I never repaired any of the reported file system problems, which means likely means disk reads are sometimes corrupted, but that the disk is not actually corrupted.

     

    The disk passes a long SeaTools test, and passes preclear.  RAM tested clean after 24 hours of continuous testing.

     

    Replace the SATA cables?  Replace the power supply?  I'm a capable troubleshooter, but not a Linux or unRAID pro, and it feels to me like everything is suspect.

     

    Any thoughts?

  18. Summary

    I recently built my first unRAID server, and I’m having bizarre parity issues, despite my best efforts and a lot of troubleshooting.

     

    Configuration

    unRAID 4.7

    Motherboard: MSI 870S-G46

    CPU: AMD Athlon II X2 215 2.7 GHz ADX215OCK22GQ

    RAM: 2GB Kingston PC3-10600 DDR3 dual channel (KVR1333D3K2/2GR)

    PS: hec Zephyr MX 750

    Parity drive: 2TB Western Digital WD20EARS

    Disk1 drive: 1TB Seagate 31000528AS

    (I kept it to just the two drives at first, so I could get to know unRAID a bit.)

     

    Problem

    The problem is that after preclearing both drives, firing up the array, copying ~100GB of big multimedia files to the server, seeing zero errors on both drives according to the rightmost column on the Main page, and then clicking Check to start a Parity-Check, I’m told there are dozens of parity errors.

     

    Things you might want to know

    I’m running unRAID 4.7

     

    The drives are reporting 25-26°C

     

    Both drives are “MBR: 4K-aligned”

     

    I've attached SMART reports for both disks, as well as a lengthy, messy syslog (sorry about that)

     

    Troubleshooting I’ve done

    Memtest ran for over 24 hours with no errors reported.

     

    Western Digital and Seagate drive utilities report nothing strange about either drive.

     

    Each drive has been precleared at least three times, never with what I interpret as an error or failure report.

     

    My testing

    I’ve seen this problem consistently after preclearing both drives and starting from scratch three times--yes, that's many days of preclearing!

     

    During the last preclear cycle, I precleared the 2TB drive once while preclearing the 1TB drive twice back-to-back, just to make sure there was plenty of activity involving both drives.

     

    After that last round of preclears, issued an “initconfig” before repopulating the devices.

     

    No problems were reported during the initial Parity-Sync.

     

    The copying of the 100GB of big files was started after Parity-Sync completed, and seemingly went smoothly--the Windows 7 box pushing the files didn’t complain, anyway.

     

    After the copy was complete, I clicked the Check button, and eventually 39 parity sync errors were fixed!

     

    My questions

    What do I try next regarding the parity failures?

    This is feeling like a hardware failure outside the drives. I’m fearing I’ve chosen the wrong motherboard, or have a lemon.

     

    Am I right that it is fine to preclear all drives with the -A option?

    My understanding is that the Seagate drive doesn’t need this, but it is fine to do so, other than some space lost if I have a huge number of files. For the sake of elegance, simplicity, and never getting it wrong on any given drive in the future, I’m hoping always doing this is fine.

    bland328_SMART_sda.txt

    bland328_SMART_sdb.txt

    bland328_syslog_with_a_few_notes_at_the_top.txt