Help - Please


Chem13

Recommended Posts

SYMPTIONS:

- System consistently shows parity invalid even after running successful parity check.  

- Random system lock-up requiring hard reboot.  

- Periodically will not show all drives present.  Will list the drive and show “no device”

DISCUSSION:

- I can’t seem to tie the system lock-up to any one activity…they seems to happen more frequently during parity checks

- Lock-up requires a hard reboot by powering off then back on

- Even after a successful parity check, I get invalid parity on the next reboot.

- I have check the file system on each drive and found no errors

- Files on the drives appear to be unaffected and accessable

 

 

 

 

Help Please:

 

The lime sever has been the best investment I have made ever since my best friend told me about the system.  This is the first problem I have had in many years of operation.

 

My system:

 

It is about 3-4 years old.

Running software version 4.4.2

2 Thermalake 550 watt power supplies

Asus P4P800 Deluxe Mother board

Intel P4 2.4 GHz CPU with retail cooler

2 Gig Ram

1 Promise IDE card       brand and model recommened on Lime site

1 Promise SATA card    brand and model recommended on Lime site

Coolmaster large tower case

 

10 hard drives ranging in size from 750 gigs to 400 gigs

 

Parity:  SAMSUNG HD753LJ                                      Size: 750 Gig SATA

Disk 1:  Hitachi HDS721075KLA330                           Size:  750 Gig SATA

Disk 2:  Hitachi HDS721075KLA330                           Size:  750 Gig SATA

Disk 3:  ST3400633A 3NF1QWPN                              Size:  400 Gig IDE

Disk 4:  Hitachi HDS721075KLA330                           Size:  750 gig SATA

Disk 5:  Hitachi HDS721075KLA330                           Size: 750 gig SATA

Disk 6:  MAXTOR STM3500630A 5QG0JAF3            Size:  500 Gig IDE

Disk 7:  ST3400632A 5NF19PLB                                 Size:  400 Gig IDE

Disk 8:  ST3500630AS                                                  Size:  500 Gig IDE

Disk 9:  HDS725050KLAT80 KRVA45ZAH9EZYF   Size: 500 Gig IDE

 

Note:  The IDE drives are using round cables (I didn’t realize that was not recommended until I started researching my issues in the forums.  I have been using those cables for 4 years without an issue until now.  I am mentioning it now to be thorough and cover all possibilities

 

I am in the process of changing out all the IDE drives.  I started with 9 IDE drives and as funds become available, I replace them with SATA.  I’m looking at a coupld of 1.5 or 2.0 TB drives as the next upgreade with one of them becoming the new parity and replacing the 400 Gig IDE (Disk 3) and moving the current parity to replace drive 7 (400 Gig IDE).

 

 

 

Sympton:

 

- System consistently shows parity invalid even after running successful parity check.  

- Random system lock-up requiring hard reboot.  

- Periodically will not show all drives present.  Will list the drive and show “no device”

 

 

Discussion:

 

I suspect a hardware issue, possibly the promise controller card or one of the hard drives.  The system randomly locks up and requires a hard reboot through the computer case power switch.  All access through the GUI or by logging onto the system is blocked when this happens.  I can’t seem to tie the lock-up to any particular activity.  It happens when I’m moving files from one drive to another.  It also happens when I’m just running a parity check.  The lock up seems to happen more frequently when I’m running a parity check.  At least for now, the files seem to be present and unaffected on the hard drives.  I am still able to access them while the parity check is in progress or if I cancel the check and just examine the system while parity shows the red dot (invalid).

 

When I do a hard reboot like this it will start up and immediately proceed into a parity check.  Initially I was noticing that only some of the drives were showing on the GUI (drives 1-4) and the other drives showed as “no device”.  I opened the system case and reseated all the drives and cables and controller cards and at least  for the last 4 reboots this particular issue has disappeared.  All drives are showing as present.

 

I did notice that until I checked all the cards and cables, I was getting this error as the final command on the boot-up screen, “/dev/md*:  no such file or directory”.  After checking all the cables and cards, this error no longer is present and I’m not noticing any error messages on the boot-up screen, but am still having the lock-up and parity problems.

 

With each hard reboot,  it shows the party as being invalid and starts a parity check.  I also am getting an invalid parity with I do a proper shut down through the GUI (stop all drives, and click reboot).

 

I have let it run a full parity check (took about 8 hours) and got the green dot by the parity dive at the completion.  I then did a normal reboot from the GUI and it came up with a red dot by the parity drive saying the parity was invalid.  Sometimes I can run the parity check with no issues and sometimes the system totally freezes in the middle of the parity check requiring the hard reboot discussed above.

 

I have check the file system on each of the 9 hard drives and it found no errors.  I used:

Samba stop

Umount /dev/md1

Reiserfsck /dev/md1

 

Mount /dev/md1 /mnt/disk1

/usr/sbin/smbd  –D

/usr/sbin/nmbd  -D

 

There were no errors found and it didn’t ask me to run the –fix-fixable/dev/md1 or rebuild-tree.

 

 

 

Link to comment

You may be having multiple problems. The issue that a clean shudown (using the unRaid GUI) is causing a parity check on the reboot may be caused by a bad or failing USB stick. If the computer loses access to the USB or can't write to the it at shutdown it causes that symptom. Could be a bad USB stick or somthing else with the USB port. Your drive problems don't sound related, but not sure.

 

I'd recommend posting a syslog and also doing smartctl reports on each drive. That should give some clues.

Link to comment

I would run a memory test... everything you are describing points to a hardware issue, exactly as you suspected.

 

If you have not checked the bios recently, make sure the voltage and timings for the memory are set correctly.  Run the memory test overnight. (unless it shows errors immediately)

 

Also, make sure all the fans are working properly... perhaps something is overheating.

 

Joe L.

Link to comment

I have to agree, that a memory test is imperative.  Heat issues would be my next guess.  Failing motherboard and power supply issues would be next.  Random lockups are usually associated with memory, heat, and power issues.

 

Your syslog looks good, no issues apparent.  You are however, using a VIA chipset that I don't believe anyone else is using.  Boards based on VIA chipsets have been problematic at best, for unRAID.  If all other suspects check out OK (memory, temps, PSU), you may want to look for a motherboard BIOS upgrade, or return to an earlier version of unRAID, one that previously worked with your motherboard.

 

Round cables (although risky) and the drives are not likely the problem, as you should have seen errors in the syslog related to the drives, and they won't hang the machine either.  Your drives and data are probably all fine.

 

I also am getting an invalid parity with I do a proper shut down through the GUI (stop all drives, and click reboot).

Not sure about this.  If the array is all green including the parity drive, and you can access files on the flash drive yourself, when you stop the array, AND if all drives are recognized after reboot, there should not be a parity check.  If not, then there may be an issue with the flash drive, or the system was corrupted and lost access to the flash drive.  Try to view the folders on the flash drive, and then capture a syslog, just before powering off.

Link to comment

I'd recommend posting a syslog and also doing smartctl reports on each drive. That should give some clues.

 

Sorry, didn't realize you had posted syslogs already.  My mobile web doesn't show me attachments - or even tell me that there are attachments.

 

I still think doing smartctl reports is a good idea.  The memory test is also recommended.

Link to comment

To start, I want to say thanks for the help and suggestions.

 

Here is what I have found so far:

 

1.  Ran Smartctl on all drives and parity and didn't see any errors.  The temperature was up a little, but that was because I had the case open and improper air flow.

 

2.  The flash drive seems to be fine in that I can access it, read, and write to it without difficulty

 

3.  The memtest won't complete.  I originally got "invalid or corrupt kernel Image".  I reloaded 4.42 onto the flash drive and ran the memtest.  It hangs everytime about the same location.

 

Pass:  7%

Test: 21%

Test: #3 [moving in version, 8-bit patterns]

Testing:  0K-16M  1023 relocated

Pattern:  bfbfbfbf

 

walltime:  0:01:21

Cash: 1023M

RsvdMem: 864K

MemMap: e820-Std

Cache: on

ECC: off

Test: Std

Pass: 0

Errors: 0

Ecc Errors:

 

4.  I notice on reboots I get this message:

 

remounting root file-system read-only

mount: can't find /in/etc/fstab  or /etc/mtab

 

5.  I am still able to access the data on each of the 9 drives.  If finds the drives after each reboot.  However, each time it shows parity as invalid.  I can run a parity check and it completes without issues and shows green for all drives.  However, when I reboot, it comes back up with each drive accessable, but parity invalid.

 

6.  I have attached each of some of the smartctl and a fresh syslog.  Each smartctl showed similar information and each said "Pass". 

 

I am running a fresh parity check tonight and it will finish sometime around lunch.  I will post if anything different happens.

 

7.  I only had the system lock up one time in about 20 reboots and testing.  That issue seems to have diminished in scope, however, there is no change regarding the parity issue.

 

 

Link to comment

To start, I want to say thanks for the help and suggestions.

 

Here is what I have found so far:

 

1.  Ran Smartctl on all drives and parity and didn't see any errors.  The temperature was up a little, but that was because I had the case open and improper air flow.

 

2.  The flash drive seems to be fine in that I can access it, read, and write to it without difficulty

 

3.  The memtest won't complete.  I originally got "invalid or corrupt kernel Image".  I reloaded 4.42 onto the flash drive and ran the memtest.  It hangs everytime about the same location.

 

Pass:  7%

Test: 21%

Test: #3 [moving in version, 8-bit patterns]

Testing:  0K-16M   1023 relocated

Pattern:  bfbfbfbf

 

walltime:  0:01:21

Cash: 1023M

RsvdMem: 864K

MemMap: e820-Std

Cache: on

ECC: off

Test: Std

Pass: 0

Errors: 0

Ecc Errors:

 

4.  I notice on reboots I get this message:

 

remounting root file-system read-only

mount: can't find /in/etc/fstab  or /etc/mtab

 

5.  I am still able to access the data on each of the 9 drives.  If finds the drives after each reboot.  However, each time it shows parity as invalid.  I can run a parity check and it completes without issues and shows green for all drives.  However, when I reboot, it comes back up with each drive accessable, but parity invalid.

 

6.  I have attached each of some of the smartctl and a fresh syslog.  Each smartctl showed similar information and each said "Pass". 

 

I am running a fresh parity check tonight and it will finish sometime around lunch.  I will post if anything different happens.

 

7.  I only had the system lock up one time in about 20 reboots and testing.  That issue seems to have diminished in scope, however, there is no change regarding the parity issue.

 

 

I don't think you need to look too much further if the memory test does not pass.  Bad memory affects everything that uses it. (which means it will cause errors in anything that uses is in different and unpredictable ways. exactly your symptoms.)

 

If your motherboard allows you to set the memory voltage and/or timing, make sure it is set for your specific memory.  Premium memory often takes higher voltage than value memory.  The same for timing when accessing memory.  You memory may need different timing settings to access it than the BIOS sets by default.

 

If your bios has neither "voltage" or "timing" settings, or if they are already set as they are supposed to be, then I would in the following order:

1. re-seat the memory in its sockets.

2. swap the memory strips between the sockets.

3. replace the memory with good "value" memory. (don't need premium memory, as we don't overclock (or are you?)

4. replace the power supply.

5. replace the motherboard.

 

Until you can perform memory tests reliably, you do not need to do anything more.

 

Joe L.

Link to comment

The parity test is still going fine at 42% as I get up and head to work...I will let it run.

 

1.  I agree, the failed memory test was not a good sign.  I don't overclock.  I am using Corsair Valueselect 184-Pin DDR SDRAM DDR400 (PC3200) memory.  The motherboard is set to "auto" for all the memory settings.

 

Since the memory and the settings have worked without issue for a number of years, I tend to think that it may have just gone bad.  let me switch the memory between slots and reseat each chip and try the memory test again. 

 

I can see the memory easily accounting for the random lock-ups, but would that explain the system running fine with all files accessable except an inablity to keep a valid parity? or am I back to the possiblity of mulitple issues?

 

I will get the memory test working today one way or the other and update here again.  Thanks for the help and advice.

Link to comment

The parity test is still going fine at 42% as I get up and head to work...I will let it run.

 

1.  I agree, the failed memory test was not a good sign.  I don't overclock.  I am using Corsair Valueselect 184-Pin DDR SDRAM DDR400 (PC3200) memory.  The motherboard is set to "auto" for all the memory settings.

 

Since the memory and the settings have worked without issue for a number of years, I tend to think that it may have just gone bad.  let me switch the memory between slots and reseat each chip and try the memory test again. 

 

I can see the memory easily accounting for the random lock-ups, but would that explain the system running fine with all files accessable except an inablity to keep a valid parity? or am I back to the possiblity of mulitple issues?

 

I will get the memory test working today one way or the other and update here again.  Thanks for the help and advice.

I would stop the parity check.  If it "fixes" parity based on bad bits read from memory, it will actually be putting the wrong values in parity.

A subsequent parity check (after you fix the memory problem) will then find the incorrect parity values and fix them once more.  So expect a few parity errors on the first parity check AFTER you fix the memory problem.  THen, you should not see any more errors on any subsequent parity checks.

 

It is good that the memory is failing consistently at a specific spot.  To me, that indicates a memory issue and not a motherboard or power supply issue, but time will tell.

 

Good luck...  Do not assume anything.. just because it worked for years.  Your bios battery might be getting weak and setting might have changed in the bios.  You power supply might be putting out a tiny bit less voltage, and the memory voltage was right on the edge of working... 

 

I'd verify the memory voltage.  According to the newegg site, that memory needs 2.5 volts.  According to the corsair site here the timing is 5-5-5-15 for most of the memory strips like yours.  (The default memory voltage on many motherboards is 1.8 volts... way less than your memory needs.)

 

According to this following page on the coursair site, the voltage might need to be as high as 2.75volts.

http://www.corsairmemory.com/configurator/product_results.aspx?id=1915#other_modules

 

I'd start with setting the timing and voltage explicitly based on your exact memory modules.  Your "auto" setting might not be doing what it needs.

 

Joe L.

Link to comment

As I read this there are two different and distinct possible root causes that come to mind.

 

Theory #1 - You have a bad memory module.  This has been well explained.  The thing that seems odd is that the memtest is crashing.  I have used that tool many times and have never had it crash testing memory.  When it finds bad memory it logs it and goes on.  Unless that bad memory is literally the memory that the memtest program loads into, I consider this crash to be a mystery.  (UPDATE:  Joe's find may help explain.  But it still doesn't explain why the prior version of memtest failed to even load ("invalid or corrupt kernel Image" error).  Still seems fishy - but bad memory can cause fishy problems so this is definitely still a likely cause.

 

Theory #2 - The USB stick is bad (or something in the USB motherboard subsystem is bad).  The memory test runs a program called "memtest" off of the flash disk.  If the flash disk is bad, and this program is read incorrectly from the disk, it could cause the program to crash repeatably at the same spot.  (This would also explain the crash when loading the prior version of this file).  The other thing that makes me believe that this is a possibility is that you are not having any problems with your parity check.  The two scenarios are 1 - a crash of the memtest; and 2 - a problem updating super.dat (causes the parity check to restart on reboot) lead me to believe this is worth eliminating.

 

There are a couple ways to test this theory.  One is to set up a new USB stick.  If you don't have one, another way is to rename the memtest on the USB stick to memtest.old or something.  And then copy a fresh copy of that file from the .zip file from Limetech.  If the USB stick is going bad, copying a fresh copy onto another part of the USB stick may copy cleanly, but even if not should at least change the observed symptoms. 

 

The other thing to try is plugging in the USB stick into a different port.  I think that each USB "header" controlls 2 ports.  You might therefore try plugging it into two different ports to make sure you're on a different header.

 

If you rule out both of these, then the motherboard is looking like a likely target.  I hope its something simple like memory or the USB.

 

Good luck!

 

Link to comment

Always nice to see errors jump out at you, with a sign saying FIX ME!  Your first syslog had no issues, but this one:

Mar 16 18:18:10 Tower kernel: FAT: Filesystem panic (dev sde1)

Mar 16 18:18:10 Tower kernel:    fat_free_clusters: deleting FAT entry beyond EOF

Mar 16 18:18:10 Tower kernel:    File system has been set read-only

Mar 16 18:18:10 Tower kernel: write_file: error 30 opening /boot/config/super.dat

Mar 16 18:18:10 Tower kernel: md: could not write superblock from /boot/config/super.dat

That would explain why the flash drive is not getting updated, and why you have a parity check on each boot.  It does NOT explain the memory problems, so continue with the good advice above, but first...

 

Shut down your server completely, and take the flash drive to a Windows machine and run the Check disk or Scandisk on it.  Once the flash drive can test perfectly, you should re-extract the unRAID distribution to it, to make sure the unRAID programs, including memtest, are correctly stored.  Now you should be able to boot and run the memory tests.  You will have one more parity check to run, when every thing else is fine.  (Looks like all 3 of us were right!  What a team!)

Link to comment

I just want to mention, that I had a similar problem a month or two back with a gigabit motherboard.  I ran multiple memtests and they all came back with failures.  I tried replacing the ram... Still had failures.  I decided to test a new motherboard (ASUS) and once I got that installed and working, I ran another memory test.  I ran it for 24 hours without a single issue(using the original ram from that failed in the gigabit board)...  The ram was fine, the motherboard was my issue.

 

In fact that reminds me, I should grab a syslog for varification of my motherboard for the wiki (note to self)...

 

Cheers,

Matt

Link to comment

Always nice to see errors jump out at you, with a sign saying FIX ME!  Your first syslog had no issues, but this one:

Mar 16 18:18:10 Tower kernel: FAT: Filesystem panic (dev sde1)

Mar 16 18:18:10 Tower kernel:     fat_free_clusters: deleting FAT entry beyond EOF

Mar 16 18:18:10 Tower kernel:     File system has been set read-only

Mar 16 18:18:10 Tower kernel: write_file: error 30 opening /boot/config/super.dat

Mar 16 18:18:10 Tower kernel: md: could not write superblock from /boot/config/super.dat

That would explain why the flash drive is not getting updated, and why you have a parity check on each boot.  It does NOT explain the memory problems, so continue with the good advice above, but first...

 

Shut down your server completely, and take the flash drive to a Windows machine and run the Check disk or Scandisk on it.  Once the flash drive can test perfectly, you should re-extract the unRAID distribution to it, to make sure the unRAID programs, including memtest, are correctly stored.  Now you should be able to boot and run the memory tests.  You will have one more parity check to run, when every thing else is fine.  (Looks like all 3 of us were right!  What a team!)

Those errors could still be only the effects of bad memory.  If the memory caused the FAT file-system to look corrupted, it would emit the same errors. 

 

Now, the bad memory could have resulted in a corrupted FAT file-system too... but one thing at a time.  Don't be surprised if the flash drive comes out clean under windows checkdsk.  And if it corrupt, don't be surprised if the memory test still fails...

 

(I'm still betting on the memory voltage being set wrong for his memory strips)

Joe L.

Link to comment

It looks like I'm going to have to use my AIG retention bonus and some of the goverment bailout money and buy a new motherboard, CPU, and memory.

 

--I'm still unable to complete a memory test.  It hangs up at almost the exact same spot each time both with the old memory and the newly purchased memory.

 

1.  I have run chkdsk on the flash drive.  If found and repaired errors.  I then reloaded a fresh version of 4.4.2 along with a fresh memtest with no luck.

 

2.  Since my system has two power supplies I tried each one indpendently with no success going on the assumption that both power supplies going bad at the same time was unlikely.

 

3.  I disconnected all hard drives and ran just the motherboard and memory with no success.

 

4.  As mentioned earlier, I purchased new memory and the memtest still freezes at the exact same spot..about 7% in.

 

5.  The last thing i can try is my best friend has the same server, I was going to borrow his working flash drive and see if I can run a successful memtest with it.  I did boot my main computer up with my flash drive and ran a complete and successful memtest on that computer to verify that the flash drive was capable of doing a memtest.

 

Barring all that, it is looking like it is time to upgrade the motherboard.

 

What motherboard would be recommened?

 

The P5b-VM is discontinued at newegg.  I have always been a fan of ASUS motherboards, but I was considering the one currently being used in Unraid systems...Supermicro C2SEE motherboard?

 

I will prbably change out the remaining IDE drives when I do this also...get the pain over with all at once.

Link to comment

I forgot to mention that I did go into the bios and change the DRAM settings to manual and used the voltage and other recommended settings as part of the earlier tests.

 

:(

 

My guess is still that you are having USB issues.

 

... Did you try different USB ports to plug the flash into?

... Did you try a different USB stick?  Even with unRAID basic you could see if the memtest ran.

 

Second guess is a bad motherboard, and yes I'd suggest the C2SEE if you go that route.

Link to comment

Follow-up question please.

 

-Tried a new flash drive in multiple USB ports without any success

-installed new motherboard, CPU, and Memory

-System now seems to boot fine

-Parity showing as invalid (as expected)

-Disk 1 showing as unformated (unexpected)

-all other 8 disks seem fine and fully accessable

 

 

One of the nine hard drives is now showing as unformatted that previously had no issues.  Since the parity still shows as invalid, do I have any options other than to count the data as lost and reformat?  I have tried rebooting several times.  Would using the "clear statistics" button have any effect?  would relocating the drive from disk 1 to a previously unused designation such as disk 10 have any effect?

 

Just as an update, I obtained and set up a new flash drive with the help of Tom and that had no effect on the problems.  So I ordered a the Supermicro motherboard, E5300 CPU, and 2 gig of memory.  When I booted up I still got the parity invalid (as expected) but the disk 1 showing now as unformated.

 

I just now installed the new hardware.  I'm going to let it run all night to make sure there are no lock-ups or other problems.  I have stopped the parity check until I investigate the unformated drive issue further.

 

Thanks again for any advice.

Link to comment

Follow-up question please.

 

-Tried a new flash drive in multiple USB ports without any success

-installed new motherboard, CPU, and Memory

-System now seems to boot fine

-Parity showing as invalid (as expected)

-Disk 1 showing as unformated (unexpected)

-all other 8 disks seem fine and fully accessable

 

 

One of the nine hard drives is now showing as unformatted that previously had no issues.  Since the parity still shows as invalid, do I have any options other than to count the data as lost and reformat?  I have tried rebooting several times.  Would using the "clear statistics" button have any effect?  would relocating the drive from disk 1 to a previously unused designation such as disk 10 have any effect?

 

Just as an update, I obtained and set up a new flash drive with the help of Tom and that had no effect on the problems.  So I ordered a the Supermicro motherboard, E5300 CPU, and 2 gig of memory.  When I booted up I still got the parity invalid (as expected) but the disk 1 showing now as unformated.

 

I just now installed the new hardware.  I'm going to let it run all night to make sure there are no lock-ups or other problems.  I have stopped the parity check until I investigate the unformated drive issue further.

 

Thanks again for any advice.

 

If disk1 has got data that was loaded while connected to the old motherboard, it should be recognized.

 

Is it possible that you mixed up disk1 and parity, and assigned them to the wrong slots?  If you did, all may not be lost.  If your parity check started but didn't find any errors, it may be as simple as bringing down the array and reassigning the two disks.  But if you got parity errors, it is likely that disk1 (that was in the parity slot) go creamed.  Even in that case, you could likely put the parity disk back in the right slot and rebuild disk1.  DO NOT WRITE TO THE ARRAY.  Any writes are going to lead to corruption if you switched these disks around.

 

If you did not get them confused, and disk1 is mysterioiusly unformatted, I have no explanation.  Capture a syslog and post it.  You could try powering down and making sure that all of your connections are solid to that drive.  You might even try hooking to another SATA port.

Link to comment

Thanks for all the help and advice!

 

I just wanted to pop back on to say thank you.  I appreciate all the people that take time out of their day to offer help and advice.

 

As an update, I'm up and running and back to normal.  My final issue, the drive that was showing as unformated...Bjp99 was dead on...I had mixed the parity and drive 1 up when I assigned them (they are identical drives and only differ by the last digit)...it looks all those years at Texas A&M were wasted.

 

Thanks again.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.