Jump to content

rebuilt data drive but after can't login/access panel?


Recommended Posts

Hi, I had a drive fail so I put in a new drive and set it to rebuild over night and when I left it everything seemed to be going ok, estimated time was like 8 hours (which actually seemed pretty fast to me). When I came back this morning about 8-9 hours later to check I was not able to access the panel nor was I able to telnet/ssh in? I really don't want to mess things up so I have left it running but I can't "see" anything in terms of status as I can't access the panel/terminal.

 

What are my options here? what are the risks of switching it off/back on? any thoughts/suggestions would really be appreciated!

 

PS Not sure if it matters (I don't think that it does) but I replaced the defunct 2tb drive with a 3tb drive

Link to comment

Ah right. On the server screen it is showing a whole lot of output of something, looks like

 

[ffffffffstring of numbers] and then references to things like kthread, fork, wait woken, etc

 

at the very end it has:

 

---[end of trace dfc025133c06fd95 ]---

 

and does not give me a command prompt (I have not tried to press any keys or anything though).

 

Am a bit more worried now...

Link to comment

Well, I decided to take a risk and turn the computer off/on again and it thankfully started up and what I have is the drive that I tried to rebuild is showing up with the little yellow triangle which I guess means "invalid data content" which, along with the output/lock that prompted me to start this post I am guessing means that the rebuild didn't go well...

 

So... what exactly does this mean? or does it mean anything? that is should I just try to rebuild the disk again? or perhaps the replacement disk I tried using is bad?

 

Thoughts? (hopefully) anyone? I can access the command prompt now but have no idea what to run in terms of diags to further figure out what the situation is.

Link to comment

Please see Need help? Read me first!, to obtain your syslog and attach it here, so perhaps we'll see something that will help.  You might also get a SMART report for that drive.  Next, it might be good to upgrade to the latest release, not that there will be a fix for you (I think it's unlikely), but it has better diagnostic data tools, to give us more info.

Link to comment

Thanks for the link. I noticed it said to do alot of the things before rebooting but as my system seems to have locked up I am not sure that was an option. Also it did almost the same thing again which I am not sure what to think of, I just had it idle, this was not after another rebuild, it spit out a bunch of jibberish and I can't access it remotely I just took a photo of the screen. It looks like it is continuing to spit out this stuff every once in awhile (like I can get a command prompt and will be hunting around then this stuff will show up on my screen, then I can just clear screen and continue, but I can't seem to get remote access going). Like I said I can access it directly so I thought I would post, with the screen shot and ask if there is anything I can do via command prompt to get more info before I reboot?

bigtower_error1.jpg.11bde8f8e547be8b461824448ce91083.jpg

Link to comment

ok, things have gotten worse...

 

I couldn't remember which drive it was to run a smart test on so I tried to shutdown and it would not shutdown correctly, more garbage spit onto the screen, so I just powered down and tried to restart but on restart as it tries to restart (in safe mode) i get garbage. In effect the longer this gets drawn out the more often/more instances it is throwing this garbage/gibberish onto the screen. Now getting to prompts/web console is hit or miss and I am I am still stuck with one drive that is not rebuilt and an OS that seems corrupted in some way. not happy times.

 

I was able to boot one time and do a smart test (attached) and I have posted a few more screen shots (apologies for the lq) with what I hope is useful information. I have tried to get to the tools diag menu but the computer keeps freezing before I get to it.

 

I have taken the syslog.zip from the flashdrive under /logs which I assume is the same syslog.zip that is referred to in the "Need help? Read me first!" post (I am just accessing it differently as I can't do it via the web console). If this is not useful then perhaps there is another way to access it?

 

syslog.zip

bigtower_screen2.jpg.81b6665f838cb779ebf3a06636d9aa02.jpg

bigtower_screen3.jpg.80dff23bd5366e1cac4a30d08e943a87.jpg

Link to comment

OK, this is starting to look like memory corruption, because it's crashing in different core parts of the kernel.  There is no point in testing anything else, until you know you can trust your memory.  It may not be memory, but we HAVE to know it's good before we can go any farther.  Start Memtest from the unRAID boot menu, and let it run overnight at least, or less of course if it fails.

Link to comment

Some additional comments -

 

* That's an old nForce board, I had terrible problems with mine, but yours I think is about the only nForce4 board that could be trusted.

 

* It has the usual nForce "spurious IRQ7" problem.  To shut it up and keep it from causing some subsystem from being shutdown, I reserved it in the BIOS settings, so it couldn't be assigned to anything.  It's been too long ago for me to remember exactly where I found the setting, but hopefully you can figure it out.

 

* Somehow it appears to have successfully loaded a 32 bit package, ntfs-3g-2010.3.6-i486-1.txz, and that's a serious problem!  You may have to take your flash drive to another station, but make sure there are NO 32 bit or v5 plugins and packages on there!

 

* You are running Powerdown v2.06, which is really old!  We are up to v2.16 now.  It shouldn't affect normal operation, but may hinder shutting down correctly.  Not that that's a problem right now!

 

* Network problems too!  You have both the onboard nForce network chipset enabled and a Yukon network chipset, which setup first, became eth0.  It connected 4 minutes after you booted at gigabit speed (1000 Mbps).  It was up for 6 minutes, then went down for a full minute, then came up at only 100 Mbps.  It never hit gigabit speeds again.  It went up and down for awhile, then when the system became worse, only rarely came up, and at only 10 Mbps!  That's slow!  And extremely unreliable!

 

* In this syslog, the system appears to be fine for over 4 hours, then the big crash happens, and it's not operational after that.  I'm unable to determine why it happened.

Link to comment

Some additional comments -

 

* That's an old nForce board, I had terrible problems with mine, but yours I think is about the only nForce4 board that could be trusted.

 

Thats good... I hope?

 

* It has the usual nForce "spurious IRQ7" problem.  To shut it up and keep it from causing some subsystem from being shutdown, I reserved it in the BIOS settings, so it couldn't be assigned to anything.  It's been too long ago for me to remember exactly where I found the setting, but hopefully you can figure it out.

I will poke around after the mem test finishes tomorrow, not sure about it though (will be somewhat blind poking)

 

* Somehow it appears to have successfully loaded a 32 bit package, ntfs-3g-2010.3.6-i486-1.txz, and that's a serious problem!  You may have to take your flash drive to another station, but make sure there are NO 32 bit or v5 plugins and packages on there!

 

I did not realize I had installed a 32bit version, as I was never able to get SNAP to work (perhaps that is part of the reason why?) I should just be able to uninstall it, assuming I can ever get to the web console again... right?

 

 

* You are running Powerdown v2.06, which is really old!  We are up to v2.16 now.  It shouldn't affect normal operation, but may hinder shutting down correctly.  Not that that's a problem right now!

 

I didn't know I was running such an old version, but if its not causing any problems I think i will leave well enough alone as I have much bigger problems at the moment.

 

 

* Network problems too!  You have both the onboard nForce network chipset enabled and a Yukon network chipset, which setup first, became eth0.  It connected 4 minutes after you booted at gigabit speed (1000 Mbps).  It was up for 6 minutes, then went down for a full minute, then came up at only 100 Mbps.  It never hit gigabit speeds again.  It went up and down for awhile, then when the system became worse, only rarely came up, and at only 10 Mbps!  That's slow!  And extremely unreliable!

 

ugh... this could have happened when I set everything to default. Actually, I had a drive go bad a few weeks ago but since my system was pretty hot I decided to hold off on the rebuild until i got some additional cooling (120mm fan), I got it, installed it, and then booted up and got a bizzare usb error that I could not get around unless I took the bios battery out and put it back in and then I would be able to get into the bios (before the usb error would pop up before I could get to the bios setup option) so I upgraded the bios and just set the bios default options and was able to boot fine, rebuilt the data drive and bam, next day I am seeing these errors start up (probably the 6 hours later you are refering to?)

 

For the networking part... perhaps I can disable one and leave the other and it will behave?

 

 

* In this syslog, the system appears to be fine for over 4 hours, then the big crash happens, and it's not operational after that.  I'm unable to determine why it happened.

 

well... I am not sure what to think, it all sounds bad but at least I know more than before. and a very big thank you (very big) for going through the stuff as I looked at it and have not been able to make much sense of it.

 

I will wait until morning for the mem test results and go into the BIOS and see if I can figure out the IRQ7 and disable one of the network chipsets, perhaps that will help a bit?

Link to comment

* That's an old nForce board, I had terrible problems with mine, but yours I think is about the only nForce4 board that could be trusted.

Thats good... I hope?

I don't know, I would think about replacing it.  Silent data corruption is a terrible thing!

 

For the networking part... perhaps I can disable one and leave the other and it will behave?

 

I will wait until morning for the mem test results and go into the BIOS and see if I can figure out the IRQ7 and disable one of the network chipsets, perhaps that will help a bit?

That's exactly what I would do!

Link to comment

Well the mem test is still running, and looking... really bad (to my novice self). something about 0 passes and 68 errors says situation not good to me. How does 4 mem sticks fail so spectacularly all at the same time? (I am assuming all, it hasn't finished running, its been running like ~14 hours though)

 

There's no point in continuing the test any further.  You don't want a low number of memory errors, you want exactly ZERO!  In a way, this is good, you've found the problem!  Or at least a major part of the problem.

 

The next step is to isolate and test the sticks individually, because it's probably not all of them.  Cut the problem in half and remove 2 sticks and test.  If they are perfect, then issue is in the other 2.  You want to end up your testing knowing EXACTLY which sticks are good and which are trash.  Initial tests don't have to be very long, any error stops the test.  Once you have what appears to be the only good ones, then run a very long test with all of them installed.  If it's an odd number, you'll probably want to purchase another (test it with them!).

Link to comment

Well the mem test is still running, and looking... really bad (to my novice self). something about 0 passes and 68 errors says situation not good to me. How does 4 mem sticks fail so spectacularly all at the same time? (I am assuming all, it hasn't finished running, its been running like ~14 hours though)

 

There's no point in continuing the test any further.  You don't want a low number of memory errors, you want exactly ZERO!  In a way, this is good, you've found the problem!  Or at least a major part of the problem.

 

The next step is to isolate and test the sticks individually, because it's probably not all of them.  Cut the problem in half and remove 2 sticks and test.  If they are perfect, then issue is in the other 2.  You want to end up your testing knowing EXACTLY which sticks are good and which are trash.  Initial tests don't have to be very long, any error stops the test.  Once you have what appears to be the only good ones, then run a very long test with all of them installed.  If it's an odd number, you'll probably want to purchase another (test it with them!).

One small point. It's relatively common to see memory errors caused not by the sticks being bad, but by the motherboard not being able to drive all 4 simultaneously. A bad capacitor can cause the power to be noisy to the RAM, and lower load with less sticks can mask the problem. I wouldn't be at all surprised to see all 4 RAM sticks test perfect individually, and start giving errors when grouped.

 

The fewer chips you can use to get the speed and capacity you want, the better.

Link to comment

Well I am juggling a few work things so can't put all my time into this at the moment but I tried to boot with two sticks in, and it just froze at the bios screen (wouldn't even let me go into the bios setup) and same when I took out the other sticks and put the previous ones back in. Haven't tried different sticks in diff slots yet but... geeze...

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...