Jump to content

Dead drive, lots of kernel errors, can't access server


ridge

Recommended Posts

Posted

Running 5.0beta14.

 

Server went down while watching TV earlier. Couldn't access gui, and couldn't access via telnet. Had to do hard reboot. On reboot, unraid did a parity check and found one drive with a lot of errors. Screenshot is attached (unfortunately it's the only one I got). Array came back up, but I was completely unable to run anything via console because the system was generating error message after error message without stopping. I couldn't escape out of it. Eventually the server stopped responding and I had to do a hard power down.

 

Attached is a small portion of my syslog (the whole thing zipped is 14MB, so I just took the first 8800 lines or so and zipped that). I'm hoping this is as simple as a dead drive, but wanted some expert guidance based on the syslog before I go ripping out the drive. What makes me most nervous is I can't do anything on the server via the console when it's up because it's just generating system messages so fast without stopping!

 

So, next steps? I can run to Fry's in the morning and buy a replacement drive if necessary. Does it look like that's the only problem?

 

Edited to add syslog

 

Thanks for any insight you guys can offer!

Screen-Shot-2012-01-06-at-12_26.png.66179b8f4fddbdaec9415fd483a60ac3.png

syslog.txt.zip

Posted

Thanks. I got that much.  ;D

 

From the syslog I posted, does it look like there's anything else going on I should be aware of before replacing the drive and rebuilding the parity?

Posted

1 Raw_Read_Error_Rate    0x002f  197  197  051    Pre-fail  Always      -      1569

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      15

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      1

Posted

>:( reiserfsck didn't finish:

 

###########
reiserfsck --check started at Fri Jan  6 07:18:59 2012
###########
Replaying journal: Done.
Reiserfs journal '/dev/md9' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. \/  7 (of  16\/ 70 (of 170\/116 (of 170\
Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: EIP: [<c1092787>] mntget+0x7/0xf SS:ESP 0068:c7c0bdf8

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: CR2: 00000000213cfe98

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Call Trace:

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Code: 00 00 00 8b 00 e8 81 ff ff ff 85 c0 74 06 8b 50 18 64 ff 02 ba 80 49 4a c1 64 03 15 14 40 4a c1 fe 02 5d c3 55 85 c0 89 e5 74 06 <8b> 50 18 64 ff 02 5d c3 55 ba 80 49 4a c1 89 e5 53 8b 58 48 64 

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Stack:

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Process fuser (pid: 15007, ti=c7c0a000 task=f09b2520 task.ti=c7c0a000)

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Oops: 0000 [#1] SMP 

 

Now what?

Posted

Connected up a monitor (with the intention of running memtest) to be greeted by a whole page of this:

 

INFO: rcu_sched_state detected stall on CPU 0 (t=1376280 jiffries)

 

Only the numbers after "t=" are different on each line.

 

CPU stalling?

Posted

1 Raw_Read_Error_Rate     0x002f   197   197   051    Pre-fail  Always       -       1569

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       15

198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -      1

 

sdi may be failing. It needs to be rebuilt or replaced. I'd replace it and then run pre-clear on to see what happens to Current_Pending_Sector and Reallocated_Sector_Ct.

Posted

Ha! Well now I feel like a moron! Thanks!

 

Realized I also needed to preclear the disk (been a while since I did this the first time) so I'm doing that right now. When it's finished and added to the array (hopefully parity will have rebuilt the contents) I'll post back.

Posted

Set preclear running and went out. Came back to babysit and found this:

================================================================== 1.9
=                unRAID server Pre-Clear disk /dev/sdi
=               cycle 1 of 1, partition start on sector 64 
= Disk Pre-Read in progress: 4% complete
= ( 93,768,192,000  bytes of  2,000,398,934,016  read ) 117 MB/s
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
Disk Temperature: 32C, Elapsed Time:  0:14:06

Message from syslogd@Gallifrey at Sat Jan  7 14:00:02 2012 ...
Gallifrey kernel: EIP: [<c10889d7>] path_lookupat+0x2ef/0x4f9 SS:ESP 0068:c52fbe48

 

System crashed again. Time for memtest?

 

Additionally, will this failure during preclear negatively impact the new drive at all?

Posted

Set preclear running and went out. Came back to babysit and found this:

================================================================== 1.9

Your version of the preclear script is VERY old.  please always use the newest version. (I think you are about 4 versions behind)

you can find it here: http://lime-technology.com/forum/index.php?topic=2817.0

 

You'll need to figure out the reason you are crashing.... start with a memory test, and verify that the voltage, clock speed, and timing are all set properly in the BIOS for your specific make and model of memory strips.

Posted

Ran memtest for 24 hours without any errors. Changed the BIOS clock speed as per Joe's advice to the settings specified for the memory (clock speed to 667Mhz was the only one I could change) and rebooted. Now the server is completely unresponsive. Monitor won't come on and I can't back to the BIOS!

 

Help!

Posted

OK, thanks dgaschk. Did all that. Server back up and running (after finding settings to make the Flash drive boot!).

 

Now, after no errors in memtest after a solid 24 hour test, what's next for diagnosis?

Posted

Server crashed again with:

INFO: rcu_sched_state detected stall on CPU 0 (t=1376280 jiffries)

lines repeating down the page.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...