Dead drive, lots of kernel errors, can't access server

January 6, 201214 yr

Running 5.0beta14.

Server went down while watching TV earlier. Couldn't access gui, and couldn't access via telnet. Had to do hard reboot. On reboot, unraid did a parity check and found one drive with a lot of errors. Screenshot is attached (unfortunately it's the only one I got). Array came back up, but I was completely unable to run anything via console because the system was generating error message after error message without stopping. I couldn't escape out of it. Eventually the server stopped responding and I had to do a hard power down.

Attached is a small portion of my syslog (the whole thing zipped is 14MB, so I just took the first 8800 lines or so and zipped that). I'm hoping this is as simple as a dead drive, but wanted some expert guidance based on the syslog before I go ripping out the drive. What makes me most nervous is I can't do anything on the server via the console when it's up because it's just generating system messages so fast without stopping!

So, next steps? I can run to Fry's in the morning and buy a replacement drive if necessary. Does it look like that's the only problem?

Edited to add syslog

Thanks for any insight you guys can offer!

syslog.txt.zip

Quote

January 6, 201214 yr

There is a problem with disk 9.

Quote

January 6, 201214 yr

Author

Thanks. I got that much.

From the syslog I posted, does it look like there's anything else going on I should be aware of before replacing the drive and rebuilding the parity?

Quote

January 6, 201214 yr

I don't see a syslog.

Quote

January 6, 201214 yr

Author

Hrm...didn't post...ok, here it is

syslog.txt.zip

Quote

January 6, 201214 yr

The syslog doesn't show anything wrong with disk9. The problem must occur later in the log. Post a SMART report for disk 9.

Quote

January 6, 201214 yr

Author

SMART report attached.

I'm running a reiserfsck -- check on the drive now too. Will post results shortly.

smart.txt

Quote

January 6, 201214 yr

1 Raw_Read_Error_Rate 0x002f 197 197 051 Pre-fail Always - 1569

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 15

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1

Quote

January 6, 201214 yr

Author

reiserfsck didn't finish:

###########
reiserfsck --check started at Fri Jan  6 07:18:59 2012
###########
Replaying journal: Done.
Reiserfs journal '/dev/md9' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. \/  7 (of  16\/ 70 (of 170\/116 (of 170\
Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: EIP: [<c1092787>] mntget+0x7/0xf SS:ESP 0068:c7c0bdf8

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: CR2: 00000000213cfe98

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Call Trace:

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Code: 00 00 00 8b 00 e8 81 ff ff ff 85 c0 74 06 8b 50 18 64 ff 02 ba 80 49 4a c1 64 03 15 14 40 4a c1 fe 02 5d c3 55 85 c0 89 e5 74 06 <8b> 50 18 64 ff 02 5d c3 55 ba 80 49 4a c1 89 e5 53 8b 58 48 64 

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Stack:

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Process fuser (pid: 15007, ti=c7c0a000 task=f09b2520 task.ti=c7c0a000)

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Oops: 0000 [#1] SMP

Now what?

Quote

January 6, 201214 yr

reiserfsck shouldn't crash like that -- it seems you might have a system stability issue.

Have you run a memtest?

Quote

January 6, 201214 yr

Author

Connected up a monitor (with the intention of running memtest) to be greeted by a whole page of this:

INFO: rcu_sched_state detected stall on CPU 0 (t=1376280 jiffries)

Only the numbers after "t=" are different on each line.

CPU stalling?

Quote

January 6, 201214 yr

1 Raw_Read_Error_Rate 0x002f 197 197 051 Pre-fail Always - 1569

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 15

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1

sdi may be failing. It needs to be rebuilt or replaced. I'd replace it and then run pre-clear on to see what happens to Current_Pending_Sector and Reallocated_Sector_Ct.

Quote

January 6, 201214 yr

^^^First...then see if the system is stable.

Looks like you had a kernel panic during the mntget operation?

http://www.mjmwired.net/kernel/Documentation/RCU/stallwarn.txt

http://lxr.free-electrons.com/ident?i=mntget

After the drive is rebuilt, take a closer look at your hardware..

Quote

January 6, 201214 yr

Author

OK, off to get a new drive. I'll post more when that's set up.

Quote

January 7, 201214 yr

Author

OK, my inexperience is showing. I bought a replacement 2TB drive, followed the instructions here http://lime-technology.com/wiki/index.php?title=Replacing_a_Data_Drive, only to now have Disk 9 showing that it's not installed, and seemingly no way forward. Attached image.

Quote

January 7, 201214 yr

Stop the array and assign the new disk as disk 9.

Quote

January 7, 201214 yr

Author

Ha! Well now I feel like a moron! Thanks!

Realized I also needed to preclear the disk (been a while since I did this the first time) so I'm doing that right now. When it's finished and added to the array (hopefully parity will have rebuilt the contents) I'll post back.

Quote

January 7, 201214 yr

Author

Set preclear running and went out. Came back to babysit and found this:

================================================================== 1.9
=                unRAID server Pre-Clear disk /dev/sdi
=               cycle 1 of 1, partition start on sector 64 
= Disk Pre-Read in progress: 4% complete
= ( 93,768,192,000  bytes of  2,000,398,934,016  read ) 117 MB/s
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
Disk Temperature: 32C, Elapsed Time:  0:14:06

Message from syslogd@Gallifrey at Sat Jan  7 14:00:02 2012 ...
Gallifrey kernel: EIP: [<c10889d7>] path_lookupat+0x2ef/0x4f9 SS:ESP 0068:c52fbe48

System crashed again. Time for memtest?

Additionally, will this failure during preclear negatively impact the new drive at all?

Quote

January 7, 201214 yr

Yeah, something is wrong with the hardware.

An aborted preclear won't affect the hard drive at all.

Quote

January 7, 201214 yr

Set preclear running and went out. Came back to babysit and found this:
================================================================== 1.9

Your version of the preclear script is VERY old. please always use the newest version. (I think you are about 4 versions behind)
you can find it here: http://lime-technology.com/forum/index.php?topic=2817.0

You'll need to figure out the reason you are crashing.... start with a memory test, and verify that the voltage, clock speed, and timing are all set properly in the BIOS for your specific make and model of memory strips.

Quote

January 8, 201214 yr

Author

Ran memtest for 24 hours without any errors. Changed the BIOS clock speed as per Joe's advice to the settings specified for the memory (clock speed to 667Mhz was the only one I could change) and rebooted. Now the server is completely unresponsive. Monitor won't come on and I can't back to the BIOS!

Help!

Quote

January 8, 201214 yr

Reset/clear the BIOS. This will be a jumper or contact pads on the MB. See the MB manual.

Quote

January 8, 201214 yr

Author

OK, thanks dgaschk. Did all that. Server back up and running (after finding settings to make the Flash drive boot!).

Now, after no errors in memtest after a solid 24 hour test, what's next for diagnosis?

Quote

January 8, 201214 yr

Author

I also replaced the preclear script with the latest version, FYI.

Quote

January 8, 201214 yr

Author

Server crashed again with:

INFO: rcu_sched_state detected stall on CPU 0 (t=1376280 jiffries)

lines repeating down the page.

Quote

Dead drive, lots of kernel errors, can't access server

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)