Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Dead drive, lots of kernel errors, can't access server

Featured Replies

Running 5.0beta14.

 

Server went down while watching TV earlier. Couldn't access gui, and couldn't access via telnet. Had to do hard reboot. On reboot, unraid did a parity check and found one drive with a lot of errors. Screenshot is attached (unfortunately it's the only one I got). Array came back up, but I was completely unable to run anything via console because the system was generating error message after error message without stopping. I couldn't escape out of it. Eventually the server stopped responding and I had to do a hard power down.

 

Attached is a small portion of my syslog (the whole thing zipped is 14MB, so I just took the first 8800 lines or so and zipped that). I'm hoping this is as simple as a dead drive, but wanted some expert guidance based on the syslog before I go ripping out the drive. What makes me most nervous is I can't do anything on the server via the console when it's up because it's just generating system messages so fast without stopping!

 

So, next steps? I can run to Fry's in the morning and buy a replacement drive if necessary. Does it look like that's the only problem?

 

Edited to add syslog

 

Thanks for any insight you guys can offer!

Screen-Shot-2012-01-06-at-12_26.png.66179b8f4fddbdaec9415fd483a60ac3.png

syslog.txt.zip

  • Author

Thanks. I got that much.  ;D

 

From the syslog I posted, does it look like there's anything else going on I should be aware of before replacing the drive and rebuilding the parity?

  • Author

Hrm...didn't post...ok, here it is

syslog.txt.zip

The syslog doesn't show anything wrong with disk9. The problem must occur later in the log. Post a SMART report for disk 9.

  • Author

SMART report attached.

 

I'm running a reiserfsck -- check on the drive now too. Will post results shortly.

smart.txt

1 Raw_Read_Error_Rate    0x002f  197  197  051    Pre-fail  Always      -      1569

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      15

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      1

  • Author

>:( reiserfsck didn't finish:

 

###########
reiserfsck --check started at Fri Jan  6 07:18:59 2012
###########
Replaying journal: Done.
Reiserfs journal '/dev/md9' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. \/  7 (of  16\/ 70 (of 170\/116 (of 170\
Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: EIP: [<c1092787>] mntget+0x7/0xf SS:ESP 0068:c7c0bdf8

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: CR2: 00000000213cfe98

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Call Trace:

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Code: 00 00 00 8b 00 e8 81 ff ff ff 85 c0 74 06 8b 50 18 64 ff 02 ba 80 49 4a c1 64 03 15 14 40 4a c1 fe 02 5d c3 55 85 c0 89 e5 74 06 <8b> 50 18 64 ff 02 5d c3 55 ba 80 49 4a c1 89 e5 53 8b 58 48 64 

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Stack:

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Process fuser (pid: 15007, ti=c7c0a000 task=f09b2520 task.ti=c7c0a000)

Message from syslogd@Gallifrey at Fri Jan  6 08:00:13 2012 ...
Gallifrey kernel: Oops: 0000 [#1] SMP 

 

Now what?

reiserfsck shouldn't crash like that -- it seems you might have a system stability issue.

 

Have you run a memtest?

  • Author

Connected up a monitor (with the intention of running memtest) to be greeted by a whole page of this:

 

INFO: rcu_sched_state detected stall on CPU 0 (t=1376280 jiffries)

 

Only the numbers after "t=" are different on each line.

 

CPU stalling?

1 Raw_Read_Error_Rate     0x002f   197   197   051    Pre-fail  Always       -       1569

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       15

198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -      1

 

sdi may be failing. It needs to be rebuilt or replaced. I'd replace it and then run pre-clear on to see what happens to Current_Pending_Sector and Reallocated_Sector_Ct.

  • Author

OK, off to get a new drive. I'll post more when that's set up.

Stop the array and assign the new disk as disk 9.

  • Author

Ha! Well now I feel like a moron! Thanks!

 

Realized I also needed to preclear the disk (been a while since I did this the first time) so I'm doing that right now. When it's finished and added to the array (hopefully parity will have rebuilt the contents) I'll post back.

  • Author

Set preclear running and went out. Came back to babysit and found this:

================================================================== 1.9
=                unRAID server Pre-Clear disk /dev/sdi
=               cycle 1 of 1, partition start on sector 64 
= Disk Pre-Read in progress: 4% complete
= ( 93,768,192,000  bytes of  2,000,398,934,016  read ) 117 MB/s
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
Disk Temperature: 32C, Elapsed Time:  0:14:06

Message from syslogd@Gallifrey at Sat Jan  7 14:00:02 2012 ...
Gallifrey kernel: EIP: [<c10889d7>] path_lookupat+0x2ef/0x4f9 SS:ESP 0068:c52fbe48

 

System crashed again. Time for memtest?

 

Additionally, will this failure during preclear negatively impact the new drive at all?

Yeah, something is wrong with the hardware.

 

An aborted preclear won't affect the hard drive at all.

Set preclear running and went out. Came back to babysit and found this:

================================================================== 1.9

Your version of the preclear script is VERY old.  please always use the newest version. (I think you are about 4 versions behind)

you can find it here: http://lime-technology.com/forum/index.php?topic=2817.0

 

You'll need to figure out the reason you are crashing.... start with a memory test, and verify that the voltage, clock speed, and timing are all set properly in the BIOS for your specific make and model of memory strips.

  • Author

Ran memtest for 24 hours without any errors. Changed the BIOS clock speed as per Joe's advice to the settings specified for the memory (clock speed to 667Mhz was the only one I could change) and rebooted. Now the server is completely unresponsive. Monitor won't come on and I can't back to the BIOS!

 

Help!

Reset/clear the BIOS. This will be a jumper or contact pads on the MB. See the MB manual.

  • Author

OK, thanks dgaschk. Did all that. Server back up and running (after finding settings to make the Flash drive boot!).

 

Now, after no errors in memtest after a solid 24 hour test, what's next for diagnosis?

  • Author

I also replaced the preclear script with the latest version, FYI.

  • Author

Server crashed again with:

INFO: rcu_sched_state detected stall on CPU 0 (t=1376280 jiffries)

lines repeating down the page.

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.