ridge Posted January 6, 2012 Posted January 6, 2012 Running 5.0beta14. Server went down while watching TV earlier. Couldn't access gui, and couldn't access via telnet. Had to do hard reboot. On reboot, unraid did a parity check and found one drive with a lot of errors. Screenshot is attached (unfortunately it's the only one I got). Array came back up, but I was completely unable to run anything via console because the system was generating error message after error message without stopping. I couldn't escape out of it. Eventually the server stopped responding and I had to do a hard power down. Attached is a small portion of my syslog (the whole thing zipped is 14MB, so I just took the first 8800 lines or so and zipped that). I'm hoping this is as simple as a dead drive, but wanted some expert guidance based on the syslog before I go ripping out the drive. What makes me most nervous is I can't do anything on the server via the console when it's up because it's just generating system messages so fast without stopping! So, next steps? I can run to Fry's in the morning and buy a replacement drive if necessary. Does it look like that's the only problem? Edited to add syslog Thanks for any insight you guys can offer! syslog.txt.zip
ridge Posted January 6, 2012 Author Posted January 6, 2012 Thanks. I got that much. From the syslog I posted, does it look like there's anything else going on I should be aware of before replacing the drive and rebuilding the parity?
ridge Posted January 6, 2012 Author Posted January 6, 2012 Hrm...didn't post...ok, here it is syslog.txt.zip
dgaschk Posted January 6, 2012 Posted January 6, 2012 The syslog doesn't show anything wrong with disk9. The problem must occur later in the log. Post a SMART report for disk 9.
ridge Posted January 6, 2012 Author Posted January 6, 2012 SMART report attached. I'm running a reiserfsck -- check on the drive now too. Will post results shortly. smart.txt
mbryanr Posted January 6, 2012 Posted January 6, 2012 1 Raw_Read_Error_Rate 0x002f 197 197 051 Pre-fail Always - 1569 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 15 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1
ridge Posted January 6, 2012 Author Posted January 6, 2012 reiserfsck didn't finish: ########### reiserfsck --check started at Fri Jan 6 07:18:59 2012 ########### Replaying journal: Done. Reiserfs journal '/dev/md9' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. \/ 7 (of 16\/ 70 (of 170\/116 (of 170\ Message from syslogd@Gallifrey at Fri Jan 6 08:00:13 2012 ... Gallifrey kernel: EIP: [<c1092787>] mntget+0x7/0xf SS:ESP 0068:c7c0bdf8 Message from syslogd@Gallifrey at Fri Jan 6 08:00:13 2012 ... Gallifrey kernel: CR2: 00000000213cfe98 Message from syslogd@Gallifrey at Fri Jan 6 08:00:13 2012 ... Gallifrey kernel: Call Trace: Message from syslogd@Gallifrey at Fri Jan 6 08:00:13 2012 ... Gallifrey kernel: Code: 00 00 00 8b 00 e8 81 ff ff ff 85 c0 74 06 8b 50 18 64 ff 02 ba 80 49 4a c1 64 03 15 14 40 4a c1 fe 02 5d c3 55 85 c0 89 e5 74 06 <8b> 50 18 64 ff 02 5d c3 55 ba 80 49 4a c1 89 e5 53 8b 58 48 64 Message from syslogd@Gallifrey at Fri Jan 6 08:00:13 2012 ... Gallifrey kernel: Stack: Message from syslogd@Gallifrey at Fri Jan 6 08:00:13 2012 ... Gallifrey kernel: Process fuser (pid: 15007, ti=c7c0a000 task=f09b2520 task.ti=c7c0a000) Message from syslogd@Gallifrey at Fri Jan 6 08:00:13 2012 ... Gallifrey kernel: Oops: 0000 [#1] SMP Now what?
elkay14 Posted January 6, 2012 Posted January 6, 2012 reiserfsck shouldn't crash like that -- it seems you might have a system stability issue. Have you run a memtest?
ridge Posted January 6, 2012 Author Posted January 6, 2012 Connected up a monitor (with the intention of running memtest) to be greeted by a whole page of this: INFO: rcu_sched_state detected stall on CPU 0 (t=1376280 jiffries) Only the numbers after "t=" are different on each line. CPU stalling?
dgaschk Posted January 6, 2012 Posted January 6, 2012 1 Raw_Read_Error_Rate 0x002f 197 197 051 Pre-fail Always - 1569 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 15 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 sdi may be failing. It needs to be rebuilt or replaced. I'd replace it and then run pre-clear on to see what happens to Current_Pending_Sector and Reallocated_Sector_Ct.
mbryanr Posted January 6, 2012 Posted January 6, 2012 ^^^First...then see if the system is stable. Looks like you had a kernel panic during the mntget operation? http://www.mjmwired.net/kernel/Documentation/RCU/stallwarn.txt http://lxr.free-electrons.com/ident?i=mntget After the drive is rebuilt, take a closer look at your hardware..
ridge Posted January 6, 2012 Author Posted January 6, 2012 OK, off to get a new drive. I'll post more when that's set up.
ridge Posted January 7, 2012 Author Posted January 7, 2012 OK, my inexperience is showing. I bought a replacement 2TB drive, followed the instructions here http://lime-technology.com/wiki/index.php?title=Replacing_a_Data_Drive, only to now have Disk 9 showing that it's not installed, and seemingly no way forward. Attached image.
dgaschk Posted January 7, 2012 Posted January 7, 2012 Stop the array and assign the new disk as disk 9.
ridge Posted January 7, 2012 Author Posted January 7, 2012 Ha! Well now I feel like a moron! Thanks! Realized I also needed to preclear the disk (been a while since I did this the first time) so I'm doing that right now. When it's finished and added to the array (hopefully parity will have rebuilt the contents) I'll post back.
ridge Posted January 7, 2012 Author Posted January 7, 2012 Set preclear running and went out. Came back to babysit and found this: ================================================================== 1.9 = unRAID server Pre-Clear disk /dev/sdi = cycle 1 of 1, partition start on sector 64 = Disk Pre-Read in progress: 4% complete = ( 93,768,192,000 bytes of 2,000,398,934,016 read ) 117 MB/s = = = = = = = = = = Disk Temperature: 32C, Elapsed Time: 0:14:06 Message from syslogd@Gallifrey at Sat Jan 7 14:00:02 2012 ... Gallifrey kernel: EIP: [<c10889d7>] path_lookupat+0x2ef/0x4f9 SS:ESP 0068:c52fbe48 System crashed again. Time for memtest? Additionally, will this failure during preclear negatively impact the new drive at all?
elkay14 Posted January 7, 2012 Posted January 7, 2012 Yeah, something is wrong with the hardware. An aborted preclear won't affect the hard drive at all.
Joe L. Posted January 7, 2012 Posted January 7, 2012 Set preclear running and went out. Came back to babysit and found this: ================================================================== 1.9 Your version of the preclear script is VERY old. please always use the newest version. (I think you are about 4 versions behind) you can find it here: http://lime-technology.com/forum/index.php?topic=2817.0 You'll need to figure out the reason you are crashing.... start with a memory test, and verify that the voltage, clock speed, and timing are all set properly in the BIOS for your specific make and model of memory strips.
ridge Posted January 8, 2012 Author Posted January 8, 2012 Ran memtest for 24 hours without any errors. Changed the BIOS clock speed as per Joe's advice to the settings specified for the memory (clock speed to 667Mhz was the only one I could change) and rebooted. Now the server is completely unresponsive. Monitor won't come on and I can't back to the BIOS! Help!
dgaschk Posted January 8, 2012 Posted January 8, 2012 Reset/clear the BIOS. This will be a jumper or contact pads on the MB. See the MB manual.
ridge Posted January 8, 2012 Author Posted January 8, 2012 OK, thanks dgaschk. Did all that. Server back up and running (after finding settings to make the Flash drive boot!). Now, after no errors in memtest after a solid 24 hour test, what's next for diagnosis?
ridge Posted January 8, 2012 Author Posted January 8, 2012 I also replaced the preclear script with the latest version, FYI.
ridge Posted January 8, 2012 Author Posted January 8, 2012 Server crashed again with: INFO: rcu_sched_state detected stall on CPU 0 (t=1376280 jiffries) lines repeating down the page.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.