HELP - Array wont start!

TQ · September 7, 2015

This all started when my system seemed to not be serving up files anymore. I then tried viewing the

df -h

command and noticing that the tmpfs /var/log was full. I cleaned up some logs but the system remained unresponsive.

Decided I would reboot. Solves everything, right?

So, after powering off and back on...my array wont start.

I let it sit for over an hour, and the GUI just tells me it's "mounting disks..."

Is something terribly wrong with my system now? Syslog attached as zip.

syslog-09.07.2015.zip

BRiT · September 7, 2015

Something isnt quite right with the filesystem on your drive 5: /dev/sdf -- /dev/md5.

Sep 7 09:20:25 QuizzleUNRAID kernel: mdcmd (6): import 5 8,80 1953514552 ST2000DL003-9VT166_5YD5L8LF

Sep 7 09:20:25 QuizzleUNRAID kernel: md: import disk5: [8,80] (sdf) ST2000DL003-9VT166_5YD5L8LF size: 1953514552

This is where things start to go wrong when its trying to mount the filesystem --

Sep 7 09:20:44 QuizzleUNRAID emhttp: shcmd (1501): mkdir /mnt/disk5

Sep 7 09:20:44 QuizzleUNRAID emhttp: shcmd (1502): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md5 /mnt/disk5 |& logger

Sep 7 09:20:44 QuizzleUNRAID kernel: REISERFS (device md5): found reiserfs format "3.6" with standard journal

Sep 7 09:20:44 QuizzleUNRAID kernel: REISERFS (device md5): using ordered data mode

Sep 7 09:20:44 QuizzleUNRAID kernel: reiserfs: using flush barriers

Sep 7 09:20:44 QuizzleUNRAID kernel: REISERFS (device md5): journal params: device md5, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Sep 7 09:20:44 QuizzleUNRAID kernel: REISERFS (device md5): checking transaction log (md5)

Sep 7 09:21:19 QuizzleUNRAID kernel: sd 1:0:2:0: attempting task abort! scmd(f602a900)

Sep 7 09:21:19 QuizzleUNRAID kernel: sd 1:0:2:0: [sdf] CDB:

Sep 7 09:21:19 QuizzleUNRAID kernel: cdb[0]=0x28: 28 00 00 00 c5 3f 00 00 08 00

Sep 7 09:21:19 QuizzleUNRAID kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)

Sep 7 09:21:19 QuizzleUNRAID kernel: scsi target1:0:2: enclosure_logical_id(0x500605b00455dc80), slot(1)

Sep 7 09:21:20 QuizzleUNRAID kernel: sd 1:0:2:0: task abort: SUCCESS scmd(f602a900)

Sep 7 09:21:50 QuizzleUNRAID kernel: sd 1:0:2:0: attempting task abort! scmd(f602a900)

Sep 7 09:21:50 QuizzleUNRAID kernel: sd 1:0:2:0: [sdf] CDB:

Sep 7 09:21:50 QuizzleUNRAID kernel: cdb[0]=0x28: 28 00 00 00 c5 3f 00 00 08 00

Sep 7 09:21:50 QuizzleUNRAID kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)

Sep 7 09:21:50 QuizzleUNRAID kernel: scsi target1:0:2: enclosure_logical_id(0x500605b00455dc80), slot(1)

Sep 7 09:21:51 QuizzleUNRAID kernel: sd 1:0:2:0: task abort: SUCCESS scmd(f602a900)

Sep 7 09:22:21 QuizzleUNRAID kernel: sd 1:0:2:0: attempting task abort! scmd(f602a900)

Sep 7 09:22:21 QuizzleUNRAID kernel: sd 1:0:2:0: [sdf] CDB:

Sep 7 09:22:21 QuizzleUNRAID kernel: cdb[0]=0x28: 28 00 00 00 c5 3f 00 00 08 00

Sep 7 09:22:21 QuizzleUNRAID kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)

Sep 7 09:22:21 QuizzleUNRAID kernel: scsi target1:0:2: enclosure_logical_id(0x500605b00455dc80), slot(1)

Sep 7 09:22:22 QuizzleUNRAID kernel: sd 1:0:2:0: task abort: SUCCESS scmd(f602a900)

Sep 7 09:22:38 QuizzleUNRAID avahi-daemon[8908]: Invalid response packet from host 10.0.1.10.

Sep 7 09:22:52 QuizzleUNRAID kernel: sd 1:0:2:0: attempting task abort! scmd(f602a900)

Sep 7 09:22:52 QuizzleUNRAID kernel: sd 1:0:2:0: [sdf] CDB:

RobJ · September 7, 2015

I suspect that /var/log filled up because of the same type of errors flooding the syslog. It all appears to be fine until it tries to mount Disk 5 and replay the transactions. Because of the previous 'unclean shutdown', all of the drives have transactions to be replayed. It never gets to mount Disk 6 or 7. The same SAS card driver error sequence occurs over and over, with 30 second timeouts. The error sequence tells us nothing about what actually is wrong. Because it occurs during transaction replay, I would reboot and start the array in Maintenance mode and check the file system on Disk 5. See Check Disk File systems, the ReiserFS and v5 section.

However, I don't know for sure it's a problem with the file system, because the errors are coming from the SAS card error handler. It doesn't know or care about file systems.

TQ · September 7, 2015

Thanks for the replies.

I suspect that /var/log filled up because of the same type of errors flooding the syslog. It all appears to be fine until it tries to mount Disk 5 and replay the transactions. Because of the previous 'unclean shutdown', all of the drives have transactions to be replayed. It never gets to mount Disk 6 or 7. The same SAS card driver error sequence occurs over and over, with 30 second timeouts. The error sequence tells us nothing about what actually is wrong. Because it occurs during transaction replay, I would reboot and start the array in Maintenance mode and check the file system on Disk 5. See Check Disk File systems, the ReiserFS and v5 section.

However, I don't know for sure it's a problem with the file system, because the errors are coming from the SAS card error handler. It doesn't know or care about file systems.

I've rebooted and tried to start in Maintenance Mode, but the managment interface just keeps trying to load like it never actually entered Maintenance Mode properly. In the mean time, I was able to run a check on another disk (to be sure what to expect) and it passed the reiserfsck --check.

###########
reiserfsck --check started at Mon Sep  7 13:13:59 2015
###########
Replaying journal: Trans replayed: mountid 56, transid 503347, desc 6708, len 1, commit 6710, next trans offset 6693
Trans replayed: mountid 56, transid 503348, desc 6711, len 1, commit 6713, next trans offset 6696
Replaying journal: Done.
Reiserfs journal '/dev/md4' in blocks [18..8211]: 2 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 429447
        Internal nodes 2710
        Directories 5382
        Other files 12238
        Data block pointers 432666611 (6988730 of them are zero)
        Safe links 0

Once I attempted to run the check on Disk 5, it just stalled out.

QuizzleUNRAID:/dev# reiserfsck --check /dev/md5
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md5
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

And that's where it's sat for a while. A new syslog is attached but I'm unsure if it will glean any new info.

syslog-09.07.2015.2.zip

RobJ · September 8, 2015

I've rebooted and tried to start in Maintenance Mode, but the managment interface just keeps trying to load like it never actually entered Maintenance Mode properly.

Because of the previous unclean shutdown, it tried to run a parity check, and as soon as there's any I/O to Disk 5, it or its SCSI device begins spewing the errors, which essentially stops and blocks everything else. Parity check had to be canceled, but I don't know that it would ever give you the chance to do that.

Can you describe what you saw, and why you felt it wasn't in Maintenance mode?

In the mean time, I was able to run a check on another disk (to be sure what to expect) and it passed the reiserfsck --check.

That's great, Disk 4 looks fine. I think you must have been in Maintenance mode in order to run it. Not sure how you got there, with those errors monopolizing the machine!

Once I attempted to run the check on Disk 5, it just stalled out.

Let's try a SMART short test on the drive -

smartctl -t short /dev/sdf

Wait 5 minutes, then obtain a SMART report.

smartctl -a -d ata /dev/sdf >/boot/smart5a.txt

If that seems to work, try a long test (temporarily disable spin down for the drive) -

smartctl -t long /dev/sdf

Wait 5 hours or so, then obtain a SMART report.

smartctl -a -d ata /dev/sdf >/boot/smart5b.txt

If the long test does not appear to have completed (it says the percentage complete near the bottom of report), wait a few more hours and get the report again.

Attach those SMART reports for us to see.

TQ · September 13, 2015

Can you describe what you saw, and why you felt it wasn't in Maintenance mode?

The management page never responded after trying to start the array. I could never get into the web mgmt page after that. It just kept "loading".

Let's try a SMART short test on the drive -

smartctl -t short /dev/sdf

Wait 5 minutes, then obtain a SMART report.

smartctl -a -d ata /dev/sdf >/boot/smart5a.txt

If that seems to work, try a long test (temporarily disable spin down for the drive) -

smartctl -t long /dev/sdf

Wait 5 hours or so, then obtain a SMART report.

smartctl -a -d ata /dev/sdf >/boot/smart5b.txt

If the long test does not appear to have completed (it says the percentage complete near the bottom of report), wait a few more hours and get the report again.

Attach those SMART reports for us to see.

So I was finally able to try the smart test as described above. Here is the result:

smartctl -t short /dev/sdf

Wait at least 5 mins...

smartctl -a -d ata /dev/sdf >/boot/smart-sdf.txt

This produces the following output in the file:

root@QuizzleUNRAID:~# cat /boot/smart-sdf.txt |more
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Invalid argument

A mandatory SMART command failed: exiting. To continue, add one or more '-T perm
issive' options.
root@QuizzleUNRAID:~# cat /boot/smart-sdf.txt |more
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Invalid argument

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Selftest:

root@QuizzleUNRAID:~# smartctl -l selftest /dev/sdf
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     48079         -
# 2  Extended offline    Aborted by host               90%     29912         -
# 3  Short offline       Completed without error       00%      9424         -
# 4  Extended offline    Completed without error       00%      9215         -

Not sure what to do next.

HELP - Array wont start!

Recommended Posts

TQ

Link to comment

BRiT

Link to comment

RobJ

Link to comment

TQ

Link to comment

RobJ

Link to comment

TQ

Link to comment

Join the conversation