TQ Posted September 7, 2015 Share Posted September 7, 2015 This all started when my system seemed to not be serving up files anymore. I then tried viewing the df -h command and noticing that the tmpfs /var/log was full. I cleaned up some logs but the system remained unresponsive. Decided I would reboot. Solves everything, right? So, after powering off and back on...my array wont start. I let it sit for over an hour, and the GUI just tells me it's "mounting disks..." Is something terribly wrong with my system now? Syslog attached as zip. syslog-09.07.2015.zip Link to comment
BRiT Posted September 7, 2015 Share Posted September 7, 2015 Something isnt quite right with the filesystem on your drive 5: /dev/sdf -- /dev/md5. Sep 7 09:20:25 QuizzleUNRAID kernel: mdcmd (6): import 5 8,80 1953514552 ST2000DL003-9VT166_5YD5L8LF Sep 7 09:20:25 QuizzleUNRAID kernel: md: import disk5: [8,80] (sdf) ST2000DL003-9VT166_5YD5L8LF size: 1953514552 This is where things start to go wrong when its trying to mount the filesystem -- Sep 7 09:20:44 QuizzleUNRAID emhttp: shcmd (1501): mkdir /mnt/disk5 Sep 7 09:20:44 QuizzleUNRAID emhttp: shcmd (1502): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md5 /mnt/disk5 |& logger Sep 7 09:20:44 QuizzleUNRAID kernel: REISERFS (device md5): found reiserfs format "3.6" with standard journal Sep 7 09:20:44 QuizzleUNRAID kernel: REISERFS (device md5): using ordered data mode Sep 7 09:20:44 QuizzleUNRAID kernel: reiserfs: using flush barriers Sep 7 09:20:44 QuizzleUNRAID kernel: REISERFS (device md5): journal params: device md5, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 Sep 7 09:20:44 QuizzleUNRAID kernel: REISERFS (device md5): checking transaction log (md5) Sep 7 09:21:19 QuizzleUNRAID kernel: sd 1:0:2:0: attempting task abort! scmd(f602a900) Sep 7 09:21:19 QuizzleUNRAID kernel: sd 1:0:2:0: [sdf] CDB: Sep 7 09:21:19 QuizzleUNRAID kernel: cdb[0]=0x28: 28 00 00 00 c5 3f 00 00 08 00 Sep 7 09:21:19 QuizzleUNRAID kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2) Sep 7 09:21:19 QuizzleUNRAID kernel: scsi target1:0:2: enclosure_logical_id(0x500605b00455dc80), slot(1) Sep 7 09:21:20 QuizzleUNRAID kernel: sd 1:0:2:0: task abort: SUCCESS scmd(f602a900) Sep 7 09:21:50 QuizzleUNRAID kernel: sd 1:0:2:0: attempting task abort! scmd(f602a900) Sep 7 09:21:50 QuizzleUNRAID kernel: sd 1:0:2:0: [sdf] CDB: Sep 7 09:21:50 QuizzleUNRAID kernel: cdb[0]=0x28: 28 00 00 00 c5 3f 00 00 08 00 Sep 7 09:21:50 QuizzleUNRAID kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2) Sep 7 09:21:50 QuizzleUNRAID kernel: scsi target1:0:2: enclosure_logical_id(0x500605b00455dc80), slot(1) Sep 7 09:21:51 QuizzleUNRAID kernel: sd 1:0:2:0: task abort: SUCCESS scmd(f602a900) Sep 7 09:22:21 QuizzleUNRAID kernel: sd 1:0:2:0: attempting task abort! scmd(f602a900) Sep 7 09:22:21 QuizzleUNRAID kernel: sd 1:0:2:0: [sdf] CDB: Sep 7 09:22:21 QuizzleUNRAID kernel: cdb[0]=0x28: 28 00 00 00 c5 3f 00 00 08 00 Sep 7 09:22:21 QuizzleUNRAID kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2) Sep 7 09:22:21 QuizzleUNRAID kernel: scsi target1:0:2: enclosure_logical_id(0x500605b00455dc80), slot(1) Sep 7 09:22:22 QuizzleUNRAID kernel: sd 1:0:2:0: task abort: SUCCESS scmd(f602a900) Sep 7 09:22:38 QuizzleUNRAID avahi-daemon[8908]: Invalid response packet from host 10.0.1.10. Sep 7 09:22:52 QuizzleUNRAID kernel: sd 1:0:2:0: attempting task abort! scmd(f602a900) Sep 7 09:22:52 QuizzleUNRAID kernel: sd 1:0:2:0: [sdf] CDB: Link to comment
RobJ Posted September 7, 2015 Share Posted September 7, 2015 I suspect that /var/log filled up because of the same type of errors flooding the syslog. It all appears to be fine until it tries to mount Disk 5 and replay the transactions. Because of the previous 'unclean shutdown', all of the drives have transactions to be replayed. It never gets to mount Disk 6 or 7. The same SAS card driver error sequence occurs over and over, with 30 second timeouts. The error sequence tells us nothing about what actually is wrong. Because it occurs during transaction replay, I would reboot and start the array in Maintenance mode and check the file system on Disk 5. See Check Disk File systems, the ReiserFS and v5 section. However, I don't know for sure it's a problem with the file system, because the errors are coming from the SAS card error handler. It doesn't know or care about file systems. Link to comment
TQ Posted September 7, 2015 Author Share Posted September 7, 2015 Thanks for the replies. I suspect that /var/log filled up because of the same type of errors flooding the syslog. It all appears to be fine until it tries to mount Disk 5 and replay the transactions. Because of the previous 'unclean shutdown', all of the drives have transactions to be replayed. It never gets to mount Disk 6 or 7. The same SAS card driver error sequence occurs over and over, with 30 second timeouts. The error sequence tells us nothing about what actually is wrong. Because it occurs during transaction replay, I would reboot and start the array in Maintenance mode and check the file system on Disk 5. See Check Disk File systems, the ReiserFS and v5 section. However, I don't know for sure it's a problem with the file system, because the errors are coming from the SAS card error handler. It doesn't know or care about file systems. I've rebooted and tried to start in Maintenance Mode, but the managment interface just keeps trying to load like it never actually entered Maintenance Mode properly. In the mean time, I was able to run a check on another disk (to be sure what to expect) and it passed the reiserfsck --check. ########### reiserfsck --check started at Mon Sep 7 13:13:59 2015 ########### Replaying journal: Trans replayed: mountid 56, transid 503347, desc 6708, len 1, commit 6710, next trans offset 6693 Trans replayed: mountid 56, transid 503348, desc 6711, len 1, commit 6713, next trans offset 6696 Replaying journal: Done. Reiserfs journal '/dev/md4' in blocks [18..8211]: 2 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 429447 Internal nodes 2710 Directories 5382 Other files 12238 Data block pointers 432666611 (6988730 of them are zero) Safe links 0 Once I attempted to run the check on Disk 5, it just stalled out. QuizzleUNRAID:/dev# reiserfsck --check /dev/md5 reiserfsck 3.6.24 Will read-only check consistency of the filesystem on /dev/md5 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes And that's where it's sat for a while. A new syslog is attached but I'm unsure if it will glean any new info. syslog-09.07.2015.2.zip Link to comment
RobJ Posted September 8, 2015 Share Posted September 8, 2015 I've rebooted and tried to start in Maintenance Mode, but the managment interface just keeps trying to load like it never actually entered Maintenance Mode properly. Because of the previous unclean shutdown, it tried to run a parity check, and as soon as there's any I/O to Disk 5, it or its SCSI device begins spewing the errors, which essentially stops and blocks everything else. Parity check had to be canceled, but I don't know that it would ever give you the chance to do that. Can you describe what you saw, and why you felt it wasn't in Maintenance mode? In the mean time, I was able to run a check on another disk (to be sure what to expect) and it passed the reiserfsck --check. That's great, Disk 4 looks fine. I think you must have been in Maintenance mode in order to run it. Not sure how you got there, with those errors monopolizing the machine! Once I attempted to run the check on Disk 5, it just stalled out. Let's try a SMART short test on the drive - smartctl -t short /dev/sdf Wait 5 minutes, then obtain a SMART report. smartctl -a -d ata /dev/sdf >/boot/smart5a.txt If that seems to work, try a long test (temporarily disable spin down for the drive) - smartctl -t long /dev/sdf Wait 5 hours or so, then obtain a SMART report. smartctl -a -d ata /dev/sdf >/boot/smart5b.txt If the long test does not appear to have completed (it says the percentage complete near the bottom of report), wait a few more hours and get the report again. Attach those SMART reports for us to see. Link to comment
TQ Posted September 13, 2015 Author Share Posted September 13, 2015 Can you describe what you saw, and why you felt it wasn't in Maintenance mode? The management page never responded after trying to start the array. I could never get into the web mgmt page after that. It just kept "loading". Let's try a SMART short test on the drive - smartctl -t short /dev/sdf Wait 5 minutes, then obtain a SMART report. smartctl -a -d ata /dev/sdf >/boot/smart5a.txt If that seems to work, try a long test (temporarily disable spin down for the drive) - smartctl -t long /dev/sdf Wait 5 hours or so, then obtain a SMART report. smartctl -a -d ata /dev/sdf >/boot/smart5b.txt If the long test does not appear to have completed (it says the percentage complete near the bottom of report), wait a few more hours and get the report again. Attach those SMART reports for us to see. So I was finally able to try the smart test as described above. Here is the result: smartctl -t short /dev/sdf Wait at least 5 mins... smartctl -a -d ata /dev/sdf >/boot/smart-sdf.txt This produces the following output in the file: root@QuizzleUNRAID:~# cat /boot/smart-sdf.txt |more smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org Read Device Identity failed: Invalid argument A mandatory SMART command failed: exiting. To continue, add one or more '-T perm issive' options. root@QuizzleUNRAID:~# cat /boot/smart-sdf.txt |more smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org Read Device Identity failed: Invalid argument A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Selftest: root@QuizzleUNRAID:~# smartctl -l selftest /dev/sdf smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 48079 - # 2 Extended offline Aborted by host 90% 29912 - # 3 Short offline Completed without error 00% 9424 - # 4 Extended offline Completed without error 00% 9215 - Not sure what to do next. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.