unable to stop array per the GUI, and put into maintenance mode

JustinChase · September 27, 2014

when I check the box, then hit "Stop", it appears to go thru the steps to stop the array, but the array never shows as stopped in the GUI.

It appears to have actually stopped the array, as none of the user shares is available now, but the GUI won't update, which means I can't put the array into maintenance mode.

I've had lots of issues with unformatted drives, red balls, cache drive, etc, and after getting an initial response from Tom via email asking for more info, and telling me not to do anything, I've not heard anything else for about 3 days, so I guess I need to just figure out a resolution on my own.

That seems to need to start with me running resierfsck on the unformatted drive. It seems I need to put the array into maintenance mode, which I'm unable to do.

I wish I could get more help from Tom, but he's busy getting beta10 ready.

So, does anyone else know how to force the array into maintenance mode, so I can try to recover my drive?

thanks in advance.

jphipps · September 27, 2014

From a shell you can try a "df -k" to check for mounted filesystems to see if any are still mounted. My guess is there is a process that has a file/directory in use such as a shell. You might be able to manually unmount any left over or check to see what PID has the filesystem in use.

Worst case you may have to reboot to kill the process and then put it in maintenance mode.

JustinChase · September 27, 2014

not sure what this means, but disk9 is the unformatted disk in my array...

root@media:~# df -k

Filesystem 1K-blocks Used Available Use% Mounted on

/dev/sda1 3781376 107376 3674000 3% /boot

/dev/md9 2930177100 1505697032 1424480068 52% /mnt/disk9

jphipps · September 27, 2014

Must not really be unformated since it is still mounted. You can try a "umount /mnt/disk9" and see if that will unmount and have the array continue to go offline. If it is in use, you can use the fuser command to see what pid is active, but that doesn't always show anything..

JustinChase · September 27, 2014

thanks for the info.

So, if it's not really unformatted, that should be 'good', right?

How would I get unRAID to recognize it as such? I've restarted the server a couple of times since it started showing this way, but from putty (shutdown -r now), since I can't stop the array to do it from the GUI.

I really don't want to make things worse than they already are.

I just saw someone question the fact that the upgrade now shows 10x1 as an option in the extensions screen, so maybe beta10 is ready for release now, and maybe Tom will get back to me, but my server is pretty useless to me, and I'm wasting time I could spend on finding and fixing duplicates by just waiting for Tom to get back to me.

I don't want to sound like I'm slighting Tom, getting the next beta ready is more important than one persons problem, but it's frustrating when you happen to be that one person with the problem

jphipps · September 27, 2014

Actually I guess the /mnt/diskX are really meta devices, so that still could be just the emulation of that drive. To see the real drive you would have to stop the array and mount the actual device to see what is really on the drive.

Yeah it is always tuff dealing with storage issues, there is usually several routes, and once you do something, you usually can't undo it...

JustinChase · September 27, 2014

hmm, it's actually the meta drive I want to access, since I've moved about 1/2 of the data off of that drive before it went to unformatted. If I 'fix' the actual drive, it will probably show all the files, including the ones I've already moved elsewhere.

However, I'd rather deal with finding and fixing duplicates than having to figure out what I'm missing altogether and have to re-create the missing data.

Your idea to umount disk9 did work, and I was able to stop the array per the GUI afterwords.

I decided to try restarting unRAID after this to see if it might 'find' disk9 again, and show it as formatted, but that didn't work. When it restarted, if still shows as unformatted.

I guess I will get it stopped again, and run reiserfsck on disk9 and get to work on fixing duplicates, assuming it finds anything on the disk at all.

jphipps · September 27, 2014

One thing you have to watch, is if the disk is red-balled, then that is an emulated disk, so any repairs wont be written to the real physical disk unless you do a disk replacement and have it rebuild the real disk with the virtual copy.

JustinChase · September 27, 2014

hmmm, so if I run

reiserfsck --check /dev/md9

are you saying that it will check the virtual/emulated disk, and not the actual hard drive? If so, that's actually a good thing for me, as I want to 'fix' the emulated disk, so I can get it to show only the files that remain after I moved about 1/2 of them off the emulated drive.

I guess I'll find out soon, since I did run that command, and it's processing as I type.

thanks again for all your help with all my issues the last week or so!!

itimpi · September 27, 2014

hmmm, so if I run

reiserfsck --check /dev/md9

are you saying that it will check the virtual/emulated disk, and not the actual hard drive? If so, that's actually a good thing for me, as I want to 'fix' the emulated disk, so I can get it to show only the files that remain after I moved about 1/2 of them off the emulated drive.

I guess I'll find out soon, since I did run that command, and it's processing as I type.

thanks again for all your help with all my issues the last week or so!!

Yes - if a disk is red-balled then unRAID has stopped writing to it and using the 'md' device is working against the emulated disk. You need to do a disk rebuild to get the data written to a real physical disk.

jphipps · September 27, 2014

Yeah, if it is red balled, that would be the case you are repairing the virtual copy. The only danger you would have is if you have another disk failure, you could loose that emulated copy.

No problem.. I know how nerve racking it is when you are in fear of loosing data..

JustinChase · September 27, 2014

So, here are the results. I don't know what my best next action would be...

root@media:~# reiserfsck --check /dev/md9

reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md9

Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

###########

reiserfsck --check started at Sat Sep 27 10:32:37 2014

###########

Replaying journal: Done.

Reiserfs journal '/dev/md9' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. \/ 7 (of 16\/ 45 (of 103// 1 (of 170-block 190808067: The number of items (6) is incorrect, should be (0)

the problem in the internal node occured (190808067), whole subtree is skipped / 12 (of 16// 27 (of 170// 98 (of 170\block 440476206: The level of the node (29666) is not correct, (1) expected

the problem in the internal node occured (440476206), whole subtree is skipped finished

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Bad nodes were found, Semantic pass skipped

2 found corruptions can be fixed only when running with --rebuild-tree

###########

reiserfsck finished at Sat Sep 27 11:40:34 2014

###########

itimpi · September 27, 2014

You now want to run with the --rebuild-tree option to get a valid file system back. Most of the time this gets back virtually all the files, but if there was severe corruption then some will be missing.

There is another option that can be used with the --rebuild-tree which is --scan-whole-partition. This reads every sector on the disk looking for what look like files. It can recover more data than running without it, but it can also result in spurious (e.g. partial or deleted) files being found and added to the lost_found folder.

JustinChase · September 27, 2014

thanks. The wiki is very specific that it has to be done 'right', so to confirm I should run this...

reiserfsck --rebuild-tree /dev/md9

exactly as shown above, correct?

Once done, it should show the drive in the GUI as formatted again, but still red-balled, but that should allow me to copy/move files off of that disk onto another disk in my array, leaving disk9 essentially 'empty'.

Once that's done, I should be able to format disk9 (as XFS), then it can be added back into the array, and I should be good to go again.

Does that sound about right, or am I missing something?

thanks again

jphipps · September 27, 2014

If all goes well and it comes back up correctly as a valid filesystem and is still red balled, you still have to rebuild that on a physical drive by either replacing the drive and letting it rebuild, or start with a new config and let it either build/check parity.

itimpi · September 27, 2014

thanks. The wiki is very specific that it has to be done 'right', so to confirm I should run this...

reiserfsck --rebuild-tree /dev/md9

exactly as shown above, correct?

Once done, it should show the drive in the GUI as formatted again, but still red-balled, but that should allow me to copy/move files off of that disk onto another disk in my array, leaving disk9 essentially 'empty'.

Once that's done, I should be able to format disk9 (as XFS), then it can be added back into the array, and I should be good to go again.

Does that sound about right, or am I missing something?

thanks again

Basically corrent.

Not sure if simply reformatting the disk will remove the red-ball status! I suspect you may have to still go through a rebuild step to clear this state. Note however, that as far as I know, you can do the reformat to XFS on the 'virtual' disk before the rebuild. This would have the advantage of being quick, and if you wanted to you could start moving files back during the rebuild process )although it would slow down the rebuild). You may prefer to wait for the rebuild to finish.

JustinChase · September 27, 2014

bad news it seems...

root@media:~# reiserfsck --rebuild-tree /dev/md9
reiserfsck 3.6.24

*************************************************************
** Do not  run  the  program  with  --rebuild-tree  unless **
** something is broken and MAKE A BACKUP  before using it. **
** If you have bad sectors on a drive  it is usually a bad **
** idea to continue using it. Then you probably should get **
** a working hard drive, copy the file system from the bad **
** drive  to the good one -- dd_rescue is  a good tool for **
** that -- and only then run this program.                 **
*************************************************************

Will rebuild the filesystem (/dev/md9) tree
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
Replaying journal: Done.
Reiserfs journal '/dev/md9' in blocks [18..8211]: 0 transactions replayed
###########
reiserfsck --rebuild-tree started at Sat Sep 27 13:15:12 2014
###########

Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 376654954 blocks marked used
Skipping 30567 blocks (super block, journal, bitmaps) 376624387 blocks will be read
0%..block 67015897: The number of items (1) is incorrect, should be (0) - corrected
block 67015897: The free space (17) is incorrect, should be (4072) - corrected
block 68140233: The number of items (6144) is incorrect, should be (1) - corrected
block 68140233: The free space (59648) is incorrect, should be (464) - corrected
pass0: vpf-10110: block 68140233, item (0): Unknown item type found [201326593 385872640 0x49ffff00  (15)] - deleted
block 70212653: The number of items (65521) is incorrect, should be (1) - corrected
block 70212653: The free space (33) is incorrect, should be (2037) - corrected
pass0: vpf-10110: block 70212653, item (0): Unknown item type found [4293459968 92209184 0xca908bf  (15)] - deleted
block 71292588: The number of items (1107) is incorrect, should be (1) - corrected
block 71292588: The free space (165) is incorrect, should be (2144) - corrected
pass0: vpf-10110: block 71292588, item (0): Unknown item type found [69599234 155844917 0x6540a03  (15)] - deleted
block 72139248: The number of items (1) is incorrect, should be (0) - corrected
block 72139248: The free space (7) is incorrect, should be (4072) - corrected
block 73023393: The number of items (1024) is incorrect, should be (1) - corrected
block 73023393: The free space (512) is incorrect, should be (3280) - corrected
pass0: vpf-10110: block 73023393, item (0): Unknown item type found [67109119 50332160 0x1e000c00  (15)] - deleted
block 73111698: The number of items (65414) is incorrect, should be (1) - corrected
block 73111698: The free space (256) is incorrect, should be (3793) - corrected
pass0: vpf-10110: block 73111698, item (0): Unknown item type found [67174655 277938189 0x20007  (15)] - deleted
block 73344453: The number of items (1024) is incorrect, should be (1) - corrected
block 73344453: The free space (768) is incorrect, should be (3792) - corrected
pass0: vpf-10110: block 73344453, item (0): Unknown item type found [83886082 33555968 0xad000b00  (15)] - deleted
block 73481095: The number of items (1) is incorrect, should be (0) - corrected
block 73481095: The free space (4) is incorrect, should be (4072) - corrected
block 73559406: The number of items (1) is incorrect, should be (0) - corrected
block 73559406: The free space (65442) is incorrect, should be (4072) - corrected
.block 76566909: The number of items (1280) is incorrect, should be (1) - corrected
block 76566909: The free space (256) is incorrect, should be (3792) - corrected
pass0: vpf-10110: block 76566909, item (0): Unknown item type found [50331905 16777472 0xde000e00  (15)] - deleted
block 77580715: The number of items (1) is incorrect, should be (0) - corrected
block 77580715: The free space (65480) is incorrect, should be (4072) - corrected
block 79472804: The number of items (2304) is incorrect, should be (1) - corrected
block 79472804: The free space (4352) is incorrect, should be (3536) - corrected
pass0: vpf-10110: block 79472804, item (0): Unknown item type found [117441026 134222080 0x24001300  (15)] - deleted
block 79475198: The number of items (768) is incorrect, should be (1) - corrected
block 79475198: The free space (63744) is incorrect, should be (2768) - corrected
pass0: vpf-10110: block 79475198, item (0): Unknown item type found [33554433 218103296 0x95000a00  (15)] - deleted
block 80145536: The number of items (768) is incorrect, should be (1) - corrected
block 80145536: The free space (256) is incorrect, should be (1488) - corrected
pass0: vpf-10110: block 80145536, item (0): Unknown item type found [83886849 167772672 0x70000600  (15)] - deleted
.20%                                            left 292991467, 26021 /sec
The problem has occurred looks like a hardware problem. If you have
bad blocks, we advise you to get a new hard drive, because once you
get one bad block  that the disk  drive internals  cannot hide from
your sight,the chances of getting more are generally said to become
much higher  (precise statistics are unknown to us), and  this disk
drive is probably not expensive enough  for you to you to risk your
time and  data on it.  If you don't want to follow that follow that
advice then  if you have just a few bad blocks,  try writing to the
bad blocks  and see if the drive remaps  the bad blocks (that means
it takes a block  it has  in reserve  and allocates  it for use for
of that block number).  If it cannot remap the block,  use badblock
option (-B) with  reiserfs utils to handle this block correctly.

bread: Cannot read the block (122410614): (Input/output error).

Aborted

suggestions?

itimpi · September 27, 2014

That is the commonest response from reiserfsck if a drive come up a unformatted. You need to run as suggested with the --rebuild-tree to fix the issues. In most cases this clears the problem wit virtually no data loss.

Note that at this point since the drive is red-balled you are writing to the emulated 'md9' device - not the [physical drive. You still have the physical drive (I would remove it from the server to play safe) to fall back on.

JustinChase · September 27, 2014

That was the result of running

reiserfsck --rebuild-tree /dev/md9

anything else I can try?

dgaschk · September 27, 2014

Post a screen shot of unRAID Main and a SMART report for disk9.

JustinChase · September 27, 2014

screenshot attached.

I ran smartctl -a -d ata /dev/sdg >/boot/smart.txt, but did not manually force a smart test recently. not sure if this report is valid, or useful in this case.

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.16.2-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD30EZRX-00D8PB0
Serial Number:    WD-WCC4N1252623
LU WWN Device Id: 5 0014ee 25f94d9ce
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Sep 27 17:20:55 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(40560) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 407) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x7035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   178   177   021    Pre-fail  Always       -       6058
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       170
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       697
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       12
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3836
194 Temperature_Celsius     0x0022   122   113   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

JustinChase · September 27, 2014

So, I assume the drive is screwed beyond remedy, or at least the parity representation of it; correct?

if I just pull the actual drive, stick it in another machine, run a parity check on it, then stick it back into the server, i still couldn't rebuild the parity representation onto this drive correct?

Is there anything I can do at this point to retrieve the data off of disk9, or do I just need to move on?

I know I probably sound desperate, but I'm sick of not having my server, and I'm disappointed that Tom has not responded to my requests for help, so I really just want to put this behind me and get whatever remains of my data back into use, as some is better than nothing.

Any ideas?

jphipps · September 28, 2014

I would stop the array completely, and mount the /dev/sdX1 partition and you should be able to see the data on the drive. If the data on the drive looks good and mounts fine. I think I would just start a new config and put all the drives back in as they were and atleast you should have all the original disk9 data in the array. Then you would just need to worry about the dups..

JustinChase · September 28, 2014

how do I

...mount the /dev/sdX1 partition and you should be able to see the data on the drive.

I tried...

root@media:/mnt# mount /dev/sdg1

and got...

mount: can't find /dev/sdg1 in /etc/fstab or /etc/mtab

jphipps · September 28, 2014

you have to give it a mount point and possible filesystem type:

mkdir /mnt/somedisk

mount /dev/sdg1 /mnt/somedis

If it says you need a filesystem type, just add a -t {filesystem type} after mount.

unable to stop array per the GUI, and put into maintenance mode

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived