May 12, 201511 yr Hi guys, we had a power outage the other day and little did i know that the UPS was kaput and dropped the server as soon as the power went out. ever since then drive 1 is 'red balling', i know , i can hear you say "yep the drive is stuffed" well no...... so i have did bit of goolging and reading here and found several things. check the SMART data. i did and its clean (the drive is virtually brand new WD 2tb RED) if you are sure everything is fine the set the tower as clean and restart. did that hour or so later red ball. i even went as far as killing the config and starting again. was green until parity stopped and blam red ball. So after bit more search i found another thread where drive 20 is doing the same thing, but nothing has been resolved. as part of that thread it as suggested to check all the connections, so i stopped the tower and checked all the connections to the drives (its only a small system 3 data, cache and parity) still have drive 1 as red ball. any ideas? (please be specific as i have little experience) attaching syslog and smart status report. System is running unraid 5.05 syslog.txt smart.txt
May 12, 201511 yr This syslog does not show any drive errors, nothing that caused a red ball. However, I noticed that Disk 1 was not spun up with the others, so that makes me wonder if it is still a red ball from before. I'm unclear what you are doing to resolve the red ball. By the way, I noticed that you have IDE emulation turned on, for some of your SATA drives, in particular the Parity drive. When you next boot, go into the BIOS settings and look for the SATA mode, and change it to a native SATA mode, preferably AHCI, anything but IDE emulation mode. It should be quicker. Your BIOS is dated from 2008, and I see a lot of resource disabling and workarounds. You might check for a BIOS upgrade for that motherboard.
May 12, 201511 yr Author hi RobJ, thanks for those tips re the BIOS, i will plug a screen in and do them next chance i get. I am seeing no read and write activity from drive 1 in the GUI, i can browse to teh files in teh directories but not use them. (say play a music file) what i have done to fix it? reseated cables set the 'config' as good and restarted the array cleared the config and reset the drives up and started the array, keeping parity cleared the config and reset the drives and started the array making new parity. i have done nothing else as i cant find a decent example to go buy, so if you can think of something else you want me to do, let me know. CH syslog.txt smart.txt
May 12, 201511 yr Once a drive red-balls then it will stay red-balled until you take an action to clear the red-ball (e.g. rebuild it). It is not clear to me whether you are saying you are taking recovery action and then it red-balls again, or whether you have never taken a recovery action after the original red-ball. Also, while a disk is red-balled unRAID will stop reading-writing to the physical drive and will instead emulate it using the combination of the other drives plus parity.
May 12, 201511 yr Author hmmm, OK, its seems i have missed something trying to self help. so from what you are saying i have probably not done anything that will clear the red ball. Now that we have established that i have done nothing, that will help this situation, can you please point me at what i need to do? You say 'rebuild' the drive, but this has always confused me. I know that UNRAID keeps parity for this, but i have never had to use it. Consequentially i have no idea how to do it. The GUI does not seem to have an option for it, so how do i do it? PS, i fully plan on upgrading to V6 as soon as i have got this sorted if that changed your action plan. And thanks for your time!
May 12, 201511 yr The recovery action is the same for v5 and v6 so upgrading at this point would not achieve anything. A red-ball basically means that a write is failing for some reason. This can lead to file system corruption that needs fixing with reiserfsck. If you do have file system corruption then attempts to write to that drive can fail and cause the drive to be red-balled again. From your earlier posts it appears that you may have tried to recover by doing a 'new config' and then reassigning the drives and rebuilding parity? Is this so? If this is what you did then it seems likely to me that you have file system corruption, or that the drive is really failing. However there was nothing I saw in the SMART attributes that suggest the drive really has a problem. Note that if you went the new config route then any data written to that drive after the red-ball will have been lost as this was not the correct recovery procedure to avoid potentially losing data. What I would suggest doing at this stage is: - Stop the array and restart it in Maintenance mode - From a telnet/console session run a command of the form reiserfsck --check /dev/md?? where ?? corresponds to the disk number that has issues. This command will run for some time, and if you are using telnet you must not close the telnet window while it is running or it will abort the command. The results of that command will indicate whether there actually is file system corruption and give suggestions on recovery action if any is found. It is likely that the recommendation will be to take any suggested action, but you may want to check back here before doing so. The final step in any recovery action will be to 'rebuild' the failed disk but before trying to do that it seems sensible to get to the root cause of your current issue.
May 12, 201511 yr Author hi there, OK, yes i did start new config and do remember seeing that note. But this array is mainly to hold my media collection (music and videos) so it doesnt change much. anything that is lost will not be a big concern, i have not added anything to the array recently. So i ran that command see below -=-=-=-=-=-=-=-=-=-=- tower login: ######## Linux 3.9.11p-unRAID. root@tower:~# reiserfsck --check /dev/sdc reiserfsck 3.6.24 Will read-only check consistency of the filesystem on /dev/sdc Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes reiserfs_open: the reiserfs superblock cannot be found on /dev/sdc. Failed to open the filesystem. If the partition table has not been changed, and the partition is valid and it really contains a reiserfs partition, then the superblock is corrupted and you need to run this utility with --rebuild-sb. root@tower:~# -=-=-=-=-=-=-=- so run the command offered? chris.
May 12, 201511 yr That is the wrong device name - it should be of the form /dev/md?? where ?? corresponds to the disk number in unRAID - not the sd? letter variant. If you MUST use the raw device name then you have to include the partition (i.e. /dev/sdc1) but doing so will invalidate the current parity. Using the /dev/md?? type names does not invalidate parity and removes the need to specify the partition.
May 12, 201511 yr ok, entered the right command and its running now. Good. Not sure how long it takes for a 2TB disk, but expect it to be something like an hour. I think it also depends on the number of files on the disk - more files means it takes longer.
May 12, 201511 yr Author try several hours! ########### reiserfsck --check started at Tue May 12 20:03:46 2015 ########### Replaying journal: Done. Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 397506 Internal nodes 2404 Directories 1093 Other files 6585 Data block pointers 401575586 (0 of them are zero) Safe links 0 ########### reiserfsck finished at Tue May 12 22:14:18 2015 ########### root@tower:~# =-=-=-=-=-=-=-=-=-= tried stopping ad restarting (not rebooting) the tower into normal opperation but its still red. Chris. (attaching new syslog) syslog2.txt
May 13, 201511 yr http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F
May 13, 201511 yr Author hi dgaschk, thanks for taking the time to find that link, but that is the info i found before (up near the top of the post, although i didn't link to it.) but that is the info i followed to the best of my ability. Marking the config as good and restarting the array or clearing the config as it describes in the artical does noting. Nor does clearing the config altogether and starting again with a full parity rebuild as i said further up. If i have missed an important step please point it out, as i have said in am not that good with linux.
May 13, 201511 yr The good news is that the reiserfsck found no corruption so you have not suffered any potential data loss. In some ways that is a shame as if corruption was found it would explain why the drive would red-ball again. The bad news is that we still do not know why the drive red-balls - at this stage the most likely culprit is a SATA or power cable not making a perfect connection. You can also get more obscure items such as a RAM stick going faulty or a disk controller card not being perfectly seated but these tend to produce more random errors and not just affect a single specific drive and are thus unlikely. I would suggest that at this point you want to get the SMART report for all the drives to check that none of them have "Pending sectors" that are not 0 as this can interfere with any attempt to resolve the red-ball issue by doing a rebuild. Reallocated sectors is also something to watch, but a non-zero value here is not an issue as long as it is small, and does not increase over time. Having said that I prefer my drives to have 0 reallocated sectors. In terms of resolving the current red-ball, if you look at the link given a few posts earlier, then the recommended methods are to follow the steps in the "Re-enable the Drive" or "Replace the Drive" sections. Both of these will trigger a rebuild - the first back onto the same drive (effectively overwriting its data with the same data) and the second onto a new drive. As you have said that you had not done a rebuild earlier I assume that neither of these procedures was followed?
May 13, 201511 yr Author hi, well it seems there is something i didn't understand in those instructions, because that is what i followed, or at least i thought i did. i checked all the drives and i do have issues with one of the drives, my cache drive. that drive is quite old, so no surprise there. -=-=-=-=-=- 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 5 -=-=-=-=-=- (SMART report attached.) so that one is on its way out. the drive in question, drive 1 or sdc, is still showing as good with zero for everything that is in that link above. so i will go over that 'howto' linked above and see what i did wrong last time. What i suspect i did last time was just rebuild parity, not rebuild data. thanks for your time and i will report back once i get someplace.
May 13, 201511 yr OK - issue with the cache drive do not affect the array data drives so that is not an issue. It would be nice to get this issue finally resolved for you.
May 13, 201511 yr Author ok, so i think i know what happened. i followed this procedure, while at work via a Remote PC, and i am thinking that i didn't start and stop the array with the drive removed. As i have just done the steps again, and the drive showed a blue dot, now a orange dot and its telling me that it is rebuilding data. its going to take ages! so once i am done with that drive i will shutdown the array and fix those other issues you pointed out with the bois. i am looking now to see if i can get an newer bios. thanks again for your time!
May 13, 201511 yr Author oh dear, dont worry the rebuild is still progressing But is seem the BIOS is very old. i have 7 iterations after the one that is installed! installed is F4 they are at (stable) F11, and a beta for F12. At least i dont have to flash all those!
May 14, 201511 yr Run a parity check. The Pending Sectors count must return to zero. See here: http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector
Archived
This topic is now archived and is closed to further replies.