[SOLVED] Problems with my array after upgrading to v6.6.6


hwilker

Recommended Posts

I just upgraded from v6.1 to v6.6.6. The upgrade itself went smoothly, but upon rebooting the system after upgrading a series of events left me unsure how to proceed.  (diagnostics zip file attached for the last boot of the system described below.)

 

Please note that nowhere in the actions I describe below did I mount the array.

 

When the system first came up after upgrading it indicated that two disks had udma crc errors (see image001). I did some rooting around on the internet which seemed to indicate that such problems were related to the transfer of data from the disk to the host, and that the best course of action was to check out the connections on the cables and the seating of the disks.

 

I did this and rebooted the system. The indication of udma crc errors didn't reappear, but a potentially more serious problem was reported. Disk 2 was reported as missing. More curious, the disk that represented disk 2 (serial number ending in 5HSG) wasn't even made available as an option to remount (I do have a hot spare that was presented as available to replace the 'missing' disk 2).

 

I shut down the system and tried moving disk 2 to another empty bay (this may have been a mistake but I thought that the physical position of the disk was no longer required in V6) and restarted the system. Disk 2 was still reported as missing, so I shut down the system again, put disk 2 back into its original location and rebooted.

 

When the system rebooted it still indicated that disk 2 was missing but now the physical drive (5HSG) was now presented as option to place in the disk 2 slot (see image002). I once again shut the system down and rebooted.

 

Again, disk 2 was reported as missing and in an error state, but now the disk was shown as selected for Disk 2. (image003). (this is the current state of the system and is the basis for diagnostics file that is attached to this post.)

 

My general question is how best to proceed. Specifically, I presume that if I start the array in this state, it would start to rebuild the array as if the physical disk (5HSG) was a replacement disk for the disk that used to occupy disk 2. I presume that would be ok, but is there a better way, presuming that the data on disk 2 is valid; some way to just rebuild the configuration? If there is, is it more risky than just rebuilding the disk.

 

Any advice or suggestions about the best way to proceed would be appreciated.

 

Thanks
 

tower-diagnostics-20190222-1320.zip

image001.PNG

image002.PNG

image003.PNG

Edited by hwilker
Link to comment

Thanks. I started the array, and received the following message (image004), and an indication of problems on disk2 in the Device column (the red 'X' on the left).

 

But the array itself seems intact. I can mount the array on my PC, and if I access disk2 directly from the network via the TOWER, I can see disk2 and access the content on disk2.

 

Can you advise what I should do next.

 

Thanks.

 

image004.PNG

Link to comment

UnRAID has disabled disk2 due to a write error and is now emulating its contents using the combination of the other disks plus the parity disk (which is why you can still see its contents).   Once a disk is disabled Unraid stops writing to it.  However this means that the array is now unprotected until the problem disk is recovered back to normal state so another disk getting disabled would cause data loss so you want to get the array back to a good state ASAP.

 

The disk is quite likely to be OK as write errors are frequently due to external factors (e.g. cabling, power).    If you post your diagnostics zip file (obtained via Tools->Diagnostics) then you will likely get some feedback as to whether the disks looks OK.    The recovery process involves rebuilding to a good disk to get the physical disk back inline with the contents that Unraid is currently emulating.

Link to comment

Thanks. I've attached the diagnostics zip file as suggested.

 

By the way, I've gone through the rebuild process over the years, to both replace a failed disk and to replace a disk with a larger disk. But,  from a UI perspective, I'm unclear how to tell the array to rebuild an existing disk, as, if I recall correctly, you place the new, unrecognized disk into the old disk's slot in the array (disk2 in this case), start the array and the system does the rest. How do you tell unraid to rebuild using an existing disk?

 

Anyway, I'll await the analysis of the diagnostic file, before proceeding. I, too, think there is nothing wrong with the disk itself, as prior to upgrading to 6.6 (from 6.1) there were no indications of problems, and SMART status indicates it's OK. (see image005). But, if there are problems with it, I've got an 8 TB hot spare ready to go. I just don't want to 'waste' that spare at this point if I don't have to, since I've got plenty of free space in my array.

 

Thanks for your help.

tower-diagnostics-20190223-1502.zip

image005.PNG

Link to comment

Thanks.

 

My build uses a Norco 4224 case, with two m1015 SAS controllers and eight onboard SATA ports. The drive in question is connected to one of these SAS controllers, so replacing the cable to the drive2 means replacing an SAS cable (SFF-8087, if I remember correctly) that serves four drives. So, before dealing with replacing the cable, I thought I'd try a few things:

 

I followed the instructions outlined in reconstructing the drive, up until the final step of actually starting to reconstruct the drive. Upon re-assigning the disk I received a warning notice that there were udma crc errors on disk 2 (see image006).

 

So, I shut down the system, and moved the disk into one of the slots controlled directly from one of the SATA ports on the motherboard, and rebooted the system. This time I received no notices or warnings (see image007).

 

Now I'm left with a couple of questions and some decisions to make. If you look at the first post in this thread, immediately after upgrading to 6.6, I received notifications of udma crc errors on two other drives( drive6 and drive8. After that initial boot these drives haven't reported any problems). Those two drives are also attached to  SAS controllers but are on different cables from disk2 and different cables from each other. This would mean three of my four SAS cables all started reporting problems at the same time, since none of this was being reported until the upgrade to 6.6 (dumb question: does 6.6 perhaps incorporate diagnostics that are, in some way, more 'sensitive' to udma crc errors than 6.1?). It does seem odd to me that these errors would suddenly start to appear just after upgrading the OS, and to do so on three different cables.

 

At any rate, I'm now confronted with the question of whether, given the udma crc errors that the system is apparently throwing off, is it safe or advisable to attempt a rebuild with disk2 in its new slot?  Or, given the possibility of new udma crc errors occurring during the rebuild, should I just go the safe route and replace the SAS cables before proceeding.

 

I know there is probably no definitive answer, but the fact that all these errors are coming up on different cables  all at once and at exactly the moment that I'm upgrading the software makes me wary. 

image006.PNG

image007.PNG

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.