November 4, 201015 yr One of my HDs showed up with a red ball and I shut down the server, replaced the failed drive with same capacity new HD and rebooted. unRaid reported that a failed disc has been replaced and I checked off the Start the Array and rebuild/reexpand the file system. However, the new HD which initially showed up as a Blue Ball is now showing up as a Red and there is no reconstruct going on. I attached syslog. Any help? Thanks, Lev syslog-2010-11-03_1.txt_2.zip
November 4, 201015 yr Basically writes to disk 5 are failing. It could be the disk itself, or the cabling to it or the disk controller port. I'd re-seat the cables first, then try a different cable, then a different port. (if on a different port you'll need to use the "Devices" page to assign the disk to the right logical slot in your array)
November 4, 201015 yr Author Basically writes to disk 5 are failing. It could be the disk itself, or the cabling to it or the disk controller port. I'd re-seat the cables first, then try a different cable, then a different port. (if on a different port you'll need to use the "Devices" page to assign the disk to the right logical slot in your array) Joe, I am using a second SASLP-MV8 card that I just added to my server, running two HDs, one that's "failed" and the cache. I connected the SAS cable to the second connector on the controller card and also moved the HD in question to a different slot. The cables are seated well. Still the same problem. Any other thoughts? Lev syslog-2010-11-03_2.txt.zip
November 4, 201015 yr Author Now the rebuild has started but at snails pace, and the the monitor connected to the unRaid has a REISERFS abort message I've never seen. Very frustrating.... The syslog is over 10MB so I can't even post it!!! > PS. here is the latest syslog after I changed controller cards. syslog-2010-11-04_2.txt.zip
November 4, 201015 yr Author I've tried every possible permutation I can think off. I switched controller cards, I've tried on board regular SATA connectors, I shuffled drives between slots. Nothing seems to help. The PCI controller cards (2 SATA ports on one card) which worked in the past now does not even want to complete boot. The boot screen is locked in endless messages about UDMA 133 and various other sdj related errors. Can there be something wrong with the program itself or something other than the controller card/wire/HD? I am attaching the latest syslog and going to get a sledge hammer... syslog-2010-11-04_3.txt.zip
November 4, 201015 yr By any chance the new disk you got is an "advanced format disk"? such as WD EAR disk? Meanwhile your MB looks like has Nvidia north bridge chipset, is this correct? Nov 3 20:29:30 Tower kernel: ata10: status=0x41 { DriveReady Error } Nov 3 20:29:30 Tower kernel: ata10: error=0x04 { DriveStatusError } Nov 3 20:29:30 Tower kernel: sd 7:0:3:0: [sde] Result: hostbyte=0x00 driverbyte=0x08 Nov 3 20:29:30 Tower kernel: sd 7:0:3:0: [sde] Sense Key : 0xb [current] [descriptor] Nov 3 20:29:30 Tower kernel: Descriptor sense data with sense descriptors (in hex): Nov 3 20:29:30 Tower kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Nov 3 20:29:30 Tower kernel: 00 00 00 3e Nov 3 20:29:30 Tower kernel: sd 7:0:3:0: [sde] ASC=0x0 ASCQ=0x0 Nov 3 20:29:30 Tower kernel: sd 7:0:3:0: [sde] CDB: cdb[0]=0x28: 28 00 00 00 00 3f 00 00 08 00 Nov 3 20:29:30 Tower kernel: end_request: I/O error, dev sde, sector 63
November 4, 201015 yr Author By any chance the new disk you got is an "advanced format disk"? such as WD EAR disk? Meanwhile your MB looks like has Nvidia north bridge chipset, is this correct? Nov 3 20:29:30 Tower kernel: ata10: status=0x41 { DriveReady Error } Nov 3 20:29:30 Tower kernel: ata10: error=0x04 { DriveStatusError } Nov 3 20:29:30 Tower kernel: sd 7:0:3:0: [sde] Result: hostbyte=0x00 driverbyte=0x08 Nov 3 20:29:30 Tower kernel: sd 7:0:3:0: [sde] Sense Key : 0xb [current] [descriptor] Nov 3 20:29:30 Tower kernel: Descriptor sense data with sense descriptors (in hex): Nov 3 20:29:30 Tower kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Nov 3 20:29:30 Tower kernel: 00 00 00 3e Nov 3 20:29:30 Tower kernel: sd 7:0:3:0: [sde] ASC=0x0 ASCQ=0x0 Nov 3 20:29:30 Tower kernel: sd 7:0:3:0: [sde] CDB: cdb[0]=0x28: 28 00 00 00 00 3f 00 00 08 00 Nov 3 20:29:30 Tower kernel: end_request: I/O error, dev sde, sector 63 That is correct. I have an MSI 680i quad SLI motherboard. I have 2 WD EARS HDs on that board, one of which has been there for quiet some time now (disk 5) and another was added a few days ago. Both drives have 7/8 pins connected. The first WD EARS drive was now showing up as a red ball and while trying to replace it with a Seagate drive I was not able to rebuild the information. Because of the parity I was able to copy all of the information from "drive 5" over the network to another PC. Theoretically I no longer need that drive in the array. Should I delete it and then add a new drive and copy the data from my PC back unto the unRaid array? I can also try replacing the motherboard. I recently bought a Supermicro MBD-C2SEE-O board and I can move my system to that one. Should I try that as the last resort?
November 4, 201015 yr That is correct. I have an MSI 680i quad SLI motherboard. I have 2 WD EARS HDs on that board, one of which has been there for quiet some time now (disk 5) and another was added a few days ago. Both drives have 7/8 pins connected. The first WD EARS drive was now showing up as a red ball and while trying to replace it with a Seagate drive I was not able to rebuild the information. Because of the parity I was able to copy all of the information from "drive 5" over the network to another PC. Theoretically I no longer need that drive in the array. Should I delete it and then add a new drive and copy the data from my PC back unto the unRaid array? I can also try replacing the motherboard. I recently bought a Supermicro MBD-C2SEE-O board and I can move my system to that one. Should I try that as the last resort? According to your syslog, the problem starting from here. Looks to me system try to probe this disk and this operation eventually timeout then system fail into loop until exhausted. -------------------------- From your syslog --------------------------------------------------------------------------------------------------------------- Nov 3 20:29:30 Tower kernel: sas: command 0xc37dcd80, task 0xc3ed4140, timed out: BLK_EH_NOT_HANDLED Nov 3 20:29:30 Tower kernel: sas: Enter sas_scsi_recover_host Nov 3 20:29:30 Tower kernel: sas: trying to find task 0xc3ed4140 Nov 3 20:29:30 Tower kernel: sas: sas_scsi_find_task: aborting task 0xc3ed4140 Nov 3 20:29:30 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1701:mvs_abort_task:rc= 5 Nov 3 20:29:30 Tower kernel: sas: sas_scsi_find_task: querying task 0xc3ed4140 Nov 3 20:29:30 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1645:mvs_query_task:rc= 5 Nov 3 20:29:30 Tower kernel: sas: sas_scsi_find_task: task 0xc3ed4140 failed to abort Nov 3 20:29:30 Tower kernel: sas: task 0xc3ed4140 is not at LU: I_T recover Nov 3 20:29:30 Tower kernel: sas: I_T nexus reset for dev 0300000000000000 Nov 3 20:29:30 Tower kernel: sas: I_T 0300000000000000 recovered Nov 3 20:29:30 Tower kernel: sas: --- Exit sas_scsi_recover_host At LGA775, because of cross-licensing, companies such as Nvidia could make its own north bridge chip for x86 processor, some of those MB mainly for cross-fire or SLI, i am not sure if they can fully support anything other than this purpose as you can see in spec from newegg, depends on how you use it those 4 PCI-e slots could be arranged in two different ways that told you there are some "spicy" ingredient there. At i3/i5/i7 now, you can not find any north bridge chip from Nvidia. 4 (The 4 PCI Express interface will operate at either x8+x8+x16+x8 or x16+x16+x8 mode) http://www.newegg.com/Product/Product.aspx?Item=N82E16813130080 At this moment, i think your options are (a) try to seat this 2nd SASLP-MV8 at different PCI-e slot see if it help. and make sure there is no loose cable as well as there is enough juice from your PSU, unless you have enough headroom otherwise every time when adding any HW always double check power consumption. (b) Switch to "regular" MB like this C2SEE but make sure you do some verification on this C2SEE first before starting to migrate.
November 4, 201015 yr Author That is correct. I have an MSI 680i quad SLI motherboard. I have 2 WD EARS HDs on that board, one of which has been there for quiet some time now (disk 5) and another was added a few days ago. Both drives have 7/8 pins connected. The first WD EARS drive was now showing up as a red ball and while trying to replace it with a Seagate drive I was not able to rebuild the information. Because of the parity I was able to copy all of the information from "drive 5" over the network to another PC. Theoretically I no longer need that drive in the array. Should I delete it and then add a new drive and copy the data from my PC back unto the unRaid array? I can also try replacing the motherboard. I recently bought a Supermicro MBD-C2SEE-O board and I can move my system to that one. Should I try that as the last resort? According to your syslog, the problem starting from here. Looks to me system try to probe this disk and this operation eventually timeout then system fail into loop until exhausted. -------------------------- From your syslog --------------------------------------------------------------------------------------------------------------- Nov 3 20:29:30 Tower kernel: sas: command 0xc37dcd80, task 0xc3ed4140, timed out: BLK_EH_NOT_HANDLED Nov 3 20:29:30 Tower kernel: sas: Enter sas_scsi_recover_host Nov 3 20:29:30 Tower kernel: sas: trying to find task 0xc3ed4140 Nov 3 20:29:30 Tower kernel: sas: sas_scsi_find_task: aborting task 0xc3ed4140 Nov 3 20:29:30 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1701:mvs_abort_task:rc= 5 Nov 3 20:29:30 Tower kernel: sas: sas_scsi_find_task: querying task 0xc3ed4140 Nov 3 20:29:30 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1645:mvs_query_task:rc= 5 Nov 3 20:29:30 Tower kernel: sas: sas_scsi_find_task: task 0xc3ed4140 failed to abort Nov 3 20:29:30 Tower kernel: sas: task 0xc3ed4140 is not at LU: I_T recover Nov 3 20:29:30 Tower kernel: sas: I_T nexus reset for dev 0300000000000000 Nov 3 20:29:30 Tower kernel: sas: I_T 0300000000000000 recovered Nov 3 20:29:30 Tower kernel: sas: --- Exit sas_scsi_recover_host At LGA775, because of cross-licensing, companies such as Nvidia could make its own north bridge chip for x86 processor, some of those MB mainly for cross-fire or SLI, i am not sure if they can fully support anything other than this purpose as you can see in spec from newegg, depends on how you use it those 4 PCI-e slots could be arranged in two different ways that told you there are some "spicy" ingredient there. At i3/i5/i7 now, you can not find any north bridge chip from Nvidia. 4 (The 4 PCI Express interface will operate at either x8+x8+x16+x8 or x16+x16+x8 mode) http://www.newegg.com/Product/Product.aspx?Item=N82E16813130080 At this moment, i think your options are (a) try to seat this 2nd SASLP-MV8 at different PCI-e slot see if it help. and make sure there is no loose cable as well as there is enough juice from your PSU, unless you have enough headroom otherwise every time when adding any HW always double check power consumption. (b) Switch to "regular" MB like this C2SEE but make sure you do some verification on this C2SEE first before starting to migrate. Thank you for the reply. I tried changing the SAS cards to different PCI-E slots on the motherboard, even switching with the video card. I probably changed the arrangement of the SAS controllers 4 or 5 different ways. I also tried plugging the WD and the replacement Seagate HDs directly onto the motherboards SATA connector and also tried using a PCI SATA board, which has been running on the system for sometime with the present configuration without a problem. I tried swapping the WD/Seagate drives with one of the other drives on the connectors and the same problem persisted, with drive 5 being the culprit. What did you mean by saying "...but make sure you do some verification on this C2SEE first before starting to migrate"? What kind of verification? If I just replace the motherboard and plug into it all HDs, what should I do else to make sure that the system will not have a problem?
November 4, 201015 yr Thank you for the reply. I tried changing the SAS cards to different PCI-E slots on the motherboard, even switching with the video card. I probably changed the arrangement of the SAS controllers 4 or 5 different ways. I also tried plugging the WD and the replacement Seagate HDs directly onto the motherboards SATA connector and also tried using a PCI SATA board, which has been running on the system for sometime with the present configuration without a problem. I tried swapping the WD/Seagate drives with one of the other drives on the connectors and the same problem persisted, with drive 5 being the culprit. You can move this disk5 to other machine such as Windows/Mac and use reiserfs tools to check it out. if you can read data from this disk ar different machine then unlikely it is the disk issue. What did you mean by saying "...but make sure you do some verification on this C2SEE first before starting to migrate"? What kind of verification? If I just replace the motherboard and plug into it all HDs, what should I do else to make sure that the system will not have a problem? What i meant is, if you have spare components, build a unRAID with free basic license with this MB and do some burn-in tests such as memtest, copy large amount of data in/out of system, do couple parity check....etc. Although this MB is official one used by limetech but that doesn't meant "the one" you got will be trouble free. If this is not feasible then just do your migration then do burn-in tests. I am doing the same thing because i also plan to migrate to this MB.
November 4, 201015 yr Author GK20, I am going to migrate my unRaid components later on tonight or tomorrow and get back here with update. I am still troubled though with the fact that the drive that is in the system now is not WD EARS and it is still not being properly rebuild. On the main page it does not even have a valid temperature displayed for it - 0°C is what it shows up as... As far as the EARS drive, that one is now hocked up to my PC and the information that was on drive 5 has been copied onto it. I am also copying information from the other EARS drive in my unRaid over to it. It's taking some time but I want to make sure that both EARS data is backed up before the move... If when I have all of my components in the new system (Supermicro motherboard), and the drive 5 still fails to be recongnized/rebuild, what then? I have a second unRaid license and I guess I can build a second independent server and move slowly all of my data from one unRaid box over to another... That would really be unpleasant... :'(
November 4, 201015 yr it does not even have a valid temperature displayed for it - 0°C is what it shows up as... If system can not get status back from disk, this temperature display is expected. As far as the EARS drive, that one is now hocked up to my PC and the information that was on drive 5 has been copied onto it. I am also copying information from the other EARS drive in my unRaid over to it. It's taking some time but I want to make sure that both EARS data is backed up before the move... looks like this disk is still ok. while you have this disk connected in different machine maybe you can also chek out SMART readout on this disk see if there is anything abnormal. So far it looks to me the issue is on MB/2nd SASLP-MV8. end to end, from CPU to disk, there are many components involved. If when I have all of my components in the new system (Supermicro motherboard), and the drive 5 still fails to be recongnized/rebuild, what then? Maybe you should consider to buy a lottery ticket. Joke aside, There is no need to rush just take your time and hope for best. i think you should have better chance with this MB. Although unRAID claim can support any x86 system but for myself i will not choose anything other than from either Intel/AMD.
November 4, 201015 yr Author it does not even have a valid temperature displayed for it - 0°C is what it shows up as... If system can not get status back from disk, this temperature display is expected. As far as the EARS drive, that one is now hocked up to my PC and the information that was on drive 5 has been copied onto it. I am also copying information from the other EARS drive in my unRaid over to it. It's taking some time but I want to make sure that both EARS data is backed up before the move... looks like this disk is still ok. while you have this disk connected in different machine maybe you can also chek out SMART readout on this disk see if there is anything abnormal. So far it looks to me the issue is on MB/2nd SASLP-MV8. end to end, from CPU to disk, there are many components involved. If when I have all of my components in the new system (Supermicro motherboard), and the drive 5 still fails to be recongnized/rebuild, what then? Maybe you should consider to buy a lottery ticket. Joke aside, There is no need to rush just take your time and hope for best. i think you should have better chance with this MB. Although unRAID claim can support any x86 system but for myself i will not choose anything other than from either Intel/AMD. With my luck, buying a lottery ticket is definitely not the smartest choice! I just picked up my RAM and will start the build later on tonight. Thanks for your help. Lev
November 7, 201015 yr Author OK. I am officially out of ideas... I have replaced my motherboard, RAM and PSU. I have replaced controllers and cables; tried several different HDs. Triple checked each connection and yet I am still getting the red ball next to the same drive which is refusing to rebuild. What am I doing wrong??? I have all the information from that drive elsewhere and can take it out of the array without any loss of data to me. Can I do that? If yes, how? I can then re-add the drive and copy over the information. ??? ??? Mobo: Supermicro MBD-C2SEE-0 syslog-2010-11-07_1.txt.zip
November 7, 201015 yr I have all the information from that drive elsewhere and can take it out of the array without any loss of data to me. Can I do that? If yes, how? I can then re-add the drive and copy over the information. ??? ??? Since you have single disk failure now but you already successfully copy data out of this disk at different machine (correct me if i am wrong), so if you want to let go of parity data (not disk) you can by un-assign this failed disk (the one you are replacing right now) and let unRAID rebuild parity using rest of disks still in unRAID domain. once parity re-calculation has finished you can decide to add this disk and data back or not. Mobo: Supermicro MBD-C2SEE-0 Essentially, you still have same error. Nov 6 20:27:19 Tower kernel: sas: command 0xc3e039c0, task 0xc3d18000, timed out: BLK_EH_NOT_HANDLED Nov 6 20:27:19 Tower kernel: sas: Enter sas_scsi_recover_host Nov 6 20:27:19 Tower kernel: sas: trying to find task 0xc3d18000 Nov 6 20:27:19 Tower kernel: sas: sas_scsi_find_task: aborting task 0xc3d18000 Nov 6 20:27:19 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1701:mvs_abort_task:rc= 5 Nov 6 20:27:19 Tower kernel: sas: sas_scsi_find_task: querying task 0xc3d18000 Nov 6 20:27:19 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1645:mvs_query_task:rc= 5 Nov 6 20:27:19 Tower kernel: sas: sas_scsi_find_task: task 0xc3d18000 failed to abort Nov 6 20:27:19 Tower kernel: sas: task 0xc3d18000 is not at LU: I_T recover Nov 6 20:27:19 Tower kernel: sas: I_T nexus reset for dev 0600000000000000 Nov 6 20:27:19 Tower kernel: sas: I_T 0600000000000000 recovered Nov 6 20:27:19 Tower kernel: sas: --- Exit sas_scsi_recover_host Nov 6 20:27:19 Tower kernel: ata5: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 Nov 6 20:27:19 Tower kernel: ata5: status=0x41 { DriveReady Error } Nov 6 20:27:19 Tower kernel: ata5: error=0x04 { DriveStatusError } Since you have different MB now and disk is good because you can copy data out at different machine so the only one left is controller card. Doing some search online, it looks like this is a known issue in mvsas driver. Refer to following two links for information and looks like it only happen when using TWO AOC-SASLP-MV8 cards http://hardforum.com/showpost.php?s=f62b3ad7dee7d877ff7ad31c83e7578e&p=1035429090&postcount=432 http://hardforum.com/showthread.php?t=1397855&page=22 At this moment, i have no idea how you can work around this problem.
November 8, 201015 yr Author I've done everything possible to the point - replaced the motherboard and CPU. Replaced power supply. Replaced controller cards and cables. Rearranged HDs around. Used 3 different 2 TB hard drives!!! Nothing worked, with the previous drive still showing up as missing and unable to be rebuild. In the middle of things I upgraded from 4.5.4 to 4.5.6. Nothing helped. I felt pretty lost, having lost many nights of sleep over 8 data drives... Just now, after backing up the information from the old setup files, I completely reformatted the flash drive, and... the unRAID booted up nearly instantly, and I did not see any errors while booting up. Because I loaded new system I guess the information about the array is missing and all drives show up as blue. What is my next step? The parity should still be there, correct? Any the drive #5 is new. I can copy over the old setup files back to the flash drive and try to reboot and see what happens... Any thoughts? Thanks, Lev syslog-2010-11-07_2.txt.zip
November 8, 201015 yr Author Update... I copied the contents of the config folder onto the flash drive, reasigned all drives and data rebuild started!!! :D I'll drop a note tomorrow once I know more. 575 minutes to go... ;D
November 9, 201015 yr Author I was able to reconstruct the contents of drive 5 and decided to run parity check just one more time. Now, I am getting a red ball on disk 1. I seriously doubt that another drive could have failed so soon. Too much of a coincidence. Considering the fact that when I reinstalled the flash drive software, the array was able to rebuild it self, can this represent a failing flash drive? Any thoughts? syslog-2010-11-09_1.zip
November 10, 201015 yr Author I'd like to update - Problem solved. I think... As I mentioned in one of my posts I have a Norco 4220 case. When I received a new fan board I decided to do a little rearrangement of the drives in the case, and I guess that that is when the troubles began. To make a long story short, when the drives where out of their cages, and connected to the SATA/SAS controllers outside, and hence had separate power connectors, everything worked; but, as soon as I placed the drives into the rack the errors began. I could not understand why this was going on as the same SATA/SAS connectors were being used. Finally, I decided to add additional power cables/molex to the backplanes of the Norco and the server booted up perfectly. When I bought the case I specifically called Norco and asked them about the dual power connectors and if I needed to use them both. I was told that the second connector was for a redundant power supply and that I did not have to use it. I guess that the backplanes do in fact need extra power supply, more than one connector can provide. Not all rows needed the extra "boost" but some do. By the way, I have an OCZ 700W and I also tried Corsair 950W PSU. Both had the same issues. So, hopefully my experience with this case will help others to avoid this potential pitfall. If you are having problems with booting or getting errors and are using 4220, 4020 or 4224 cases add additional power connector from a separate cable.
November 22, 201015 yr This helps! I had some strange problems with my 4224. Tom's support response asked if my drives were getting enough power and I said of course, I have a Seasonic 750W unit and the case is only pulling a total of 250W. I guess the drives may be underpowered if all connectors are not connected. Will try this ASAP and see if things improve. Thanks!
November 22, 201015 yr Author Since connecting the additional molexes I have been running a total of 10 HDs without a single problem. Good luck. Lev
Archived
This topic is now archived and is closed to further replies.