JimPhreak Posted February 28, 2016 Share Posted February 28, 2016 Having had 3 failed disks in the past 3 weeks has me convinced that I must have something else going bad in my server that is causing these errors. It has been a different disk each time. The failed disks are on different SAS-to-4SATA cables. What should I be doing to isolate my issue? Quote Link to comment
ashman70 Posted February 28, 2016 Share Posted February 28, 2016 What kind of failures? Bad sectors, smart errors? What model drives? Is your server on a UPS? Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 What kind of failures? Bad sectors, smart errors? What model drives? Is your server on a UPS? Two of the disks are 3TB WD Reds, the other is a Seagate 8TB SMR. No smart errors. I've attached the smart report for the latest failed disk and here is the latest errors from the syslog. Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] Synchronizing SCSI cache Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] CDB: opcode=0x88 88 00 00 00 00 00 5f 02 71 b0 00 00 00 08 00 00 Feb 28 14:01:29 SPE-UNRAID kernel: blk_update_request: I/O error, dev sdj, sector 1593995696 Feb 28 14:01:29 SPE-UNRAID kernel: md: disk5 read error, sector=1593995632 Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Feb 28 14:01:29 SPE-UNRAID kernel: mpt2sas0: removing handle(0x000f), sas_addr(0x5001e677b7db5fed) Feb 28 14:01:40 SPE-UNRAID kernel: md: disk5 write error, sector=1593995632 Feb 28 14:01:40 SPE-UNRAID kernel: md: recovery thread woken up ... Feb 28 14:01:40 SPE-UNRAID kernel: write_file: write error 4 Feb 28 14:01:40 SPE-UNRAID kernel: md: could not write superblock from /boot/config/super.dat Feb 28 14:01:40 SPE-UNRAID kernel: md: recovery thread has nothing to resync Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: Direct-Access ATA WDC WD30EFRX-68E 0A82 PQ: 0 ANSI: 6 Server is not currently on a UPS. On a related note, how do I re-enable the disabled disk if I'm convinced it is not bad without having to do a full rebuild of the disk? WDC_WD30EFRX-68EUZN0_WD-WCC4NNARETAN-diagnostics-20160228.txt Quote Link to comment
ashman70 Posted February 28, 2016 Share Posted February 28, 2016 Were all these disks on the same controller card? Or are they connected to the motherboard? Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 Were all these disks on the same controller card? Or are they connected to the motherboard? They are all connected to this SAS expander which is connected to an M1015 controller. Quote Link to comment
ashman70 Posted February 28, 2016 Share Posted February 28, 2016 SO you've had multiple drives fail on both of your servers? Or just one server? Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 SO you've had multiple drives fail on both of your servers? Or just one server? Multiple drives on the same server. Just not sure how to go about isolating the issue. Plus the disk failures have been weeks apart to so it's hard to pinpoint. Quote Link to comment
ashman70 Posted February 28, 2016 Share Posted February 28, 2016 So multiple drives, same controller? Is it possible to remove the controller from the picture and use onboard ports? Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 So multiple drives, same controller? Is it possible to remove the controller from the picture and use onboard ports? I have 12 disks attached to the controller (8 in the array, 4 in cache pool). I only have 6 onboard SATA ports. Quote Link to comment
JorgeB Posted February 28, 2016 Share Posted February 28, 2016 It can just happen, a little over a month ago I had three 3TB WD Green fail on the same server within two weeks. Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 It can just happen, a little over a month ago I had three 3TB WD Green fail on the same server within two weeks. I don't believe any of the disks are bad though. None have any smart errors that I can see and they passed short and long smart tests. I also tested them with WD Data Lifeguard and all passed. After I replaced the first 2 failed disks, I then ran 3 preclears on them as well and they completed those runs fine. Quote Link to comment
JorgeB Posted February 28, 2016 Share Posted February 28, 2016 Sometimes disks fail and SMART looks healthy, although I agree that having 3 fail like that is improbable, less so if they were bought together, do you have another server? When I have doubts in a case like this I use the disk in another server, if it fails again I know it's really a bad disk. Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 Sometimes disks fail and SMART looks healthy, although I agree that having 3 fail like that is improbable, less so if they were bought together, do you have another server? When I have doubts in a case like this I use the disk in another server, if it fails again I know it's really a bad disk. I have a backup server but that server is fully populated with drives so there is no room to add an additional drive at this time. I'm confused what triggered these errors though as I don't believe I had any data written to my array since late last night. You can even see that last mover ran without moving anything. Feb 28 12:00:01 SPE-UNRAID logger: mover started Feb 28 12:00:01 SPE-UNRAID logger: skipping "Docker" Feb 28 12:00:01 SPE-UNRAID logger: skipping "Downloads" Feb 28 12:00:01 SPE-UNRAID logger: skipping "vdisks" Feb 28 12:00:01 SPE-UNRAID logger: mover finished Feb 28 12:45:52 SPE-UNRAID kernel: mdcmd (260): spindown 3 Feb 28 13:16:08 SPE-UNRAID kernel: mdcmd (261): spindown 1 Feb 28 13:59:00 SPE-UNRAID php: /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker 'restart' 'unifi' Feb 28 13:59:01 SPE-UNRAID kernel: vethb1e47cc: renamed from eth0 Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth7819134) entered disabled state Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth7819134) entered disabled state Feb 28 13:59:01 SPE-UNRAID avahi-daemon[5487]: Withdrawing workstation service for vethb1e47cc. Feb 28 13:59:01 SPE-UNRAID avahi-daemon[5487]: Withdrawing workstation service for veth7819134. Feb 28 13:59:01 SPE-UNRAID kernel: device veth7819134 left promiscuous mode Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth7819134) entered disabled state Feb 28 13:59:01 SPE-UNRAID kernel: device veth87a37ae entered promiscuous mode Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state Feb 28 13:59:01 SPE-UNRAID avahi-daemon[5487]: Withdrawing workstation service for vethc67a6bd. Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered disabled state Feb 28 13:59:01 SPE-UNRAID kernel: eth0: renamed from vethc67a6bd Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state Feb 28 13:59:05 SPE-UNRAID emhttp: cmd: /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker logs --tail=350 -f unifi Feb 28 13:59:16 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] Synchronizing SCSI cache Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] CDB: opcode=0x88 88 00 00 00 00 00 5f 02 71 b0 00 00 00 08 00 00 Feb 28 14:01:29 SPE-UNRAID kernel: blk_update_request: I/O error, dev sdj, sector 1593995696 Feb 28 14:01:29 SPE-UNRAID kernel: md: disk5 read error, sector=1593995632 Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Feb 28 14:01:29 SPE-UNRAID kernel: mpt2sas0: removing handle(0x000f), sas_addr(0x5001e677b7db5fed) Feb 28 14:01:40 SPE-UNRAID kernel: md: disk5 write error, sector=1593995632 Feb 28 14:01:40 SPE-UNRAID kernel: md: recovery thread woken up ... Feb 28 14:01:40 SPE-UNRAID kernel: write_file: write error 4 Feb 28 14:01:40 SPE-UNRAID kernel: md: could not write superblock from /boot/config/super.dat Feb 28 14:01:40 SPE-UNRAID kernel: md: recovery thread has nothing to resync Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: Direct-Access ATA WDC WD30EFRX-68E 0A82 PQ: 0 ANSI: 6 Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: SATA: handle(0x000f), sas_addr(0x5001e677b7db5fed), phy(13), device_name(0x0000000000000000) Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: SATA: enclosure_logical_id(0x5001e677b7db5fff), slot(13) Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: Attached scsi generic sg9 type 0 Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB) Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] 4096-byte physical blocks Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] Write Protect is off Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] Mode Sense: 7f 00 10 08 Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] Write cache: enabled, read cache: enabled, supports DPO and FUA Feb 28 14:01:40 SPE-UNRAID kernel: sdl: sdl1 Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] Attached SCSI disk Feb 28 14:02:02 SPE-UNRAID sSMTP[9523]: Creating SSL connection to host Feb 28 14:02:02 SPE-UNRAID sSMTP[9523]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256 Quote Link to comment
JorgeB Posted February 28, 2016 Share Posted February 28, 2016 If you think the disks are ok the next things I would try is different controller and power supply. Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 If you think the disks are ok the next things I would try is different controller and power supply. Power supply is less than 6 months old and is pretty solid. http://www.newegg.com/Product/Product.aspx?Item=N82E16817151124 I can maybe swap the M1015's between my main and backup server and see if that makes any difference. In the meantime, how can I re-enable the failed disk without having to do a rebuild? Quote Link to comment
JorgeB Posted February 28, 2016 Share Posted February 28, 2016 If you're sure nothing was written to it after redballing you can do a new config, re-assign all disks, check "parity is already valid" and start array. Double check parity disk is on the parity slot before starting array. Quote Link to comment
BobPhoenix Posted February 28, 2016 Share Posted February 28, 2016 Have you updated the firmware on the M1015 and RES2SV240 expander? Firmware for RES2SV240 is available here: https://downloadcenter.intel.com/download/21686/RAID-Expander-RES2SV240-RES2CV240-RES2CV360-Firmware It can be updated in unRAID from the command line that is how I updated mine a year or so ago. Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 Have you updated the firmware on the M1015 and RES2SV240 expander? Firmware for RES2SV240 is available here: https://downloadcenter.intel.com/download/21686/RAID-Expander-RES2SV240-RES2CV240-RES2CV360-Firmware It can be updated in unRAID from the command line that is how I updated mine a year or so ago. I haven't updated the M1015 firmware in some time and never have on the RES2SV240 since I bought it. Can I update both from the UnRAID command line? Quote Link to comment
Frank1940 Posted February 28, 2016 Share Posted February 28, 2016 If you think the disks are ok the next things I would try is different controller and power supply. Power supply is less than 6 months old and is pretty solid. http://www.newegg.com/Product/Product.aspx?Item=N82E16817151124 OK, I had a look at this PS. I think it is a bit quick to say that it adequate for the job. It is only a 450W unit. While the current rating on the 12V bus is 37A (444W), I doubt if it can ever supply that max current on that bus as MB, CPU, fans, etc. will also be drawing power which must be accounted for in that 450W maximum. You did not provide any other information on those components or the the other HD's. Remember that when hard drives are spun-up they require a high initial starting current and there are times when all of the drives are required to spin up at the same time. If that PS goes into power limit mode, most likely it is going to effect every output voltage coming out. You can find more on PS's in this thread: http://lime-technology.com/forum/index.php?topic=12219.0 Quote Link to comment
BobPhoenix Posted February 28, 2016 Share Posted February 28, 2016 Have you updated the firmware on the M1015 and RES2SV240 expander? Firmware for RES2SV240 is available here: https://downloadcenter.intel.com/download/21686/RAID-Expander-RES2SV240-RES2CV240-RES2CV360-Firmware It can be updated in unRAID from the command line that is how I updated mine a year or so ago. I haven't updated the M1015 firmware in some time and never have on the RES2SV240 since I bought it. Can I update both from the UnRAID command line? You can upgrade the RES2SV240 from the unRAID command line I've done it that way. I've always done the M1015 from a dos boot flash but there are other methods including UEFI. If you have cross flashed it to be an LSI 9211-8i then it looks like it would work based on the firmware downloads from here: http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9211-8i#downloads But I would see if someone else can confirm it since I've only done it from a dos boot flash. Quote Link to comment
JimPhreak Posted February 28, 2016 Author Share Posted February 28, 2016 If you think the disks are ok the next things I would try is different controller and power supply. Power supply is less than 6 months old and is pretty solid. http://www.newegg.com/Product/Product.aspx?Item=N82E16817151124 OK, I had a look at this PS. I think it is a bit quick to say that it adequate for the job. It is only a 450W unit. While the current rating on the 12V bus is 37A (444W), I doubt if it can ever supply that max current on that bus as MB, CPU, fans, etc. will also be drawing power which must be accounted for in that 450W maximum. You did not provide any other information on those components or the the other HD's. Remember that when hard drives are spun-up they require a high initial starting current and there are times when all of the drives are required to spin up at the same time. If that PS goes into power limit mode, most likely it is going to effect every output voltage coming out. You can find more on PS's in this thread: http://lime-technology.com/forum/index.php?topic=12219.0 This is the board I'm using in addition to the 12 disks. http://www.supermicro.com/products/motherboard/Xeon/D/X10SDV-TLN4F.cfm Have you updated the firmware on the M1015 and RES2SV240 expander? Firmware for RES2SV240 is available here: https://downloadcenter.intel.com/download/21686/RAID-Expander-RES2SV240-RES2CV240-RES2CV360-Firmware It can be updated in unRAID from the command line that is how I updated mine a year or so ago. I haven't updated the M1015 firmware in some time and never have on the RES2SV240 since I bought it. Can I update both from the UnRAID command line? You can upgrade the RES2SV240 from the unRAID command line I've done it that way. I've always done the M1015 from a dos boot flash but there are other methods including UEFI. If you have cross flashed it to be an LSI 9211-8i then it looks like it would work based on the firmware downloads from here: http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9211-8i#downloads But I would see if someone else can confirm it since I've only done it from a dos boot flash. Thanks Bob, I will try and update both when I can and see if that helps. Quote Link to comment
steve1977 Posted February 29, 2016 Share Posted February 29, 2016 I believe this is exactly the same issue I face and I am using exactly the same controller card. I am struggling with this for a year already and probably around 10+ disks disabled on this controller - none of them was actually faulty. A few things you can save your time to try out: 1) FW update of the M1015 was done and not helped, 2) Moving failing disk to an on-board controller never have the disk fail again, 3) Replacing M1015 with a new M1015 did not improve things, 4) I have 850W PSU, so it is also not an PSU issue, 5) Replacing all cabling also did not help. Yesterday someone raised the suggestion not to spin down disks. That's what I will try next. Let's keep each other updated if/once we find a solution. Quote Link to comment
JimPhreak Posted February 29, 2016 Author Share Posted February 29, 2016 I believe this is exactly the same issue I face and I am using exactly the same controller card. I am struggling with this for a year already and probably around 10+ disks disabled on this controller - none of them was actually faulty. A few things you can save your time to try out: 1) FW update of the M1015 was done and not helped, 2) Moving failing disk to an on-board controller never have the disk fail again, 3) Replacing M1015 with a new M1015 did not improve things, 4) I have 850W PSU, so it is also not an PSU issue, 5) Replacing all cabling also did not help. Yesterday someone raised the suggestion not to spin down disks. That's what I will try next. Let's keep each other updated if/once we find a solution. Thanks for the info steve. The fact that you have tried all of that to no avail doesn't leave me optimistic. Sure I can try not spinning my drives down but that's not really a solution to me considering having the ability to spin my drives down is one of the big reasons why I use UnRAID over other solutions. Quote Link to comment
steve1977 Posted February 29, 2016 Share Posted February 29, 2016 You are right and this was indeed also one of the reasons for me. By now, I mind it less and see overall the benefit of Unraid versus others. I am currently rebuilding and will try to stop the spin-down over the next couple of weeks. This thread has more context: http://lime-technology.com/forum/index.php?topic=46985.0 Please also keep me posted if you find some other solution. I also would not mind switching to another controller, but the M1015 appears to be the most commonly used / best supported. Quote Link to comment
JimPhreak Posted February 29, 2016 Author Share Posted February 29, 2016 In the next month I'll be upgrading/converging my storage server into a 2-in-1 storage server running UnRAID in one VM for my bulk media and Napp-it with ZFS in another for my VMs and dockers. I'm just waiting for this motherboard to be released for sale which has an onboard LSI controller that can connect 16 drives as well as dual 10gig SFP+ ports. So I won't be using the M1015 in my main server anymore going forward (still will in my backup server but haven't had any drive failure issues in that one yet). Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.