Jump to content

3 failed disks in the past 3 weeks


JimPhreak

Recommended Posts

Having had 3 failed disks in the past 3 weeks has me convinced that I must have something else going bad in my server that is causing these errors.  It has been a different disk each time.  The failed disks are on different SAS-to-4SATA cables.  What should I be doing to isolate my issue?

Link to comment

What kind of failures? Bad sectors, smart errors? What model drives? Is your server on a UPS?

 

Two of the disks are 3TB WD Reds, the other is a Seagate 8TB SMR.  No smart errors.  I've attached the smart report for the latest failed disk and here is the latest errors from the syslog.

 

Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] Synchronizing SCSI cache

Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00

Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] CDB: opcode=0x88 88 00 00 00 00 00 5f 02 71 b0 00 00 00 08 00 00

Feb 28 14:01:29 SPE-UNRAID kernel: blk_update_request: I/O error, dev sdj, sector 1593995696

Feb 28 14:01:29 SPE-UNRAID kernel: md: disk5 read error, sector=1593995632

Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00

Feb 28 14:01:29 SPE-UNRAID kernel: mpt2sas0: removing handle(0x000f), sas_addr(0x5001e677b7db5fed)

Feb 28 14:01:40 SPE-UNRAID kernel: md: disk5 write error, sector=1593995632

Feb 28 14:01:40 SPE-UNRAID kernel: md: recovery thread woken up ...

Feb 28 14:01:40 SPE-UNRAID kernel: write_file: write error 4

Feb 28 14:01:40 SPE-UNRAID kernel: md: could not write superblock from /boot/config/super.dat

Feb 28 14:01:40 SPE-UNRAID kernel: md: recovery thread has nothing to resync

Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: Direct-Access    ATA      WDC WD30EFRX-68E 0A82 PQ: 0 ANSI: 6

 

Server is not currently on a UPS.

 

 

On a related note, how do I re-enable the disabled disk if I'm convinced it is not bad without having to do a full rebuild of the disk?

WDC_WD30EFRX-68EUZN0_WD-WCC4NNARETAN-diagnostics-20160228.txt

Link to comment

It can just happen,  a little over a month ago I had three 3TB WD Green fail on the same server within  two weeks.

 

I don't believe any of the disks are bad though.  None have any smart errors that I can see and they passed short and long smart tests.  I also tested them with WD Data Lifeguard and all passed.

 

After I replaced the first 2 failed disks, I then ran 3 preclears on them as well and they completed those runs fine.

Link to comment

Sometimes disks fail and SMART looks healthy, although I agree that having 3 fail like that is improbable, less so if they were bought together, do you have another server? When I have doubts in a case like this I use the disk in another server,  if it fails again I know it's really a bad disk.

 

I have a backup server but that server is fully populated with drives so there is no room to add an additional drive at this time. 

 

I'm confused what triggered these errors though as I don't believe I had any data written to my array since late last night.  You can even see that last mover ran without moving anything.

 

Feb 28 12:00:01 SPE-UNRAID logger: mover started

Feb 28 12:00:01 SPE-UNRAID logger: skipping "Docker"

Feb 28 12:00:01 SPE-UNRAID logger: skipping "Downloads"

Feb 28 12:00:01 SPE-UNRAID logger: skipping "vdisks"

Feb 28 12:00:01 SPE-UNRAID logger: mover finished

Feb 28 12:45:52 SPE-UNRAID kernel: mdcmd (260): spindown 3

Feb 28 13:16:08 SPE-UNRAID kernel: mdcmd (261): spindown 1

Feb 28 13:59:00 SPE-UNRAID php: /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker 'restart' 'unifi'

Feb 28 13:59:01 SPE-UNRAID kernel: vethb1e47cc: renamed from eth0

Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth7819134) entered disabled state

Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth7819134) entered disabled state

Feb 28 13:59:01 SPE-UNRAID avahi-daemon[5487]: Withdrawing workstation service for vethb1e47cc.

Feb 28 13:59:01 SPE-UNRAID avahi-daemon[5487]: Withdrawing workstation service for veth7819134.

Feb 28 13:59:01 SPE-UNRAID kernel: device veth7819134 left promiscuous mode

Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth7819134) entered disabled state

Feb 28 13:59:01 SPE-UNRAID kernel: device veth87a37ae entered promiscuous mode

Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state

Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state

Feb 28 13:59:01 SPE-UNRAID avahi-daemon[5487]: Withdrawing workstation service for vethc67a6bd.

Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered disabled state

Feb 28 13:59:01 SPE-UNRAID kernel: eth0: renamed from vethc67a6bd

Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state

Feb 28 13:59:01 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state

Feb 28 13:59:05 SPE-UNRAID emhttp: cmd: /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker logs --tail=350 -f unifi

Feb 28 13:59:16 SPE-UNRAID kernel: docker0: port 8(veth87a37ae) entered forwarding state

Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] Synchronizing SCSI cache

Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00

Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] CDB: opcode=0x88 88 00 00 00 00 00 5f 02 71 b0 00 00 00 08 00 00

Feb 28 14:01:29 SPE-UNRAID kernel: blk_update_request: I/O error, dev sdj, sector 1593995696

Feb 28 14:01:29 SPE-UNRAID kernel: md: disk5 read error, sector=1593995632

Feb 28 14:01:29 SPE-UNRAID kernel: sd 2:0:5:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00

Feb 28 14:01:29 SPE-UNRAID kernel: mpt2sas0: removing handle(0x000f), sas_addr(0x5001e677b7db5fed)

Feb 28 14:01:40 SPE-UNRAID kernel: md: disk5 write error, sector=1593995632

Feb 28 14:01:40 SPE-UNRAID kernel: md: recovery thread woken up ...

Feb 28 14:01:40 SPE-UNRAID kernel: write_file: write error 4

Feb 28 14:01:40 SPE-UNRAID kernel: md: could not write superblock from /boot/config/super.dat

Feb 28 14:01:40 SPE-UNRAID kernel: md: recovery thread has nothing to resync

Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: Direct-Access    ATA      WDC WD30EFRX-68E 0A82 PQ: 0 ANSI: 6

Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: SATA: handle(0x000f), sas_addr(0x5001e677b7db5fed), phy(13), device_name(0x0000000000000000)

Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: SATA: enclosure_logical_id(0x5001e677b7db5fff), slot(13)

Feb 28 14:01:40 SPE-UNRAID kernel: scsi 2:0:14:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)

Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: Attached scsi generic sg9 type 0

Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)

Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] 4096-byte physical blocks

Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] Write Protect is off

Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] Mode Sense: 7f 00 10 08

Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] Write cache: enabled, read cache: enabled, supports DPO and FUA

Feb 28 14:01:40 SPE-UNRAID kernel: sdl: sdl1

Feb 28 14:01:40 SPE-UNRAID kernel: sd 2:0:14:0: [sdl] Attached SCSI disk

Feb 28 14:02:02 SPE-UNRAID sSMTP[9523]: Creating SSL connection to host

Feb 28 14:02:02 SPE-UNRAID sSMTP[9523]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256

Link to comment

If you think the disks are ok the next things I would try is different controller and power supply.

 

Power supply is less than 6 months old and is pretty solid. http://www.newegg.com/Product/Product.aspx?Item=N82E16817151124

 

I can maybe swap the M1015's between my main and backup server and see if that makes any difference.

 

 

In the meantime, how can I re-enable the failed disk without having to do a rebuild?

Link to comment

Have you updated the firmware on the M1015 and RES2SV240 expander?

 

 

Firmware for RES2SV240 is available here: https://downloadcenter.intel.com/download/21686/RAID-Expander-RES2SV240-RES2CV240-RES2CV360-Firmware

 

 

It can be updated in unRAID from the command line that is how I updated mine a year or so ago.

I haven't updated the M1015 firmware in some time and never have on the RES2SV240 since I bought it.  Can I update both from the UnRAID command line?

Link to comment

If you think the disks are ok the next things I would try is different controller and power supply.

 

Power supply is less than 6 months old and is pretty solid. http://www.newegg.com/Product/Product.aspx?Item=N82E16817151124

 

 

OK, I had a look at this PS.  I think  it is a bit quick to say that it adequate for the job.  It is only a 450W unit.  While the current rating on the 12V bus is 37A (444W), I doubt if it can ever supply that max current on that bus as MB, CPU, fans, etc. will also be drawing power which must be accounted for in that 450W maximum.  You did not provide any other information on those components or the the other HD's.  Remember that when hard drives are spun-up they require a high initial starting current  and there are times when all of the drives are required to spin up at the same time.  If that PS goes into power limit mode, most likely it is going to effect every output voltage coming out.

 

You can find more on PS's in this thread:

 

    http://lime-technology.com/forum/index.php?topic=12219.0

 

 

Link to comment

Have you updated the firmware on the M1015 and RES2SV240 expander?

 

 

Firmware for RES2SV240 is available here: https://downloadcenter.intel.com/download/21686/RAID-Expander-RES2SV240-RES2CV240-RES2CV360-Firmware

 

 

It can be updated in unRAID from the command line that is how I updated mine a year or so ago.

I haven't updated the M1015 firmware in some time and never have on the RES2SV240 since I bought it.  Can I update both from the UnRAID command line?

You can upgrade the RES2SV240 from the unRAID command line I've done it that way. 

 

 

I've always done the M1015 from a dos boot flash but there are other methods including UEFI.  If you have cross flashed it to be an LSI 9211-8i then it looks like it would work based on the firmware downloads from here: http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9211-8i#downloads

 

 

But I would see if someone else can confirm it since I've only done it from a dos boot flash.

Link to comment

If you think the disks are ok the next things I would try is different controller and power supply.

 

Power supply is less than 6 months old and is pretty solid. http://www.newegg.com/Product/Product.aspx?Item=N82E16817151124

 

 

OK, I had a look at this PS.  I think  it is a bit quick to say that it adequate for the job.  It is only a 450W unit.  While the current rating on the 12V bus is 37A (444W), I doubt if it can ever supply that max current on that bus as MB, CPU, fans, etc. will also be drawing power which must be accounted for in that 450W maximum.  You did not provide any other information on those components or the the other HD's.  Remember that when hard drives are spun-up they require a high initial starting current  and there are times when all of the drives are required to spin up at the same time.  If that PS goes into power limit mode, most likely it is going to effect every output voltage coming out.

 

You can find more on PS's in this thread:

 

    http://lime-technology.com/forum/index.php?topic=12219.0

 

This is the board I'm using in addition to the 12 disks.  http://www.supermicro.com/products/motherboard/Xeon/D/X10SDV-TLN4F.cfm

 

 

Have you updated the firmware on the M1015 and RES2SV240 expander?

 

 

Firmware for RES2SV240 is available here: https://downloadcenter.intel.com/download/21686/RAID-Expander-RES2SV240-RES2CV240-RES2CV360-Firmware

 

 

It can be updated in unRAID from the command line that is how I updated mine a year or so ago.

I haven't updated the M1015 firmware in some time and never have on the RES2SV240 since I bought it.  Can I update both from the UnRAID command line?

You can upgrade the RES2SV240 from the unRAID command line I've done it that way. 

 

 

I've always done the M1015 from a dos boot flash but there are other methods including UEFI.  If you have cross flashed it to be an LSI 9211-8i then it looks like it would work based on the firmware downloads from here: http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9211-8i#downloads

 

 

But I would see if someone else can confirm it since I've only done it from a dos boot flash.

 

Thanks Bob, I will try and update both when I can and see if that helps.

Link to comment

I believe this is exactly the same issue I face and I am using exactly the same controller card. I am struggling with this for a year already and probably around 10+ disks disabled on this controller - none of them was actually faulty.

 

A few things you can save your time to try out: 1) FW update of the M1015 was done and not helped, 2) Moving failing disk to an on-board controller never have the disk fail again, 3) Replacing M1015 with a new M1015 did not improve things, 4) I have 850W PSU, so it is also not an PSU issue, 5) Replacing all cabling also did not help.

 

Yesterday someone raised the suggestion not to spin down disks. That's what I will try next. Let's keep each other updated if/once we find a solution.

Link to comment

I believe this is exactly the same issue I face and I am using exactly the same controller card. I am struggling with this for a year already and probably around 10+ disks disabled on this controller - none of them was actually faulty.

 

A few things you can save your time to try out: 1) FW update of the M1015 was done and not helped, 2) Moving failing disk to an on-board controller never have the disk fail again, 3) Replacing M1015 with a new M1015 did not improve things, 4) I have 850W PSU, so it is also not an PSU issue, 5) Replacing all cabling also did not help.

 

Yesterday someone raised the suggestion not to spin down disks. That's what I will try next. Let's keep each other updated if/once we find a solution.

 

Thanks for the info steve.  The fact that you have tried all of that to no avail doesn't leave me optimistic.  Sure I can try not spinning my drives down but that's not really a solution to me considering having the ability to spin my drives down is one of the big reasons why I use UnRAID over other solutions.

Link to comment

You are right and this was indeed also one of the reasons for me. By now, I mind it less and see overall the benefit of Unraid versus others. I am currently rebuilding and will try to stop the spin-down over the next couple of weeks. This thread has more context: http://lime-technology.com/forum/index.php?topic=46985.0

 

Please also keep me posted if you find some other solution. I also would not mind switching to another controller, but the M1015 appears to be the most commonly used / best supported.

Link to comment

In the next month I'll be upgrading/converging my storage server into a 2-in-1 storage server running UnRAID in one VM for my bulk media and Napp-it with ZFS in another for my VMs and dockers.  I'm just waiting for this motherboard to be released for sale which has an onboard LSI controller that can connect 16 drives as well as dual 10gig SFP+ ports.  So I won't be using the M1015 in my main server anymore going forward (still will in my backup server but haven't had any drive failure issues in that one yet).

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...