Jump to content

Disk failed repeatedly


mic.88

Recommended Posts

Hi UNRAID Community,

I have an issue with different drives failing but the SMART health reports don't show any errors.

Attached you find the diag report just after the disk3 failed @ around 19:00 and got disabled

The disk3 and 4 failed several days ago too, but the SMART values are look good in my optinion. I did some tesing on my workbench PC and the disk3 did not show any signs of failures/errors.

Might be an controller/cable issue (LSI SAS2308, FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00)) but the systems runs for more than 3 years now without major issues

I replaced the disk3 now with a new (tested) one. But I don't think this is the root cause here

Please help me analyse the diag-logs :)

disk3.png

alpha-unraid-diagnostics-20231029-1938.zip

Link to comment

I had some disk issues that I believe I narrowed down to spin down.  I had a disk that I would spin down that was only my media library, so it wasn't used all that often.  But also in a sweep of frustration I also swapped out the SATA cables as I was using those thin cables that are bound together, so somewhere between the cables and turning off spin down, the errors have gone away.  

Link to comment
  • 3 weeks later...

The spin-down seems to be the issue here.

The array run for 2 weeks now with spin-down set to never.

I started yesterday with setting individual disks to spin-down after 15mins.

Almost all of my disks work with this.

Only my Exos X18 (Disk11 /dev/sdk) disks don't.

 

Nov 16 10:19:32 Alpha-Unraid kernel: mdcmd (74): set md_num_stripes 1280
Nov 16 10:19:32 Alpha-Unraid kernel: mdcmd (75): set md_queue_limit 80
Nov 16 10:19:32 Alpha-Unraid kernel: mdcmd (76): set md_sync_limit 5
Nov 16 10:19:32 Alpha-Unraid kernel: mdcmd (77): set md_write_method
Nov 16 10:19:32 Alpha-Unraid emhttpd: spinning down /dev/sdk
Nov 16 10:20:01 Alpha-Unraid flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Nov 16 10:20:47 Alpha-Unraid kernel: sd 11:0:5:0: attempting task abort!scmd(0x000000000658e7e8), outstanding for 7050 ms & timeout 7000 ms
Nov 16 10:20:47 Alpha-Unraid kernel: sd 11:0:5:0: [sdk] tag#6626 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
Nov 16 10:20:47 Alpha-Unraid kernel: scsi target11:0:5: handle(0x000e), sas_address(0x5001517e3b0bf0aa), phy(10)
Nov 16 10:20:47 Alpha-Unraid kernel: scsi target11:0:5: enclosure logical id(0x5001e677b6dbbfff), slot(10)
Nov 16 10:20:47 Alpha-Unraid kernel: sd 11:0:5:0: device_block, handle(0x000e)
Nov 16 10:20:49 Alpha-Unraid kernel: sd 11:0:5:0: device_unblock and setting to running, handle(0x000e)
Nov 16 10:20:50 Alpha-Unraid kernel: sd 11:0:5:0: task abort: SUCCESS scmd(0x000000000658e7e8)
Nov 16 10:20:50 Alpha-Unraid kernel: sd 11:0:5:0: [sdk] Synchronizing SCSI cache
Nov 16 10:20:50 Alpha-Unraid kernel: sd 11:0:5:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
Nov 16 10:20:50 Alpha-Unraid kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x5001517e3b0bf0aa)
Nov 16 10:20:50 Alpha-Unraid kernel: mpt2sas_cm0: removing handle(0x000e), sas_addr(0x5001517e3b0bf0aa)
Nov 16 10:20:50 Alpha-Unraid kernel: mpt2sas_cm0: enclosure logical id(0x5001e677b6dbbfff), slot(10)
Nov 16 10:20:50 Alpha-Unraid emhttpd: read SMART /dev/sdk
Nov 16 10:20:50 Alpha-Unraid kernel: mpt2sas_cm0: handle(0xe) sas_address(0x5001517e3b0bf0aa) port_type(0x1)
Nov 16 10:20:51 Alpha-Unraid kernel: scsi 11:0:15:0: Direct-Access     ATA      ST16000NM000J-2T SC02 PQ: 0 ANSI: 6
Nov 16 10:20:51 Alpha-Unraid kernel: scsi 11:0:15:0: SATA: handle(0x000e), sas_addr(0x5001517e3b0bf0aa), phy(10), device_name(0x0000000000000000)
Nov 16 10:20:51 Alpha-Unraid kernel: scsi 11:0:15:0: enclosure logical id (0x5001e677b6dbbfff), slot(10)
Nov 16 10:20:51 Alpha-Unraid kernel: scsi 11:0:15:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Nov 16 10:20:51 Alpha-Unraid kernel: scsi 11:0:15:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
Nov 16 10:20:51 Alpha-Unraid kernel: sd 11:0:15:0: Attached scsi generic sg10 type 0
Nov 16 10:20:51 Alpha-Unraid kernel: sd 11:0:15:0: Power-on or device reset occurred
Nov 16 10:20:51 Alpha-Unraid kernel: end_device-11:0:11: add: handle(0x000e), sas_addr(0x5001517e3b0bf0aa)
Nov 16 10:20:51 Alpha-Unraid kernel: sd 11:0:15:0: [sdt] 31251759104 512-byte logical blocks: (16.0 TB/14.6 TiB)
Nov 16 10:20:51 Alpha-Unraid kernel: sd 11:0:15:0: [sdt] 4096-byte physical blocks
Nov 16 10:20:51 Alpha-Unraid kernel: sd 11:0:15:0: [sdt] Write Protect is off
Nov 16 10:20:51 Alpha-Unraid kernel: sd 11:0:15:0: [sdt] Mode Sense: 7f 00 10 08
Nov 16 10:20:51 Alpha-Unraid kernel: sd 11:0:15:0: [sdt] Write cache: enabled, read cache: enabled, supports DPO and FUA
Nov 16 10:20:51 Alpha-Unraid kernel: sdt: sdt1
Nov 16 10:20:51 Alpha-Unraid kernel: sd 11:0:15:0: [sdt] Attached SCSI disk
Nov 16 10:20:52 Alpha-Unraid unassigned.devices: Disk with ID 'ST16000NM000J-2TW103_ZR5AGC4V ()' is not set to auto mount.
Nov 16 10:20:53 Alpha-Unraid emhttpd: error: hotplug_devices, 1706: No such file or directory (2): tagged device ST16000NM000J-2TW103_ZR5AGC4V was (sdk) is now (sdt)

Nov 16 10:20:53 Alpha-Unraid emhttpd: read SMART /dev/sdt
Nov 16 10:20:53 Alpha-Unraid kernel: emhttpd[10420]: segfault at 67c ip 000056528418775f sp 00007ffd34b8b620 error 4 in emhttpd[565284172000+24000] likely on CPU 8 (core 10, socket 0)
Nov 16 10:20:53 Alpha-Unraid kernel: Code: c4 36 01 00 48 89 45 f8 48 8d 05 f9 23 01 00 48 89 45 f0 e9 79 01 00 00 8b 45 ec 89 c7 e8 1a 88 ff ff 48 89 45 d8 48 8b 45 d8 <8b> 80 7c 06 00 00 85 c0 0f 94 c0 0f b6 c0 89 45 d4 48 8b 45 e0 48
Nov 16 10:21:01 Alpha-Unraid flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update

 

Link to comment

Thanks i'll try this as soon as the parity rebuild is done.

I have two IronWolf disks in the array who don't have the issue.

When this issue first came up, the disks failing where WesternDigital-disks

I replaced the WD-disks with new Exos X18

I'll update the post as soon as its done and tested

Link to comment

The EPC and lowCurrentSpinup setting did not solve the issue

I noticed this the last time too but did not tell you here. When the disks goes into error-mode, the web-gui becomes kind of read-only. The Stop-Array buttons does not work and no settings can be changed anymore. The containers and VMs are still up an I can access shares. I have to push the power-button to shutdown and restart the array

Very strange

 

alpha-unraid-diagnostics-20231118-1739.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...