drive disabled again (brand new drive)

Mattlevant · June 9, 2023

so i recently had 2 4tb WD red cmr drives die (or i belive they died)
both were giving corrupt file system errors for weeks not at the same time but one on one day then the other on another, happening repeatedly but i just ran xfs repair and carried on, then i noticed both disk 3 & 4 dropped offline and giving i/o errors on xfs repair unable to find superblock, tried multiple different cables, different power connection combinations etc, anyway i replaced with 2 brand new seagate ironwolf 6tb (lost alot of data in the process as i only had single parity but its all media that can be redownloaded which i am in the process of)

anyway the drives been powered now for maybe a total of 3 days, and ive just had notification that disk 4 has been disabled due to errors

i dont beleive it has failed but am i right in thinking i can just stop the array, remove disk 4, start in maintenance, stop and then readd it to rebuild and renable?
i did come across this thread doing some quick googling, has this been found to be a fix and worth doing?

diags attached incase anything can be found to be the cause in there, im using a lsi 9201 in IT mode if my memory is correct but i may be wrong on the exact model

tower-diagnostics-20230609-0945.zip

Edited June 9, 2023 by Mattlevant

JorgeB · June 9, 2023

Looks more like a power/connection problem, check/replace cables and try again.

These are logged like a disk problem, so good idea to run an extended SMART test on diak1:

Jun  8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=6s
Jun  8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 Sense Key : 0x3 [current] 
Jun  8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 ASC=0x11 ASCQ=0x0 
Jun  8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 CDB: opcode=0x88 88 00 00 00 00 00 01 64 66 98 00 00 02 f0 00 00
Jun  8 12:29:56 Tower kernel: critical medium error, dev sdb, sector 23357080 op 0x0:(READ) flags 0x0 phys_seg 94 prio class 0
Jun  8 12:29:56 Tower kernel: md: disk1 read error, sector=23357016
Jun  8 12:29:56 Tower kernel: md: disk1 read error, sector=23357024
Jun  8 12:29:56 Tower kernel: md: disk1 read error, sector=23357032
Jun  8 12:29:56 Tower kernel: md: disk1 read error, sector=23357040

Mattlevant · June 9, 2023

yeah ive got a few errors on disk 1 but disk 4 is the one that has been disabled

is it likely a power issue for both disk 1 & disk 4?
ive got a 800w supply in there, surely thats anough for the drives i have?

Mattlevant · June 9, 2023

4 minutes ago, JorgeB said:

Looks more like a power/connection problem, check/replace cables and try again.

These are logged like a disk problem, so good idea to run an extended SMART test on diak1:

Jun  8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=6s
Jun  8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 Sense Key : 0x3 [current] 
Jun  8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 ASC=0x11 ASCQ=0x0 
Jun  8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 CDB: opcode=0x88 88 00 00 00 00 00 01 64 66 98 00 00 02 f0 00 00
Jun  8 12:29:56 Tower kernel: critical medium error, dev sdb, sector 23357080 op 0x0:(READ) flags 0x0 phys_seg 94 prio class 0
Jun  8 12:29:56 Tower kernel: md: disk1 read error, sector=23357016
Jun  8 12:29:56 Tower kernel: md: disk1 read error, sector=23357024
Jun  8 12:29:56 Tower kernel: md: disk1 read error, sector=23357032
Jun  8 12:29:56 Tower kernel: md: disk1 read error, sector=23357040

missed quote from response
see above

JorgeB · June 9, 2023

2 minutes ago, Mattlevant said:

is it likely a power issue for both disk 1 & disk 4?

No, for disk4, like mentioned disk1 is logged as an actual disk problem

6 minutes ago, JorgeB said:

run an extended SMART test

Mattlevant · June 9, 2023

4 minutes ago, JorgeB said:

No, for disk4, like mentioned disk1 is logged as an actual disk problem

ah sorry misunderstood initially as disk 4 wasnt specified in your response, i will try a different cable (i have 2 spares on my LSI card and 2 regular sata cables i can plug into the mainboard)

running extended self test on disk 1 now

for reference smart attributes attached for disk 1

JorgeB · June 9, 2023

SMART attributes look fine, but it's logged as a disk issue both on the syslog and on extended SMART info, and there are multiple errors from earlier, all these UNC @ LBA are not a good sign:

Error 55 [6] occurred at disk power-on lifetime: 57993 hours (2416 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 01 64 66 98 40 00  Error: UNC at LBA = 0x01646698 = 23357080

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 02 f0 00 08 00 00 01 64 66 98 40 00  1d+22:23:27.698  READ FPDMA QUEUED
  60 00 10 00 00 00 00 80 ac a6 20 40 00  1d+22:23:27.698  READ FPDMA QUEUED
  60 01 f0 00 10 00 00 01 64 64 a8 40 00  1d+22:23:27.690  READ FPDMA QUEUED
  60 04 00 00 08 00 00 01 64 60 a8 40 00  1d+22:23:27.690  READ FPDMA QUEUED
  60 03 80 00 00 00 00 01 64 5d 28 40 00  1d+22:23:27.689  READ FPDMA QUEUED

Error 54 [5] occurred at disk power-on lifetime: 57786 hours (2407 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 54 ad 34 f8 40 00  Error: UNC at LBA = 0x54ad34f8 = 1420637432

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 00 00 00 54 ad 34 f8 40 00  2d+11:10:42.878  READ FPDMA QUEUED
  60 04 00 00 08 00 00 54 a7 54 f8 40 00  2d+11:10:41.508  READ FPDMA QUEUED
  60 04 00 00 00 00 00 54 a7 50 f8 40 00  2d+11:10:41.508  READ FPDMA QUEUED
  60 04 00 00 00 00 00 54 a7 4c f8 40 00  2d+11:10:41.502  READ FPDMA QUEUED
  60 04 00 00 00 00 00 54 a7 48 f8 40 00  2d+11:10:41.499  READ FPDMA QUEUED

Error 53 [4] occurred at disk power-on lifetime: 57317 hours (2388 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 2a ac 0e 68 40 00  Error: UNC at LBA = 0x2aac0e68 = 715918952

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 00 00 00 2a ac 0e 68 40 00  5d+06:18:02.593  READ FPDMA QUEUED
  60 00 20 00 18 00 00 2a a3 ff 28 40 00  5d+06:18:02.510  READ FPDMA QUEUED
  60 00 20 00 10 00 00 2a ab ff 28 40 00  5d+06:18:02.510  READ FPDMA QUEUED
  60 00 20 00 08 00 00 2a a3 ff 48 40 00  5d+06:18:02.510  READ FPDMA QUEUED
  60 00 20 00 00 00 00 2a ab ff 48 40 00  5d+06:18:02.510  READ FPDMA QUEUED

Error 52 [3] occurred at disk power-on lifetime: 57128 hours (2380 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 ad a0 e7 38 40 00  Error: UNC at LBA = 0x1ada0e738 = 7207970616

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 00 00 01 ad a0 e7 38 40 00  2d+04:22:59.028  READ FPDMA QUEUED
  60 00 08 00 00 00 00 e8 fc c7 d0 40 00  2d+04:22:55.579  READ FPDMA QUEUED
  60 00 20 00 18 00 00 2a a0 87 e8 40 00  2d+04:22:55.552  READ FPDMA QUEUED
  60 00 20 00 10 00 00 2a a8 87 e8 40 00  2d+04:22:55.552  READ FPDMA QUEUED
  60 00 20 00 08 00 00 2a a0 88 08 40 00  2d+04:22:55.552  READ FPDMA QUEUED

Error 51 [2] occurred at disk power-on lifetime: 56999 hours (2374 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 00 00 00 40 40 00  Error: UNC at LBA = 0x00000040 = 64

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 00 00 00 00 00 00 40 40 00  2d+16:09:06.123  READ FPDMA QUEUED
  60 00 08 00 00 00 00 e8 e8 15 d8 40 00  2d+16:09:02.836  READ FPDMA QUEUED
  60 00 20 00 18 00 00 2b 70 1f a8 40 00  2d+16:09:02.758  READ FPDMA QUEUED
  60 00 20 00 10 00 00 2b 78 1f a8 40 00  2d+16:09:02.758  READ FPDMA QUEUED
  60 00 20 00 08 00 00 2b 70 1f c8 40 00  2d+16:09:02.758  READ FPDMA QUEUED

Error 50 [1] occurred at disk power-on lifetime: 54448 hours (2268 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 77 7c 46 08 40 00  Error: UNC at LBA = 0x2777c4608 = 10594567688

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 00 00 02 77 7c 46 08 40 00 15d+09:54:26.733  READ FPDMA QUEUED
  60 04 00 00 00 00 02 77 76 92 08 40 00 15d+09:54:24.824  READ FPDMA QUEUED
  60 04 00 00 00 00 02 77 76 8e 08 40 00 15d+09:54:24.822  READ FPDMA QUEUED
  60 04 00 00 18 00 02 77 76 8a 08 40 00 15d+09:54:24.818  READ FPDMA QUEUED
  60 04 00 00 10 00 02 77 76 86 08 40 00 15d+09:54:24.818  READ FPDMA QUEUED

Error 49 [0] occurred at disk power-on lifetime: 53679 hours (2236 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 84 21 6a 40 40 00  Error: UNC at LBA = 0x184216a40 = 6511749696

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 10 00 01 84 21 6a 40 40 00 11d+20:38:22.772  READ FPDMA QUEUED
  60 04 00 00 08 00 01 84 21 66 40 40 00 11d+20:38:22.772  READ FPDMA QUEUED
  60 04 00 00 00 00 01 84 21 62 40 40 00 11d+20:38:22.772  READ FPDMA QUEUED
  60 04 00 00 08 00 01 84 21 5e 40 40 00 11d+20:38:22.771  READ FPDMA QUEUED
  60 04 00 00 00 00 01 84 21 5a 40 40 00 11d+20:38:22.770  READ FPDMA QUEUED

Error 48 [23] occurred at disk power-on lifetime: 51049 hours (2127 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 38 74 01 28 40 00  Error: UNC at LBA = 0x38740128 = 947126568

Mattlevant · June 9, 2023

1 hour ago, JorgeB said:

SMART attributes look fine, but it's logged as a disk issue both on the syslog and on extended SMART info, and there are multiple errors from earlier, all these UNC @ LBA are not a good sign:

Error 55 [6] occurred at disk power-on lifetime: 57993 hours (2416 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 01 64 66 98 40 00  Error: UNC at LBA = 0x01646698 = 23357080

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 02 f0 00 08 00 00 01 64 66 98 40 00  1d+22:23:27.698  READ FPDMA QUEUED
  60 00 10 00 00 00 00 80 ac a6 20 40 00  1d+22:23:27.698  READ FPDMA QUEUED
  60 01 f0 00 10 00 00 01 64 64 a8 40 00  1d+22:23:27.690  READ FPDMA QUEUED
  60 04 00 00 08 00 00 01 64 60 a8 40 00  1d+22:23:27.690  READ FPDMA QUEUED
  60 03 80 00 00 00 00 01 64 5d 28 40 00  1d+22:23:27.689  READ FPDMA QUEUED

Error 54 [5] occurred at disk power-on lifetime: 57786 hours (2407 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 54 ad 34 f8 40 00  Error: UNC at LBA = 0x54ad34f8 = 1420637432

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 00 00 00 54 ad 34 f8 40 00  2d+11:10:42.878  READ FPDMA QUEUED
  60 04 00 00 08 00 00 54 a7 54 f8 40 00  2d+11:10:41.508  READ FPDMA QUEUED
  60 04 00 00 00 00 00 54 a7 50 f8 40 00  2d+11:10:41.508  READ FPDMA QUEUED
  60 04 00 00 00 00 00 54 a7 4c f8 40 00  2d+11:10:41.502  READ FPDMA QUEUED
  60 04 00 00 00 00 00 54 a7 48 f8 40 00  2d+11:10:41.499  READ FPDMA QUEUED

Error 53 [4] occurred at disk power-on lifetime: 57317 hours (2388 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 2a ac 0e 68 40 00  Error: UNC at LBA = 0x2aac0e68 = 715918952

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 00 00 00 2a ac 0e 68 40 00  5d+06:18:02.593  READ FPDMA QUEUED
  60 00 20 00 18 00 00 2a a3 ff 28 40 00  5d+06:18:02.510  READ FPDMA QUEUED
  60 00 20 00 10 00 00 2a ab ff 28 40 00  5d+06:18:02.510  READ FPDMA QUEUED
  60 00 20 00 08 00 00 2a a3 ff 48 40 00  5d+06:18:02.510  READ FPDMA QUEUED
  60 00 20 00 00 00 00 2a ab ff 48 40 00  5d+06:18:02.510  READ FPDMA QUEUED

Error 52 [3] occurred at disk power-on lifetime: 57128 hours (2380 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 ad a0 e7 38 40 00  Error: UNC at LBA = 0x1ada0e738 = 7207970616

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 00 00 01 ad a0 e7 38 40 00  2d+04:22:59.028  READ FPDMA QUEUED
  60 00 08 00 00 00 00 e8 fc c7 d0 40 00  2d+04:22:55.579  READ FPDMA QUEUED
  60 00 20 00 18 00 00 2a a0 87 e8 40 00  2d+04:22:55.552  READ FPDMA QUEUED
  60 00 20 00 10 00 00 2a a8 87 e8 40 00  2d+04:22:55.552  READ FPDMA QUEUED
  60 00 20 00 08 00 00 2a a0 88 08 40 00  2d+04:22:55.552  READ FPDMA QUEUED

Error 51 [2] occurred at disk power-on lifetime: 56999 hours (2374 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 00 00 00 40 40 00  Error: UNC at LBA = 0x00000040 = 64

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 00 00 00 00 00 00 40 40 00  2d+16:09:06.123  READ FPDMA QUEUED
  60 00 08 00 00 00 00 e8 e8 15 d8 40 00  2d+16:09:02.836  READ FPDMA QUEUED
  60 00 20 00 18 00 00 2b 70 1f a8 40 00  2d+16:09:02.758  READ FPDMA QUEUED
  60 00 20 00 10 00 00 2b 78 1f a8 40 00  2d+16:09:02.758  READ FPDMA QUEUED
  60 00 20 00 08 00 00 2b 70 1f c8 40 00  2d+16:09:02.758  READ FPDMA QUEUED

Error 50 [1] occurred at disk power-on lifetime: 54448 hours (2268 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 77 7c 46 08 40 00  Error: UNC at LBA = 0x2777c4608 = 10594567688

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 00 00 02 77 7c 46 08 40 00 15d+09:54:26.733  READ FPDMA QUEUED
  60 04 00 00 00 00 02 77 76 92 08 40 00 15d+09:54:24.824  READ FPDMA QUEUED
  60 04 00 00 00 00 02 77 76 8e 08 40 00 15d+09:54:24.822  READ FPDMA QUEUED
  60 04 00 00 18 00 02 77 76 8a 08 40 00 15d+09:54:24.818  READ FPDMA QUEUED
  60 04 00 00 10 00 02 77 76 86 08 40 00 15d+09:54:24.818  READ FPDMA QUEUED

Error 49 [0] occurred at disk power-on lifetime: 53679 hours (2236 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 84 21 6a 40 40 00  Error: UNC at LBA = 0x184216a40 = 6511749696

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 10 00 01 84 21 6a 40 40 00 11d+20:38:22.772  READ FPDMA QUEUED
  60 04 00 00 08 00 01 84 21 66 40 40 00 11d+20:38:22.772  READ FPDMA QUEUED
  60 04 00 00 00 00 01 84 21 62 40 40 00 11d+20:38:22.772  READ FPDMA QUEUED
  60 04 00 00 08 00 01 84 21 5e 40 40 00 11d+20:38:22.771  READ FPDMA QUEUED
  60 04 00 00 00 00 01 84 21 5a 40 40 00 11d+20:38:22.770  READ FPDMA QUEUED

Error 48 [23] occurred at disk power-on lifetime: 51049 hours (2127 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 38 74 01 28 40 00  Error: UNC at LBA = 0x38740128 = 947126568

so are you saying disk 1 is likely to be the next one to fail?

JorgeB · June 9, 2023

It might be already bad, SMART test will confirm, since these errors can sometimes be intermittent.

Mattlevant · June 10, 2023

21 hours ago, JorgeB said:

It might be already bad, SMART test will confirm, since these errors can sometimes be intermittent.

disk 1 extended smart test compl eted without error, new diagnostics attached incase that shows anything

tower-diagnostics-20230610-0956.zip

JorgeB · June 10, 2023

That means the disk is OK, at least for now, keep monitoring but any more read errors I would consider replacing it.

Mattlevant · June 10, 2023

1 hour ago, JorgeB said:

That means the disk is OK, at least for now, keep monitoring but any more read errors I would consider replacing it.

just come back to find disk 3 has errors now so thats disk 1,3&4 giving errors

can you just check disk 3?
3&4 are my new seagate disks

running extended self test on 3 now but maybe you can see something in diagnostics

tower-diagnostics-20230610-1258.zip

Mattlevant · June 10, 2023

rebooted server and now ive lost disk 1 🤦‍♂️

edit: its back online after swapping cables about

Edited June 10, 2023 by Mattlevant

JorgeB · June 11, 2023

20 hours ago, Mattlevant said:

can you just check disk 3?

It's not logged as a disk problem, and the disk looks healthy, most likely power/connection, check/replace cables and/or try a different PSU.

Mattlevant · June 11, 2023

17 minutes ago, JorgeB said:

It's not logged as a disk problem, and the disk looks healthy, most likely power/connection, check/replace cables and/or try a different PSU.

Ive just lost disk 2 now this morning, I'm beginning to think it's a issue with my lsi card causing drop outs

Does the diagnostics show anything for that or is that something that isn't logged?

I'm currently bodging together a power and SATA connector from my old dell server which was a proprietary 8 pin to a female molex because I've run out of spare sata cables

JorgeB · June 11, 2023

Though not as common it could be a bad controller, if power/cables don't help try a different one.

Mattlevant · June 11, 2023

1 minute ago, JorgeB said:

Though not as common it could be a bad controller, if power/cables don't help try a different one.

I just don't see how I can be dropping so many drives so often

All my drives can't be bad and all the sas-sata cables coming from the controller can't be bad either

The common denominator is the LSI card everything goes through that, gonna try bypass it completely and see what happens

Gragorg · June 11, 2023

I have seen some have problems if all your drives are powered on the same cable from the power supply.

Mattlevant · June 11, 2023

1 hour ago, Gragorg said:

I have seen some have problems if all your drives are powered on the same cable from the power supply.

3 on one cable and 3 on another

But 800w supply should be plenty

JonathanM · June 11, 2023

2 hours ago, Mattlevant said:

But 800w supply should be plenty

Not relevant to cable count. Every added drive on a single feed drops the voltage available to all drives on that feed by a small amount. If you reach a point where that voltage sags enough during simultaneous drive spinup, you will see communication errors on those drives at that point.

The amount of voltage drop on a feed is effected by the thickness of the wire and the length, also any connections like the slip fit between adapters and the modular connections on the PSU.

Mattlevant · June 12, 2023

13 hours ago, JonathanM said:

Not relevant to cable count. Every added drive on a single feed drops the voltage available to all drives on that feed by a small amount. If you reach a point where that voltage sags enough during simultaneous drive spinup, you will see communication errors on those drives at that point.

The amount of voltage drop on a feed is effected by the thickness of the wire and the length, also any connections like the slip fit between adapters and the modular connections on the PSU.

Makes sense I suppose

But it was random which got the errors, it's not like it was the last drive on the string every time

Anyway

I've currently got 4 powered via my bodged molex to sata power

The way I've connected it I've got 2 on each string on a Y split from 1 molex 4 pin

The other 2 drives are on a single string from the dedicated sata power cable from the PSU

So 3 strings of 2 now also totally bypassed the lsi card

Now I'm unsure if it was power related or the LSI card was at fault but it's currently online and ran a whole parity check took 18 hours without any drive errors (quite a few parity errors 500k or so which Im running a correcting check to fix currently)

drive disabled again (brand new drive)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation