Mattlevant Posted June 9, 2023 Share Posted June 9, 2023 (edited) so i recently had 2 4tb WD red cmr drives die (or i belive they died) both were giving corrupt file system errors for weeks not at the same time but one on one day then the other on another, happening repeatedly but i just ran xfs repair and carried on, then i noticed both disk 3 & 4 dropped offline and giving i/o errors on xfs repair unable to find superblock, tried multiple different cables, different power connection combinations etc, anyway i replaced with 2 brand new seagate ironwolf 6tb (lost alot of data in the process as i only had single parity but its all media that can be redownloaded which i am in the process of) anyway the drives been powered now for maybe a total of 3 days, and ive just had notification that disk 4 has been disabled due to errors i dont beleive it has failed but am i right in thinking i can just stop the array, remove disk 4, start in maintenance, stop and then readd it to rebuild and renable? i did come across this thread doing some quick googling, has this been found to be a fix and worth doing? diags attached incase anything can be found to be the cause in there, im using a lsi 9201 in IT mode if my memory is correct but i may be wrong on the exact model tower-diagnostics-20230609-0945.zip Edited June 9, 2023 by Mattlevant Quote Link to comment
JorgeB Posted June 9, 2023 Share Posted June 9, 2023 Looks more like a power/connection problem, check/replace cables and try again. These are logged like a disk problem, so good idea to run an extended SMART test on diak1: Jun 8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=6s Jun 8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 Sense Key : 0x3 [current] Jun 8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 ASC=0x11 ASCQ=0x0 Jun 8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 CDB: opcode=0x88 88 00 00 00 00 00 01 64 66 98 00 00 02 f0 00 00 Jun 8 12:29:56 Tower kernel: critical medium error, dev sdb, sector 23357080 op 0x0:(READ) flags 0x0 phys_seg 94 prio class 0 Jun 8 12:29:56 Tower kernel: md: disk1 read error, sector=23357016 Jun 8 12:29:56 Tower kernel: md: disk1 read error, sector=23357024 Jun 8 12:29:56 Tower kernel: md: disk1 read error, sector=23357032 Jun 8 12:29:56 Tower kernel: md: disk1 read error, sector=23357040 Quote Link to comment
Mattlevant Posted June 9, 2023 Author Share Posted June 9, 2023 yeah ive got a few errors on disk 1 but disk 4 is the one that has been disabled is it likely a power issue for both disk 1 & disk 4? ive got a 800w supply in there, surely thats anough for the drives i have? Quote Link to comment
Mattlevant Posted June 9, 2023 Author Share Posted June 9, 2023 4 minutes ago, JorgeB said: Looks more like a power/connection problem, check/replace cables and try again. These are logged like a disk problem, so good idea to run an extended SMART test on diak1: Jun 8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=6s Jun 8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 Sense Key : 0x3 [current] Jun 8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 ASC=0x11 ASCQ=0x0 Jun 8 12:29:56 Tower kernel: sd 9:0:0:0: [sdb] tag#205 CDB: opcode=0x88 88 00 00 00 00 00 01 64 66 98 00 00 02 f0 00 00 Jun 8 12:29:56 Tower kernel: critical medium error, dev sdb, sector 23357080 op 0x0:(READ) flags 0x0 phys_seg 94 prio class 0 Jun 8 12:29:56 Tower kernel: md: disk1 read error, sector=23357016 Jun 8 12:29:56 Tower kernel: md: disk1 read error, sector=23357024 Jun 8 12:29:56 Tower kernel: md: disk1 read error, sector=23357032 Jun 8 12:29:56 Tower kernel: md: disk1 read error, sector=23357040 missed quote from response see above Quote Link to comment
JorgeB Posted June 9, 2023 Share Posted June 9, 2023 2 minutes ago, Mattlevant said: is it likely a power issue for both disk 1 & disk 4? No, for disk4, like mentioned disk1 is logged as an actual disk problem 6 minutes ago, JorgeB said: run an extended SMART test Quote Link to comment
Mattlevant Posted June 9, 2023 Author Share Posted June 9, 2023 4 minutes ago, JorgeB said: No, for disk4, like mentioned disk1 is logged as an actual disk problem ah sorry misunderstood initially as disk 4 wasnt specified in your response, i will try a different cable (i have 2 spares on my LSI card and 2 regular sata cables i can plug into the mainboard) running extended self test on disk 1 now for reference smart attributes attached for disk 1 Quote Link to comment
JorgeB Posted June 9, 2023 Share Posted June 9, 2023 SMART attributes look fine, but it's logged as a disk issue both on the syslog and on extended SMART info, and there are multiple errors from earlier, all these UNC @ LBA are not a good sign: Error 55 [6] occurred at disk power-on lifetime: 57993 hours (2416 days + 9 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 01 64 66 98 40 00 Error: UNC at LBA = 0x01646698 = 23357080 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 02 f0 00 08 00 00 01 64 66 98 40 00 1d+22:23:27.698 READ FPDMA QUEUED 60 00 10 00 00 00 00 80 ac a6 20 40 00 1d+22:23:27.698 READ FPDMA QUEUED 60 01 f0 00 10 00 00 01 64 64 a8 40 00 1d+22:23:27.690 READ FPDMA QUEUED 60 04 00 00 08 00 00 01 64 60 a8 40 00 1d+22:23:27.690 READ FPDMA QUEUED 60 03 80 00 00 00 00 01 64 5d 28 40 00 1d+22:23:27.689 READ FPDMA QUEUED Error 54 [5] occurred at disk power-on lifetime: 57786 hours (2407 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 54 ad 34 f8 40 00 Error: UNC at LBA = 0x54ad34f8 = 1420637432 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 00 00 00 54 ad 34 f8 40 00 2d+11:10:42.878 READ FPDMA QUEUED 60 04 00 00 08 00 00 54 a7 54 f8 40 00 2d+11:10:41.508 READ FPDMA QUEUED 60 04 00 00 00 00 00 54 a7 50 f8 40 00 2d+11:10:41.508 READ FPDMA QUEUED 60 04 00 00 00 00 00 54 a7 4c f8 40 00 2d+11:10:41.502 READ FPDMA QUEUED 60 04 00 00 00 00 00 54 a7 48 f8 40 00 2d+11:10:41.499 READ FPDMA QUEUED Error 53 [4] occurred at disk power-on lifetime: 57317 hours (2388 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 2a ac 0e 68 40 00 Error: UNC at LBA = 0x2aac0e68 = 715918952 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 20 00 00 00 00 2a ac 0e 68 40 00 5d+06:18:02.593 READ FPDMA QUEUED 60 00 20 00 18 00 00 2a a3 ff 28 40 00 5d+06:18:02.510 READ FPDMA QUEUED 60 00 20 00 10 00 00 2a ab ff 28 40 00 5d+06:18:02.510 READ FPDMA QUEUED 60 00 20 00 08 00 00 2a a3 ff 48 40 00 5d+06:18:02.510 READ FPDMA QUEUED 60 00 20 00 00 00 00 2a ab ff 48 40 00 5d+06:18:02.510 READ FPDMA QUEUED Error 52 [3] occurred at disk power-on lifetime: 57128 hours (2380 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 01 ad a0 e7 38 40 00 Error: UNC at LBA = 0x1ada0e738 = 7207970616 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 01 00 00 00 00 01 ad a0 e7 38 40 00 2d+04:22:59.028 READ FPDMA QUEUED 60 00 08 00 00 00 00 e8 fc c7 d0 40 00 2d+04:22:55.579 READ FPDMA QUEUED 60 00 20 00 18 00 00 2a a0 87 e8 40 00 2d+04:22:55.552 READ FPDMA QUEUED 60 00 20 00 10 00 00 2a a8 87 e8 40 00 2d+04:22:55.552 READ FPDMA QUEUED 60 00 20 00 08 00 00 2a a0 88 08 40 00 2d+04:22:55.552 READ FPDMA QUEUED Error 51 [2] occurred at disk power-on lifetime: 56999 hours (2374 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 00 00 00 40 40 00 Error: UNC at LBA = 0x00000040 = 64 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 08 00 00 00 00 00 00 00 40 40 00 2d+16:09:06.123 READ FPDMA QUEUED 60 00 08 00 00 00 00 e8 e8 15 d8 40 00 2d+16:09:02.836 READ FPDMA QUEUED 60 00 20 00 18 00 00 2b 70 1f a8 40 00 2d+16:09:02.758 READ FPDMA QUEUED 60 00 20 00 10 00 00 2b 78 1f a8 40 00 2d+16:09:02.758 READ FPDMA QUEUED 60 00 20 00 08 00 00 2b 70 1f c8 40 00 2d+16:09:02.758 READ FPDMA QUEUED Error 50 [1] occurred at disk power-on lifetime: 54448 hours (2268 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 02 77 7c 46 08 40 00 Error: UNC at LBA = 0x2777c4608 = 10594567688 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 01 00 00 00 00 02 77 7c 46 08 40 00 15d+09:54:26.733 READ FPDMA QUEUED 60 04 00 00 00 00 02 77 76 92 08 40 00 15d+09:54:24.824 READ FPDMA QUEUED 60 04 00 00 00 00 02 77 76 8e 08 40 00 15d+09:54:24.822 READ FPDMA QUEUED 60 04 00 00 18 00 02 77 76 8a 08 40 00 15d+09:54:24.818 READ FPDMA QUEUED 60 04 00 00 10 00 02 77 76 86 08 40 00 15d+09:54:24.818 READ FPDMA QUEUED Error 49 [0] occurred at disk power-on lifetime: 53679 hours (2236 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 01 84 21 6a 40 40 00 Error: UNC at LBA = 0x184216a40 = 6511749696 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 10 00 01 84 21 6a 40 40 00 11d+20:38:22.772 READ FPDMA QUEUED 60 04 00 00 08 00 01 84 21 66 40 40 00 11d+20:38:22.772 READ FPDMA QUEUED 60 04 00 00 00 00 01 84 21 62 40 40 00 11d+20:38:22.772 READ FPDMA QUEUED 60 04 00 00 08 00 01 84 21 5e 40 40 00 11d+20:38:22.771 READ FPDMA QUEUED 60 04 00 00 00 00 01 84 21 5a 40 40 00 11d+20:38:22.770 READ FPDMA QUEUED Error 48 [23] occurred at disk power-on lifetime: 51049 hours (2127 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 38 74 01 28 40 00 Error: UNC at LBA = 0x38740128 = 947126568 Quote Link to comment
Mattlevant Posted June 9, 2023 Author Share Posted June 9, 2023 1 hour ago, JorgeB said: SMART attributes look fine, but it's logged as a disk issue both on the syslog and on extended SMART info, and there are multiple errors from earlier, all these UNC @ LBA are not a good sign: Error 55 [6] occurred at disk power-on lifetime: 57993 hours (2416 days + 9 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 01 64 66 98 40 00 Error: UNC at LBA = 0x01646698 = 23357080 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 02 f0 00 08 00 00 01 64 66 98 40 00 1d+22:23:27.698 READ FPDMA QUEUED 60 00 10 00 00 00 00 80 ac a6 20 40 00 1d+22:23:27.698 READ FPDMA QUEUED 60 01 f0 00 10 00 00 01 64 64 a8 40 00 1d+22:23:27.690 READ FPDMA QUEUED 60 04 00 00 08 00 00 01 64 60 a8 40 00 1d+22:23:27.690 READ FPDMA QUEUED 60 03 80 00 00 00 00 01 64 5d 28 40 00 1d+22:23:27.689 READ FPDMA QUEUED Error 54 [5] occurred at disk power-on lifetime: 57786 hours (2407 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 54 ad 34 f8 40 00 Error: UNC at LBA = 0x54ad34f8 = 1420637432 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 00 00 00 54 ad 34 f8 40 00 2d+11:10:42.878 READ FPDMA QUEUED 60 04 00 00 08 00 00 54 a7 54 f8 40 00 2d+11:10:41.508 READ FPDMA QUEUED 60 04 00 00 00 00 00 54 a7 50 f8 40 00 2d+11:10:41.508 READ FPDMA QUEUED 60 04 00 00 00 00 00 54 a7 4c f8 40 00 2d+11:10:41.502 READ FPDMA QUEUED 60 04 00 00 00 00 00 54 a7 48 f8 40 00 2d+11:10:41.499 READ FPDMA QUEUED Error 53 [4] occurred at disk power-on lifetime: 57317 hours (2388 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 2a ac 0e 68 40 00 Error: UNC at LBA = 0x2aac0e68 = 715918952 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 20 00 00 00 00 2a ac 0e 68 40 00 5d+06:18:02.593 READ FPDMA QUEUED 60 00 20 00 18 00 00 2a a3 ff 28 40 00 5d+06:18:02.510 READ FPDMA QUEUED 60 00 20 00 10 00 00 2a ab ff 28 40 00 5d+06:18:02.510 READ FPDMA QUEUED 60 00 20 00 08 00 00 2a a3 ff 48 40 00 5d+06:18:02.510 READ FPDMA QUEUED 60 00 20 00 00 00 00 2a ab ff 48 40 00 5d+06:18:02.510 READ FPDMA QUEUED Error 52 [3] occurred at disk power-on lifetime: 57128 hours (2380 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 01 ad a0 e7 38 40 00 Error: UNC at LBA = 0x1ada0e738 = 7207970616 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 01 00 00 00 00 01 ad a0 e7 38 40 00 2d+04:22:59.028 READ FPDMA QUEUED 60 00 08 00 00 00 00 e8 fc c7 d0 40 00 2d+04:22:55.579 READ FPDMA QUEUED 60 00 20 00 18 00 00 2a a0 87 e8 40 00 2d+04:22:55.552 READ FPDMA QUEUED 60 00 20 00 10 00 00 2a a8 87 e8 40 00 2d+04:22:55.552 READ FPDMA QUEUED 60 00 20 00 08 00 00 2a a0 88 08 40 00 2d+04:22:55.552 READ FPDMA QUEUED Error 51 [2] occurred at disk power-on lifetime: 56999 hours (2374 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 00 00 00 40 40 00 Error: UNC at LBA = 0x00000040 = 64 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 08 00 00 00 00 00 00 00 40 40 00 2d+16:09:06.123 READ FPDMA QUEUED 60 00 08 00 00 00 00 e8 e8 15 d8 40 00 2d+16:09:02.836 READ FPDMA QUEUED 60 00 20 00 18 00 00 2b 70 1f a8 40 00 2d+16:09:02.758 READ FPDMA QUEUED 60 00 20 00 10 00 00 2b 78 1f a8 40 00 2d+16:09:02.758 READ FPDMA QUEUED 60 00 20 00 08 00 00 2b 70 1f c8 40 00 2d+16:09:02.758 READ FPDMA QUEUED Error 50 [1] occurred at disk power-on lifetime: 54448 hours (2268 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 02 77 7c 46 08 40 00 Error: UNC at LBA = 0x2777c4608 = 10594567688 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 01 00 00 00 00 02 77 7c 46 08 40 00 15d+09:54:26.733 READ FPDMA QUEUED 60 04 00 00 00 00 02 77 76 92 08 40 00 15d+09:54:24.824 READ FPDMA QUEUED 60 04 00 00 00 00 02 77 76 8e 08 40 00 15d+09:54:24.822 READ FPDMA QUEUED 60 04 00 00 18 00 02 77 76 8a 08 40 00 15d+09:54:24.818 READ FPDMA QUEUED 60 04 00 00 10 00 02 77 76 86 08 40 00 15d+09:54:24.818 READ FPDMA QUEUED Error 49 [0] occurred at disk power-on lifetime: 53679 hours (2236 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 01 84 21 6a 40 40 00 Error: UNC at LBA = 0x184216a40 = 6511749696 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 10 00 01 84 21 6a 40 40 00 11d+20:38:22.772 READ FPDMA QUEUED 60 04 00 00 08 00 01 84 21 66 40 40 00 11d+20:38:22.772 READ FPDMA QUEUED 60 04 00 00 00 00 01 84 21 62 40 40 00 11d+20:38:22.772 READ FPDMA QUEUED 60 04 00 00 08 00 01 84 21 5e 40 40 00 11d+20:38:22.771 READ FPDMA QUEUED 60 04 00 00 00 00 01 84 21 5a 40 40 00 11d+20:38:22.770 READ FPDMA QUEUED Error 48 [23] occurred at disk power-on lifetime: 51049 hours (2127 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 38 74 01 28 40 00 Error: UNC at LBA = 0x38740128 = 947126568 so are you saying disk 1 is likely to be the next one to fail? Quote Link to comment
JorgeB Posted June 9, 2023 Share Posted June 9, 2023 It might be already bad, SMART test will confirm, since these errors can sometimes be intermittent. Quote Link to comment
Mattlevant Posted June 10, 2023 Author Share Posted June 10, 2023 21 hours ago, JorgeB said: It might be already bad, SMART test will confirm, since these errors can sometimes be intermittent. disk 1 extended smart test compl eted without error, new diagnostics attached incase that shows anything tower-diagnostics-20230610-0956.zip Quote Link to comment
JorgeB Posted June 10, 2023 Share Posted June 10, 2023 That means the disk is OK, at least for now, keep monitoring but any more read errors I would consider replacing it. Quote Link to comment
Mattlevant Posted June 10, 2023 Author Share Posted June 10, 2023 1 hour ago, JorgeB said: That means the disk is OK, at least for now, keep monitoring but any more read errors I would consider replacing it. just come back to find disk 3 has errors now so thats disk 1,3&4 giving errors can you just check disk 3? 3&4 are my new seagate disks running extended self test on 3 now but maybe you can see something in diagnostics tower-diagnostics-20230610-1258.zip Quote Link to comment
Mattlevant Posted June 10, 2023 Author Share Posted June 10, 2023 (edited) rebooted server and now ive lost disk 1 🤦♂️ edit: its back online after swapping cables about Edited June 10, 2023 by Mattlevant Quote Link to comment
JorgeB Posted June 11, 2023 Share Posted June 11, 2023 20 hours ago, Mattlevant said: can you just check disk 3? It's not logged as a disk problem, and the disk looks healthy, most likely power/connection, check/replace cables and/or try a different PSU. Quote Link to comment
Mattlevant Posted June 11, 2023 Author Share Posted June 11, 2023 17 minutes ago, JorgeB said: It's not logged as a disk problem, and the disk looks healthy, most likely power/connection, check/replace cables and/or try a different PSU. Ive just lost disk 2 now this morning, I'm beginning to think it's a issue with my lsi card causing drop outs Does the diagnostics show anything for that or is that something that isn't logged? I'm currently bodging together a power and SATA connector from my old dell server which was a proprietary 8 pin to a female molex because I've run out of spare sata cables Quote Link to comment
JorgeB Posted June 11, 2023 Share Posted June 11, 2023 Though not as common it could be a bad controller, if power/cables don't help try a different one. Quote Link to comment
Mattlevant Posted June 11, 2023 Author Share Posted June 11, 2023 1 minute ago, JorgeB said: Though not as common it could be a bad controller, if power/cables don't help try a different one. I just don't see how I can be dropping so many drives so often All my drives can't be bad and all the sas-sata cables coming from the controller can't be bad either The common denominator is the LSI card everything goes through that, gonna try bypass it completely and see what happens Quote Link to comment
Gragorg Posted June 11, 2023 Share Posted June 11, 2023 I have seen some have problems if all your drives are powered on the same cable from the power supply. Quote Link to comment
Mattlevant Posted June 11, 2023 Author Share Posted June 11, 2023 1 hour ago, Gragorg said: I have seen some have problems if all your drives are powered on the same cable from the power supply. 3 on one cable and 3 on another But 800w supply should be plenty Quote Link to comment
JonathanM Posted June 11, 2023 Share Posted June 11, 2023 2 hours ago, Mattlevant said: But 800w supply should be plenty Not relevant to cable count. Every added drive on a single feed drops the voltage available to all drives on that feed by a small amount. If you reach a point where that voltage sags enough during simultaneous drive spinup, you will see communication errors on those drives at that point. The amount of voltage drop on a feed is effected by the thickness of the wire and the length, also any connections like the slip fit between adapters and the modular connections on the PSU. Quote Link to comment
Mattlevant Posted June 12, 2023 Author Share Posted June 12, 2023 13 hours ago, JonathanM said: Not relevant to cable count. Every added drive on a single feed drops the voltage available to all drives on that feed by a small amount. If you reach a point where that voltage sags enough during simultaneous drive spinup, you will see communication errors on those drives at that point. The amount of voltage drop on a feed is effected by the thickness of the wire and the length, also any connections like the slip fit between adapters and the modular connections on the PSU. Makes sense I suppose But it was random which got the errors, it's not like it was the last drive on the string every time Anyway I've currently got 4 powered via my bodged molex to sata power The way I've connected it I've got 2 on each string on a Y split from 1 molex 4 pin The other 2 drives are on a single string from the dedicated sata power cable from the PSU So 3 strings of 2 now also totally bypassed the lsi card Now I'm unsure if it was power related or the LSI card was at fault but it's currently online and ran a whole parity check took 18 hours without any drive errors (quite a few parity errors 500k or so which Im running a correcting check to fix currently) Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.