Yet More Drive Issues - Any ideas?


Recommended Posts

Disk 2. Twice this week this drive has been disabled.

 

Both Diagnostics attached. Passes an extended SMART test, can't see any errors in the log apart from:

 

Jan 22 00:02:35 Tower kernel: sd 6:0:3:0: [sdl] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00

 

The first time it showed an irrationally high number of "Reads" as can be made out in the screenshot below (sorry about the red filter).

 

lOQejnS9XuRgyaEnmZBjc5MYwgDaqMdNG6fNqDb6

 

First time, whilst it was disabled I copied the emulated data to another drive, whilst running an extended SMART. On passing the SMART I re-built to the same drive (note there was little to rebuild as a moved it off).

 

2 evenings later the same disk goes down - albeit without the ridiculous number of reads.

 

Drive is attached to and LSI 8 port card. 3 other drives are on the same 4 sata to sas cable. Been like this for 6-8 weeks without any issues (on these drives/controller at least).

 

Expert Opinion please?

tower-diagnostics-20210122-1019.zip tower-diagnostics-20210120-0351.zip

Link to comment

@JorgeB Thanks for the reply, as always.

 

I haven't yet done what you suggested as the box is hard to get to and I have to empty a cupboard before even attempting to open it up. In the meantime I rebuilt to the same disk.

 

Alas it disabled itself again the next night!

 

However, what I've noticed is that all 3 of the error states have occurred at almost the same time...

 

20 JAN 01:17 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl)

22 JAN 01:20 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl)

24 JAN 01:22 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl)

 

This can't be a coincidence. 

 

Log attached from the latest error. Any clues as to what the server is doing at this time every night?

 

Shouldn't be Mover, CA AppData Backup or SSD Trim. 

tower-diagnostics-20210125-1422.zip

Link to comment

Here is the Cron output

 

Linux 4.19.107-Unraid.
root@Tower:~# crontab -l
# If you don't want the output of a cron job mailed to you, you have to direct
# any output to /dev/null.  We'll do this here since these jobs should run
# properly on a newly installed system.  If a script fails, run-parts will
# mail a notice to root.
#
# Run the hourly, daily, weekly, and monthly cron jobs.
# Jobs that need different timing may be entered into the crontab as before,
# but most really don't need greater granularity than this.  If the exact
# times of the hourly, daily, weekly, and monthly cron jobs do not suit your
# needs, feel free to adjust them.
#
# Run hourly cron jobs at 47 minutes after the hour:
47 * * * * /usr/bin/run-parts /etc/cron.hourly 1> /dev/null
#
# Run daily cron jobs at 4:40 every day:
40 4 * * * /usr/bin/run-parts /etc/cron.daily 1> /dev/null
#
# Run weekly cron jobs at 4:30 on the first day of the week:
30 4 * * 0 /usr/bin/run-parts /etc/cron.weekly 1> /dev/null
#
# Run monthly cron jobs at 4:20 on the first day of the month:
20 4 1 * * /usr/bin/run-parts /etc/cron.monthly 1> /dev/null
0 1 * * 2 /usr/local/emhttp/plugins/ca.backup2/scripts/backup.php &>/dev/null 2>&1
root@Tower:~#

 

Any clues?

Link to comment
On 1/25/2021 at 10:45 PM, itimpi said:

Might you not also want the contents of the file /etc/cron.d/root  to see if that is running anything at those times?

 

# Generated docker monitoring schedule:
10 0 * * 1 /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/dockerupdate.php check &> /dev/null

# Generated system monitoring schedule:
*/1 * * * * /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null

# Generated mover schedule:
40 3 * * 5 /usr/local/sbin/mover &> /dev/null

# Generated parity check schedule:
0 0 1 * * /usr/local/sbin/mdcmd check NOCORRECT &> /dev/null

# Generated plugins version check schedule:
10 0 * * 1 /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugincheck &> /dev/null

# Generated Unraid OS update check schedule:
11 0 * * 1 /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/unraidcheck &> /dev/null

# Generated cron settings for docker autoupdates
0 0 * * 0 /usr/local/emhttp/plugins/ca.update.applications/scripts/updateDocker.php >/dev/null 2>&1
# Generated cron settings for plugin autoupdates
0 0 * * * /usr/local/emhttp/plugins/ca.update.applications/scripts/updateApplications.php >/dev/null 2>&1

# CRON for CA background scanning of applications
0 * * * * php /usr/local/emhttp/plugins/community.applications/scripts/notices.php > /dev/null 2>&1

# Generated ssd trim schedule:
0 2 * * 1 /sbin/fstrim -a -v | logger &> /dev/null

# Generated system data collection schedule:
*/1 * * * * /usr/local/emhttp/plugins/dynamix.system.stats/scripts/sa1 1 1 &> /dev/null

 

Any clues?

Link to comment
  • 3 weeks later...

Woes continue.

 

Firstly I re-manufactured the sata power cables to this bank of drives as I didn't like them. Connected it back up, did a short SMART test, rebuilt to same drive. That night same drive disabled at 01:40am.

 

Then I switched the SAS port it was on, rebuild to same disk, then that night the same drive disabled at 01:42am.

 

WTF is going on here. Happy to accept the drive might be bad despite passing SMART tests, but disabling itself at such consistent times I don't believe is just a co-incidence....

 

Latest diagnostics attached but I doubt it tell us anything new.

 

Shall i just give up and remove the drive?

tower-diagnostics-20210212-2340.zip

Link to comment
8 hours ago, air_marshall said:

but disabling itself at such consistent times I don't believe is just a co-incidence....

You may have a ghost in the machine...

 

There are signs of the problem earlier:

 

Feb 12 00:01:28 Tower kernel: sd 10:0:2:0: attempting task abort! scmd(0000000003923bdb)
Feb 12 00:01:28 Tower kernel: sd 10:0:2:0: [sdk] tag#6275 CDB: opcode=0x85 85 09 0e 00 00 00 02 00 07 00 00 00 00 00 2f 00
Feb 12 00:01:28 Tower kernel: scsi target10:0:2: handle(0x000b), sas_address(0x4433221104000000), phy(4)
Feb 12 00:01:28 Tower kernel: scsi target10:0:2: enclosure logical id(0x500605b00991da10), slot(7)
Feb 12 00:01:32 Tower kernel: sd 10:0:2:0: task abort: SUCCESS scmd(0000000003923bdb)
Feb 12 00:02:08 Tower kernel: sd 10:0:2:0: device_block, handle(0x000b)
Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: device_unblock and setting to running, handle(0x000b)
Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: [sdk] Synchronizing SCSI cache
Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Feb 12 00:02:10 Tower kernel: mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221104000000)
Feb 12 00:02:10 Tower kernel: mpt2sas_cm0: enclosure logical id(0x500605b00991da10), slot(7)

 

I would swap that disk with another from the onboard SATA controller to see if it changes anything.

Link to comment
5 hours ago, JorgeB said:

You may have a ghost in the machine...

 

There are signs of the problem earlier:

 


Feb 12 00:01:28 Tower kernel: sd 10:0:2:0: attempting task abort! scmd(0000000003923bdb)
Feb 12 00:01:28 Tower kernel: sd 10:0:2:0: [sdk] tag#6275 CDB: opcode=0x85 85 09 0e 00 00 00 02 00 07 00 00 00 00 00 2f 00
Feb 12 00:01:28 Tower kernel: scsi target10:0:2: handle(0x000b), sas_address(0x4433221104000000), phy(4)
Feb 12 00:01:28 Tower kernel: scsi target10:0:2: enclosure logical id(0x500605b00991da10), slot(7)
Feb 12 00:01:32 Tower kernel: sd 10:0:2:0: task abort: SUCCESS scmd(0000000003923bdb)
Feb 12 00:02:08 Tower kernel: sd 10:0:2:0: device_block, handle(0x000b)
Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: device_unblock and setting to running, handle(0x000b)
Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: [sdk] Synchronizing SCSI cache
Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Feb 12 00:02:10 Tower kernel: mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221104000000)
Feb 12 00:02:10 Tower kernel: mpt2sas_cm0: enclosure logical id(0x500605b00991da10), slot(7)

 

I would swap that disk with another from the onboard SATA controller to see if it changes anything.

 

Thanks again @JorgeB, given time and case constraints I'll shrink the array for now and investigate further at another time.

 

PITA, that'll be 3 drives I've had to drop in as many months since my re-casing project 😞

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.