air_marshall Posted January 22, 2021 Share Posted January 22, 2021 Disk 2. Twice this week this drive has been disabled. Both Diagnostics attached. Passes an extended SMART test, can't see any errors in the log apart from: Jan 22 00:02:35 Tower kernel: sd 6:0:3:0: [sdl] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 The first time it showed an irrationally high number of "Reads" as can be made out in the screenshot below (sorry about the red filter). First time, whilst it was disabled I copied the emulated data to another drive, whilst running an extended SMART. On passing the SMART I re-built to the same drive (note there was little to rebuild as a moved it off). 2 evenings later the same disk goes down - albeit without the ridiculous number of reads. Drive is attached to and LSI 8 port card. 3 other drives are on the same 4 sata to sas cable. Been like this for 6-8 weeks without any issues (on these drives/controller at least). Expert Opinion please? tower-diagnostics-20210122-1019.zip tower-diagnostics-20210120-0351.zip Quote Link to comment
JorgeB Posted January 22, 2021 Share Posted January 22, 2021 Drive dropped offline both times, because of that there's no SMART report, but assuming it look good it's likely a connection problem, try swapping cables with another drive, on same or on different controller. Quote Link to comment
air_marshall Posted January 25, 2021 Author Share Posted January 25, 2021 @JorgeB Thanks for the reply, as always. I haven't yet done what you suggested as the box is hard to get to and I have to empty a cupboard before even attempting to open it up. In the meantime I rebuilt to the same disk. Alas it disabled itself again the next night! However, what I've noticed is that all 3 of the error states have occurred at almost the same time... 20 JAN 01:17 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) 22 JAN 01:20 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) 24 JAN 01:22 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) This can't be a coincidence. Log attached from the latest error. Any clues as to what the server is doing at this time every night? Shouldn't be Mover, CA AppData Backup or SSD Trim. tower-diagnostics-20210125-1422.zip Quote Link to comment
trurl Posted January 25, 2021 Share Posted January 25, 2021 6 minutes ago, air_marshall said: what the server is doing at this time every night? What do you get from the command line with this? crontab -l Quote Link to comment
air_marshall Posted January 25, 2021 Author Share Posted January 25, 2021 Here is the Cron output Linux 4.19.107-Unraid. root@Tower:~# crontab -l # If you don't want the output of a cron job mailed to you, you have to direct # any output to /dev/null. We'll do this here since these jobs should run # properly on a newly installed system. If a script fails, run-parts will # mail a notice to root. # # Run the hourly, daily, weekly, and monthly cron jobs. # Jobs that need different timing may be entered into the crontab as before, # but most really don't need greater granularity than this. If the exact # times of the hourly, daily, weekly, and monthly cron jobs do not suit your # needs, feel free to adjust them. # # Run hourly cron jobs at 47 minutes after the hour: 47 * * * * /usr/bin/run-parts /etc/cron.hourly 1> /dev/null # # Run daily cron jobs at 4:40 every day: 40 4 * * * /usr/bin/run-parts /etc/cron.daily 1> /dev/null # # Run weekly cron jobs at 4:30 on the first day of the week: 30 4 * * 0 /usr/bin/run-parts /etc/cron.weekly 1> /dev/null # # Run monthly cron jobs at 4:20 on the first day of the month: 20 4 1 * * /usr/bin/run-parts /etc/cron.monthly 1> /dev/null 0 1 * * 2 /usr/local/emhttp/plugins/ca.backup2/scripts/backup.php &>/dev/null 2>&1 root@Tower:~# Any clues? Quote Link to comment
itimpi Posted January 25, 2021 Share Posted January 25, 2021 8 hours ago, trurl said: What do you get from the command line with this? crontab -l Might you not also want the contents of the file /etc/cron.d/root to see if that is running anything at those times? Quote Link to comment
air_marshall Posted January 28, 2021 Author Share Posted January 28, 2021 On 1/25/2021 at 10:45 PM, itimpi said: Might you not also want the contents of the file /etc/cron.d/root to see if that is running anything at those times? # Generated docker monitoring schedule: 10 0 * * 1 /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/dockerupdate.php check &> /dev/null # Generated system monitoring schedule: */1 * * * * /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null # Generated mover schedule: 40 3 * * 5 /usr/local/sbin/mover &> /dev/null # Generated parity check schedule: 0 0 1 * * /usr/local/sbin/mdcmd check NOCORRECT &> /dev/null # Generated plugins version check schedule: 10 0 * * 1 /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugincheck &> /dev/null # Generated Unraid OS update check schedule: 11 0 * * 1 /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/unraidcheck &> /dev/null # Generated cron settings for docker autoupdates 0 0 * * 0 /usr/local/emhttp/plugins/ca.update.applications/scripts/updateDocker.php >/dev/null 2>&1 # Generated cron settings for plugin autoupdates 0 0 * * * /usr/local/emhttp/plugins/ca.update.applications/scripts/updateApplications.php >/dev/null 2>&1 # CRON for CA background scanning of applications 0 * * * * php /usr/local/emhttp/plugins/community.applications/scripts/notices.php > /dev/null 2>&1 # Generated ssd trim schedule: 0 2 * * 1 /sbin/fstrim -a -v | logger &> /dev/null # Generated system data collection schedule: */1 * * * * /usr/local/emhttp/plugins/dynamix.system.stats/scripts/sa1 1 1 &> /dev/null Any clues? Quote Link to comment
itimpi Posted January 28, 2021 Share Posted January 28, 2021 Looks like the only thing scheduled to start around then is the CA Backup plug-in scheduled to start at 1:00 am. That should not lead to your problem though unless it is triggering something non-obvious. Quote Link to comment
trurl Posted January 28, 2021 Share Posted January 28, 2021 1 minute ago, itimpi said: CA Backup plug-in scheduled to start at 1:00 am On Tue, his syslog timestamps are on Wed I always just go to corntab.com instead of trying to remember how to parse these. Quote Link to comment
air_marshall Posted February 12, 2021 Author Share Posted February 12, 2021 Woes continue. Firstly I re-manufactured the sata power cables to this bank of drives as I didn't like them. Connected it back up, did a short SMART test, rebuilt to same drive. That night same drive disabled at 01:40am. Then I switched the SAS port it was on, rebuild to same disk, then that night the same drive disabled at 01:42am. WTF is going on here. Happy to accept the drive might be bad despite passing SMART tests, but disabling itself at such consistent times I don't believe is just a co-incidence.... Latest diagnostics attached but I doubt it tell us anything new. Shall i just give up and remove the drive? tower-diagnostics-20210212-2340.zip Quote Link to comment
trurl Posted February 13, 2021 Share Posted February 13, 2021 Nothing is assigned as disk1, is that expected? Disk2 is disabled and doesn't appear to be connected since there is no SMART report for it. Quote Link to comment
JorgeB Posted February 13, 2021 Share Posted February 13, 2021 8 hours ago, air_marshall said: but disabling itself at such consistent times I don't believe is just a co-incidence.... You may have a ghost in the machine... There are signs of the problem earlier: Feb 12 00:01:28 Tower kernel: sd 10:0:2:0: attempting task abort! scmd(0000000003923bdb) Feb 12 00:01:28 Tower kernel: sd 10:0:2:0: [sdk] tag#6275 CDB: opcode=0x85 85 09 0e 00 00 00 02 00 07 00 00 00 00 00 2f 00 Feb 12 00:01:28 Tower kernel: scsi target10:0:2: handle(0x000b), sas_address(0x4433221104000000), phy(4) Feb 12 00:01:28 Tower kernel: scsi target10:0:2: enclosure logical id(0x500605b00991da10), slot(7) Feb 12 00:01:32 Tower kernel: sd 10:0:2:0: task abort: SUCCESS scmd(0000000003923bdb) Feb 12 00:02:08 Tower kernel: sd 10:0:2:0: device_block, handle(0x000b) Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: device_unblock and setting to running, handle(0x000b) Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: [sdk] Synchronizing SCSI cache Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Feb 12 00:02:10 Tower kernel: mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221104000000) Feb 12 00:02:10 Tower kernel: mpt2sas_cm0: enclosure logical id(0x500605b00991da10), slot(7) I would swap that disk with another from the onboard SATA controller to see if it changes anything. Quote Link to comment
air_marshall Posted February 13, 2021 Author Share Posted February 13, 2021 11 hours ago, trurl said: Nothing is assigned as disk1, is that expected? Disk2 is disabled and doesn't appear to be connected since there is no SMART report for it. I had to drop disk1 because it failed a whilte ago and I shrank the array. Quote Link to comment
air_marshall Posted February 13, 2021 Author Share Posted February 13, 2021 5 hours ago, JorgeB said: You may have a ghost in the machine... There are signs of the problem earlier: Feb 12 00:01:28 Tower kernel: sd 10:0:2:0: attempting task abort! scmd(0000000003923bdb) Feb 12 00:01:28 Tower kernel: sd 10:0:2:0: [sdk] tag#6275 CDB: opcode=0x85 85 09 0e 00 00 00 02 00 07 00 00 00 00 00 2f 00 Feb 12 00:01:28 Tower kernel: scsi target10:0:2: handle(0x000b), sas_address(0x4433221104000000), phy(4) Feb 12 00:01:28 Tower kernel: scsi target10:0:2: enclosure logical id(0x500605b00991da10), slot(7) Feb 12 00:01:32 Tower kernel: sd 10:0:2:0: task abort: SUCCESS scmd(0000000003923bdb) Feb 12 00:02:08 Tower kernel: sd 10:0:2:0: device_block, handle(0x000b) Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: device_unblock and setting to running, handle(0x000b) Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: [sdk] Synchronizing SCSI cache Feb 12 00:02:10 Tower kernel: sd 10:0:2:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Feb 12 00:02:10 Tower kernel: mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221104000000) Feb 12 00:02:10 Tower kernel: mpt2sas_cm0: enclosure logical id(0x500605b00991da10), slot(7) I would swap that disk with another from the onboard SATA controller to see if it changes anything. Thanks again @JorgeB, given time and case constraints I'll shrink the array for now and investigate further at another time. PITA, that'll be 3 drives I've had to drop in as many months since my re-casing project 😞 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.