air_marshall

Members
  • Posts

    34
  • Joined

  • Last visited

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

air_marshall's Achievements

Newbie

Newbie (1/14)

0

Reputation

  1. Thanks again @JorgeB, given time and case constraints I'll shrink the array for now and investigate further at another time. PITA, that'll be 3 drives I've had to drop in as many months since my re-casing project 😞
  2. I had to drop disk1 because it failed a whilte ago and I shrank the array.
  3. Woes continue. Firstly I re-manufactured the sata power cables to this bank of drives as I didn't like them. Connected it back up, did a short SMART test, rebuilt to same drive. That night same drive disabled at 01:40am. Then I switched the SAS port it was on, rebuild to same disk, then that night the same drive disabled at 01:42am. WTF is going on here. Happy to accept the drive might be bad despite passing SMART tests, but disabling itself at such consistent times I don't believe is just a co-incidence.... Latest diagnostics attached but I doubt it tell us anything new. Shall i just give up and remove the drive? tower-diagnostics-20210212-2340.zip
  4. # Generated docker monitoring schedule: 10 0 * * 1 /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/dockerupdate.php check &> /dev/null # Generated system monitoring schedule: */1 * * * * /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null # Generated mover schedule: 40 3 * * 5 /usr/local/sbin/mover &> /dev/null # Generated parity check schedule: 0 0 1 * * /usr/local/sbin/mdcmd check NOCORRECT &> /dev/null # Generated plugins version check schedule: 10 0 * * 1 /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugincheck &> /dev/null # Generated Unraid OS update check schedule: 11 0 * * 1 /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/unraidcheck &> /dev/null # Generated cron settings for docker autoupdates 0 0 * * 0 /usr/local/emhttp/plugins/ca.update.applications/scripts/updateDocker.php >/dev/null 2>&1 # Generated cron settings for plugin autoupdates 0 0 * * * /usr/local/emhttp/plugins/ca.update.applications/scripts/updateApplications.php >/dev/null 2>&1 # CRON for CA background scanning of applications 0 * * * * php /usr/local/emhttp/plugins/community.applications/scripts/notices.php > /dev/null 2>&1 # Generated ssd trim schedule: 0 2 * * 1 /sbin/fstrim -a -v | logger &> /dev/null # Generated system data collection schedule: */1 * * * * /usr/local/emhttp/plugins/dynamix.system.stats/scripts/sa1 1 1 &> /dev/null Any clues?
  5. Here is the Cron output Linux 4.19.107-Unraid. root@Tower:~# crontab -l # If you don't want the output of a cron job mailed to you, you have to direct # any output to /dev/null. We'll do this here since these jobs should run # properly on a newly installed system. If a script fails, run-parts will # mail a notice to root. # # Run the hourly, daily, weekly, and monthly cron jobs. # Jobs that need different timing may be entered into the crontab as before, # but most really don't need greater granularity than this. If the exact # times of the hourly, daily, weekly, and monthly cron jobs do not suit your # needs, feel free to adjust them. # # Run hourly cron jobs at 47 minutes after the hour: 47 * * * * /usr/bin/run-parts /etc/cron.hourly 1> /dev/null # # Run daily cron jobs at 4:40 every day: 40 4 * * * /usr/bin/run-parts /etc/cron.daily 1> /dev/null # # Run weekly cron jobs at 4:30 on the first day of the week: 30 4 * * 0 /usr/bin/run-parts /etc/cron.weekly 1> /dev/null # # Run monthly cron jobs at 4:20 on the first day of the month: 20 4 1 * * /usr/bin/run-parts /etc/cron.monthly 1> /dev/null 0 1 * * 2 /usr/local/emhttp/plugins/ca.backup2/scripts/backup.php &>/dev/null 2>&1 root@Tower:~# Any clues?
  6. @JorgeB Thanks for the reply, as always. I haven't yet done what you suggested as the box is hard to get to and I have to empty a cupboard before even attempting to open it up. In the meantime I rebuilt to the same disk. Alas it disabled itself again the next night! However, what I've noticed is that all 3 of the error states have occurred at almost the same time... 20 JAN 01:17 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) 22 JAN 01:20 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) 24 JAN 01:22 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) This can't be a coincidence. Log attached from the latest error. Any clues as to what the server is doing at this time every night? Shouldn't be Mover, CA AppData Backup or SSD Trim. tower-diagnostics-20210125-1422.zip
  7. Disk 2. Twice this week this drive has been disabled. Both Diagnostics attached. Passes an extended SMART test, can't see any errors in the log apart from: Jan 22 00:02:35 Tower kernel: sd 6:0:3:0: [sdl] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 The first time it showed an irrationally high number of "Reads" as can be made out in the screenshot below (sorry about the red filter). First time, whilst it was disabled I copied the emulated data to another drive, whilst running an extended SMART. On passing the SMART I re-built to the same drive (note there was little to rebuild as a moved it off). 2 evenings later the same disk goes down - albeit without the ridiculous number of reads. Drive is attached to and LSI 8 port card. 3 other drives are on the same 4 sata to sas cable. Been like this for 6-8 weeks without any issues (on these drives/controller at least). Expert Opinion please? tower-diagnostics-20210122-1019.zip tower-diagnostics-20210120-0351.zip
  8. Thanks @JorgeB. I'll still have to rebuild the disk right? No way to avoid doing that? I'll replace the power cable. I could also put it on a different controller / port and see if it re-occurs at a later date. No critical data on it.
  9. Having rebuilt to the same disk for disk5 and removing disk1 from the array all has been stable and fine for a couple of weeks. A parity check has been completed since too, after parity rebuild. Today, when again the server was seemingly doing very little I received an alert of read errors on the same disk as before, disk5. It's now back to the disabled status. Here is a diagnostics report. I have NOT rebooted the server or done anything else this time so hopefully I can get some more pointed assitance. I note this from the Disk Log Information from the disk: Dec 7 21:50:59 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Dec 7 21:50:59 Tower kernel: ata1.00: failed command: READ DMA EXT Dec 7 21:50:59 Tower kernel: ata1.00: cmd 25/00:00:78:f9:37/00:01:ff:00:00/e0 tag 14 dma 131072 in Dec 7 21:50:59 Tower kernel: ata1.00: status: { DRDY } Dec 7 21:50:59 Tower kernel: ata1: hard resetting link Dec 7 21:51:03 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 7 21:51:03 Tower kernel: ata1.00: configured for UDMA/133 Dec 7 21:51:03 Tower kernel: ata1: EH complete Dec 11 04:46:44 Tower kernel: ata1: link is slow to respond, please be patient (ready=0) Dec 11 04:46:49 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 11 04:46:49 Tower kernel: ata1.00: configured for UDMA/133 Dec 11 08:39:15 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Dec 11 08:39:15 Tower kernel: ata1.00: failed command: READ DMA EXT Dec 11 08:39:15 Tower kernel: ata1.00: cmd 25/00:08:d0:cf:11/00:00:fb:00:00/e0 tag 19 dma 4096 in Dec 11 08:39:15 Tower kernel: ata1.00: status: { DRDY } Dec 11 08:39:15 Tower kernel: ata1: hard resetting link Dec 11 08:39:21 Tower kernel: ata1: link is slow to respond, please be patient (ready=0) Dec 11 08:39:25 Tower kernel: ata1: COMRESET failed (errno=-16) Dec 11 08:39:25 Tower kernel: ata1: hard resetting link Dec 11 08:39:31 Tower kernel: ata1: link is slow to respond, please be patient (ready=0) Dec 11 08:39:35 Tower kernel: ata1: COMRESET failed (errno=-16) Dec 11 08:39:35 Tower kernel: ata1: hard resetting link Dec 11 08:39:41 Tower kernel: ata1: link is slow to respond, please be patient (ready=0) Dec 11 08:40:10 Tower kernel: ata1: COMRESET failed (errno=-16) Dec 11 08:40:10 Tower kernel: ata1: limiting SATA link speed to 3.0 Gbps Dec 11 08:40:10 Tower kernel: ata1: hard resetting link Dec 11 08:40:15 Tower kernel: ata1: COMRESET failed (errno=-16) Dec 11 08:40:15 Tower kernel: ata1: reset failed, giving up Dec 11 08:40:15 Tower kernel: ata1.00: disabled Dec 11 08:40:15 Tower kernel: ata1: EH complete Dec 11 08:40:15 Tower kernel: sd 1:0:0:0: [sdb] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Dec 11 08:40:15 Tower kernel: sd 1:0:0:0: [sdb] tag#20 CDB: opcode=0x88 88 00 00 00 00 00 fb 11 cf d0 00 00 00 08 00 00 Dec 11 08:40:15 Tower kernel: print_req_error: I/O error, dev sdb, sector 4212248528 Dec 11 08:40:23 Tower kernel: sd 1:0:0:0: [sdb] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Dec 11 08:40:23 Tower kernel: sd 1:0:0:0: [sdb] tag#23 CDB: opcode=0x8a 8a 00 00 00 00 00 fb 11 cf d0 00 00 00 08 00 00 Dec 11 08:40:23 Tower kernel: print_req_error: I/O error, dev sdb, sector 4212248528 When I was removing disk1 I took the opportunity to check and replace and re-route the sata cable for disk5. When visiting the Disk 5 Settings page the system is unable to read SMART attributes or Disk Capabilities, it also cannot be spun up. So it seems it's also dropped offline on the controller. tower-diagnostics-20201211-1238.zip
  10. does docker-compose persist after reboot if you install it this way?
  11. So I copied the emulated contents of disk5 off of the array onto a USB drive, no faults or errors occurred during this process. Having completed another extended SMART on disk5 without issue, combined with the fact there was no critical data on disk5 and me not having a spare disk or willing to spend on one I elected to rebuild to the same drive. This completed but with Errors concurrent with read errors from disc1. Why has this happened during the rebuild but not the copy of the same emulated data beforehand? Realistically how much data are we talking about as being corrupt? The latest diagnostics is attached detailing the sector errors, which I should add is now increasing with drive usage, so disk1 is definitely on the way out. Regarding data on disk1. If parity data is not part of the corrupted rebuild data on disk5 (i doubt there is any way of confirming this) I should be able to a) replace disk1 and rebuild without data corruption OR (if I am happy to shrink the array - which I am) unassign the disk1 and then copy the emulated data to another drive without data corruption; then proceed to shrink the array using the methods in the documentation. Should I run a non-correcting parity check first? There is some critical data on disk1, I have already copied that off to an external source with the drive still assigned and mounted, there were no errors during this copy process. tower-diagnostics-20201121-1408.zip
  12. Ok thanks. I'm still trying to comprehend the situation here. So in all cases I should address disk5 first, regardless of the SMART errors on disk1? Errors during the rebuild of disk5 may result due to the SMART issues with disk1. Is that correct? Therefore if I retain disk5 in its current state I may be able to rescue data from it if required? If disk5 is currently mounted what is wrong with just copying all the data off it? Or is the data actually replicated from parity despite the mount? IF I get a successful rebuild of disk5 what do I need to do to address disk1 if I am happy to make it redundant.
  13. Thank you for trying to help me. I think I'm struggling for drives in order to do this and having just spent on 2x4tb units I'm not inclined to get another. I have a spare 1tb 2.5 spinner that I can use and hasn't been used much, at the moment I've been using it externally to copy the critical data off disk1 and disk5. I also have a spare, slow, old 500gn 2.5 spinner that used to be my cache before I upgraded to an SSD for this purpose. I can probably also free up disk9 a new 4tb unit. Should attempt use ddrescue before attempting a rebuild to disk5? I will set about copying everything off disk5 now to the external 1tb I have in the meantime. At least then I have it so if there are any issues with a rebuild I might be in a better place.
  14. I'm not sure what you mean. Disk 5 still shows as disabled but all data remains available due emulation. /mnt/disk5 is also listed and available There is only 300gb on disk 5 so I could transfer it off to a usb drive before attempting a rebuild to the same 4tb drive. Regarding issues with the rebuild, could these result due the SMART errors on disk1 that have also appeared? Thank you for trying to help me understand this situation.
  15. Thanks @JorgeB The extended SMART passed ok and a did a parity swap followed by a rebuild and all was well. I now have bigger issues! See my other post! Eeeeeek