air_marshall

Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice?

air_marshall replied to air_marshall's topic in General Support

sde extended SMART attached GUI reports completed without error tower-smart-20240326-2039.zip

Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice?

air_marshall replied to air_marshall's topic in General Support

Ok, extended sde SMART test is running. Will post results. Same for sdj, the disc I replaced in the array. Any ideas why there is such a gap in my logs? syslog-previous ends Mar 21 12:30:57. Any place else I can go looking for them? Log seems to be flooded with an nvidia error that I've not seen before!

Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice?

air_marshall replied to air_marshall's topic in General Support

Started 22 Mar 1051 Finished 22 Mar 2254 No reboot or power cycle since! Data-Rebuild 2024-03-22, 22:53:15 (Friday) 4 TB 12 hr, 2 min, 49 sec 92,2 MB/s OK 41

air_marshall started following Yet More Drive Issues - Any ideas? and Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice? March 26

Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice?

air_marshall posted a topic in General Support

Hi Guys, Long story short, had some 'thermal management' issues in my 'comms room' that may have started to push some of my older drives closer to EOL. Drive sdj threw a couple of SMART errors so set about replacing that one first with a pre-cleared 4TB drive sdi that was pre-cleared. During rebuild of data drive sde threw 41 read errors which translated to 41 data errors with the rebuild. Last Parity check without errors was 07 Mar 24. See attached diagnostics. sdj still attached to machine and hasn't been touched since removing from the array. Data on that drive is largely expendable. Non-expendable is backed up elsewhere. I have another 4TB waiting to be pre-cleared that is available for either replacement of another drive or adding to the array. What are my next steps before I make this situation worse..... TIA tower-diagnostics-20240326-0923.zip

(6.12.3) USB Boot Drive Read Only after every reboot/shutdown

air_marshall replied to air_marshall's topic in General Support

Good thinking. Done. First reboot from freshly prepared and restored stick has given the same result. Server GUI not accessible. Shares not mounted. SSH available. Looks like USB is read-only again. root@Tower:~# diagnostics mkdir: cannot create directory ‘/boot/logs’: Input/output error Starting diagnostics collection... tail: cannot open '/boot/bz*.sha256' for reading: No such file or directory sed: can't read /tower-diagnostics-20230829-0942/config/go.txt: No such file or directory done. ZIP file '/boot/logs/tower-diagnostics-20230829-0942.zip' created. Last lines of syslog using: Aug 29 09:42:42 Tower kernel: FAT-fs (sda1): Directory bread(block 30620) failed Aug 29 09:42:42 Tower kernel: FAT-fs (sda1): Directory bread(block 30621) failed Aug 29 09:42:42 Tower kernel: FAT-fs (sda1): Directory bread(block 30622) failed Aug 29 09:42:42 Tower kernel: FAT-fs (sda1): Directory bread(block 30623) failed Aug 29 09:42:42 Tower kernel: FAT-fs (sda1): Directory bread(block 30624) failed Aug 29 09:42:42 Tower kernel: FAT-fs (sda1): Directory bread(block 30625) failed Aug 29 09:42:42 Tower kernel: FAT-fs (sda1): Directory bread(block 30626) failed Aug 29 09:42:42 Tower kernel: FAT-fs (sda1): Directory bread(block 30627) failed Does this mean the USB is bad OR files in the USB config folder are corrupted? Diagnostics file attached from last shutdown before erasing and reinstalling usb. USB 2 port is in use. Best path forward at this point? tower-diagnostics-20230829-0914-shutdown.zip

(6.12.3) USB Boot Drive Read Only after every reboot/shutdown

air_marshall posted a topic in General Support

Almost everytime I reboot/shutdown my server comes back-up with no GUI, no shares mounted. SSH is available and 'diagnostics' command results is a 'directory' doesn't exist error. USB drive chkdisk repair results in a successful boot. Diagnostics file is attached - I believe this one was captured automatically during post Tip and Tweaks install. TL;DR Server was off for c12 months during our house renovation. Upon reboot I've had multiple issues. First seemed to be related to docker not being terminated on array stop/reboot/shutdown. I found the forum post, used the commands to stop and then was able to clean reboot/shutdown. Server upgraded to latest stable. I set-about reformatting my cache to ZFS using spaceinvader video. Created a pool, but on copying back to the cache pool 1 of the drives had significant ATA issues, that drive eventually disappeared (I suspect went bad, but I've now pulled the SATA cable on it). So back to a single drive cache. Current situation is as above. I've carried out most suggestions from the 'Unclean Shutdown Thread", docker is disabled, no VMs, Tips & Tweaks plugin is installed to capture diagnostics on shutdown and time-outs set at 7 mins. There isn't a delay in unmounting drives however I feel like something is hanging on reboot/shutdown that results in the usb drive being mounted "read-only" on almost every reboot. Fixed by pulling the usb and doing a repair in windows but obviously not sustainable. Any/All help is much appreciated. I'm a long time unraid user and these are the first stability issues i've experienced. TIA tower-diagnostics-20230827-1032.zip

Yet More Drive Issues - Any ideas?

air_marshall replied to air_marshall's topic in General Support

Thanks again @JorgeB, given time and case constraints I'll shrink the array for now and investigate further at another time. PITA, that'll be 3 drives I've had to drop in as many months since my re-casing project 😞

Yet More Drive Issues - Any ideas?

air_marshall replied to air_marshall's topic in General Support

I had to drop disk1 because it failed a whilte ago and I shrank the array.

Yet More Drive Issues - Any ideas?

air_marshall replied to air_marshall's topic in General Support

Woes continue. Firstly I re-manufactured the sata power cables to this bank of drives as I didn't like them. Connected it back up, did a short SMART test, rebuilt to same drive. That night same drive disabled at 01:40am. Then I switched the SAS port it was on, rebuild to same disk, then that night the same drive disabled at 01:42am. WTF is going on here. Happy to accept the drive might be bad despite passing SMART tests, but disabling itself at such consistent times I don't believe is just a co-incidence.... Latest diagnostics attached but I doubt it tell us anything new. Shall i just give up and remove the drive? tower-diagnostics-20210212-2340.zip

Yet More Drive Issues - Any ideas?

air_marshall replied to air_marshall's topic in General Support

# Generated docker monitoring schedule: 10 0 * * 1 /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/dockerupdate.php check &> /dev/null # Generated system monitoring schedule: */1 * * * * /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null # Generated mover schedule: 40 3 * * 5 /usr/local/sbin/mover &> /dev/null # Generated parity check schedule: 0 0 1 * * /usr/local/sbin/mdcmd check NOCORRECT &> /dev/null # Generated plugins version check schedule: 10 0 * * 1 /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugincheck &> /dev/null # Generated Unraid OS update check schedule: 11 0 * * 1 /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/unraidcheck &> /dev/null # Generated cron settings for docker autoupdates 0 0 * * 0 /usr/local/emhttp/plugins/ca.update.applications/scripts/updateDocker.php >/dev/null 2>&1 # Generated cron settings for plugin autoupdates 0 0 * * * /usr/local/emhttp/plugins/ca.update.applications/scripts/updateApplications.php >/dev/null 2>&1 # CRON for CA background scanning of applications 0 * * * * php /usr/local/emhttp/plugins/community.applications/scripts/notices.php > /dev/null 2>&1 # Generated ssd trim schedule: 0 2 * * 1 /sbin/fstrim -a -v | logger &> /dev/null # Generated system data collection schedule: */1 * * * * /usr/local/emhttp/plugins/dynamix.system.stats/scripts/sa1 1 1 &> /dev/null Any clues?

Yet More Drive Issues - Any ideas?

air_marshall replied to air_marshall's topic in General Support

Here is the Cron output Linux 4.19.107-Unraid. root@Tower:~# crontab -l # If you don't want the output of a cron job mailed to you, you have to direct # any output to /dev/null. We'll do this here since these jobs should run # properly on a newly installed system. If a script fails, run-parts will # mail a notice to root. # # Run the hourly, daily, weekly, and monthly cron jobs. # Jobs that need different timing may be entered into the crontab as before, # but most really don't need greater granularity than this. If the exact # times of the hourly, daily, weekly, and monthly cron jobs do not suit your # needs, feel free to adjust them. # # Run hourly cron jobs at 47 minutes after the hour: 47 * * * * /usr/bin/run-parts /etc/cron.hourly 1> /dev/null # # Run daily cron jobs at 4:40 every day: 40 4 * * * /usr/bin/run-parts /etc/cron.daily 1> /dev/null # # Run weekly cron jobs at 4:30 on the first day of the week: 30 4 * * 0 /usr/bin/run-parts /etc/cron.weekly 1> /dev/null # # Run monthly cron jobs at 4:20 on the first day of the month: 20 4 1 * * /usr/bin/run-parts /etc/cron.monthly 1> /dev/null 0 1 * * 2 /usr/local/emhttp/plugins/ca.backup2/scripts/backup.php &>/dev/null 2>&1 root@Tower:~# Any clues?

Yet More Drive Issues - Any ideas?

air_marshall replied to air_marshall's topic in General Support

@JorgeB Thanks for the reply, as always. I haven't yet done what you suggested as the box is hard to get to and I have to empty a cupboard before even attempting to open it up. In the meantime I rebuilt to the same disk. Alas it disabled itself again the next night! However, what I've noticed is that all 3 of the error states have occurred at almost the same time... 20 JAN 01:17 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) 22 JAN 01:20 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) 24 JAN 01:22 - Tower: Alert [TOWER] - Disk 2 in error state (disk dsbl) SAMSUNG_HD103SJ_S2C8J9GZA00505 (sdl) This can't be a coincidence. Log attached from the latest error. Any clues as to what the server is doing at this time every night? Shouldn't be Mover, CA AppData Backup or SSD Trim. tower-diagnostics-20210125-1422.zip

Yet More Drive Issues - Any ideas?

air_marshall posted a topic in General Support

Disk 2. Twice this week this drive has been disabled. Both Diagnostics attached. Passes an extended SMART test, can't see any errors in the log apart from: Jan 22 00:02:35 Tower kernel: sd 6:0:3:0: [sdl] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 The first time it showed an irrationally high number of "Reads" as can be made out in the screenshot below (sorry about the red filter). First time, whilst it was disabled I copied the emulated data to another drive, whilst running an extended SMART. On passing the SMART I re-built to the same drive (note there was little to rebuild as a moved it off). 2 evenings later the same disk goes down - albeit without the ridiculous number of reads. Drive is attached to and LSI 8 port card. 3 other drives are on the same 4 sata to sas cable. Been like this for 6-8 weeks without any issues (on these drives/controller at least). Expert Opinion please? tower-diagnostics-20210122-1019.zip tower-diagnostics-20210120-0351.zip

Multiple disk issues. Unsure of data integrity state. Best path forward.

air_marshall replied to air_marshall's topic in General Support

Thanks @JorgeB. I'll still have to rebuild the disk right? No way to avoid doing that? I'll replace the power cable. I could also put it on a different controller / port and see if it re-occurs at a later date. No critical data on it.

Multiple disk issues. Unsure of data integrity state. Best path forward.

air_marshall replied to air_marshall's topic in General Support

Having rebuilt to the same disk for disk5 and removing disk1 from the array all has been stable and fine for a couple of weeks. A parity check has been completed since too, after parity rebuild. Today, when again the server was seemingly doing very little I received an alert of read errors on the same disk as before, disk5. It's now back to the disabled status. Here is a diagnostics report. I have NOT rebooted the server or done anything else this time so hopefully I can get some more pointed assitance. I note this from the Disk Log Information from the disk: Dec 7 21:50:59 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Dec 7 21:50:59 Tower kernel: ata1.00: failed command: READ DMA EXT Dec 7 21:50:59 Tower kernel: ata1.00: cmd 25/00:00:78:f9:37/00:01:ff:00:00/e0 tag 14 dma 131072 in Dec 7 21:50:59 Tower kernel: ata1.00: status: { DRDY } Dec 7 21:50:59 Tower kernel: ata1: hard resetting link Dec 7 21:51:03 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 7 21:51:03 Tower kernel: ata1.00: configured for UDMA/133 Dec 7 21:51:03 Tower kernel: ata1: EH complete Dec 11 04:46:44 Tower kernel: ata1: link is slow to respond, please be patient (ready=0) Dec 11 04:46:49 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 11 04:46:49 Tower kernel: ata1.00: configured for UDMA/133 Dec 11 08:39:15 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Dec 11 08:39:15 Tower kernel: ata1.00: failed command: READ DMA EXT Dec 11 08:39:15 Tower kernel: ata1.00: cmd 25/00:08:d0:cf:11/00:00:fb:00:00/e0 tag 19 dma 4096 in Dec 11 08:39:15 Tower kernel: ata1.00: status: { DRDY } Dec 11 08:39:15 Tower kernel: ata1: hard resetting link Dec 11 08:39:21 Tower kernel: ata1: link is slow to respond, please be patient (ready=0) Dec 11 08:39:25 Tower kernel: ata1: COMRESET failed (errno=-16) Dec 11 08:39:25 Tower kernel: ata1: hard resetting link Dec 11 08:39:31 Tower kernel: ata1: link is slow to respond, please be patient (ready=0) Dec 11 08:39:35 Tower kernel: ata1: COMRESET failed (errno=-16) Dec 11 08:39:35 Tower kernel: ata1: hard resetting link Dec 11 08:39:41 Tower kernel: ata1: link is slow to respond, please be patient (ready=0) Dec 11 08:40:10 Tower kernel: ata1: COMRESET failed (errno=-16) Dec 11 08:40:10 Tower kernel: ata1: limiting SATA link speed to 3.0 Gbps Dec 11 08:40:10 Tower kernel: ata1: hard resetting link Dec 11 08:40:15 Tower kernel: ata1: COMRESET failed (errno=-16) Dec 11 08:40:15 Tower kernel: ata1: reset failed, giving up Dec 11 08:40:15 Tower kernel: ata1.00: disabled Dec 11 08:40:15 Tower kernel: ata1: EH complete Dec 11 08:40:15 Tower kernel: sd 1:0:0:0: [sdb] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Dec 11 08:40:15 Tower kernel: sd 1:0:0:0: [sdb] tag#20 CDB: opcode=0x88 88 00 00 00 00 00 fb 11 cf d0 00 00 00 08 00 00 Dec 11 08:40:15 Tower kernel: print_req_error: I/O error, dev sdb, sector 4212248528 Dec 11 08:40:23 Tower kernel: sd 1:0:0:0: [sdb] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Dec 11 08:40:23 Tower kernel: sd 1:0:0:0: [sdb] tag#23 CDB: opcode=0x8a 8a 00 00 00 00 00 fb 11 cf d0 00 00 00 08 00 00 Dec 11 08:40:23 Tower kernel: print_req_error: I/O error, dev sdb, sector 4212248528 When I was removing disk1 I took the opportunity to check and replace and re-route the sata cable for disk5. When visiting the Disk 5 Settings page the system is unable to read SMART attributes or Disk Capabilities, it also cannot be spun up. So it seems it's also dropped offline on the controller. tower-diagnostics-20201211-1238.zip

Posts

Joined

Last visited

Converted

Recent Profile Visitors

air_marshall's Achievements

Rookie (2/14)

Reputation

Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice?

Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice?

Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice?

Disk Replacement - Data Rebuild w/Errors - Next Steps/Advice?

(6.12.3) USB Boot Drive Read Only after every reboot/shutdown

(6.12.3) USB Boot Drive Read Only after every reboot/shutdown

Yet More Drive Issues - Any ideas?

Yet More Drive Issues - Any ideas?

Yet More Drive Issues - Any ideas?

Yet More Drive Issues - Any ideas?

Yet More Drive Issues - Any ideas?

Yet More Drive Issues - Any ideas?

Yet More Drive Issues - Any ideas?

Multiple disk issues. Unsure of data integrity state. Best path forward.

Multiple disk issues. Unsure of data integrity state. Best path forward.