Hey there, I am using Unraid for about two months. Since about a week some errors occur. Two weeks ago I installed a new drive that I precleared before and I think that everything went fine. I do not know why, but the server had atleast one unclean shutdown and my cache ssd showed a warning smart value and I think there was one error in the main tab aswell .I put the server into another room to do a parity check. I had problems accessing the webui and thats why I had to to a lot of unclean shutdowns (my bad). After I fixed the network problems I did a non-correcting parity check and it showed around 41560 sync-errors in the first 2-4 hours. After that there were no more sync-errors. I thought there was a problem with a sata cable so I installed a new one. Afterwards I did a correcting parity-check that corrected all sync-errors and another non-correcting one that showed no more errors. Some days ago I used unbalance to move some files to another disk (1,77TB).Yesterday I was streaming a movie from the server and had some dropped frames (could be madVR or something else), but it had me worriying so I checked the server afterwards and saw an absurd amount of reads and writes on my disks. The parity drive had around 3.2 million reads and writes, the first data drive had 37 million reads and 2000 writes, the second data drive had 16.5 million reads and 3.5 million writes and the cache had 60000 reads and 5.1 million writes. This morning the cache had even 500000 more writes then yesterday afternoon. The cache trims at 7am and I logged into the server at 7.11am and 7.25am.
At 7.28am this red message and some others afterwards came in: Tower kernel: ata3.00: exception Emask 0x10 SAct 0x2 SErr 0x4090000 action 0xe frozen
Stupid me shutdown the server to change the sata cable again, without saving the diagnostics. So I changed the cable and even put the disk into another sata slot on the mobo. I powered the server back on, and while looking at the webui the server had an unclean shutdown again. After that it went on and started a partiy-check that I aborted.
I looked into the syslog and saw a smiliar red message again:
Sep 17 10:28:17 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x20000000 SErr 0x4890800 action 0xe frozen
Sep 17 10:28:17 Tower kernel: ata4.00: irq_stat 0x0c400040, interface fatal error, connection status changed
Now it is ata4 and I am quite sure that this is the slot I put the data drive in. Does that mean, that my drive is dying, or is it another faulty cable ?
What should be my next steps to do? I would really hate to loose any of my data
tower-diagnostics-20200917-1041.zip