Lots of problems with my server

silou · September 17, 2020

Hey there, I am using Unraid for about two months. Since about a week some errors occur. Two weeks ago I installed a new drive that I precleared before and I think that everything went fine. I do not know why, but the server had atleast one unclean shutdown and my cache ssd showed a warning smart value and I think there was one error in the main tab aswell .I put the server into another room to do a parity check. I had problems accessing the webui and thats why I had to to a lot of unclean shutdowns (my bad). After I fixed the network problems I did a non-correcting parity check and it showed around 41560 sync-errors in the first 2-4 hours. After that there were no more sync-errors. I thought there was a problem with a sata cable so I installed a new one. Afterwards I did a correcting parity-check that corrected all sync-errors and another non-correcting one that showed no more errors. Some days ago I used unbalance to move some files to another disk (1,77TB).Yesterday I was streaming a movie from the server and had some dropped frames (could be madVR or something else), but it had me worriying so I checked the server afterwards and saw an absurd amount of reads and writes on my disks. The parity drive had around 3.2 million reads and writes, the first data drive had 37 million reads and 2000 writes, the second data drive had 16.5 million reads and 3.5 million writes and the cache had 60000 reads and 5.1 million writes. This morning the cache had even 500000 more writes then yesterday afternoon. The cache trims at 7am and I logged into the server at 7.11am and 7.25am.

At 7.28am this red message and some others afterwards came in: Tower kernel: ata3.00: exception Emask 0x10 SAct 0x2 SErr 0x4090000 action 0xe frozen

Stupid me shutdown the server to change the sata cable again, without saving the diagnostics. So I changed the cable and even put the disk into another sata slot on the mobo. I powered the server back on, and while looking at the webui the server had an unclean shutdown again. After that it went on and started a partiy-check that I aborted.

I looked into the syslog and saw a smiliar red message again:

Sep 17 10:28:17 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x20000000 SErr 0x4890800 action 0xe frozen

Sep 17 10:28:17 Tower kernel: ata4.00: irq_stat 0x0c400040, interface fatal error, connection status changed

Now it is ata4 and I am quite sure that this is the slot I put the data drive in. Does that mean, that my drive is dying, or is it another faulty cable ?

What should be my next steps to do? I would really hate to loose any of my data

tower-diagnostics-20200917-1041.zip

JorgeB · September 17, 2020

Replace/swap both power and SATA cables on disk1.

silou · September 17, 2020

I use a PicoPSU with a Sata Y-Split. I have another ony laying around. Will try and report back. Thanks!

silou · September 17, 2020

Changed the power and Sata cable and now it seems like it works. Atleast there is no red warning yet. How should I proceed? Wait a bit and start a correcting parity check?

JorgeB · September 17, 2020

Run a non correcting check for a few minutes, like half an hour or so, if everything is OK stop and start a correcting check.

silou · September 17, 2020

I was about to do that, but as soon as I cklicked on start, the server shut down. The server is powered by a 90W PicoPSU with a 72W Salcar power supply. All the drives were spun down, so I suspect, that the power supply can not handle, the spinning up of alle the drives at the same time. Time to get a new power supply, or is it something elses fault?

JonathanM · September 17, 2020

1 hour ago, silou said:

Time to get a new power supply

This.

silou · September 18, 2020

Parity check is done. 14810 errors were found and corrected. Should I run another one or get a new power supply first?

silou · September 18, 2020

Another error occured 10min ago... Should I take the SSD out? I have attached the smart report. Appdata, domains and system folders are saved on the SSD. How do I backup them correctly?

Unraid Cache disk SMART health [197]: 18-09-2020 13:44

Warning [TOWER] - current pending sector is 1179648
Intenso_SSD_Sata_III_AA000000000000004083 (sdb)

Edit: Half an hour later I got this message:

Unraid Cache disk SMART message [197]: 18-09-2020 14:14

Notice [TOWER] - current pending sector returned to normal value
Intenso_SSD_Sata_III_AA000000000000004083 (sdb)

I did a backup of the appdata folder and the flash drive with the CA Appdata Backup plugin. These errors are really scarry...

tower-smart-20200918-1354.zip

Edited September 18, 2020 by silou

JorgeB · September 18, 2020

If the error was just the SMART warning it's likely a false positive, there are other SSDs with similar issues.

silou · September 25, 2020

I changed the power supply yesterday and ran a non-correcting check overnight. There were 0 errors found, BUT there was an ata error again. This time it was the parity disk which showed one UDMA CRC error this morning. I read that this error is not a big deal unless the number increases, but the Dashboard shows an error and a thumb down. Is there any way to acknowledge the error and get a nice green thumb up again? The parity disk currently undergoes an extended SMART test. I will report back once the test finishes.

BTW: Are there any short SATA cables that are proven to work well? I bought these and it seems like they are the culprit of all errors. I already threw three of them into the trash and the rest will follow once I have a new set.

tower-smart-20200925-1331.zip

JorgeB · September 25, 2020

1 hour ago, silou said:

Is there any way to acknowledge the error and get a nice green thumb up again?

Click on it and choose acknowledge.

Lots of problems with my server

Recommended Posts

silou

Link to comment

JorgeB

Link to comment

silou

Link to comment

silou

Link to comment

JorgeB

Link to comment

silou

Link to comment

JonathanM

Link to comment

silou

Link to comment

silou

Link to comment

JorgeB

Link to comment

silou

Link to comment

JorgeB

Link to comment

Join the conversation