Lots of problems with my server


silou

Recommended Posts

Hey there, I am using Unraid for about two months. Since about a week some errors occur. Two weeks ago I installed a new drive that I precleared before and I think that everything went fine. I do not know why, but the server had atleast one unclean shutdown and my cache ssd showed a warning smart value and I think there was one error in the main tab aswell .I put the server into another room to do a parity check. I had problems accessing the webui and thats why I had to to a lot of unclean shutdowns (my bad). After I fixed the network problems I did a non-correcting parity check and it showed around 41560 sync-errors in the first 2-4 hours. After that there were no more sync-errors. I thought there was a problem with a sata cable so I installed a new one. Afterwards I did a correcting parity-check that corrected all sync-errors and another non-correcting one that showed no more errors. Some days ago I used unbalance to move some files to another disk (1,77TB).Yesterday I was streaming a movie from the server and had some dropped frames (could be madVR or something else), but it had me worriying so I checked the server afterwards and saw an absurd amount of reads and writes on my disks. The parity drive had around 3.2 million reads and writes, the first data drive had 37 million reads and 2000 writes, the second data drive had 16.5 million reads and 3.5 million writes and the cache had 60000 reads and 5.1 million writes. This morning the cache had even 500000 more writes then yesterday afternoon. The cache trims at 7am and I logged into the server at 7.11am and 7.25am.

 

At 7.28am this red message and some others afterwards came in: Tower kernel: ata3.00: exception Emask 0x10 SAct 0x2 SErr 0x4090000 action 0xe frozen

 

Stupid me shutdown the server to change the sata cable again, without saving the diagnostics. So I changed the cable and even put the disk into another sata slot on the mobo. I powered the server back on, and while looking at the webui the server had an unclean shutdown again. After that it went on and started a partiy-check that I aborted.

 

I looked into the syslog and saw a smiliar red message again: 

 

Sep 17 10:28:17 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x20000000 SErr 0x4890800 action 0xe frozen

Sep 17 10:28:17 Tower kernel: ata4.00: irq_stat 0x0c400040, interface fatal error, connection status changed

 

Now it is ata4 and I am quite sure that this is the slot I put the data drive in. Does that mean, that my drive is dying, or is it another faulty cable ? 

What should be my next steps to do? I would really hate to loose any of my data :(

tower-diagnostics-20200917-1041.zip

Link to comment

I was about to do that, but as soon as I cklicked on start, the server shut down. The server is powered by a 90W PicoPSU with a 72W Salcar power supply. All the drives were spun down, so I suspect, that the power supply can not handle, the spinning up of alle the drives at the same time. Time to get a new power supply, or is it something elses fault?

Link to comment

Another error occured 10min ago...  Should I take the SSD out? I have attached the smart report. Appdata, domains and system folders are saved on the SSD. How do I backup them correctly?

 

 

Unraid Cache disk SMART health [197]: 18-09-2020 13:44

Warning [TOWER] - current pending sector is 1179648
Intenso_SSD_Sata_III_AA000000000000004083 (sdb)

 

Edit: Half an hour later I got this message:

 

Unraid Cache disk SMART message [197]: 18-09-2020 14:14

Notice [TOWER] - current pending sector returned to normal value
Intenso_SSD_Sata_III_AA000000000000004083 (sdb)

 

I did a backup of the appdata folder and the flash drive with the CA Appdata Backup plugin. These errors are really scarry...

 

 

tower-smart-20200918-1354.zip

Edited by silou
Link to comment

I changed the power supply yesterday and ran a non-correcting check overnight. There were 0 errors found, BUT there was an ata error again. This time it was the parity disk which showed one UDMA CRC error this morning. I read that this error is not a big deal unless the number increases, but the Dashboard shows an error and a thumb down. Is there any way to acknowledge the error and get a nice green thumb up again? The parity disk currently undergoes an extended SMART test. I will report back once the test finishes.

 

BTW: Are there any short SATA cables that are proven to work well? I bought these and it seems like they are the culprit of all errors. I already threw three of them into the trash and the rest will follow once I have a new set.

tower-smart-20200925-1331.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.