Drive Error Red X


GTP

Recommended Posts

@trurl Thanks for your help today! You called it! It must have been a loose SATA cable because so far no errors and I am up to 3%. It says it will take about two more days for the rebuild to complete but I will post here once it is done. Thanks again for all your help getting me back up and running. As far as the docker config goes; no hurry of course because of the rebuild but what did you see that I could do better in my docker config?

 

Best,

 

GTP

Link to comment

@trurl I spoke to soon it did get further along but when I just checked it at 4.2% it says that the disk has 334,580 errors again. I have attached diagnostics and I am going to try a different SATA cable and if that fails I am thinking I should just junk the drive. I am lost as to why this is happening. Thanks again and if its too late to respond tonight no worries I appreciate all your help and the array can wait. 

 

Thank you,

 

GTP

 

I am sorry I am editing this post because when I went back to the main page it is now saying I have zero errors. I don't know why that happened but I refreshed before I posted just to verify and it still showed errors and then started the post while I was downloading the diagnostics. It also had yellow pop-ups that said the array had found errors and I had to dismiss them before I went to diagnostics. I'm sorry I don't want you to think I'm helpless but this was weird, I swear I saw the error count on the disk because at first I was hoping it might be on disk 10 and it is the entire chain of SATA that was bad but it wasn't it was disk 11 and I saw the total from that disk at the bottom. Sorry for the alarm and the long explanation, everything seems fine have a nice night

 

419937871_waitwhut.thumb.png.355fe1abd57b3a16752230f359bda737.png

 

tower-diagnostics-20200903-2053.zip

Edited by GTP
Wait whut! Then I tried to explain more
Link to comment
On 9/3/2020 at 9:13 PM, GTP said:

what did you see that I could do better in my docker config?

You have allocated 150G to docker.img. 20G should be more than enough. I am running 17 dockers and they use less than half of 20G. But I see you are already using more than 20G, so more than should be needed.

 

Have you had problems filling docker.img? Making it larger won't fix anything, it will just make it take longer to fill. The usual reason for filling docker.img is app writing into docker.img instead of to mapped storage. Common mistakes are specifying a path in the app that doesn't correspond to a container path in upper/lower case or specifying a relative path.

 

Your appdata, domains, system shares are set to cache-no. They do have files on cache, but also files on the array. These shares should be cache-prefer (or cache-only after you get them all moved to cache). You want these shares set to keep all their files on cache.

 

These shares are used by dockers/VMs. If the files are on the array, docker/VM performance will be impacted by slower parity writes, and dockers/VMs will keep array disks spinning since these files are always open.

 

Getting all this fixed will require several steps. Let me know when parity sync has completed and we can start working on this.

Link to comment

@trurl The rebuild for disk 11 finished and I have attached the diagnostics after it was done and some screenshots of disk 8 showing a few errors. Disk 8 has now completely failed. I have included diagnostics before and after the disk 8 errors. I also took a screenshot of the error in the syslog. I have not rebooted, stopped the array or even touched it since I got the error last night. I don't want to mess this up when I have two disks that have error-ed so close to each other. I ordered a 12 tb Parity drive so I can have dual Parity and hopefully if this ever happens again I can feel a little safer.  Please let me know if you need anything else I tried to capture every screen every diagnostic all weekend long I just didn't want to bother you during the long weekend. 

 

Thank you,

 

GTP

 

Disk 8 failure.png

Fatel Error.png

udma crc error count.png

warning.png

normal operation.png

error 18.png

after parity rebuild diagnostics.zip tower-diagnostics-20200907-2127.zip

Link to comment

SMART attributes for disk8 look OK. CRC errors are connection issues, not disk problems. Run an extended SMART test on disk8.

21 minutes ago, GTP said:

The rebuild for disk 11 finished

We didn't discuss rebuilding disk11. According to syslog and the "Parity disk returned to normal operation", you were rebuilding parity not disk11.

Sep  3 18:37:07 Tower kernel: md: recovery thread: recon P ...
Sep  3 18:37:07 Tower kernel: write_file: error 30 opening /boot/config/super.dat
Sep  3 18:37:07 Tower kernel: md: could not write superblock file: /boot/config/super.dat

That snippet also indicates problems writing to flash so you have another issue.

 

There is a gap in syslog from those diagnostics. Maybe you have other syslogs in /var/log that didn't get included and they would contain a rebuild of disk11 if you did indeed rebuild disk11. And your log folders are full. What logs did get included suggest lots of repetition of the same connection issues and some ACPI errors I don't know about. Maybe that is why the diagnostics is missing disk11 rebuild.

 

Another reason your log folders might be full is this next thing I just noticed.

 

In addition to the docker configuration problems I mentioned earlier, I just noticed that you are installing every package from NerdTools. Why? Do you even know what some of those are? I recommend not installing anything you are not using. With NerdTools, unless it is something you use frequently, you might even consider uninstalling any package after you have finished using it. One of those packages is atop, which is notorious for filling log folders with its own logs.

 

Link to comment

I don't recall whether it has been discussed in this thread or not, but with the large number of disks you have, perhaps your power supply is marginal.

 

Uninstall any Nerdpack packages you are not using, especially atop.

 

At this point, you are going to have to shutdown just to get your logs cleared. And you should put your flash drive in your PC and let it checkdisk. While there make a backup of flash. Do you always keep a current backup of flash?

 

Your latest post has left me confused about the current state of your server. It isn't even clear what the correct order of those screenshots should be.

 

After fixing flash in your PC, reboot and post new diagnostics.

 

 

Link to comment
3 minutes ago, trurl said:

SMART attributes for disk8 look OK. CRC errors are connection issues, not disk problems. Run an extended SMART test on disk8.

We didn't discuss rebuilding disk11. According to syslog and the "Parity disk returned to normal operation", you were rebuilding parity not disk11.


Sep  3 18:37:07 Tower kernel: md: recovery thread: recon P ...
Sep  3 18:37:07 Tower kernel: write_file: error 30 opening /boot/config/super.dat
Sep  3 18:37:07 Tower kernel: md: could not write superblock file: /boot/config/super.dat

That snippet also indicates problems writing to flash so you have another issue.

 

There is a gap in syslog from those diagnostics. Maybe you have other syslogs in /var/log that didn't get included and they would contain a rebuild of disk11 if you did indeed rebuild disk11. And your log folders are full. What logs did get included suggest lots of repetition of the same connection issues and some ACPI errors I don't know about. Maybe that is why the diagnostics is missing disk11 rebuild.

 

Another reason your log folders might be full is this next thing I just noticed.

 

In addition to the docker configuration problems I mentioned earlier, I just noticed that you are installing every package from NerdTools. Why? Do you even know what some of those are? I recommend not installing anything you are not using. With NerdTools, unless it is something you use frequently, you might even consider uninstalling any package after you have finished using it. One of those packages is atop, which is notorious for filling log folders with its own logs.

 

No you were right I was rebuilding parity, apologies Ok I have begun the extended smart test on disk 8 and no I do not know what the tools are, I just thought I had plenty of space and I would use them eventually. I am uninstalling the nerd tools as instructed. I actually run parallel power supplies just for that reason. The disks in question are on the main power supply though. I will absolutely make a backup of flash.

Sorry about the screenshots I thought it would order them by date. It goes 4-2-3-1. 

 

Thank you,

 

GTP

Link to comment

Below I have posted the diagnostics after shutdown. I have also repaired the flash drive and made a copy. I had to uninstall the nerd tools after the shutdown because it wouldn't display the nerd tools but they are uninstalled now. I also could not complete an extended SMART test on disk 8 it kept saying "interrupted: reset by host". I will start the SMART test on Disk 8 again now that I have rebooted. Thank you GTP

tower-diagnostics-20200908-1024.zip

Edited by GTP
error message clarification
Link to comment
3 minutes ago, trurl said:

Those diagnostics say parity is invalid. I assume parity rebuild never completed.

 

Post a screenshot of Main - Array Devices showing parity disk.

@trurl You are correct it looks like after I restarted it is attempting to rebuild Parity again. I have attached the screen shot.

 

Best,

 

GTP

Parity rebuild 0809.png

Link to comment

There were already some read errors on disk10, this means parity won't be 100% correct, doesn't look like a disk problem, I would start by updating both LSIs firmware:

 

Sep  8 10:13:56 Tower kernel: mpt2sas_cm0: LSISAS2008: FWVersion(10.00.08.00)

Very old, current release is 20.00.07.00

 

Sep  8 10:13:56 Tower kernel: mpt2sas_cm1: LSISAS2116: FWVersion(20.00.06.00)

All p20 releases except 20.00.07.00 have known issues.

 

It could also be a power/cable problem.

Link to comment

Or maybe it couldn't record the fact that parity rebuild completed because it couldn't write to your flash drive.

 

Looks like you may still be having flash problems:

Sep  8 10:16:15 Tower root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Sep  8 10:16:16 Tower kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep  8 10:16:16 Tower kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : 0x6 [current] 
Sep  8 10:16:16 Tower kernel: sd 0:0:0:0: [sda] tag#0 ASC=0x28 ASCQ=0x0 
Sep  8 10:16:16 Tower kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 ef 0e 78 00 00 80 00
Sep  8 10:16:16 Tower kernel: print_req_error: I/O error, dev sda, sector 15666808
Sep  8 10:16:16 Tower kernel: print_req_error: I/O error, dev sda, sector 15666808
Sep  8 10:16:16 Tower kernel: Buffer I/O error on dev sda, logical block 1958351, async page read

That wrong csrf_token message at that top of that snippet is unrelated, but can be quite annoying as it repeats throughout syslog. Here is the FAQ on that:

 

More than annoying because it is going to fill your logs is these:

Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined

And, you're having connection problems on disk1:

Sep  8 10:13:56 Tower kernel: ata3.00: ATA-9: WDC WD80EMAZ-00WJTA0, 2SG4RN3J, 83.H0A83, max UDMA/133
...
Sep  8 10:16:06 Tower kernel: md: import disk1: (sdl) WDC_WD80EMAZ-00WJTA0_2SG4RN3J size: 7814026532 
...
Sep  8 10:24:16 Tower kernel: ata3: lost interrupt (Status 0x50)
Sep  8 10:24:16 Tower kernel: ata3: limiting SATA link speed to 1.5 Gbps
Sep  8 10:24:16 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
Sep  8 10:24:16 Tower kernel: ata3: SError: { RecovComm PHYRdyChg CommWake DevExch }
Sep  8 10:24:16 Tower kernel: ata3.00: failed command: READ DMA
Sep  8 10:24:16 Tower kernel: ata3.00: cmd c8/00:30:10:40:35/00:00:00:00:00/e0 tag 0 dma 24576 in
Sep  8 10:24:16 Tower kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
Sep  8 10:24:16 Tower kernel: ata3.00: status: { DRDY }
Sep  8 10:24:16 Tower kernel: ata3: hard resetting link

 

Link to comment
1 minute ago, JorgeB said:

There were already some read errors on disk10, this means parity won't be 100% correct, doesn't look like a disk problem, I would start by updating both LSIs firmware:

 


Sep  8 10:13:56 Tower kernel: mpt2sas_cm0: LSISAS2008: FWVersion(10.00.08.00)

Very old, current release is 20.00.07.00

 


Sep  8 10:13:56 Tower kernel: mpt2sas_cm1: LSISAS2116: FWVersion(20.00.06.00)

All p20 releases except 20.00.07.00 have known issues.

 

It could also be a power/cable problem.

@JorgeB Thank you for the insight I am buying some new cables this week for my SAS card as far as the Firmware version goes, should I stop the parity rebuild to upgrade the firmware. The cables won't be here until Thursday. Also I found this site to upgrade the firmware https://kb.sandisk.com/app/answers/detail/a_id/11192/~/lsi-sas2008-firmware%2Fbios-download-for-lightning-pcie-enterprise-ssa is this what I want? Thank you both for your help!

 

-GTP

Link to comment

Any idea what these are about?

5 minutes ago, trurl said:

More than annoying because it is going to fill your logs is these:


Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined

 

Link to comment
1 minute ago, trurl said:

I had not gotten that far into your syslog, but yes parity build is going to be bad due to all the read errors on disk10.

 

Also, are you sure you uninstalled nerdpack?

 

Yes, Nerdpack is uninstalled but I had to do it after the shut down and restart should I interrupt the parity rebuild and do another shutdown and restart now that the nerd pack is uninstalled. I also ran chkdisk on the Flash drive it did find errors and repaired them. I then copied the contents of the disk to a local folder. Should I create a new flash disk I have tons of other Flash drives?

 

Thank you,

 

GTP

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.