Drive Error Red X

September 3, 20205 yr

Author

@trurl I have attached the log, Thank you

tower-syslog-20200903-2216.zip

Edited September 3, 20205 yr by GTP

Quote

September 3, 20205 yr

Community Expert

That isn't diagnostics, that is only syslog. Diagnostics contains syslog and many other things. But syslog probably has most of what we need now so I will take a look. You did manage to post diagnostics in the first post. It is on the Tools page.

Quote

September 3, 20205 yr

Community Expert

That is a lot of read errors on disk 11. Maybe the disk is failing. Go ahead and post those Diagnostics since that is the easiest way to get me everything else including SMART for all disks.

Quote

September 3, 20205 yr

Author

@trurlSorry about that, I'm dumb. Here is the most recent Diagnostic file. Thank you

GTP

tower-diagnostics-20200903-1724.zip

Edited September 3, 20205 yr by GTP

Quote

September 3, 20205 yr

Community Expert

SMART for disk11 still looks OK.

Stop, shutdown, check connections, SATA and power, both ends, including any splitters. Change SATA cable if you have another. Then reboot and Start parity sync again.

Quote

September 3, 20205 yr

Author

OK I'll give it a shot, @trurl

Thank you,

GTP

Quote

September 4, 20205 yr

Author

@trurl Thanks for your help today! You called it! It must have been a loose SATA cable because so far no errors and I am up to 3%. It says it will take about two more days for the rebuild to complete but I will post here once it is done. Thanks again for all your help getting me back up and running. As far as the docker config goes; no hurry of course because of the rebuild but what did you see that I could do better in my docker config?

Best,

GTP

Quote

September 4, 20205 yr

Author

@trurl I spoke to soon it did get further along but when I just checked it at 4.2% it says that the disk has 334,580 errors again. I have attached diagnostics and I am going to try a different SATA cable and if that fails I am thinking I should just junk the drive. I am lost as to why this is happening. Thanks again and if its too late to respond tonight no worries I appreciate all your help and the array can wait.

Thank you,

GTP

I am sorry I am editing this post because when I went back to the main page it is now saying I have zero errors. I don't know why that happened but I refreshed before I posted just to verify and it still showed errors and then started the post while I was downloading the diagnostics. It also had yellow pop-ups that said the array had found errors and I had to dismiss them before I went to diagnostics. I'm sorry I don't want you to think I'm helpless but this was weird, I swear I saw the error count on the disk because at first I was hoping it might be on disk 10 and it is the entire chain of SATA that was bad but it wasn't it was disk 11 and I saw the total from that disk at the bottom. Sorry for the alarm and the long explanation, everything seems fine have a nice night

tower-diagnostics-20200903-2053.zip

Edited September 4, 20205 yr by GTP
Wait whut! Then I tried to explain more

Quote

September 4, 20205 yr

Community Expert

On 9/3/2020 at 9:13 PM, GTP said:

what did you see that I could do better in my docker config?

You have allocated 150G to docker.img. 20G should be more than enough. I am running 17 dockers and they use less than half of 20G. But I see you are already using more than 20G, so more than should be needed.

Have you had problems filling docker.img? Making it larger won't fix anything, it will just make it take longer to fill. The usual reason for filling docker.img is app writing into docker.img instead of to mapped storage. Common mistakes are specifying a path in the app that doesn't correspond to a container path in upper/lower case or specifying a relative path.

Your appdata, domains, system shares are set to cache-no. They do have files on cache, but also files on the array. These shares should be cache-prefer (or cache-only after you get them all moved to cache). You want these shares set to keep all their files on cache.

These shares are used by dockers/VMs. If the files are on the array, docker/VM performance will be impacted by slower parity writes, and dockers/VMs will keep array disks spinning since these files are always open.

Getting all this fixed will require several steps. Let me know when parity sync has completed and we can start working on this.

Quote

September 8, 20205 yr

Author

@trurl The rebuild for disk 11 finished and I have attached the diagnostics after it was done and some screenshots of disk 8 showing a few errors. Disk 8 has now completely failed. I have included diagnostics before and after the disk 8 errors. I also took a screenshot of the error in the syslog. I have not rebooted, stopped the array or even touched it since I got the error last night. I don't want to mess this up when I have two disks that have error-ed so close to each other. I ordered a 12 tb Parity drive so I can have dual Parity and hopefully if this ever happens again I can feel a little safer. Please let me know if you need anything else I tried to capture every screen every diagnostic all weekend long I just didn't want to bother you during the long weekend.

Thank you,

GTP

after parity rebuild diagnostics.zip tower-diagnostics-20200907-2127.zip

Quote

September 8, 20205 yr

Community Expert

SMART attributes for disk8 look OK. CRC errors are connection issues, not disk problems. Run an extended SMART test on disk8.

21 minutes ago, GTP said:

The rebuild for disk 11 finished

We didn't discuss rebuilding disk11. According to syslog and the "Parity disk returned to normal operation", you were rebuilding parity not disk11.

Sep  3 18:37:07 Tower kernel: md: recovery thread: recon P ...
Sep  3 18:37:07 Tower kernel: write_file: error 30 opening /boot/config/super.dat
Sep  3 18:37:07 Tower kernel: md: could not write superblock file: /boot/config/super.dat

That snippet also indicates problems writing to flash so you have another issue.

There is a gap in syslog from those diagnostics. Maybe you have other syslogs in /var/log that didn't get included and they would contain a rebuild of disk11 if you did indeed rebuild disk11. And your log folders are full. What logs did get included suggest lots of repetition of the same connection issues and some ACPI errors I don't know about. Maybe that is why the diagnostics is missing disk11 rebuild.

Another reason your log folders might be full is this next thing I just noticed.

In addition to the docker configuration problems I mentioned earlier, I just noticed that you are installing every package from NerdTools. Why? Do you even know what some of those are? I recommend not installing anything you are not using. With NerdTools, unless it is something you use frequently, you might even consider uninstalling any package after you have finished using it. One of those packages is atop, which is notorious for filling log folders with its own logs.

Quote

September 8, 20205 yr

Community Expert

I don't recall whether it has been discussed in this thread or not, but with the large number of disks you have, perhaps your power supply is marginal.

Uninstall any Nerdpack packages you are not using, especially atop.

At this point, you are going to have to shutdown just to get your logs cleared. And you should put your flash drive in your PC and let it checkdisk. While there make a backup of flash. Do you always keep a current backup of flash?

Your latest post has left me confused about the current state of your server. It isn't even clear what the correct order of those screenshots should be.

After fixing flash in your PC, reboot and post new diagnostics.

Quote

September 8, 20205 yr

Author

3 minutes ago, trurl said:
SMART attributes for disk8 look OK. CRC errors are connection issues, not disk problems. Run an extended SMART test on disk8.

We didn't discuss rebuilding disk11. According to syslog and the "Parity disk returned to normal operation", you were rebuilding parity not disk11.
Sep  3 18:37:07 Tower kernel: md: recovery thread: recon P ...
Sep  3 18:37:07 Tower kernel: write_file: error 30 opening /boot/config/super.dat
Sep  3 18:37:07 Tower kernel: md: could not write superblock file: /boot/config/super.dat
That snippet also indicates problems writing to flash so you have another issue.

There is a gap in syslog from those diagnostics. Maybe you have other syslogs in /var/log that didn't get included and they would contain a rebuild of disk11 if you did indeed rebuild disk11. And your log folders are full. What logs did get included suggest lots of repetition of the same connection issues and some ACPI errors I don't know about. Maybe that is why the diagnostics is missing disk11 rebuild.

Another reason your log folders might be full is this next thing I just noticed.

In addition to the docker configuration problems I mentioned earlier, I just noticed that you are installing every package from NerdTools. Why? Do you even know what some of those are? I recommend not installing anything you are not using. With NerdTools, unless it is something you use frequently, you might even consider uninstalling any package after you have finished using it. One of those packages is atop, which is notorious for filling log folders with its own logs.

No you were right I was rebuilding parity, apologies Ok I have begun the extended smart test on disk 8 and no I do not know what the tools are, I just thought I had plenty of space and I would use them eventually. I am uninstalling the nerd tools as instructed. I actually run parallel power supplies just for that reason. The disks in question are on the main power supply though. I will absolutely make a backup of flash.

Sorry about the screenshots I thought it would order them by date. It goes 4-2-3-1.

Thank you,

GTP

Quote

September 8, 20205 yr

Author

Below I have posted the diagnostics after shutdown. I have also repaired the flash drive and made a copy. I had to uninstall the nerd tools after the shutdown because it wouldn't display the nerd tools but they are uninstalled now. I also could not complete an extended SMART test on disk 8 it kept saying "interrupted: reset by host". I will start the SMART test on Disk 8 again now that I have rebooted. Thank you GTP

tower-diagnostics-20200908-1024.zip

Edited September 8, 20205 yr by GTP
error message clarification

Quote

September 8, 20205 yr

Community Expert

Those diagnostics say parity is invalid. I assume parity rebuild never completed.

Post a screenshot of Main - Array Devices showing parity disk.

Quote

September 8, 20205 yr

Author

3 minutes ago, trurl said:

Those diagnostics say parity is invalid. I assume parity rebuild never completed.

Post a screenshot of Main - Array Devices showing parity disk.

@trurl You are correct it looks like after I restarted it is attempting to rebuild Parity again. I have attached the screen shot.

Best,

GTP

Quote

September 8, 20205 yr

Community Expert

There were already some read errors on disk10, this means parity won't be 100% correct, doesn't look like a disk problem, I would start by updating both LSIs firmware:

Sep  8 10:13:56 Tower kernel: mpt2sas_cm0: LSISAS2008: FWVersion(10.00.08.00)

Very old, current release is 20.00.07.00

Sep  8 10:13:56 Tower kernel: mpt2sas_cm1: LSISAS2116: FWVersion(20.00.06.00)

All p20 releases except 20.00.07.00 have known issues.

It could also be a power/cable problem.

Quote

September 8, 20205 yr

Community Expert

Or maybe it couldn't record the fact that parity rebuild completed because it couldn't write to your flash drive.

Looks like you may still be having flash problems:

Sep  8 10:16:15 Tower root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Sep  8 10:16:16 Tower kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep  8 10:16:16 Tower kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : 0x6 [current] 
Sep  8 10:16:16 Tower kernel: sd 0:0:0:0: [sda] tag#0 ASC=0x28 ASCQ=0x0 
Sep  8 10:16:16 Tower kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 ef 0e 78 00 00 80 00
Sep  8 10:16:16 Tower kernel: print_req_error: I/O error, dev sda, sector 15666808
Sep  8 10:16:16 Tower kernel: print_req_error: I/O error, dev sda, sector 15666808
Sep  8 10:16:16 Tower kernel: Buffer I/O error on dev sda, logical block 1958351, async page read

That wrong csrf_token message at that top of that snippet is unrelated, but can be quite annoying as it repeats throughout syslog. Here is the FAQ on that:

More than annoying because it is going to fill your logs is these:

Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined

And, you're having connection problems on disk1:

Sep  8 10:13:56 Tower kernel: ata3.00: ATA-9: WDC WD80EMAZ-00WJTA0, 2SG4RN3J, 83.H0A83, max UDMA/133
...
Sep  8 10:16:06 Tower kernel: md: import disk1: (sdl) WDC_WD80EMAZ-00WJTA0_2SG4RN3J size: 7814026532 
...
Sep  8 10:24:16 Tower kernel: ata3: lost interrupt (Status 0x50)
Sep  8 10:24:16 Tower kernel: ata3: limiting SATA link speed to 1.5 Gbps
Sep  8 10:24:16 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
Sep  8 10:24:16 Tower kernel: ata3: SError: { RecovComm PHYRdyChg CommWake DevExch }
Sep  8 10:24:16 Tower kernel: ata3.00: failed command: READ DMA
Sep  8 10:24:16 Tower kernel: ata3.00: cmd c8/00:30:10:40:35/00:00:00:00:00/e0 tag 0 dma 24576 in
Sep  8 10:24:16 Tower kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
Sep  8 10:24:16 Tower kernel: ata3.00: status: { DRDY }
Sep  8 10:24:16 Tower kernel: ata3: hard resetting link

Quote

September 8, 20205 yr

Author

1 minute ago, JorgeB said:
There were already some read errors on disk10, this means parity won't be 100% correct, doesn't look like a disk problem, I would start by updating both LSIs firmware:
Sep  8 10:13:56 Tower kernel: mpt2sas_cm0: LSISAS2008: FWVersion(10.00.08.00)
Very old, current release is 20.00.07.00
Sep  8 10:13:56 Tower kernel: mpt2sas_cm1: LSISAS2116: FWVersion(20.00.06.00)
All p20 releases except 20.00.07.00 have known issues.

It could also be a power/cable problem.

@JorgeB Thank you for the insight I am buying some new cables this week for my SAS card as far as the Firmware version goes, should I stop the parity rebuild to upgrade the firmware. The cables won't be here until Thursday. Also I found this site to upgrade the firmware https://kb.sandisk.com/app/answers/detail/a_id/11192/~/lsi-sas2008-firmware%2Fbios-download-for-lightning-pcie-enterprise-ssa is this what I want? Thank you both for your help!

-GTP

Quote

September 8, 20205 yr

Community Expert

6 minutes ago, JorgeB said:

There were already some read errors on disk10

I had not gotten that far into your syslog, but yes parity build is going to be bad due to all the read errors on disk10.

Also, are you sure you uninstalled nerdpack?

Quote

September 8, 20205 yr

Community Expert

Any idea what these are about?

5 minutes ago, trurl said:
More than annoying because it is going to fill your logs is these:
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined
Sep  8 10:24:10 Tower root: ACPI action volumedown is not defined

Quote

September 8, 20205 yr

Community Expert

3 minutes ago, GTP said:

should I stop the parity rebuild to upgrade the firmware.

I would say you should stop using your server until then.

Quote

September 8, 20205 yr

Community Expert

4 minutes ago, GTP said:

should I stop the parity rebuild to upgrade the firmware.

I would, since it won't be valid anyway, then try again.

Quote

September 8, 20205 yr

Community Expert

8 minutes ago, trurl said:

Looks like you may still be having flash problems:

But you also need to fix these.

Quote

September 8, 20205 yr

Author

1 minute ago, trurl said:

I had not gotten that far into your syslog, but yes parity build is going to be bad due to all the read errors on disk10.

Also, are you sure you uninstalled nerdpack?

Yes, Nerdpack is uninstalled but I had to do it after the shut down and restart should I interrupt the parity rebuild and do another shutdown and restart now that the nerd pack is uninstalled. I also ran chkdisk on the Flash drive it did find errors and repaired them. I then copied the contents of the disk to a local folder. Should I create a new flash disk I have tons of other Flash drives?

Thank you,

GTP

Quote

Drive Error Red X

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)