Neejrow

Members
  • Posts

    53
  • Joined

  • Last visited

Everything posted by Neejrow

  1. Errors seem to be increasing. Also here is the latest smart report. Number of Reallocated Logical Sectors = 40 So gone from 8, to 16, to 32, to 40. Am I likely to be needing to replace this drive? I'll power off and try another SATA cable. tower-smart-20230202-1521.zip
  2. Rebuild finally finished. However the reallocated sector ct has seemingly gone up from 8 to 16. Attached latest diagnostics. Thoughts on what to do? Is this just the system "catching up" on errors after doing the rebuild or am I likely to see these errors continue to increase? Thanks tower-diagnostics-20230202-1507.zip
  3. Well you were definitely right about that. It just finished without error. I don't know if I'm supposed to upload the smart report too or diagnostics log but I guess I'll do it just in case. Been a long while since I had to rebuild a drive. I think I put it in maintenance mode to do it. I'll figure it out. Will update thread tomorrow with results. tower-smart-20230202-0007.zip tower-diagnostics-20230202-0013.zip
  4. That's fine, as long as it's not going to be more beneficial to run with the array off or on. I'll just do it with it off. I'll report back once it's complete.
  5. Okay thanks. So I should run that before trying to start the array? Or would it be better to start the Array and then run the Extended test? And then after I run the extended test I should run a rebuild? how do I do that exactly? Do I start the array in maintenance mode or something
  6. Just got another email (12:10PM) Event: Unraid Disk 5 SMART health [199] Subject: Warning [TOWER] - udma crc error count is 299 Description: ST8000VN004-2M2101_WKD3HD7N (sde) Importance: warning Server has now been turned off. I'm going to take it off the LSI card and move it back to a SATA cable from motherboard SATA ports, but now I'm hesitant to turn the server back on so I'll wait for some advice before turning it back on again.
  7. UnRAID 6.10.3 B550M Mortar Wifi Ryzen 3900X Hello, Got a couple of emails this morning from my server to say a Disk had errors. It's been quite a while since the last time I had any errors so I was a little unsure what to do and am looking for some guidance. Here is the first email around 5:30am (Mover starts at 5am) Event: Unraid array errors Subject: Warning [TOWER] - array has errors Description: Array has 1 disk with read errors Importance: warning Disk 5 - ST8000VN004-2M2101_WKD3HD7N (sde) (errors 2048) Then here is the second email around 10:00am Event: Unraid Disk 5 SMART health [5] Subject: Warning [TOWER] - reallocated sector ct is 8 Description: ST8000VN004-2M2101_WKD3HD7N (sde) Importance: warning I had downloaded a couple hundred GB of stuff yesterday and when the mover ran overnight, only this specific drive had any writes to it from mover, so I think the error count came from mover. Now this is where I potentially made a mistake that I will regret. I had noticed a whole bunch of my stuff I downloaded had not been properly moved across folders by Lidarr and were "stuck" in a completed downloads folder. I knew a bunch of the files were not very important to me and so I stupidly went into Windows explorer to my network share of unRAID and simply deleted the files from this completed downloads folder that were not being moved. It was probably around 300GB of files. I realised as soon as it started deleting the files that I probably should not have touched any data...but what's done is done I suppose. I have taken a diagnostics snapshot from before I restarted the server, then after. And also ran a SMART short test before and after, and also took SMART downloads. (I'm still not actually sure how I check a SMART short test. I run it and it says some % progress and then says done, but where do I find the test?) I turned the server off, changed the SATA cable over on the drive from my motherboard to my LSI SAS card, and when I turned the server back on the disk was disabled still but this time it now says udma CRC Error count 70. (which it also emailed me about) 3rd Email (11:40am) Event: Unraid Disk 5 SMART health [199] Subject: Warning [TOWER] - udma crc error count is 70 Description: ST8000VN004-2M2101_WKD3HD7N (sde) Importance: warning I'm going to turn the server off again now after posting this, and move the drive back to a new SATA cable via motherboard SATA port instead of the LSI card. But just after advice on what my next steps should be? tower-diagnostics-20230201-0958-Pre Restart.zip tower-diagnostics-20230201-1024-Post Restart.zip tower-smart-20230201-1019.zip tower-smart-20230201-1145 - Cable change mobo to SAS card.zip
  8. Unfortunately my board is out of SATA ports, thats why I bought this LSI card a few weeks ago. Will just have to hope the new cable fixes it, until then I'll leave the drives running with no spin down. I suppose if the issue occurs after that, I'll have to look for a new LSI card.
  9. yes I am sure. I double checked again today that EPC and Low Current Spinup were disabled on the 2 drives.
  10. I have decided to set the 2 drives on my LSI card, to never spin down, while I wait for my new Mini SFF-8087 to SAS/SATA cables to arrive. But I also have a question. Why do my drives randomly just decide to spin up? As far as I can tell, nobody tried to access any of my media files. I manually spun down all my drives (except the 2 drives on the LSI card) and I just noticed a few minutes ago that one of them spun up? Mar 16 17:07:55 Tower nmbd[24170]: Mar 16 17:07:55 Tower nmbd[24170]: ***** Mar 16 17:24:08 Tower kernel: mdcmd (45): set md_num_stripes 1280 Mar 16 17:24:08 Tower kernel: mdcmd (46): set md_queue_limit 80 Mar 16 17:24:08 Tower kernel: mdcmd (47): set md_sync_limit 5 Mar 16 17:24:08 Tower kernel: mdcmd (48): set md_write_method Mar 16 17:39:57 Tower emhttpd: read SMART /dev/sde Is there a way for me to go into more detailed version of logs that might indicate why this drive (sde) just randomly decided to spin up?
  11. Just had another error, this time on disk SDH (SG7) which I am pretty sure is not the same drive as the original error. Although I am not 100% sure on this. Can anyone confirm between the 2 logs? I'm not great at understanding which drive is which. The other error log says disk SDF errored, but I'm not totally familiar with unRAID and if it assigns the same drive letters to the same drives each time after reboots. Either way I know both times it's been drives that are connected to the LSI card, as only 2 are on that and the rest connected to the motherboard SATA ports. Any suggestions? I'm about to order a new cable because the 2 cables I have, came included with the LSI card from the ebay seller. I'm worried maybe they aren't of the highest quality, as the sale was pretty cheap. Note: The fixes for disabling EPC and Low Current Spinup had been already done to the drive that just errored this time. tower-diagnostics-20210316-1513.zip
  12. Just an update. I ran the fixes listed in the thread above, and then did a full system parity check + rebuild on the errored drive. Everything went smoothly. Hopefully that solves my issue!
  13. I just want to say thank you very much for this guide. Made it very simple to fix. Hopefully this resolves my issue that I just ran into today!
  14. Hi, I just got an email from my server to notify me that Disk 7 is in an error state. I'm not exactly sure what to do here. I have taken a diagnostics log of the server. I notice the Disk is connected to my SAS card that I just got a few weeks ago. I initially had issues with it giving a small amount of read errors (4) but since then it's been fine until today. I was just updating some of my docker apps around the time this occurred, although that should be unrelated to the disk error as all my docker stuff is on cache ssd. Can anyone take a look at the logs to figure out what might be the issue? Is my SAS card or 4to1 cable a bit dodgy or something? Also how should I proceed? tower-diagnostics-20210314-1836.zip
  15. Just an update. No read errors since making the changes. Hopefully it continues like this.
  16. Sorry I am a little confused. When you say HBA do you mean my SAS PCIE card? (Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)) The Ebay listing for it says "FLASHED TO LATEST LSI P20 IT MODE FIRMWARE" so I assumed I do not need to update it, and I am not even sure how I would go about updating it either. I found a forum post on google about my card from May 2020 saying that the latest firmware is 20.0.07.0, and that is what my card says it is running in BIOS. I tried looking in to this and found nothing related to ASPM for PCIE power saving in BIOS. I did do a BIOS update to make sure I was fully up to date, and I also switched my SAS card to the bottom PCIE 16x slot on my board (which runs at PCIE 3.0 x4) just in case maybe the top slot was causing some sort of compatibility issue as it runs at PCIE 4.0 natively? I also set that slot to run in gen3 mode now just in case, although the slot will be empty from now. I have also switched the cable to the second set of Mini-SAS cables which came with this SAS card, and used the 2nd mini-SAS slot on the card too. I am currently running a bunch of movies off the drive connected to the SAS card to force it to do a bunch of reads, to see if I can get any errors again. But so far so good (only been about 15 minutes though) I took a few photos in BIOS of the SAS card, in case any of this information is useful to someone who might know if this is the latest version of the firmware etc. Anyway, will let the drive run for a few hours reading data and see if it gets any more read errors. After that I'll force it to spin down and see if that causes any read errors again too. Will update this post tomorrow.
  17. System specs: Ryzen 3900X, MSI B550M Mortar Wifi, unRAID 6.9.0-rc2, headless no GPU. Hello, I installed a SAS card yesterday (into my top PCIE 4.0 16x slot) that was recommended to me by someone on the unRAID discord server. LSI 9207-8i It seemed to be working fine, however this morning I got an email from my server to say that there was an array read error. I have never had one of these before. I checked and it's the drive which I connected using the SAS card. It's not a new drive, it has been running on the server for many months but I decided to connect it via the SAS card instead of directly to a SATA port on the motherboard, to test the card. Server booted up fine and there was no errors and the drive was detected and running, so I assumed it was fine until this morning. I did not "flash" the SAS card or anything, as the ebay listing says it's already flashed and works for unRAID. I have attached below the system log, and I can see now from reading that the read errors occurred when the drive spun down. I have all my drives set to spin down after 4 hours of no use, because it's a Plex server only right now and so the drives don't get used too often. (is 4 hours too soon? they are all NAS drives and I am only doing it for power saving basically - would like some opinions on what a more optimal setup is) Feb 23 11:52:49 Tower emhttpd: spinning down /dev/sde Feb 23 13:58:01 Tower kernel: sd 7:0:0:0: attempting task abort!scmd(0x00000000bec51a11), outstanding for 15197 ms & timeout 15000 ms Feb 23 13:58:01 Tower kernel: sd 7:0:0:0: [sde] tag#3307 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Feb 23 13:58:01 Tower kernel: scsi target7:0:0: handle(0x0009), sas_address(0x4433221107000000), phy(7) Feb 23 13:58:01 Tower kernel: scsi target7:0:0: enclosure logical id(0x500605b009db1330), slot(4) Feb 23 13:58:05 Tower kernel: sd 7:0:0:0: task abort: SUCCESS scmd(0x00000000bec51a11) Feb 23 13:58:05 Tower kernel: sd 7:0:0:0: [sde] tag#3832 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=19s Feb 23 13:58:05 Tower kernel: sd 7:0:0:0: [sde] tag#3832 Sense Key : 0x2 [current] Feb 23 13:58:05 Tower kernel: sd 7:0:0:0: [sde] tag#3832 ASC=0x4 ASCQ=0x0 Feb 23 13:58:05 Tower kernel: sd 7:0:0:0: [sde] tag#3832 CDB: opcode=0x88 88 00 00 00 00 01 00 d2 c1 10 00 00 00 20 00 00 Feb 23 13:58:05 Tower kernel: blk_update_request: I/O error, dev sde, sector 4308779280 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0 Feb 23 13:58:05 Tower kernel: md: disk4 read error, sector=4308779216 Feb 23 13:58:05 Tower kernel: md: disk4 read error, sector=4308779224 Feb 23 13:58:05 Tower kernel: md: disk4 read error, sector=4308779232 Feb 23 13:58:05 Tower kernel: md: disk4 read error, sector=4308779240 Feb 23 13:58:05 Tower emhttpd: read SMART /dev/sde Feb 23 13:58:16 Tower emhttpd: read SMART /dev/sdb Feb 23 13:58:16 Tower emhttpd: read SMART /dev/sdc Feb 23 13:58:25 Tower emhttpd: read SMART /dev/sdg Feb 23 13:58:26 Tower kernel: sd 7:0:0:0: Power-on or device reset occurred Feb 23 13:58:26 Tower emhttpd: read SMART /dev/sdd Feb 23 13:59:01 Tower sSMTP[17393]: Creating SSL connection to host Feb 23 13:59:03 Tower sSMTP[17393]: SSL connection using TLS_AES_256_GCM_SHA384 Feb 23 13:59:08 Tower sSMTP[17393]: Sent mail for [email protected] (221 2.0.0 closing connection That's about all I can think of to provide. I have uploaded the diagnostics log and also the SMART report of the drive that had the read errors. Please let me know if there is anything else I need to provide. Thanks tower-diagnostics-20210223-1459.zip tower-smart-20210223-1454.zip
  18. Thankfully I have a backup of all my photos / important documents. So I have not lost anything important and am willing to just move ahead. All I have lost is some Movies and some NVIDIA Shadowplay game footage which I probably wouldn't really ever do anything with anyway.
  19. There is actually over 60GB of files in lost+found and I have lost a bunch of data just by having a quick look through my Plex / Shadowplay folders. A quick google tells me I basically can't do anything with most of these files in lost+found and that anything that wasn't really a photo (which seem to still be working in lost+found) is corrupted and unusable?
  20. Okay so I decided to just put it into my Windows PC and make the folder and put the .txz file inside, booted up unRAID and ran the xfs_repair -v and it did stuff. Unfortunately the console window was enormous and I was unable to scroll all the way up to the beginning. Is there any way to view the log of what happened somewhere? Anyway, just started the array and the drive is mounted! Thank you so much @johnnie.black you have saved the day. Should I do some sort of parity check now to make sure everything is all fine?
  21. Sorry I'm incredibly new to this and sort of Linux in general. can I simply use something like Krusader Docker to add this file via a GUI or do I need to type in some console commands to download this file and make a folder etc to flash? Or have I misunderstood completely and I can plugin my unRAID USB to my Windows PC and make the folder and drag that .txz file onto the USB?
  22. Yes was simply just mounting the board & drives into a case. I am not using SAS cards or anything, just the motherboard SATA ports for 5 HDD's & cache on an M.2 NVme slot on the board. Could it be a faulty SATA data cable? I may have changed the cable inadvertently because I had like 7 sata data cables there in a tangle but only 5 being used.
  23. root@Tower:~# xfs_repair -v /dev/md3 Phase 1 - find and verify superblock... writing modified primary superblock Primary superblock bad after phase 1! Exiting now.
  24. root@Tower:~# xfs_repair -nv /dev/md3 Phase 1 - find and verify superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now. I take it this is a bad thing?
  25. should I run -nv first to be safe and post that here?