sawdustfarmer Posted October 16, 2018 Share Posted October 16, 2018 Hi all, When I try and access my movies folder everything is gone, Plex can still access everything from it and I can see the files in MC when browsing disks, but not users. Every time I try and access the folder from windows the "BTRFS error (device md1): parent transid verify failed" appears in the system log. disk1 is reporting errors and i'm not sure why, only last month I removed it from the array formatted it and put it back in, the SMART diagnostics show anything to be concerned about (though I may be wrong) I think this is unrelated because last month I didn't loose a whole share folder like this. I can still browse disk1 through MC This could be linked to a sudden power loss which caused an unclean shutdown. I have ordered a new 8tb red and APC UPS to try and fix the issue but that wont arrive for another 6 days If anyone can help that would be greatly appreciated, i'm running a HP Gen8 Microserver tower-diagnostics-20181016-2229.zip Quote Link to comment
trurl Posted October 16, 2018 Share Posted October 16, 2018 1 hour ago, sawdustfarmer said: disk1 is reporting errors and i'm not sure why, only last month I removed it from the array formatted it and put it back in Everything about this sentence is wrong. The disk is DISABLED. Removing a disk from the array, formatting it, and putting it back so that Unraid will use it again isn't actually possible without doing a lot more than just that. Is there any more details you can provide as to exactly what you did? Please seek advice on the forum in the future before doing anything else with the disks until you understand how Unraid works. Unraid DISABLES a disk when a write to it fails. Most of the time, I/O failures are not due to any actual disk problem. Bad connections are by far the most frequent cause, sometimes controller issues. Actual disk problems are far less common. When a write to a disk in the parity array fails, Unraid disables the disk, but it still uses the data to update parity. After the disk is disabled, Unraid won't actually use the disk again until it is rebuilt, because its data is out-of-sync and is no longer valid. Instead, Unraid emulates the disk by calculating its data by reading parity plus all the other disks in the array. It not only reads the emulated disk from the parity calculation, but it even writes the emulated disk from the parity calculation by updating parity. So the disabled disk isn't used, but the emulated disk can still be read and written. Also, unless you have more parity disks than disabled disks, you are no longer protected by parity since Unraid requires that parity plus all other disks be working in order to rebuild a disk. Shut down and check all connections. You must rebuild the disk. Do you have a spare? If not there is a method to get Unraid to rebuild to the same disk. Quote Link to comment
JorgeB Posted October 16, 2018 Share Posted October 16, 2018 (edited) Disk1 is failing and needs to be replaced. Oct 16 13:20:13 Tower kernel: BTRFS error (device md1): parent transid verify failed on 582139904 wanted 116345 found 116343 This is usually a fatal filesystem error, after replacing the disk you'll need to be reformatted the disk and restore the data, you can use this to help backup the data from the emulated disk if needed: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=543490 It should also be possible to copy most data from old disk1, though some files will have read errors. Edited October 16, 2018 by johnnie.black Quote Link to comment
trurl Posted October 16, 2018 Share Posted October 16, 2018 5 minutes ago, johnnie.black said: Disk1 is failing and needs to be replaced. Is this because of SMART attribute 200? I forgot to look at that one. Quote Link to comment
JorgeB Posted October 16, 2018 Share Posted October 16, 2018 Just now, trurl said: Is this because of SMART attribute 200? I forgot to look at that one. Yep, that together with recent UNC @ LBA errors (aka read errors): Quote Error 55 [6] occurred at disk power-on lifetime: 12255 hours (510 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 02 b8 00 00 3d d7 a7 b0 e0 00 Error: UNC 696 sectors at LBA = 0x3dd7a7b0 = 1037543344 And also how the read errors appear in the syslog: Quote Oct 14 23:10:37 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Oct 14 23:10:37 Tower kernel: ata1.00: irq_stat 0x40000001 Oct 14 23:10:37 Tower kernel: ata1.00: failed command: READ DMA EXT Oct 14 23:10:37 Tower kernel: ata1.00: cmd 25/00:40:10:28:ad/00:05:c9:01:00/e0 tag 17 dma 688128 in Oct 14 23:10:37 Tower kernel: res 51/40:40:c0:2c:ad/00:05:c9:01:00/e0 Emask 0x9 (media error) Oct 14 23:10:37 Tower kernel: ata1.00: status: { DRDY ERR } Oct 14 23:10:37 Tower kernel: ata1.00: error: { UNC } Oct 14 23:10:37 Tower kernel: ata1.00: configured for UDMA/133 Oct 14 23:10:37 Tower kernel: sd 1:0:0:0: [sdb] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Oct 14 23:10:37 Tower kernel: sd 1:0:0:0: [sdb] tag#17 Sense Key : 0x3 [current] Oct 14 23:10:37 Tower kernel: sd 1:0:0:0: [sdb] tag#17 ASC=0x11 ASCQ=0x4 Oct 14 23:10:37 Tower kernel: sd 1:0:0:0: [sdb] tag#17 CDB: opcode=0x88 88 00 00 00 00 01 c9 ad 28 10 00 00 05 40 00 00 Oct 14 23:10:37 Tower kernel: print_req_error: I/O error, dev sdb, sector 7678535696 Oct 14 23:10:37 Tower kernel: md: disk1 read error, sector=7678535632 Oct 14 23:10:37 Tower kernel: md: disk1 read error, sector=7678535640 Oct 14 23:10:37 Tower kernel: md: disk1 read error, sector=7678535648 media errors almost always = disk problem With those 3 indications I'm 99.99% it's a failing disk, OP can confirm by running an extended SMART test, it should result in "read failure" Quote Link to comment
sawdustfarmer Posted October 16, 2018 Author Share Posted October 16, 2018 8 hours ago, trurl said: Everything about this sentence is wrong. The disk is DISABLED. Removing a disk from the array, formatting it, and putting it back so that Unraid will use it again isn't actually possible without doing a lot more than just that. Is there any more details you can provide as to exactly what you did? Please seek advice on the forum in the future before doing anything else with the disks until you understand how Unraid works. Sorry that was poorly worded, I used the instructions here to Re-enable the drive (format was the wrong word) https://wiki.unraid.net/Troubleshooting#Re-enable_the_drive What I understood of the extended SMART report everything seemed ok, I checked for a bad connection and it all seemed ok, in hindsight I should have asked somone else just to look over it. 8 hours ago, johnnie.black said: Disk1 is failing and needs to be replaced. Oct 16 13:20:13 Tower kernel: BTRFS error (device md1): parent transid verify failed on 582139904 wanted 116345 found 116343 This is usually a fatal filesystem error, after replacing the disk you'll need to be reformatted the disk and restore the data, you can use this to help backup the data from the emulated disk if needed: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=543490 It should also be possible to copy most data from old disk1, though some files will have read errors. Thank you, I found that topic while trying to research my missing files in my Movies share and thought I better ask for advice before going further. I'll wait until my new 8TB arrives and backup the data. So you think I can't access the files in the Movies Share is because of Disk1 failing, I thought parity would emulate the files when a Disk dies or is disabled and you could use unraid as normal until its replaced (without parity protection of course). Why have I lost access to ALL the files in the Share, but things like Plex and MC can access them? or is this something bigger than just a failing disk. Sorry if these questions seem a bit dumb, im just trying to understand the situation. 8 hours ago, johnnie.black said: With those 3 indications I'm 99.99% it's a failing disk, OP can confirm by running an extended SMART test, it should result in "read failure" I started the extended SMART test, i'll get back to you with the results when I get home. Quote Link to comment
trurl Posted October 17, 2018 Share Posted October 17, 2018 1 hour ago, sawdustfarmer said: format was the wrong word Format means "write an empty filesystem to this disk". Many people have confusion over this. Sometimes they even format a disk instead of rebuilding it. And Unraid updates parity for any write operation, including formatting, so after a format the data can't be rebuilt. Quote Link to comment
JorgeB Posted October 17, 2018 Share Posted October 17, 2018 The problem accessing your data isn't the missing disk, is the filesystem on the emulated disk being corrupt. Quote Link to comment
sawdustfarmer Posted October 17, 2018 Author Share Posted October 17, 2018 20 hours ago, johnnie.black said: OP can confirm by running an extended SMART test, it should result in "read failure" Attached are my extended SMART test results WDC_WD40EFRX-68N32N0_WD-WCC7K7LX57HD-20181017-2123.txt Quote Link to comment
JorgeB Posted October 17, 2018 Share Posted October 17, 2018 SMART test passed, but I have no doubt the problem is the disk, but unlike pending sectors these errors aren't always reproducible, e.g., it could fail the next extended test. Quote Link to comment
sawdustfarmer Posted October 22, 2018 Author Share Posted October 22, 2018 I finally copied everything from Disk1 that I could (just under 4TB), a few of files failed to copy but nothing I cant re download. But now I have more issues. My system completed its weekly parity check while I was away and things have gone to custard, I've lost all options to shut down, the page just wont load, I cant download a diagnostics it just fails. All I've got are these screenshots. Quote Link to comment
sawdustfarmer Posted October 22, 2018 Author Share Posted October 22, 2018 It looks like I can still access files through MC, should I be grabbing as much as I can and just start again? Quote Link to comment
JorgeB Posted October 22, 2018 Share Posted October 22, 2018 If the copy is done type reboot on the console, if it doesn't after a few minutes you might need to force it. Quote Link to comment
sawdustfarmer Posted October 22, 2018 Author Share Posted October 22, 2018 I had to force reboot it. I can login through MC but the web GUI doesn't load. I don't have a monitor with a VGA port to plug into the server to see whats going Quote Link to comment
JorgeB Posted October 22, 2018 Share Posted October 22, 2018 Type diagnostics on the console and post the zip. Quote Link to comment
sawdustfarmer Posted October 22, 2018 Author Share Posted October 22, 2018 I'm using PuTTY and it says 'starting diagnostics collection...' and doesn't seem to go anywhere. I can see a Diagnostics folder in the root directory but the only folder with anything in it is /system. I grabbed everything there one by one. I left my Logging window open and it keeps updating, I have attached it Tower Diag 20181022-2254.zip Quote Link to comment
JorgeB Posted October 22, 2018 Share Posted October 22, 2018 Server doesn't appear to have rebooted, did you force it by e.g. pressing reset? Quote Link to comment
sawdustfarmer Posted October 22, 2018 Author Share Posted October 22, 2018 Maybe it didn't, I pressed the button and it beeped like it does when it boots, I'll long press it to shut down. Quote Link to comment
sawdustfarmer Posted October 22, 2018 Author Share Posted October 22, 2018 alight, that did the trick! Here's a diagnostics. Do I need to bring the array on-line? tower-diagnostics-20181022-2335.zip Quote Link to comment
JorgeB Posted October 22, 2018 Share Posted October 22, 2018 What's your plan now, replacing disk1 with a new disk or rebuilding to the same? I assume there's no more important data there since likely it will need to be formatted. Quote Link to comment
sawdustfarmer Posted October 22, 2018 Author Share Posted October 22, 2018 Important data is backed up offline and to cloud services. This was my plan, tell me if its a bad idea... - Remove disk1 (4TB), replace with new 8TB as disk1 - Let the parity rebuild (this is my only option isnt it, I can't completly remove the drive and files because unRAID will see its missing?) - Create a new Movies share, Movies2 on disk1 (8TB) - Copy all the files from Movies from disk1 disk2 disk3 to Movies2 on disk1 (8TB) - Delete Movies share, rename Movies2 to Movies (this will keep the dockers accessing it happy) But after last night experiencing the unRAID GUI not responding and acting wierd after the parity check I'm worried that more is corrupt than just disk1. in one of my previous screenshots you can see disk2 came back with 399 errors after the parity check, the notification says at 9.59am 'array has 2 disks with read errors' then a few hours later at 6.47pm 'read check finsihed (0 errors)' does that mean they're fixed or they still exist? And then the logging window showing 'tower kernel: md: do_drive_cmd: disk2: ATA_OP e3 ioctl error: -5' is that a sign disk2 has issues After seeing my diagnostics what would you recommend moving foward? Quote Link to comment
JorgeB Posted October 22, 2018 Share Posted October 22, 2018 Do you have the diagnostics showing the errors on disk2? If you didn't reboot since upload the current ones, though if you don't have them still upload current diags. Quote Link to comment
sawdustfarmer Posted October 22, 2018 Author Share Posted October 22, 2018 10 hours ago, sawdustfarmer said: Here's a diagnostics. Do I need to bring the array on-line? tower-diagnostics-20181022-2335.zip I ended up doing a force shutdown by holding the power button because of the unresponsive/half loaded GUI. After powering the server back on I did the diagnostics while the array was offline, I havent bought the array back online since the force shutdown Quote Link to comment
JorgeB Posted October 23, 2018 Share Posted October 23, 2018 I had already seen those diags but they don't show any errors on disk2 Quote Link to comment
sawdustfarmer Posted October 23, 2018 Author Share Posted October 23, 2018 errors on Disk2 are now at 0, I have physically removed Disk1 there was an odd sound coming from one of the drives, I suspected it was Disk1 it seems to have gone after removing Disk1, i put it in as an unassigned drive and grabbed a file or 2 off it with no issue hoping to hear the noise again (one of the files I grabbed I couldn't copy before and it seems ok, does that mean its corrupt on the parity and not in the disk1?) In other news my Movies share folder is accessible again?! What do you think is the best method in getting everything going again, just rebuild the parity? tower-diagnostics-20181023-2202.zip Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.