Jump to content

Unraid 6 vs SuperMicro AOC-SAS2LP-MV8, Offline Drives, ReiserFS Corruptions


ldrax

Recommended Posts

Hi guys,
I built my unRAID back in 2013 and other than regular parity check, I left the machine pretty much as it is (both hardware and software). OS was 5.0rc8a

Long story short, I upgraded to latest unRAID 6 a few days ago,

1. following the guide somewhere on the wiki, I did 'upgrade install', and then

2. set new perms.


- Then I did parity check (non-correct), just to be sure, when halfway, one of the disk suddenly failed and got disabled.


Did random check on syslog, a lot of message with sas this sas that, google it, and came across threads in this forum that says that the SuperMicro AOC-SAS2LP-MV8 might not be compatible with unRAID 6. My bad for not doing a thorough research. but I swear this card was among the most highly recommended ones back then, so I never though it would be an issue.

If I don't want the trouble of switching to different controller, can I just roll back the OS to 5.0rc8a?
Was there an upgrade step that would not be compatible or make it irreversible? (From what I read, filesystem is still reiserfs, so wouldn't be a problem here, would it?)

Thanks in advance!

Link to comment

You will not be able to use of the plugins currently installed on your vers. 6 system.  (Ver 6 is 64 bit and ver 5 is 32 bit.)  Basically, you need to reverse the ver 5 to ver 6 upgrade procedures.  See here for those instructions:

 

     https://lime-technology.com/forums/topic/37354-upgrading-unraid-from-version-5-to-version-6/

and

     http://lime-technology.com/wiki/index.php/Upgrading_to_UnRAID_v6

 

EDIT:  I will point out that you can get an LSI card between $60 and $100 on E-Bay.  You will even find them flashed for IT mode at that price which makes it a plug-and-play replacement process.

Link to comment

I haven't got a chance to make use of new (LSI) card yet, turned on the machine today (it's off since the system disabled the one drive), and to my horror, another disk is also marked X (disabled) now. I guess even with the new LSI card, rebuilding is out of question now.
 

Link to comment

Don't panic!!!   Many times there is nothing 'wrong' with the disk and the data on it.   (You need to post up your diagnostics file.  Tools   >>>   Diagnostics  )   Be sure to tell us which disks are disabled by the last four digits of the serial numbers.  I would not be using the server much (Especially writing to it) until you get the new controller.  

 

When you install the LSI card, triple check that all of the SATA connectors (both data and power) are firmly seated in their sockets.  Often, folks will loosen an adjacent connector when working with another SATA cable.EDIT:  I would add to thread title "(now two drives offline)" by editing the first post. 

Link to comment
  • 3 weeks later...

Hi again Frank, it's been a while. Due to other priorities, I didn't get back to my unRAID system until recently. Thanks so much for the tips.

 

I had to summarize the timeline myself to reassess the issue. Here goes:

Initial Action was to upgrade from from 5.0rc8a to 6-latest-stable. but a slew of problems arose.
 

1. 'upgrade install'.

2. set new perms. (system reported no error during execution).
3. non-corrective parity check: halted halfway, and one drive (say, Drive C) became disabled (one drive offline).
4. upon syslog inspection, and googling, possible issue with Super Micro card.
5. OFF the machine, and left it OFF while considering what to do.
6. ON the machine, another drive (say, drive E) marked disabled, so now it's two drives offline.
7. OFF the machine, and left it OFF while considering what to do.

This was where I was as per the last post timestamp in this thread above.
So carrying on:

8. Got an LSI Card, installed it. Drive E is shown to be back. only drive C still disabled. Next course will be to replace drive C.
9. Precleared new drive to replace drive C (on another machine), then installed to this system, performed rebuild. System completed with no error.
10. Performed Non-corrective parity check: many sync errors reported.
11. Performed Corrective parity check: system complete with no error reported (other than the sync errors which were corrected during the process).
12. Let a few days passed with system running.
13. Performed Non-corrective parity check: no sync errors reported.

Phew, thought it's all back and good by now! I slept well.
Few days after, I decided to upgrade 2 drives capacity.

14. Precleared 2 new drives on another machine.
15. Installed 1 precleared drive to the array to replace drive, say, drive F. Rebuilt the array, no error during process.
16. Installed 1 precleared drive to replace drive, say, drive G. Rebuild the array, no error during process.
17. Performed Non-corrective parity check: no sync errors reported.

I slept well again.
But not for long now.

18. Out of habit, performed a du command (disk usage) on drive C (replaced in step 9 above). du command returned many permission denied errors on many files. Upon ls command, it reveals ? ? under users and groups column.  Was it due to unsuccessfull and unreported error during step 2 (set new perms?)
19. From GUI, performed a reiserfs --check on drive C. Suggestion: do --rebuild-tree.
20. From GUI, performed --rebuild-tree on drive C. After a couple of hours, operations was done with many errors. And subsequent df, du and ls command showed missing large number of files. Comparing the df output from before the --rebuild-tree and after, estimated data lost about 100GB++.  Nothing in lost+found directory. Since the data here is not that important, I let it go.

21. Curious, I repeated step 18-20 for drive F (that I replaced and upgraded in step 15 above). Drive F suffered even more filesystem corruption. Upon completion of --rebuild-tree, I lost about 1.2++ TB of files! And nothing is lost+found directory.  (UPDATE: lost+found contains 1TB++ of data, good to know data is there, now the problem is how to identify what files (and file names) are those?)

I'm still wrapping my head around this.
But just curious, where did I wrong, or what procedures our routine did I miss, all this while.
I have resorted to storing my backup on unRAID due to at least 2 reasons, namely ability to  i) rebuild one failing drive from parity, and ii) any single drive can be plugged (read only, if must) to other machine to inspect and retrieve files, in case of unraid emergency.


I still have the original drives of C, F, G. I mounted them on different machine to check if corruptions are there (of course they are).
Next possible course of action:  reiserfs check and rebuild tree on these original drives, and hoping for better recoverability? I'm not optimist, as doing the same thing might only result in the same output.



 

Link to comment
7 hours ago, ldrax said:

I still have the original drives of C, F, G. I mounted them on different machine to check if corruptions are there (of course they are).

This would suggest the filesystem corruption is an old problem, not related to the rebuilds and your recent issues with the disabled disks, as to what caused that difficult to say, could be a hardware problem like bad RAM, controller, etc, also never forget any type of RAID (or unRAID) is never a backup..

Link to comment
34 minutes ago, Frank1940 said:

I have no ideas for you at this time except this.  This is an old thread.  To assure that it gets looked at, I would suggest that you edit the first post in the thread and change the title of the Thread to add  (NEW ISSUES) to the beginning  (or end) of it.

 

Thanks Frank, noted!

 

21 minutes ago, johnnie.black said:

This would suggest the filesystem corruption is an old problem, not related to the rebuilds and your recent issues with the disabled disks, as to what caused that difficult to say, could be a hardware problem like bad RAM, controller, etc, also never forget any type of RAID (or unRAID) is never a backup..

 

Yeah it could be an old problem. I just hope it didn’t happen during the upgrade process.

 

Which then leads to a question, how to regularly check the filesystem health? For the many years the system sit under my table, I only performed parity check every now and then, which apparently couldn’t catch filesystem corruption.

 

Point noted about raid not being backup ?

Link to comment
2 minutes ago, ldrax said:

which apparently couldn’t catch filesystem corruption.

It can't, also it can't help with that problem, i.e., if the filesystem gets corrupt it can't be fixed by rebuilding the disk, as it will result in the same corruption.

 

Filesystem corruption mostly happens because of unclean shutdowns and/or bad hardware, especially RAM or controller, after an unclean shutdown it's good practice to do a filesystem check, as for hardware use ECC RAM when possible, as it will protect against that, if not using ECC regularly run Memtest, nothing much you can do about a controller going bad, but it's rather uncommon.

 

Other than that you can also run regular filesystem checks, especially during the next weeks to see if it keeps happening.

 

 

Link to comment

Unclean shutdowns are usually due to power outages.  These can be prevented by using a UPS to allow for proper shutdowns of your server after an outage occurs.  However, if you have installed certain Dockers, plugins or VM's, there can be problems is doing a clean shutdown.  One way to test for these types of issues is to try and stop the array manually when they are running.  IF you can't stop the array, there is a good possibility that you won't get a clean shutdown.  As I understand unRAID will attempt an orderly shutdown of all processes and as a last resort, kill them.  This latter action could result in a file(s) corruption but it should not result in file system corruption.

Link to comment

Just to add insult to injury, I must note that ReiserFS is pretty much as dead as Hans Reiser's wife that he killed. https://en.wikipedia.org/wiki/Hans_Reiser

 

I'd start planning to migrate your data to XFS or BTRFS drives, you can do it one drive at a time, you must have a drive completely empty to change the filesystem, then you can copy the data from one of the ReiserFS drives to the newly formatted drives, change the format on the drive you just copied from, repeat until done.

 

If you don't have enough free space to move data around and empty your largest drive, you may have to purchase another drive or shuffle data off externally to get it done.

Link to comment
1 hour ago, Frank1940 said:

Unclean shutdowns are usually due to power outages.  These can be prevented by using a UPS to allow for proper shutdowns of your server after an outage occurs.  However, if you have installed certain Dockers, plugins or VM's, there can be problems is doing a clean shutdown.  One way to test for these types of issues is to try and stop the array manually when they are running.  IF you can't stop the array, there is a good possibility that you won't get a clean shutdown.  As I understand unRAID will attempt an orderly shutdown of all processes and as a last resort, kill them.  This latter action could result in a file(s) corruption but it should not result in file system corruption.


Yeah it did experience power outage a couple of times, after which I ran parity check, but didn't think to do fsck. Now I know better.

 

 

1 hour ago, jonathanm said:

Just to add insult to injury, I must note that ReiserFS is pretty much as dead as Hans Reiser's wife that he killed. https://en.wikipedia.org/wiki/Hans_Reiser

 

I'd start planning to migrate your data to XFS or BTRFS drives, you can do it one drive at a time, you must have a drive completely empty to change the filesystem, then you can copy the data from one of the ReiserFS drives to the newly formatted drives, change the format on the drive you just copied from, repeat until done.

 

If you don't have enough free space to move data around and empty your largest drive, you may have to purchase another drive or shuffle data off externally to get it done.


Will work on that migration then! Thanks for the pointers!

Link to comment
  • 2 weeks later...
On 6/28/2018 at 2:21 PM, Frank1940 said:

Unclean shutdowns are usually due to power outages.

 

ReiserFS has had a tradition of being very good at handle unclean shutdowns. It will replay the journal and restore a correct file system. Obviously, the file system may still miss some of the last changes before the power loss so files involved in updates just before the power loss may be missing or contain partial data. But the journal will replay all complete transactions.

 

It's more common that broken RFS file systems are caused by machine errors - no journal can protect from a machine that goofs and writes incorrect data or write the data to an incorrect location on the disk.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...