Mageia forum

by **jiml8** » Nov 7th, '16, 23:05

My system has crashed twice in 24 hours. These are the first incidents I have had in at least 6 months; I have been rebooting the machine only when a new kernel update came through and it has been thoroughly reliable.

The first crash left a trace in the log; a kernel oops involving a page fault occurred in the middle of the night, and the machine went out of control shortly after I started working on it in the morning yesterday. The second crash occurred this morning. I had been using it for about an hour when kwin stopped responding to most inputs though the mouse would move. I switched to a console and killed kwin, then tried to stop dm. Some kernel issue (an oops or a panic...not sure which) occurred and I had to hit the reset button. Nothing was left in the logs.

After the system came up and I restarted my user session, I started some diagnostics and discovered that I could not dismount any volume and commands such as df and lsof were non-responsive. I dropped to runlevel 1 to do maintenance, but still could not dismount any volumes and still could not use df and lsof.

I rebooted into a usb stick, and went through every volume. I found corruption on the volume containing my system and the one that contained my home. I also found some on the volume that contains /tmp.

What I do not know (yet) is whether this corruption caused the crashes or is an effect of the crashes. The corruption in all cases was incorrect free block count, which is not usually too serious.

So, we will see. I do hope the problem is solved.

by **gohlip** » Nov 8th, '16, 05:03

Try booting into prompt (or tty) and then do update/upgrade.
Make sure you do again update to make sure all updates are all completed.
There has several updates recently that may help.

by **jiml8** » Nov 8th, '16, 09:49

Right now I'm just using this thread to document anomalies in my system. Might be useful if these crashes turn into a "thing" and I ask for help here, and in any case it might prove useful to someone in the future. Who can say? When a solid system suddenly turns flaky, there has to be a reason.

So, this evening I just went through my /var/log/journald directory and found a lot of old and apparently orphaned journals, some dating back to 2014. So, I stopped the journal daemon, deleted all that stuff, and restarted the journal daemon.

Then, I was looking around /var/log, and decided to dump trim.log. My system has 2 SSDs in it, and I have a daily cron trim them at about 4 AM, and I keep a log of what got trimmed. I noticed that partition /dev/mapper/crypt-sdc5 had 252.7 GiB trimmed last night.

Now, that partition is 431G and has 235G available on it, so this is a really huge trim. This is the partition that contains /var, and the usual nightly trim runs about 1-2 GiB. This partition also contains a number of virtual machines including 4 VMs that are usually running...and those 4 were all running last night when I went to bed, and were still running this morning when I was working before the crash.

For the trim to be that huge, something had to write all those locations then be deleted. I have no idea what that could be, but I now suspect that this filesystem was corrupted before the crashes, and this might be the indicator that this filesystem actually caused the crashes.

So, time will tell.

by **jiml8** » Nov 8th, '16, 10:05

Hmmm...

I also just now noticed that /dev/mapper/crypt_sdd5 had a 97 GiB trim last night. Typical for it is 1-1.5 GiB a night. This volume is on my other SSD, and it contains both /home and a number of virtual machines, including 4 that were running when the crash occurred (I had 8 VMs up). This partition is 457G and has 74G available. This partition also was corrupted when I checked it this morning (as was /dev/mapper/crypt_sdd5).

Also, I see that /dev/sdc1 (unencrypted, and contains / ) showed an 11.6 GB trim last night. I found / corrupted this morning when I worked on it. This partition is 28G and has 11G available.

I did not check /dev/mapper/crypt_sdd1 because it is swap.

Allowing for the difference between G (or GB) and GiB, it looks to me like something happened that cause both SSDs to either be filled or to think they were filled, resulting in really big trims. Not sure what would cause that.

My sda and sdb are both hard drives, so I wouldn't get this kind of information for them.

by **wintpe** » Nov 8th, '16, 10:53

hi jim
sometimes file system corruptions can be caused by bad memory.
running memtest may prove something, but more often i find it never catches the issue.
its been a while now since i experienced this, but i did have a system a few years ago that had the same issue, and every time i fsck's it the problem got worse.
that system also did not have ecc memory
just something to think of, as i can see you were looking for a software reason, and it might just be faulty hw.

regards peter

by **jiml8** » Nov 11th, '16, 07:30

Software trouble is a lot more common than hardware trouble; I will usually start there and proceed to hardware if the evidence warrants it.

In any case, for the last two days there has been no further evidence of trouble, and the trim log shows the trims for the last two nights have been at appropriate levels.

Pending further evidence, I am going to say the file system corruption led to the problems. I am not totally happy saying that, but it appears to be OK now.

by **jiml8** » Nov 29th, '16, 01:38

17 days later, no further trouble. I guess a corrupted filesystem was the heart of the problem. Why was it corrupted? Who can say?

by **yankee495** » Dec 17th, '16, 18:26

jiml8 wrote:17 days later, no further trouble. I guess a corrupted filesystem was the heart of the problem. Why was it corrupted? Who can say?

Hello,

Sorry to hear that, I just hate it when that happens! I've found that when I have unexplained disk/file corruption that I know I didn't do, it usually turns out to be hardware. Overheating can cause it and so can a bad or loose SATA cable. It's also possible that you had a very short power failure unless you have a UPS. Speaking of power, a power supply that is getting flaky can cause it.

It happened to me one time with a new SATA cable. It had ran for awhile, maybe a few weeks and I couldn't believe it was the cable, but it was. Overclocking can cause it too but that usually shows up at the time you over do it. It is possible there is a bug that struck just your system from a combination of hardware/installed/running software and drivers and those are the worst to track down. I'd think that wasn't it but you never know. It could be a firmware bug in the SSD too, sometimes they have one that will crop up under the right circumstances that you may not be able to repeat.

One other thing is my cat. I had things happening one time and it was the cat walking on the keyboard and I had been thinking it was the kids and they were lying to me. I assume it's still running good which is a good thing but it may mean you'll never know what caused it. Be sure and ask your neighbors about the power failure thing when/if it happens again, they may have noticed the lights blink or something.

Since I've gotten better with Linux I don't goof it up too much and I've been lucky I guess with no hardware failures in a while. This past summer we had a couple of days where the power went on and off 3 or 4 times in a row real fast. I turned it off when I wasn't using it for a while back then. Like all of us I have a lot of work in my system and I really need to get a UPS to protect it and give it time to shutdown or something.

I wish I could be more help but that why these unexplained crashes are so bad, there is really nothing you can do without better clues etc. Good luck and happy holidays and happy new year! We're going to have Mageia 6 soon and maybe that'll change something that caused it.

Oh, just a couple of days ago my neighbor had problems with her computer just shutting down. Since it turned cold she had the heat on and there was a heat vent blowing right in the back of the computer causing it to over heat and shutdown. A lot of people have problems with dirty fans and when they turn the heat on the computer goes flaky. I assume you've checked the fans/temp etc. If not you may check that.

by **jiml8** » Dec 17th, '16, 23:20

My system has been solid since my last post on this thread. However, there have been two kernel updates, which have forced two reboots.

After seeing this latest response on this thread, I took a look at trim.log again (which I have not done for a month) and I saw this:

Code: Select all: *** Fri, 09 Dec 2016 04:02:01 -0700 *** /: 11 GiB (11840512000 bytes) trimmed /mnt/sdc5: 298 GiB (319940829184 bytes) trimmed /mnt/sdd5: 97.9 GiB (105104257024 bytes) trimmed

So, something happened between 4 AM Dec 8 and 4 AM Dec 9 that motivated trims of all free space on both SSDs...which is a symptom I saw when the system went bad a month and a half ago. However, the following night (and all nights subsequently) the trims looked appropriate.

Now, in that time frame, the latest kernel update came through and I did allow it to happen. So, there were indeed some big changes to the system at that time. I would expect that to affect / and maybe /mnt/sdc5 (contains /var as a symlink from / ) but not /mnt/sdd5 (home).

So, I guess I don't know what is going on here, and I don't know whether I should allow this to worry me or not.

I do use a UPS on both my workstation and my NAS.

I am debating now whether to take the system down to run filesystem checks. This is not a trivial decision; I have work to do and booting my full environment takes about half an hour.

by **jiml8** » Dec 18th, '16, 00:18

Well, it bothered me enough that I took the system down to test it.

User logout failed for me. This is a common problem for me, so I just switched to a console and killed the display manager. I then tried to unmount a volume that should have been unmountable (/dev/mapper/crypt_sdb1) and it would not unmount...claimed to be busy though it had no open file handlers per lsof.

So, I dropped to telinit 1 and tried to unmount it. Failed. So I tried a df and the console hung.

I hit the reset button, and booted into a USB stick. Checked all filesystems. /dev/mapper/crypte_sda1 (which has /tmp symlinked to it), /dev/mapper/crypt_sdc5 (which has /var symlinked to it), and /dev/sdc1 (system) all needed to have the journal recovered and free block count adjusted (which is not a surprise when I hit reset) but showed no filesystem errors otherwise. /dev/mapper/crypt_sdb1 showed no errors at all.

Symlinking /var and /tmp rather than mounting as partitions is something I have done for many years. It saves me having to dedicate a (likely inappropriately sized) partition to the purpose. These directories are linked to large partitions, and the fact that they are directories lets them resize - both up and down - as required by the specific conditions.

However, in these days of systemd, this is becoming problematic. I find I can't dismount these things on shutdown, so on those rare occasions when I do have to reboot I always have to hit the reset button after reaching the shutdown target. This should not hurt anything, but I cannot say for sure it does not hurt anything.

I will note that I booted into a Mageia 4 USB stick to check the filesystems, and when I rebooted after checking, the Mageia 4 stick also hung at the shutdown target, requiring me to hit reset.

by **filip** » Dec 18th, '16, 15:08

Please don't hit Reset button or Power switch before you try the gentle linux restart sometimes called Magic SysRq key restart.

by **doktor5000** » Dec 18th, '16, 18:50

Well, if he cannot unmount filesystems manually, that won't help that much, and he can also manually trigger a sync before the reset. And I'm totally sure Jim is aware about magic sysrq, but thanks for pointing this out

by **jiml8** » Dec 19th, '16, 05:51

Well, the reality is that vmware workstation causes issues on the host. I usually have anywhere from 6 to 9 VMs running at a time, and which ones are running changes from time to time. I have no fewer than three physical NICs and 5 virtual networks running in this system all the time. I have tweaked the kernel in a number of ways to make things work better, and the fact is that this machine usually is up for a very long time between reboots; it is stable.

But there are issues. I have grown used to the issues, which I label a consequence of extreme complexity. The issues include things like networking getting slow, inability to log out, volumes I can't dismount, and sometimes Workstation crashing and leaving the VMs running...with their handles to Workstation lost. I have scripted a number of typical repair and recovery scenarios to employ when Workstation causes one of these problems. I've gotten pretty good at cleaning things up quickly and without rebooting, actually, and I don't really have to do it all that often.

So, issues on shutdown rank very low on my list of priorities to solve. I very rarely shut down and try hard to avoid rebooting. Nonetheless, it often makes it hard to tell if I have a filesystem problem.

I've given thought to eliminating this OS host/guest paradigm by switching to a hypervisor which would virtualize everything and put all operating systems on the same footing. This would be a huge reconfiguration though, so while I consider it I have not attempted it.

Mageia forum

with a thud...

with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Re: with a thud...

Who is online