DrKonqi ❤️ coredumpd

Wednesday, 25 May 2022

Get some popcorn and strap in for a long one! I shall delight you with some insights into crash handling and all that unicorn sparkle material.

Since Plasma 5.24 DrKonqi, Plasma’s infamous crash reporter, has gained support to route crashes through coredumpd and it is amazing - albeit a bit unused. That is why I’m telling you about it now because it’s matured a bit and is even more amazing - albeit still unused, I hope that will change.

To explain what any of this does I have to explain some basics first, so we are on the same page…

Most applications made by KDE will generally rely on KCrash, a KDE framework that implements crash handling, to, well, handle crashes. The way this works depends a bit on the operating system but one way or another when an application encounters a fault it first stops to think for a moment, about the meaning of life and whatever else, we call that “catching the crash”, during that time frame we can apply further diagnostics to help later figure out what went wrong. On POSIX systems specifically, we generate a backtrace and send that off to our bugzilla for handling by a developer - that is in essence the job of DrKonqi.

Currently DrKonqi operates in a mode of operation generally dubbed “just-in-time debugging”. When a crash occurs: KCrash immediately starts DrKonqi, DrKonqi attaches GDB to the still running process, GDB creates a backtrace, and then DrKonqi sends the trace along with metadata to bugzilla.

Just-in-time debugging is often useful on developer machines because you can easily switch to interactive debugging and also have a more complete picture of the environmental system state. For user systems it is a bit awkward though. You may not have time to deal with the report right now, you may have no internet connection, indeed the crash may be impossible to trace because of technical complications occurring during just-in-time debugging because of how POSIX signals work (threads continue running :O), etc.

In short: just-in-time really shouldn’t be the default.

Enter coredumpd.

Coredumpd is part of systemd and acts as kernel core handler. Ah, that’s a mouthful again. Let’s backtrace (pun intended)… earlier when I was talking about KCrash I only told part of the story. When fault occurs it doesn’t necessarily mean that the application has to crash, it could also neatly exit. It is only when the application takes no further action to alleviate the problem that the Linux kernel will jump in and do some rudimentary crash handling, forcefully. Very rudimentary indeed, it simply takes the memory state of the process and dumps it into a file. This is then aptly called a core dump. It’s kind of like a snapshot of the state of the process when the fault occurred and allows for debugging after the fact. Now things get interesting, don’t they? :)

So… KCrash can simply do nothing and let the Linux kernel do the work, and the Linux kernel can also be lazy and delegate the work to a so called core handler, an application that handles the core dumping. Well, here we are. That core handler can be coredumpd, making it the effective crash handler.

What’s the point you ask? – We get to be lazy!

Also, core dumping has one huge advantage that also is its disadvantage (depending on how you look at it): when a core dumps, the process is no longer running. When backtracing a core dump you are looking at a snapshot of the past, not a still running process. That means you can deal with crashes now or in 5 minutes or in 10 hours. So long as the core dump is available on disk you can trace the cause of the crash. This is further improved by coredumpd also storing a whole lot of metadata in journald. All put together it allows us to run drkonqi after-the-fact, instead of just-in-time. Amazing! I’m sure you will agree.

For the user everything looks the same, but under the hood we’ve gotten rid of various race conditions and gotten crash persistence across reboots for free!

Among other things this gives us the ability to look at past crashes. A GUI for which will be included in Plasma 5.25. Future plans also include the ability to file bug reports long after the fact.

Inner Workings

The way this works behind the scenes is somewhat complicated but should be easy enough to follow:

The application produces a fault
KCrash writes KCrash-specific metadata into a file on disk and doesn’t exit
The kernel issues a core dump via coredumpd
The systemd unit coredump@ starts
At the same time drkonqi-coredump-processor@ starts
The processor@ waits for coredump@ to finishes its task of dumping the core
The processor@ starts drkonqi-coredump-launcher@ in user scope
launcher@ starts DrKonqi with the same arguments as though it had been started just-in-time
DrKonqi assembles all the data to produce a crash report
the user is greeted by a crash notification just like just-in-time debugging
the entire crash reporting procedure is the same

Use It!

If you are using KDE neon unstable edition you are already using coredumpd based crash reporting for months! You haven’t even noticed, have you? ;)

If not, here’s your chance to join the after-the-fact club of cool kids.

KCRASH_DUMP_ONLY=1

in your `/etc/environment` and make sure your distribution has enabled the relevant systemd units accordingly.