|
|
Subscribe / Log in / New account

Warning about WARN_ON()

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

By Jonathan Corbet
April 18, 2024
Kernel developers, like conscientious developers for many projects, will often include checks in the code for conditions that are never expected to occur, but which would indicate a serious problem should that expectation turn out to be incorrect. For years, developers have been encouraged (to put it politely) to avoid using assertions that crash the machine for such conditions unless there is truly no alternative. Increasingly, though, use of the kernel's WARN_ON() family of macros, which developers were told to use instead, is also being discouraged.

A longstanding way to test for a condition that cannot be recovered from is the BUG_ON() macro, which includes a test for the unexpected condition:

    /* This can never happen, honest, would I lie? */
    BUG_ON(foo_ptr == NULL);

A BUG_ON() call leads directly to a kernel panic, resulting (usually) in the machine being rebooted. There are times when there is no alternative, but use of BUG_ON() has been discouraged for years. Crashing the machine deprives the user of any chance of reacting to the problem or saving work and can make it harder to track down the source of the problem. Even so, there are something like 12,000 BUG_ON() instances in the kernel (not counting BUILD_BUG_ON(), which only affects the build process and is not discouraged in the same way).

Instead, developers are told to use WARN_ON(), which puts a traceback into the kernel log but does not crash the machine (in theory, at least, but keep reading). The kernel's coding-style document says:

Do not add new code that uses any of the BUG() variants, such as BUG(), BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not required if there is no reasonable way to at least partially recover.

Increasingly, though, developers are being told to avoid using WARN_ON() as well. There are a couple of reasons for that. One is that any WARN_ON() that can be triggered from user space can be used, at a minimum, to spam the system log, obscuring other events and perhaps affecting system performance. The other reason, though, applies even to WARN_ON() calls that cannot be triggered in this way.

The kernel contains a sysctl knob called panic_on_warn. It does exactly what its name suggests: if this option is set, any WARN_ON() call will cause the system to panic. In essence, it turns WARN_ON() calls into BUG_ON() calls. This option is set by users who see any warning as a sufficiently suspicious event that, when one happens, it is better to kill the system and start over. Such users include many Android devices and the host kernels run at cloud providers (and beyond). Any WARN_ON() that actually triggers, in other words, has the potential to bring down a lot of machines.

The same coding-style document advises developers that this outcome is something that panic_on_warn users have explicitly opted into:

However, the existence of panic_on_warn users is not a valid reason to avoid the judicious use WARN*(). That is because, whoever enables panic_on_warn has explicitly asked the kernel to crash if a WARN*() fires, and such users must be prepared to deal with the consequences of a system that is somewhat more likely to crash.

The current pressure against WARN_ON() use is not entirely consistent with this advice, though. Thus, Alex Elder was recently motivated to send a patch changing the advice given in the coding-style document. Gone is the language suggesting that panic_on_warn users were getting what they asked for; the new text reads:

The existence of this option is not a valid reason to avoid the judicious use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()`` issue warnings but do **not** cause the kernel to crash. Use these if you want to prevent such panics.

Christoph Hellwig was quick to call this change "wronger than wrong": "If you set panic_on_warn you get to keep the pieces". Laurent Pinchart pointed out that the suggested alternatives are not the same; they are much easier to ignore and, thus, less effective at getting developers to fix the problem that the warning is trying to draw attention to. Greg Kroah-Hartman, though, was happy to see this change. The recommendation to avoid panic_on_warn has been ignored, he said, so new WARN_ON() calls should not be added.

To summarize the situation: over the years, BUG_ON() has been seen as so destructive that developers are simply told not to use it at all. The WARN_ON() macro has, instead, taken its place; but in settings where panic_on_warn is set, the end result of a WARN_ON() call is essentially the same. So, naturally, use of WARN_ON() is also now discouraged much of the time.

Whether the proposed documentation change will be applied is unclear; the kernel's befuddled documentation maintainer, who has happily not been appointed the arbiter of the kernel's coding style, makes a point of not applying coding-style changes in the absence of a clear consensus. It is not clear that a consensus on this change exists currently. Regardless of that change, though, developers will continue to be encouraged toward logging functions like pr_warn() instead of WARN_ON() — until somebody inevitably adds a panic_on_pr_warn sysctl knob and the whole process starts over again.

Index entries for this article
KernelWarnings


(Log in to post comments)

Warning about WARN_ON()

Posted Apr 18, 2024 14:50 UTC (Thu) by ukleinek (subscriber, #56625) [Link]

When I was faced the last time with review feedback to not add a WARN_ON I had the same thought you had at the end of the article about panic_on_pr_warn.

However I suggest to use number suffixes to keep the function names somewhat handy also after say 10 iterations. So I suggest to add a WARN_ON1() that should be recommended until this one is frowned upon, too. That's when we introduce WARN_ON2(). The panic_on_... parameter would just get the same suffix. Otherwise we end up with panic_on_warn_now_really_2026_honestly_yesIknowThisIsABadIdea_new.

Warning about WARN_ON()

Posted Apr 18, 2024 20:33 UTC (Thu) by roc (subscriber, #30627) [Link]

This is a fantastic idea, bravo!

Warning about WARN_ON()

Posted Apr 18, 2024 22:34 UTC (Thu) by snajpa (subscriber, #73467) [Link]

*if* the WARN_ONs contained a kernel version, year, year+month (or ...?), one could teach the panic_on_warn to recognize a cut off, that would be used to limit the warns that trigger a panic to already-known subset (so the kernel can be upgraded, panic_on_warn stays at the same value for most of the fleet while keeping an eye on just a subset where it gets bumped)...

but that's a big if :-D it would probably be ugly to annotate all WARN_ONs with a tag for when they appeared - and not all kernels get built from a git clone... perhaps there could be an indirect file where that information could live, auto-generated from git?

not that I care personally to implement it, I am happy with the current panic_on_warn as is, without complaints

Warning about WARN_ON()

Posted Apr 19, 2024 17:32 UTC (Fri) by Heretic_Blacksheep (subscriber, #169992) [Link]

Theoretically it should just be given a comment like "//added <date> for <specific check case that shouldn't happen> NOTE: if you panic on this, that's YOUR choice". Just because people abuse a sysctl tweak doesn't mean the rest of its users should lose a useful tool, which is kinda what that documentation change suggests.

Warning about WARN_ON()

Posted Apr 19, 2024 11:40 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

Right, this is like the Euphemism Treadmill. From a engineering point of view this is futile, and so it would be appropriate for someone to authoritatively say no, we're not wasting our time on this.

Warning about WARN_ON()

Posted Apr 18, 2024 15:04 UTC (Thu) by shironeko (subscriber, #159952) [Link]

it really sounds like not adding new use of WARN defeats the point of panic_on_warn. If we go down that route why not just make panic_on_warn a no-op instead?

If the idea is that current use of WARN is too liberal and people are using it in reachable code, the perhaps the better idea is to define strictly when WARN is appropriate? or create different macros for different use-cases. rather than a blunt "don't use this".

Warning about WARN_ON()

Posted May 13, 2024 7:05 UTC (Mon) by cpitrat (subscriber, #116459) [Link]

Exactly that! Though some people do want panic_on_warn. So make panic_on_warn a no-op and introduce a really_panic_on_warn for users who know what they are doing :-)

Warning about WARN_ON()

Posted Apr 18, 2024 15:09 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

100% agreed with Christoph here.

pr_warn is a completely different thing than WARN_ON. If syzbot triggers a WARN it's because something ought to be impossible; it's exactly what fuzzing is supposed to detect. A pr_warn instead would go completely undetected and is mostly pointless.

WARN_ON vs. WARN_ON_ONCE is a different story: WARN_ON_ONCE does not spam the syslog so it's better than WARN_ON almost always. But the fact that panic_on_warn is used in production systems where you cannot see the logs is not the developers' problem.

Personally I'm going to add WARN_ONs to KVM liberally just like we did so far.

Warning about WARN_ON()

Posted Apr 19, 2024 7:22 UTC (Fri) by david.hildenbrand (subscriber, #108299) [Link]

... unfortunately I (the author of that change) has to learn on lwn.net about that discussion. To me, that says a lot about how much time was spent looking into the history of that documentation, which is part of the introducing commit [1].

So yes, I'll also continue to use and ACK *sane* use of WARN_ON_ONCE. Just as documented.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

Warning about WARN_ON()

Posted Apr 18, 2024 15:17 UTC (Thu) by epa (subscriber, #39769) [Link]

Surely the panic_on_warn flag should also apply to dev_warn* and pr_warn* calls?
Or at least they should have their own flags to panic -- for people who really want that?

Warning about WARN_ON()

Posted Apr 24, 2024 17:01 UTC (Wed) by crystalwood (subscriber, #168140) [Link]

Why would you want that? Those prints aren't necessarily meant to indicate a kernel fault; e.g. it could be complaining about hardware or firmware, and it might happen on every boot.

Warning about WARN_ON()

Posted May 13, 2024 7:12 UTC (Mon) by cpitrat (subscriber, #116459) [Link]

I think this was sarcastic

Warning about WARN_ON()

Posted Apr 18, 2024 15:22 UTC (Thu) by mfuzzey (subscriber, #57966) [Link]

The nice (from a developpers point of view) thing about WARN_ON is that gives a stack trace which can be helpful in understanding why the situation occured which dev_warn() and pr_warn() don't do. The stack trace is also "noisier" which seems to help getting the problem reported.

Maybe WARN_ON_NEVER_PANIC() or WARN_ON_MAY_PANIC() ?

But the reasoning for avoiding WARN_ON() related to the panic_on_warn option seems very strange to me.
If the system administrator / device manufacturer decides that rebooting on a warning is important to them that's their perogative right?
If people stop using WARN_ON() then panic_on_warn becomes pointless and the said administrators lose whatever they thought they were gaining by panic_on_warn.

Warning about WARN_ON()

Posted Apr 24, 2024 16:55 UTC (Wed) by crystalwood (subscriber, #168140) [Link]

I vote for WHINE_ON()

Warning about WARN_ON()

Posted Apr 18, 2024 15:31 UTC (Thu) by atnot (subscriber, #124910) [Link]

Every time this situation comes up, I think of how much it mirrors the postel's law fallacy. There, while in the short term being lax about accepting slighly incorrect code reduces the amount of breakage, in practice over time it just proliferates that breakage unnoticed, until someone writes something that actually adheres to the spec and everything blows up. The solution to this has generally been much stricter conformance test suites and tightening up of protocols.

It's not quite the same thing, but I do wonder how much better tooling we would have and how much more bugs would be found early and how much more robust things would be if coding errors reliably caused some sort of panic.

There was a blogpost recently that strengthened this to me:
> Take Unix for example. If you call close on a file descriptor you never opened, you get an error code back. If you call open and hand it a null pointer instead of a pathname? You get an error code back. Both of these are violations of a system call’s preconditions, and both are handled through the same error mechanism that handles “file not found” and other cases that can happen in a correct program.
> On Hubris, if you break a system call’s preconditions, your task is immediately destroyed with no opportunity to do anything else.
> But combined with the kernel’s habit of faulting any task that looks at it funny, this makes the system’s behavior very unusual compared to most operating systems.
> And it’s been great.
> Initially I was concerned that I’d made the kernel too aggressive, but in practice, this has meant that errors are caught very early in development. A fault is hard to miss, and literally cannot be ignored the way an error code might be.
https://cliffle.com/blog/hubris-reply-fault/

As noted, it also has the advantage of removing an absolute ton of untested error handling code for things that should never occur.

I do recognize that this is a bit unrealistic with the way the kernel is currently written and suddenly changing course on it would be extremely painful. But I think one has to be careful to not make this situation worse.

Recovery because you Expect Failure

Posted Apr 18, 2024 19:03 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

I've got a number of things I'm thinking about which earn the label 'anti-fragile behaviour'. We can make systemic choices to never leave Undefined Behaviour as just UB, because it always needs clear direction; we can have a 'failure domain' where WARN_ON would enter a debugging state if there are developer credentials in place (whether a development workstation or monitoring a prod system for a rare set of causes) or else to attempt recovery while anticipating failure; BUG_ON might write system state to predefined protected store and, if this BUG_ON has not been triggered since power-on, flag it as being hit and KEXEC a fresh kernel.

Being explicit about recovery, about unplanned failure, allows people to plan for it and to be robust when it goes wrong. I have spare CPU cycles and gigabytes of RAM and non-volatile storage, and I will accept lower performance to protect against outages.

K3n.

Warning about WARN_ON()

Posted Apr 19, 2024 0:48 UTC (Fri) by khim (subscriber, #9252) [Link]

> It's not quite the same thing, but I do wonder how much better tooling we would have and how much more bugs would be found early and how much more robust things would be if coding errors reliably caused some sort of panic.

It would have never worked. Remember that Tony Hoare story about Algol users and FORTRAN users. That was decades ago.

And story back then was exactly the same as it is today: Algol users though it's extremely silly to remove runtime checks even in production while FORTRAN users were more of I don't care about the subscript error I just want it to run variety.

And the whole thing with FORTRAN lovers (who, later, become a BASIC lovers, PHP lovers, JavaScript lovers, worse is better) works like that because of that Lemon socialism (aka privatization of gains and socialized of losses).

If you, personally, benefit from the ability to run code with subscript errors (or any other gratuitous bugs) and may shift the blame (and accompanying losses!) when the whole thing would, inevitably, fall aparts on someone else then, of course, you would want to have ensure that there are as few checks as possible. Very often that “someone else” is the poor guy who actually pays for the whole things when s/he purchases cheap chineese gadget.

If, on the other hand, you are developing software which is not enabler or some kind of shovelware, but something you, personally, want to develop and use for extended period of time… then you would want to apply fail fast approach as much as possible.

Linus managed to marry these two approaches to a large extent and this is precisely what made Linux kernel the most popular OS kernel in existence, but, of course, there are always intrinsic tension between proponents of these two approaches.

Warning about WARN_ON()

Posted Apr 18, 2024 16:04 UTC (Thu) by rgb (subscriber, #57129) [Link]

So the kernel has the inverse of a trigger warning which needs to be protected by a trigger warning. I need to take my medicine now.

Warning about WARN_ON()

Posted Apr 18, 2024 16:40 UTC (Thu) by iabervon (subscriber, #722) [Link]

It seems to me like Android devices and the host kernels at cloud providers are not supposed to be debugged (from inside the live system) or provide a way for users to save their work. Cloud providers want their systems to panic on any condition that's unanticipated, because that fits neatly into their handling of hardware failures, and they want debugging of these conditions to happen on test systems that are not following that policy.

Warning about WARN_ON()

Posted Apr 18, 2024 16:53 UTC (Thu) by mb (subscriber, #50428) [Link]

panic_on_pr_warn=1 when?

> The recommendation to avoid panic_on_warn has been ignored

So?
"It hurts, if I do this."
I don't see the problem.

Warning about WARN_ON()

Posted Apr 18, 2024 17:16 UTC (Thu) by adobriyan (subscriber, #30858) [Link]

panic_on_warn is hilarious.

Those protecting against exploits with panic_on_oops should "#define WARN_ON BUG_ON" and spare us from yet another kernel option.

Warning about WARN_ON()

Posted Apr 18, 2024 22:42 UTC (Thu) by kees (subscriber, #27264) [Link]

Compile-time-only behavioral control options means people using Distro kernels are out of luck. It's important to keep these kinds of things at least as a boot-time flag.

Warning about WARN_ON()

Posted Apr 24, 2024 1:47 UTC (Wed) by jwarnica (subscriber, #27492) [Link]

Or they are in-luck because their primary source of support - volunteer or paid - has made an explicit choice to limit the scope of support, and thus, be better able to credibly provide support for that.

Warning about WARN_ON()

Posted Apr 18, 2024 21:11 UTC (Thu) by roc (subscriber, #30627) [Link]

My experiences at Mozilla taught me that you want two kinds of assertions:
A) "Something is wrong and it is dangerous to proceed. Emit debugging information and terminate." (e.g. when you suspect memory corruption)
B) "Something is wrong but it's not important enough to terminate. Emit debugging information and continue." (e.g. when something is at the wrong position on the screen)

People often complained that severity-B warnings were too easy to ignore (e.g. while testing) and wanted to make them severity-A instead. They were wrong. You instantly realize that's a bad idea when you're trying to debug a severity-A problem and you can't because you keep hitting a different severity-B problem that is incorrectly being treated as severity-A. You do need telemetry and management discipline to make sure that severity-B problems are tabulated and not ignored.

Warning about WARN_ON()

Posted Apr 19, 2024 15:27 UTC (Fri) by shironeko (subscriber, #159952) [Link]

Just temporarily remove that other assert locally while you are debugging?

Also fast forward to when one realize that other error is exactly what caused the error you are debugging.

Warning about WARN_ON()

Posted Apr 20, 2024 4:31 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

The problem is that the definition of "important" varies with the use case. I would say we need to consider at least three different situations:

1. A "pet" where all of the code running in userspace (regardless of UID) is either trusted or (adequately) sandboxed.
2. A "cattle" where all of the code running in userspace is trusted or sandboxed.
3. A "cattle" where some of the code running in userspace is not trusted and not (adequately) sandboxed.

In case (1), panicking is a big deal. It could cause serious data loss. We do not want to do it. In cases (2) and (3), there is other infrastructure that will deal with a panic. We will not lose (much) data, and whatever data loss does happen is "priced in" (i.e. we have accepted that we may lose it and planned accordingly). In cases (1) and (2), continuing in an invalid or unexpected case is somewhat likely to be safe (though obviously not certain). In case (3), we know that userspace is actively trying to take over the system at least some of the time, and continuing in an invalid or unexpected case would potentially make it easier for them to do that.

There is also:

4. A "pet " where some of the code running in userspace is not trusted and not (adequately) sandboxed.

This describes the average user's home computer. Unfortunately, in this case, we're screwed, because we have to choose between possibly losing data, and possibly letting an attacker take over the system.

In this thread, I think most people are taking the position that (4) is a lost cause (or at least, it requires finer-grained security than panic_on_warn can provide - see for example seccomp, SELinux, etc.), and therefore panic_on_warn should be calibrated to the expectations of (3) and maybe (2). That entails using WARN_ON liberally in contexts where a genuinely unexpected or invalid state arises, but conservatively or not at all in more mundane situations that may simply reflect a userspace-level misconfiguration. (4) can then be configured with panic_on_warn disabled, and other security measures can be applied to the extent practicable.

Warning about WARN_ON()

Posted Apr 18, 2024 21:58 UTC (Thu) by nevets (subscriber, #11875) [Link]

I have a strict requirement in my code that all WARN_ON() calls are used only to show that a bug happened in the kernel. If it is hit, it's a bug (I agree that this should never be used for incorrect user space input).

Panic on WARN should never trigger, and if it does, then a fix to the kernel is urgently needed.

I truly believe that this use should be encouraged.

Warning about WARN_ON()

Posted Apr 18, 2024 22:46 UTC (Thu) by kees (subscriber, #27264) [Link]

This is exactly the intent of WARN*(). It should be use for expected-to-be-totally-unreachable condition. Killing the entire system (e.g. BUG) reduces the likelihood that it will get caught/debugged/etc. But if userspace can reach it, something is very wrong. (And this is why many deployments set panic_on_warn: something impossible happened: we can no longer trust the integrity of the system.)

Warning about WARN_ON()

Posted Apr 21, 2024 16:08 UTC (Sun) by marcH (subscriber, #57642) [Link]

> The recommendation to avoid panic_on_warn has been ignored

Of course it has been "ignored", because:

> we can no longer trust the integrity of the system

That's the simple reason why this advice has been ignored: because it's really bad!

"Something that should never happen, happened. Let's pretend nothing important happened and keep running anyway this monolithic kernel written in the programming language of memory corruptions". Unbelievable.

If some crazy people or unusual use cases want panic_on_warn to be OFF for $reasons then let them - but don't let them force that bad choice on everybody else by changing the semantics of existing APIs. That's even more unbelievable.

Warning about WARN_ON()

Posted Apr 21, 2024 16:21 UTC (Sun) by mb (subscriber, #50428) [Link]

Yes, I also have panic-on-{oops, warn, ...} and panic-reboot enabled on my server systems.

It makes a whole lot of sense to reboot when the system got into a state that was thought to be impossible.
Going on as if nothing happened is not an option for me.

I do understand why this behavior would not be ideal on desktop systems.
There I'd like to be notified and make the decision manually.

Suggesting pr_warn for "impossible" states is taking away this decision from administrators. Which is wrong. Kernel developers cannot decide what to do in these cases, because there's no single correct answer.

Warning about WARN_ON()

Posted Apr 21, 2024 18:01 UTC (Sun) by marcH (subscriber, #57642) [Link]

> Suggesting pr_warn for "impossible" states is taking away this decision from administrators. Which is wrong. Kernel developers cannot decide what to do in these cases, because there's no single correct answer.

This. The job of kernel _developers_ is to 1) provide APIs with a "scale" of error levels defined as best as possible. As simple a scale as possible, but not simpler 2) Thoroughly enforce it through code reviews. The latter is not easy because error handling is hard to test, rarely ever tested, formally defining "levels" is also hard and many developers don't care about errors. So it'll never be perfect. But it's not optional and must be done as best as possible.

Then, what to do with these error levels is a POLICY decision that belongs to the _administrator_ and specific use cases. BTW the _same_ administrator can be dealing with errors differently depending on the use cases.

"Mechanism, not policy". Default values matter and sometimes they must be adjusted (slowly and carefully) but developers should never remove some knobs or change their semantics only because of the perception that "some" administrators don't understand them. First, developers can't possibly understand all use cases. Then what about all the administrators who have been using the knobs as intended the whole time? Why should they suffer?

Warning about WARN_ON()

Posted Apr 24, 2024 2:02 UTC (Wed) by jwarnica (subscriber, #27492) [Link]

>"Mechanism, not policy". Default values matter and sometimes they must be adjusted (slowly and carefully) but developers should never remove some knobs or change their semantics only because of the perception that "some" administrators don't understand them. First, developers can't possibly understand all use cases. Then what about all the administrators who have been using the knobs as intended the whole time? Why should they suffer?

Opensource allows people to get into things deeper than credible levels of safety.

Absolutely it is true that "developers can't possibly understand all use cases", which suggest their offer of software that works should be credible. Not in any legal sense of "warranty for a particular purpose" but some level of credible thinking, analysis, testing. If you've tested 13 scenarios, say you've tested 13 scenarios, not 255 scenarios since that is the next smallest data type.

If the downstream user wants to open up the code and test more, fine. They can do that. But that wasn't what the upstream provided and implied they tested; the user then is on the hook for what happens.

Warning about WARN_ON()

Posted Apr 24, 2024 18:23 UTC (Wed) by marcH (subscriber, #57642) [Link]

> But that wasn't what the upstream provided and implied they tested; the user then is on the hook for what happens.

The number of clones, branches, commits/versions and the Kconfig combinatorial explosion is so huge that "tested upstream" does not really mean anything. At best you could have metric measuring some distance from a git tag from Linus and the ,config he uses but would that be very useful?

The linux kernel is not a product, it's a very large set of legos.

"Upstream" is not a location, it's a direction. It's in the name.


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds