4 models for escalating access permissions during emergencies

Just-in-time permissions for operational staff are a security best practice. But how do you manage them for incidents and outages?

Contributor, InfoWorld |

4 models for escalating permissions during emergencies — Putilich / Getty Images

When building modern applications, managing access permissions during operational events is tricky.

Security best practices specify that engineers—developers and operations engineers—should have as little access as possible to the production application and its infrastructure. Sometimes business requirements or industry regulations require access to production to be severely restricted. But even without industry or business requirements, security best practices, such as the principle of least privilege, dictate that engineers should have as little access to production as possible, including those engineers responsible for managing on-call operational issues.

However, this can become an issue when an on-call engineer must deal with a problem in a production application. When permissions are tightly controlled, on-call engineers occasionally need additional permissions to resolve production issues. Sometimes simple acts, such as rebooting a production server, are beyond the normal permissions of an on-call engineer but may be required in an emergency.

How do on-call engineers perform these activities during an emergency? By performing a permission escalation. This is an action that temporarily gives an engineer additional permissions in order to perform emergency procedures that would normally be beyond their permission allowances.

But how can an engineer escalate their permissions without the ability to escalate their permissions itself becoming a security vulnerability?

There are four ways to accomplish permission escalation that can grant your on-call engineers the permissions they require while limiting the security vulnerabilities inherent in granting the additional permissions. Each has advantages and disadvantages.

BTG or Break the Glass

The BTG or Break the Glass model allows an on-call engineer to request from a system process an escalation of permissions, to be used only in an emergency situation. When the engineer requests these additional permissions, an automated system grants them the necessary permissions, but it immediately logs the request and sends a notification to appropriate management to let them know of the request. Because the engineer is aware of this notification, they know they can only “Break the Glass” during an actual emergency, and that they will have to explain their actions later, such as during an upcoming incident review. This makes it unlikely that someone will request these additional permissions except when absolutely necessary.

This model is typically very easy to implement in a production environment, and allows engineers the flexibility they require during an emergency. However, the review process is a reactive process, not a proactive process. This means that management can only review what happened that caused the engineer to escalate their permissions after the fact. If a disgruntled engineer requests a BTG to get permissions to perform a nefarious activity, management will know about it only after it occurs, not before. Because of this, the BTG model is great for moderately secure environments, but is not acceptable when a high level of security or protection against potential employee bad actions is required.

Logged escalation

In the logged escalation model, when an engineer needs to perform certain privileged activities beyond their normal permission level, they use specific commands that are logged and monitored for inappropriate access. For example, if an engineer requires access to a protected private network, they may log in to a bastion host, a server that gives them access to the protected private network. The bastion host logs all activities the engineer performs, making them available for examination after the event. The intent is to ensure that no bad actor got in and performed inappropriate actions on the protected network, yet legitimate activities can still take place normally.

Similar to the BTG model, the logged escalation model allows any engineer with appropriate permissions to access the production network for any appropriate purposes, not just an emergency. However, these users only have access from within a “glass house” environment—an environment where every action taken is made visible for analysis later. This provides many of the same advantages of the BTG model, but with similar disadvantages. Namely, management will know about nefarious activity only after it has occurred and has no ability to prevent it beforehand.

Two-person escalation

Two-person escalation is an enhancement of the BTG model, where the BTG escalation is allowed only if two independent people are working together on the problem, and they both authorize the escalation. Then, by policy, all actions they take under BTG must be reviewed by both parties, and both parties must be involved in all escalation activities performed.

This is a vast improvement in security over the basic BTG model, because it mostly eliminates the disgruntled employee from being able to damage production simply by issuing a BTG escalation. Instead, two employees must work together to proactively perform any actions. No single employee can damage a system without having a second employee as an accomplice, which significantly improves the overall security.

The two-person escalation model can be harder to enforce, due to the policy requirements that people must follow for it to be effective. It has to be untenable for an engineer to work with another engineer to grant two-person BTG access, then allow the lone engineer exclusive and unmonitored use of the access. Such a problem defeats the purpose of the two-person escalation model.

Limited-scope tools

For maximum security, limited-scope tools is the best model. This involves creating custom tooling that performs specific activities necessary to provide operational maintenance to a production application. If you need to perform some action beyond your normal capabilities, you invoke the action using a tool that is custom-designed to perform that access.

For example, if an on-call engineer must reboot a production server, they would normally log in to the production server as “root” and reboot the server. This requires a level of permissions that is unacceptable in most production environments. However, imagine a web console that gives an operations engineer the push-button ability to initiate a reboot of production servers. They can, using their normal permissions, perform a specific action that would normally require escalated permissions.

The advantage of the limited-scope tools model is that it gives the user the exact capabilities they require, and only those capabilities. This preserves the principle of least permission, yet gives the operator the specific capabilities they require. The custom tool typically also provides the benefits of the logged escalation model by keeping track of who uses the tool and when it is used, so the activities involved can be tracked and examined later during an incident review.

The downside of the limited-scope tools model is that it doesn’t provide a generic escalation model, but gives access only to specific capabilities that were imagined ahead of time, using tooling created to allow that action. If you must perform some action that requires escalated permissions, and no tool is available to perform that action, you as an operations engineer may be simply out of luck.

As such, while limited-scope tools is the best, safest model overall, it often is not used alone but in conjunction with one of the other models for unanticipated permissions that might be needed. However, this can lessen the security advantages inherent in this model.

Which model is best?

These four methods are all best practices, but work only for some businesses and are unacceptable to others. In practice, a combination of multiple methods is usually employed. By selecting the processes, complexities, and operational monitoring appropriate for your business and industry, you can implement permission-escalation without unduly compromising your application and its security.

Next read this:

Lee Atchison is a recognized thought leader in cloud computing and application modernization. With more than three decades of experience in product development, architecting, scaling, and modernization, Lee has worked at Amazon, Amazon Web Services (AWS), New Relic, and other modern application organizations. He is widely quoted in many publications and has been a featured speaker across the globe. Lee’s most recent book is Architecting for Scale (O’Reilly Media). You can check out his books, courses, articles, and speaking sessions at Twitter and LinkedIn.