BlogSecurity

Beyond Cloudflare Access: building zero-trust ARO platforms with Keycloak and oauth2-proxy

Zee··
#kubernetes#aro#keycloak#zero-trust#identity
Network security architecture concept

Cloudflare Access is a good product. We deploy it. We recommend it for the right workloads. But I keep meeting teams who reach for it to authenticate an in-cluster application running on ARO or AKS, and that’s the wrong tool for that boundary.

The pattern looks tidy on a whiteboard. Put Cloudflare Access in front of the app. Users authenticate at the edge. The application receives a signed JWT. Done. No identity code to write.

In production it creates problems you don’t see until something breaks. Let me explain the topology we use instead, and why.

The trust boundary problem

Cloudflare Access authenticates at Cloudflare’s edge. That’s the whole point. Your identity decision happens outside your cluster, outside your VNet, outside the boundary your auditor drew on the architecture diagram.

For a public marketing site, that boundary is fine. The thing being protected is low-risk and the edge is a reasonable place to make the call.

For an in-cluster workload handling regulated data, you’ve now made an external SaaS the authority for access into your most sensitive environment. The application trusts a header. If anything can reach the application’s port without traversing Cloudflare, the header can be forged and the auth is gone. So you bolt on mTLS, or service tokens, or IP allowlisting back to Cloudflare’s ranges, and the simple pattern is no longer simple.

The deeper issue is conceptual. Identity for an in-cluster workload belongs inside the cluster’s trust domain, enforced by something the cluster controls. When the identity decision lives at a third-party edge, your zero-trust posture has a hole in it that’s hard to reason about and harder to evidence.

The latency cost

Trace a request. A user in Lagos hits your ARO app behind Cloudflare Access. The request goes to the nearest Cloudflare PoP, gets evaluated against the Access policy, then routes to your origin. If your origin is South Africa North, the request has now travelled to a Cloudflare edge, had a policy decision made, and travelled on to Johannesburg.

For a single page load nobody notices. For a chatty internal application making dozens of API calls, each round trip carries the edge hop. We’ve measured 60-120ms of added latency on internal tooling that didn’t need to leave the region at all.

The in-cluster pattern keeps the auth decision next to the workload. oauth2-proxy runs as a sidecar or a shared deployment in the same cluster. The token validation is a local call. The IdP, Keycloak, runs in the same cluster or the same VNet. No request leaves the boundary to decide whether a request is allowed.

Cert and key management

Cloudflare Access needs your origin reachable. That means a public hostname, a certificate Cloudflare trusts, and a way to stop anyone bypassing Cloudflare to hit the origin directly. In practice teams end up managing origin certificates, Cloudflare’s tunnel daemon or a set of firewall rules pinned to Cloudflare IP ranges, and the Access service tokens for machine-to-machine calls. Three key surfaces, all external-facing, all needing rotation.

The in-cluster pattern consolidates this. cert-manager issues and rotates TLS certificates automatically, from Let’s Encrypt for anything public or from an internal CA for east-west traffic. External Secrets Operator pulls the OIDC client secrets and signing keys from Azure Key Vault and syncs them into the namespaces that need them. Rotation is a Key Vault operation that propagates without anyone touching a manifest. The keys never leave your boundary.

The topology we actually deploy

Here’s the shape of it for an ARO platform handling regulated workloads.

Application Gateway with WAF sits at the front. It terminates public TLS, runs OWASP rule sets, and routes to the cluster ingress. This is your edge, and it’s an edge you own inside your own subscription.

oauth2-proxy sits in front of each protected workload, either as a sidecar or as a per-namespace deployment behind the ingress. It handles the OIDC dance: redirect unauthenticated users to the IdP, validate the returned token, set the session cookie, forward authenticated requests to the workload with identity headers the workload can trust because nothing reaches the workload except through the proxy, enforced by network policy.

Keycloak is the IdP, deployed in-cluster or in the platform VNet. It federates up to Entra ID, so users still sign in with their corporate identity and your Conditional Access policies still apply. Keycloak handles the OIDC clients, the token issuance, the group-to-role mapping, and the session lifecycle. It’s the authority, and it lives inside your boundary.

External Secrets Operator syncs the Keycloak client secrets, the oauth2-proxy cookie secrets, and any signing material from Azure Key Vault. Nothing sensitive is committed to Git.

cert-manager handles every certificate. Public ingress certs from Let’s Encrypt. Internal mTLS between proxy and workload from an internal issuer.

Network policies enforce the path. The workload accepts traffic only from its oauth2-proxy. The proxy accepts traffic only from the ingress. The ingress accepts traffic only from the Application Gateway. Each hop is a deliberate, evidenced allow rule.

Sessions, tokens, and revocation

Two patterns also differ in how they handle the lifecycle of a session, which matters more than it sounds.

With the in-cluster pattern, Keycloak owns the session. You set token lifetimes, refresh behaviour, and idle timeouts to your policy. When you need to revoke access, you do it at the IdP and the next token validation fails. Force a logout across every application, expire a compromised session, or tighten lifetimes for a sensitive workload, all from one place you control. For regulated environments where “revoke this person’s access immediately” is an auditable requirement, having the session authority inside your boundary is the difference between a config change and a support ticket to a third party.

oauth2-proxy holds the session cookie and refreshes tokens against Keycloak as they expire. Because the proxy sits in the request path for every call, a revoked or expired token is caught on the next request, not whenever an edge cache decides to recheck. You get tight, predictable control over how long access persists after a credential is pulled.

What this gives you that Cloudflare Access doesn’t

Every identity decision happens inside your boundary, made by software you run, against an IdP you control, federated to the corporate directory your auditors already trust. When the auditor asks “where is access to this workload decided and how is it logged,” the answer is one diagram and one log source, all inside the tenant.

You also get richer authorization. Keycloak does fine-grained role mapping, group membership, client scopes, and token claims that your application reads directly. oauth2-proxy can pass through groups and emails as headers, or you let the application validate the JWT itself for proper API-level authorization. Cloudflare Access gives you allow or deny at the edge. The in-cluster pattern gives you identity all the way down to the application’s authorization logic.

When Cloudflare Access is the right tool

I’m not arguing against the product. There are boundaries where it’s exactly right.

Public marketing sites and docs behind a staff login. Low risk, no in-cluster identity needs, and the edge is a sensible place to gate access. We use it here ourselves.

Simple SaaS dashboards with no sensitive backend, where you want SSO in front of an off-the-shelf tool and don’t want to run identity infrastructure for it.

Low-risk internal tools that live outside the cluster entirely. A status page, an internal wiki, a dashboard with no regulated data. Cloudflare Access in front of these is less work than standing up oauth2-proxy and it’s a proportionate control.

The deciding question is the boundary. If the thing you’re protecting lives inside a cluster that has its own identity and authorization needs, put the identity decision inside that cluster. If it’s a standalone web property, the edge is fine.

Operational reality: outages and audit logs

Two things separate the patterns under stress.

When Cloudflare has an outage, and they do, your internal auth flow is down. Users can’t reach the in-cluster application even though the application, the cluster, and the IdP are all healthy. You’ve coupled the availability of internal tooling to an external provider’s edge. The in-cluster pattern fails only when your own cluster fails, which is the same failure domain as the workload itself. There’s no additional external dependency to take you down.

Audit logs tell the same story. With Cloudflare Access, your access decisions are in Cloudflare’s logs, your application logs are in your SIEM, and correlating a denied access at the edge with activity in the cluster means stitching two systems together across an organisational boundary. With the in-cluster pattern, Keycloak’s events, oauth2-proxy’s logs, the ingress logs, and the workload logs all flow to the same Log Analytics workspace or the same SIEM. One query answers “who accessed this and when.”

The day-two cost, stated honestly

I won’t pretend the in-cluster pattern is free to run. You now operate Keycloak, which is a real piece of software with its own database, its own upgrades, and its own availability requirements. If Keycloak is down, your auth is down, so it gets the same high-availability treatment as anything else on the critical path: multiple replicas, a backed-up database, and a tested restore.

The honest mitigation is that these are components you likely run already on a mature platform. cert-manager and External Secrets Operator are standard on any serious cluster. Keycloak runs as a workload like your others, monitored by the same Prometheus, logged to the same workspace, deployed through the same GitOps pipeline. It folds into the platform you already operate rather than standing apart as a special case. Cloudflare Access has no day-two cost to you because someone else runs it, and that’s exactly the trade: you take on operation in exchange for keeping the identity decision inside your boundary. For a regulated workload, that trade is worth making. For a marketing site, it isn’t.

Pick the right tool for the right boundary

Cloudflare Access protects edges. Use it for edges. In-cluster workloads have identity needs that belong inside the cluster, enforced by infrastructure you run, logged where the rest of your evidence lives, and dependent on nothing outside your boundary to make an access decision.

The in-cluster pattern is more to stand up on day one. oauth2-proxy, Keycloak, ESO, cert-manager, network policies. Once it’s running it’s less to operate, easier to audit, faster for users, and resilient to the failures that matter. For any ARO or AKS platform carrying regulated workloads, that’s the trade we make every time.

If you’re designing identity for a Kubernetes platform and want a second opinion on where the boundaries should sit, our cybersecurity team does exactly this work. Book a consultation and we’ll walk through your topology and tell you, honestly, which workloads need in-cluster identity and which are fine behind an edge.