architecture · serverless · streaming

Why a proxy beats a serverless function for AI key safety

2026-05-28 · Skelf-Research

The instinctive answer to “how do I hide my OpenAI key from the browser” is “stick a serverless function in front of it”. This works. You get a Lambda or a Vercel function or a Cloudflare Worker, you set OPENAI_API_KEY as a secret, you write twenty lines that forward the request, and you sleep at night.

For some shapes of app, this is the right answer and you should stop reading. For others — most others, in our experience — a long-running proxy is the better fit. This post is about why, and where the boundaries are.

What a serverless function does well

Serverless wins on operational simplicity. You do not run a process. You do not provision a VM. You do not worry about cold reboots eating your queue. The platform takes the request, runs your handler, returns the response, and bills you for milliseconds. For a low-traffic hobby app calling OpenAI a few times a day, this is unambiguously correct.

It also wins on isolation. Each invocation has its own memory. If your function crashes, it crashes one invocation; the next is unaffected. For workloads that look like “one request in, one response out, no state in between”, serverless is the platonic ideal.

Where serverless starts to bend

The interesting thing about AI calls is that they are not “one request in, one response out”. They are streaming. The chat completion endpoint returns a Server-Sent Events stream that can run for thirty seconds, sixty seconds, sometimes longer. The user expects to see tokens arrive as they are generated, not in a single thump at the end.

Streaming and serverless are uneasy housemates. They can be made to work, but every platform has caveats:

AWS Lambda: streaming via function URLs works, but the invocation timeout caps you at fifteen minutes, and the response payload limit (despite streaming) has edge cases.
Vercel functions: streaming works on the edge runtime, but the edge runtime is a subset of Node, and some of the SDKs you might bring along don’t run on it. The Node runtime has a different timeout regime.
Cloudflare Workers: streaming works, but a Worker has CPU time limits separate from wall time, and you can run into “exceeded CPU” errors in the middle of a long generation.

None of these are dealbreakers. All of them are surprises. The first time one bites you in production, you spend a day understanding the platform’s specific streaming model. The second time, you find a different way to do it.

A long-running proxy has none of these surprises. It is a Node process. It opens a connection to the upstream, it copies bytes through, the user gets tokens in real time. There is no timeout because the proxy is not on a billing meter that punishes wall time.

Cold starts and tail latency

Serverless functions sleep between requests. The first request after a quiet period pays a cold-start tax, which for a Node function with a few dependencies is somewhere between 200ms and 1500ms depending on the platform and the deployment region.

Most of the time this is fine. For an AI app where the user is already waiting for a model response, an extra second on the first request hides inside the model’s own latency. But if your app has bursty traffic patterns — a wave of users at 9am, nothing all night — you will see the cold start every morning, and your users will notice.

A long-running proxy has one cold start, at process boot. After that, warm connections to the upstream are reused. Tail latency is predictable. P99 is close to P50.

Connection reuse

OpenAI’s API, like most modern HTTP APIs, supports keep-alive. Your client and the API server set up a TLS connection once and reuse it for many requests. A long-running proxy can hold that connection open across thousands of invocations.

A serverless function, by definition, does not. Each cold-started container opens a fresh TLS connection. Even warm containers, after a period of idleness, will have closed the connection. The TLS handshake adds 50–150ms per first-request-on-a-cold-connection.

Again, none of this is a dealbreaker. It is a tax. The tax is small for low-volume apps and meaningful for high-volume ones.

Per-client state

A serverless function is, by design, stateless. If you want to rate-limit a specific client across multiple invocations, you need an external store. Redis. DynamoDB. KV. Something. Now you have two moving parts and a network round trip on every request.

A proxy holds state in memory. The rate-limit counter for a client is just a number in a map. Session state is a JWT (so no state at all in the common path; the proxy verifies the signature and goes). When you do need a denylist or a revocation list, it’s a Set somewhere.

For a fingerprint-bound, rate-limited proxy like Perishable, the state-locality matters. You can stand up an extra Redis if you want. You don’t have to.

Observability you can actually attach

A proxy you own is a process you can attach a debugger to, tail logs from, instrument with whatever profiler you like, and run a heap dump against. A serverless function is a black box. You get the platform’s logs, the platform’s metrics, and whatever instrumentation you remembered to add at deploy time.

In an outage, having a Node process running on a known host with a known PID is calming. In an outage, having “function invocation 12abcde9 exceeded memory” is a ticket waiting to be filed.

When serverless is actually right

There are real cases where the serverless choice is correct:

Genuinely bursty, low-baseline traffic where the cost model of pay-per-invocation beats pay-per-hour-for-an-idle-VM.
Non-streaming endpoints — embeddings, classification, anything where the response is small and prompt — where the streaming awkwardness doesn’t apply.
Edge-distributed reads where you want the proxy near the user and you’re willing to deal with KV-based state.

If your shape matches those, by all means run a function. Perishable also runs perfectly well as a long-running process behind a single function endpoint if you really want; the model just becomes “function → proxy → upstream”, with the function as a thin authentication boundary.

The recommendation

For the typical case — a client-side AI app that streams completions and wants fingerprint-bound short-lived sessions — run a process. npx perishable-proxy on a small VM, or as a container next to your existing services. You get streaming that works, latency that’s predictable, and state where you can see it.

The serverless version of this design exists, and we will probably ship a Worker-flavoured Perishable at some point. Until then: a process is the simpler answer for the workload that actually matters.

Operationally, “a process” is not scary

There is a residual nervousness around running a long-lived process that comes from a decade of “serverless is the new default” framing. In practice, the operational difference between a small Node service and a serverless function, for this workload, is roughly:

One small VM or container, somewhere. $5–$20 a month for a hobby app. A pod in your existing cluster for an enterprise.
A systemd unit or a Procfile or a Dockerfile. The deployment story you already use for everything else.
A health check endpoint and your existing monitoring.

That is the entire surface. There is no autoscaler to configure, no cold-start budget to tune, no provider-specific streaming caveat to work around. You are running a Node process. The hardest decision is which logger to use.

For teams that have ruthlessly committed to a serverless-only model, introducing one always-on component feels like backsliding. It is not. Some workloads want long-running processes. Streaming AI calls are one of those workloads. Trying to bend serverless to fit is a cost you pay forever; running a process is a cost you pay once.

A yoghurt aisle does not need autoscaling. It needs a fridge.

Filed under: architecture, serverless, streaming. Spotted a mistake? Open an issue.