We built one CAPTCHA solver for 15 providers. Here's the architecture.
TL;DR — BCTRL runtimes ship a hosted CAPTCHA solver behind a single API call. It detects and solves reCAPTCHA, Cloudflare Turnstile, hCaptcha, GeeTest, Arkose, DataDome, Amazon WAF, Friendly Captcha, Lemin, MTCaptcha, Prosopo, Altcha, Basilisk, Yidun, and Tendi, with two generic AI-vision fallbacks for anything it doesn’t recognize. You never tell it which type is on the page. This post is about how it’s built — the provider-adapter registry, why interactive challenges like sliders come down to one vision call plus code-computed motion, and three approaches we shipped, measured, and deleted.
One call, no type parameter
The entire public surface is an invocation:
const solve = await bctrl.runtimes.invocations.createAndWait(runtime.id, {
action: "solveCaptcha",
target: "active",
});There’s no type: "recaptcha" field, and that’s deliberate. Your automation doesn’t know what it’s going to hit — a site that served Turnstile yesterday serves GeeTest behind a different CDN rule today. The solver runs inside the runtime, inspects the page, and figures out what it’s looking at.
Failure is structured, because your code needs to branch on it: invocation.captcha_not_found means nothing was blocking the page and your flow can proceed; invocation.captcha_solve_failed means a challenge was found and lost, and details.retryable tells you whether another attempt is worth the credits. Every attempt also lands on the run timeline as captcha.solve.started / succeeded / failed events, so you can measure how often challenges actually fire in production instead of guessing.
The adapter registry
Inside, every provider is an adapter implementing one interface: how to recognize this captcha in the DOM (frame hints, sitekey markers), how to read its state, what counts as solved (token fields appearing in hidden inputs), and what actions the solving loop may take. Detection runs the adapters in a fixed order — the cheap, high-confidence checks first (reCAPTCHA, Turnstile, hCaptcha), then the long tail, then the managed providers, and finally two generic adapters that handle anything unrecognized with pure vision.
The interface is the point. When a new provider shows up, supporting it means writing one adapter and adding it to the list — detection ordering, solved-state probing, and the solving loop come for free. That’s how the list got to fifteen without the orchestrator growing a fifteen-way switch statement.
Knowing what “solved” means turns out to matter as much as solving. Most token captchas signal success by writing into hidden form fields, but the fields differ per provider, and some — GeeTest among them — require several fields to appear together before the challenge is actually passed. Adapters declare these as data, and a single consolidated DOM probe checks all of them at once. The solver doesn’t declare victory because the puzzle looks done; it declares victory when the page holds the token the site’s backend will verify.
The GeeTest slider, or: three deleted approaches
Token captchas are bookkeeping. Interactive challenges are where it gets interesting, and the GeeTest slider is the best war story because we got it wrong three times first.
The task: a puzzle piece sits at the left edge of an image, a piece-shaped gap sits somewhere in it, and you drag a handle so the piece lands in the gap. A vision model has to ground two x-coordinates. Everything else is engineering.
Attempt one: let the model do the math. Ask the model for the drag start and end, or for toX = fromX + distance. It can see the piece and the gap fine; it cannot reliably add two numbers under spatial grounding load. Off-by-enough, consistently.
Attempt two: annotation-style grounding. Three separate pointing calls using the model’s annotation format. The output shape varied per target, there was no clean retry path, it cost three calls instead of one, and accuracy didn’t improve. Reverted.
Attempt three: a verify-and-nudge loop. Hold the handle, check the position, nudge, check again. It worked sometimes and oscillated sometimes, and it turned every solve into a multi-round conversation. Deleted entirely.
What shipped: one strict JSON-schema vision call that returns four numbers — handle position, piece position, gap position — with reasoning disabled, because for pure spatial grounding the chain-of-thought hurts more than it helps. Code computes the drag as gapX − pieceX and executes it. The model points; arithmetic is the code’s job. The prompt disambiguates piece from gap structurally rather than by color — the piece is fully-drawn content near the left edge, the gap is a hole where content is missing — with a built-in self-check: if the two x-values match, you pointed at the gap twice.
Drags that read as human
GeeTest doesn’t just check where the piece lands. It scores how it got there. A mathematically perfect linear drag is the easiest bot signal there is, so the motion layer plays the drag back the way a hand would: a Bézier path as the spatial base, roughly a pixel of Gaussian tremor on top, a minimum-jerk velocity profile (the same 10t³ − 15t⁴ + 6t⁵ curve human reaching movements follow), log-normal timing between input events at a realistic report rate, and always at least one small corrective settle at the end — because humans overshoot and fix it, every time.
The recovery logic is tuned to the provider’s behavior rather than to generic retries. GeeTest’s “try again” cooldown is three seconds, so the solver’s wait outlasts it. Repeated bad reads trigger a challenge refresh rather than another doomed drag; repeated refreshes fail the invocation honestly. And sometimes GeeTest rejects a coordinate-perfect drag anyway — its risk scoring runs deeper than geometry — which resolves over retries, and is exactly why retryable exists in the error details.
🎬 Content TODO: screen recording of a GeeTest slider solve from the live view — the tremor and settle hop are visible at normal speed and make the motion section land. 10–15 seconds, no narration.
The numbers
📊 Content TODO: per-provider solve-rate table — N attempts, success %, median solve latency, date-stamped, with the test setup disclosed (stealth level, proxy type, region). Run against the public demo pages per provider. This table is what gets the post cited; the architecture is what makes it credible.
The honest part
A solver is the fallback, not the strategy. Most challenges never fire when the runtime runs with stealth hardening and a residential proxy, because the cheapest captcha to solve is the one that was never served. That’s a different post — but it’s why the solver, the stealth layer, and the proxy config live behind the same runtime API instead of being three products you glue together.
If you want to see it work, the cookbook recipe is a complete file: Playwright hits the reCAPTCHA demo, the solver clears it, Playwright continues. Run it on a free account and watch the solve in the live view.