How we test

What we measure, and how

A short, specific description of the test harness behind our reviews. We publish this so a reader can audit the numbers and a vendor can reproduce them.

When we test a tool directly we record four things: end-to-end latency at the 50th and 95th percentile, success rate against a fixed set of real-world targets, normalised cost per thousand requests, and JS rendering coverage on a JavaScript-heavy corpus. The harness is a thin TypeScript runner that issues requests through each vendor's public SDK or HTTP API, records timing with performance.now() on the client, and validates responses against a reference parse so a 200 status with a malformed body is not counted as a success. Each tool runs against the same target list at the same time of day, from the same network egress, in batches of a few hundred requests with a short randomised delay between calls.

What we measure

Latency P50 and P95 are reported in milliseconds, end-to-end from request issuance to fully parsed response. We report both because the median hides the long tail that matters when an agent has a wall-clock budget. Success rate is the share of requests that returned a parseable response with the expected fields; CAPTCHAs, blocks, and partial pages all count as failures. Normalised cost per thousand requests takes the vendor's posted price and re-expresses it for a comparable workload (for example, including a JS-rendering surcharge if the workload requires it, or pricing a search call inclusive of any per-result fees). JS rendering coverage is the success rate on a subset of targets that require a real browser; it is reported separately because some tools are excellent on static HTML and weak on rendered pages.

For AI search APIs (Exa, Tavily, Perplexity, LinkUp, Brave Search API, You.com, Jina, Parallel) we add a recall-style metric: given a fixed query set with known relevant documents, how many of the relevant documents appear in the top ten results. This is imperfect – evaluating search quality is genuinely hard – but it is more honest than reporting only latency on a category whose entire purpose is relevance.

What we use to test

The runner itself is a thin script, not a product. We write tests in TypeScript, run them with tsx, and persist raw results to a SQLite file so a later re-run can be diffed against the prior one. For headless rendering we use Playwright, because it is the de facto reference for rendered scraping. Network egress is from a US-based provider; we note this on every results table because latency from another region will look different. Where a vendor offers an SDK we use the SDK; where they do not, we use the documented HTTP API. We do not use undocumented endpoints.

Why we do not paid-review

Some publications in this space accept payment for a review or for a higher placement. We do not. The reason is structural: a test result is only useful to a reader if the same harness is applied to every tool on the same basis, and money distorts that basis. Vendors are welcome to send a free tier or trial credits – many do – but the test plan and the verdict are ours. If a vendor disputes a number, we will re-run it; we will not change it because they paid. The editorial standards page describes the wider position on independence.

Last updated: 2026-05-08