Methodology
Everything below describes exactly how Compare STT works under the hood. The full source code is open on GitHub—nothing is hidden.
1. Providers
Each provider is called through its official SDK or REST API, with no preprocessing, normalization, or prompt engineering applied to the audio. Every provider receives the exact same audio buffer and MIME type.
| Provider | Model | Settings |
|---|---|---|
| Gladia | Solaria | Default, language detection on, code-switching off |
| Deepgram | Nova 3 | Smart format, detect language |
| AssemblyAI | Universal-3 Pro | Language detection |
| ElevenLabs | Scribe v2 | Default |
| Speechmatics | Enhanced | Language: auto, enhanced operating point |
| Mistral | Voxtral Mini | Word-level timestamps |
Both providers run in parallel (Promise.all) so neither has a latency advantage that could influence the user's perception. Transient errors (network, 429, 5xx) are retried up to 2 times with exponential backoff.
const [resultA, resultB] = await Promise.all([ transcribeForProvider(providerA.slug, audioBuffer, mimeType), transcribeForProvider(providerB.slug, audioBuffer, mimeType), ]);
2. Matchmaking
Provider pairs are not selected uniformly at random. Instead, the system uses a least-played pair strategy to ensure balanced coverage across all possible matchups:
- Enumerate all possible unordered pairs of providers (with 6 providers, that's 15 pairs).
- Count how many votes each pair has received so far (grouping by sorted provider IDs so A-vs-B and B-vs-A are the same pair).
- Find the minimum count across all pairs.
- Pick randomly among pairs that have this minimum count.
- Randomly swap which provider appears as “Model A” vs “Model B” (50/50 coin flip).
This means every pair gets roughly the same number of comparisons over time, preventing popular pairs from dominating the dataset.
// Count votes per unordered pair
const pairCounts = await prisma.vote.groupBy({
by: ["providerAId", "providerBId"],
_count: true,
});
// Aggregate into unordered pair counts
for (const row of pairCounts) {
const key = [row.providerAId, row.providerBId].sort().join(":");
countMap.set(key, (countMap.get(key) || 0) + row._count);
}
// Pick among least-played pairs
const minCount = Math.min(
...pairs.map(p => countMap.get(pairKey(p)) || 0)
);
const leastPlayed = pairs.filter(
p => (countMap.get(pairKey(p)) || 0) === minCount
);
const chosen = leastPlayed[Math.floor(Math.random() * leastPlayed.length)];
// Random A/B assignment
const swap = Math.random() < 0.5;3. Blind voting
The user sees “Model A” and “Model B” with no indication of which provider produced which transcription. Provider identities are only revealed after the vote is submitted.
To prevent tampering, the match assignment (session ID + provider A ID + provider B ID) is signed with an HMAC-SHA256 token before being sent to the client. When the vote comes back, the server verifies this token. This prevents a client from forging or replaying votes for arbitrary provider pairs.
// Sign: server → client (embedded in transcribe response)
const payload = `${sessionId}.${providerAId}.${providerBId}`;
const signature = crypto
.createHmac("sha256", signingKey)
.update(payload)
.digest("base64url");
return `${payload}.${signature}`;
// Verify: client → server (submitted with vote)
// Recompute HMAC and compare — reject if mismatchEach vote records exactly three things: which two providers were compared and who won (or null for a tie). No audio, no transcriptions, no user identifiers are stored.
4. ELO rating system
Rankings use the ELO rating system, the same approach used to rank chess players. The implementation:
- Starting rating: 1500 for every provider.
- K-factor: 32 (standard for systems with moderate churn).
- Expected score:
E(A) = 1 / (1 + 10^((R_B - R_A) / 400)) - Win: winner scores 1, loser scores 0.
- Tie: both providers score 0.5.
- Update:
R' = R + K × (actual - expected)
Ratings are computed from a single chronological pass over all votes. The leaderboard displays exact ELO scores, sorted by descending rating.
const K = 32;
const INITIAL_RATING = 1500;
function expectedScore(ratingA: number, ratingB: number): number {
return 1 / (1 + Math.pow(10, (ratingB - ratingA) / 400));
}
for (const vote of votes) {
const expectedA = expectedScore(a.rating, b.rating);
const expectedB = expectedScore(b.rating, a.rating);
// Win → 1/0, Tie → 0.5/0.5
a.rating += K * (scoreA - expectedA);
b.rating += K * (scoreB - expectedB);
}5. Leaderboard visibility
Rankings are blurred until the results reach statistical significance. While blurred, provider order is randomized to prevent premature conclusions from insufficient data.
Once revealed, providers are sorted by exact ELO rating (descending).
6. Anti-gaming measures
- Blind comparison: provider identities are hidden during voting, so preference bias is eliminated.
- Random A/B assignment:which provider appears as “A” or “B” is a coin flip, preventing position bias.
- HMAC-signed match tokens: votes are cryptographically tied to the match they were issued for, preventing forged or replayed votes.
- Balanced matchmaking: least-played pair selection ensures no provider pair is over- or under-represented.
- No stored audio: audio is deleted from temporary storage immediately after transcription completes, regardless of outcome.
- Open source: the entire codebase is public, so anyone can audit the implementation.
7. Known limitations
- User-submitted audio only: the dataset is not controlled. Audio quality, language, accent, and content vary by user. This is intentional (real-world diversity) but means results may not match performance on specific benchmarks.
- Subjective judging:users decide what “better” means. Some may prioritize accuracy, others formatting or punctuation. ELO reflects aggregate human preference, not a single objective metric.
- No normalization: transcriptions are compared as returned by each provider, including differences in casing, punctuation, and formatting. This matches real-world usage but means a provider with better formatting may score higher even with identical word accuracy.
- Sample size: with a small number of votes, rankings can be volatile. The leaderboard is hidden until the minimum vote threshold is reached for this reason.
8. Data storage
The database stores exactly three things per vote:
model Vote {
id String @id
sessionId String // anonymous session identifier
providerAId String // first provider in the matchup
providerBId String // second provider in the matchup
winnerId String? // winner ID, or null for ties
createdAt DateTime // when the vote was cast
}No audio recordings, no transcriptions, no IP addresses, no user accounts. The session ID is a random UUID generated client-side with no link to any user identity.
9. Sponsorship & API costs
None of the providers listed above have offered free API keys or credits for this project. All API calls are paid for by Gladia, which sponsors the full cost of running every transcription across every provider. We thank them for making this independent comparison possible.
Questions about the methodology? Open an issue on GitHub or email us.