early accessmedia infra for agents & automations

messy audio in.
clean JSON out.

One call eats the 4-hour file, the speaker mess and the format hell — and hands your agent a stable schema. No ffmpeg. No babysitting.

vocce — zsh

✓ tools registered — try “transcribe this call, give me action items”

needs a key — agent.v1 · failed jobs never billed

see a workflow

// runs in your agent, ships to your stackClaude CodeCursorGeminiMCPCLIREST APIn8nZapierMakeGitHub ActionsNotionHubSpot

// real output, not a mockup

See what comes back.

Every sample below is an actual file the live engine produced — play it, read it, then run it on your own.

transcript.txt

auto-detected · Vietnamese

Ngoài kia bao nhiêu câu ca tình nhân thơ tháng
Các thế gian nghe như con sông trôi ngẹn ngào
Và anh buồn sai bao nhiêu lệch trên bức trấn
Để biết em rời đi không khi nào về lại
Nói đi em bao dân gió trong đời oan cháy
Mà sao em văn thân đi suốt kinh doanh

● produced by the live enginerun it on your file

// The actual hard part

Transcription is free. Making it reliable at scale isn't.

Anyone can call Whisper. What burns weeks is everything around it. Vocce owns those parts so your agent or automation makes one call and trusts the result.

Multi-hour file chunking

Split a 4-hour recording, transcribe in parallel, and stitch it back with continuous, correct timestamps — no drift at the seams.

handles 10GB+ uploads

Speaker diarization

Stable speaker maps that stay consistent across the whole file, with low-confidence segments flagged instead of silently guessed.

speaker_map.json

Idempotent jobs & retries

Re-send the same job, get the same job_id. Transient backend failures retry automatically. Failed jobs aren't billed.

exactly-once semantics

Format hell, solved

MOV, M4A, WebM, weird bitrates, video with no normalized audio — Vocce cleans, normalizes loudness, compresses for ASR, accepts a URL or upload. You never touch ffmpeg.

20+ formats · clean · normalize · compress

Webhooks & one schema

Push completed results to your hook, queue, or automation. The output schema is identical across MCP, CLI, API, and every node.

agent.v1 schema

// The deliverable

Not a wall of text. Output your code can act on.

One job returns every artifact downstream tools need — structured, versioned, and stable enough to build on.

transcript.agent.json

{
  "speaker_turns": [
    { "speaker": "A", "start": 12.4, "text": "..." }
  ],
  "chapters": ["Problem", "Decision", "Next steps"],
  "artifacts": ["transcript.md", "subtitles.srt"]
}

// Distribution that actually exists

Drop Vocce into the tools you already use.

Real discovery happens where people build: the MCP registry, automation marketplaces, and your CI. Stable tool names and one schema across every channel.

install

{
  "mcpServers": {
    "vocce": {
      "command": "npx",
      "args": ["@vocce/transcribe-mcp"],
      "env": { "VOCCE_API_KEY": "vc_..." }
    }
  }
}

MCP starter ↓CLI starter ↓Skill starter ↓

MCP RegistryPublished server metadata so it's installable & indexable.vocce

n8n / Make / ZapierPre-built “Transcribe media” nodes users search and add.automation marketplaces

GitHub ActionDrop into CI to transcribe release media or PR demos.vocce/transcribe-action

Stable contractTool names never drift, so agents form reliable habits.create_transcription_job

// Flagship workflow

Turn customer calls into CRM-ready intelligence.

One example of going deep, not wide. The same primitives power podcast publishing, research interviews, and compliance captioning — but here's the workflow teams pay for first.

sales & success teams

Recording in. Action in your CRM out.

Drop a call recording (or point Vocce at the URL). It cleans the audio, separates speakers, extracts objections, commitments, and next steps, then pushes a structured summary straight into HubSpot, Salesforce, or Notion via webhook. No re-recording, no manual notes, no glued-together pipeline.

Try it on a file Meeting notes page

Call recordingUpload or URL — any format, any length

↓

Vocce one callclean · diarize · transcribe · extract

↓

Structured intelligenceobjections, commitments, next steps, owners

↓

Pushed to your CRMwebhook → HubSpot / Salesforce / Notion

// Where Vocce fits

Between a raw API and a heavy app.

We're honest about the lane: not the cheapest STT, not a meeting bot. The reliable layer in the middle.

Raw ASR APIsCheap minutes, no pipeline.

Deepgram, AssemblyAI, Whisper give you text. You still build chunking, diarization stitching, retries, and delivery. Vocce is that layer.

Transcription appsGreat UI, closed box.

Otter and TurboScribe are built for humans in a dashboard. Vocce is built to be called by code, with a stable schema and webhooks.

DIY pipelineWorks until it doesn't.

Your own ffmpeg + queue + ASR script breaks on the 4-hour file at 2am. Vocce is the maintained version of that script.

// Try it in the browser

No SDK needed to see the output.

Drop a file and get a free 3-minute preview: quality report, the first subtitle lines, and a sample of the agent JSON. When the export matters, pay per pack or wire up the API.

✓ 3-minute free preview✓ No card required✓ MP3, MP4, M4A, WAV, MOV

vocce · transcribe● live

// High-intent entry points

One engine. Many ad-ready pages.

The same reliable pipeline behind focused landing pages — each opens with the exact job the visitor searched for.

Convert audio to text, then export the whole pack.

Turn video into transcripts, captions, and reusable notes.

Convert MP4 into text, captions, and clean handoff files.

Turn quick voice notes into usable text.

Create subtitles from audio or video without rebuilding the workflow.

Translate subtitles without breaking the timeline.

Drop a video. Get a soundtrack that fits.

Turn podcast episodes into transcripts, notes, and reusable content.

Upload a meeting recording. Get decisions and next steps.

Give your AI agent a media processing tool.

Automate transcription from the terminal.

Give agents transcripts they can actually reason over.

Clean up messy voice recordings.

Make big, messy media easier to transcribe.

Convert M4A to MP3 in seconds.

Convert WAV to MP3 without the bulk.

Convert FLAC to MP3 that plays everywhere.

Convert AAC to MP3 cleanly.

Convert OGG to MP3 that just works.

Turn MP4 video into MP3 audio.

Convert any audio to any format.

Convert MOV to MP4 without re-encoding pain.

Pull the audio out of any video.

Compress MP3 files without killing the sound.

Compress any audio file down to size.

Get any video under 10MB.

Make your clip fit Discord's upload limit.

Shrink video small enough to email.

Compress MP4 files fast.

Compress any video, predictably.

Make any recording sound professional.

Convert MP3 to text you can use.

Turn lectures into notes worth keeping.

Transcribe entire courses, lesson by lesson.

Get the transcript of a YouTube video.

Summarize YouTube videos in minutes.

Notes that write themselves.

Generate WebVTT captions for the web.

Translate subtitles into any language.

Translate SRT files without breaking sync.

Turn any YouTube video into an MP3.

Download any YouTube video as MP4.

Download YouTube videos, then do more with them.

Download TikTok videos, watermark-free.

Grab the audio from any TikTok.

Download Instagram Reels and videos.

Download videos from X (Twitter).

Download Facebook videos as MP4.

// FAQ

Answers that reduce build risk.

What does Vocce do that a raw ASR API can't?+

A raw API returns text for one clean file. Vocce owns the pipeline around it: chunking and stitching multi-hour files without timestamp drift, speaker diarization with a stable map, format normalization, idempotent jobs, retries, and webhook delivery — returned as one versioned schema.

Where can I call it from?+

MCP, CLI, REST API, and pre-built nodes for n8n, Zapier, Make, plus a GitHub Action. Tool names and the output schema are identical across every channel.

Do failed jobs cost credits?+

No. Upload failures and failed jobs don't burn credits. Processing is only charged when a job completes successfully, with a clear failure reason exposed to your code.

Which exports are supported?+

TXT, Markdown, DOCX, SRT, VTT, clean audio, compressed audio, summary Markdown, a quality report, and the agent JSON schema for downstream tools.

Is it ready for my backend?+

The frontend posts jobs to a configurable API endpoint and fails closed if the backend isn't connected. Wire it to your upload, queue, ffmpeg, ASR, and export pipeline; the contract is documented in backend-contract.md.

stop rebuilding the same media pipeline.

Get an API key, install the MCP, or just drop a file. One call, one schema, every channel.

get an API key try a free preview

messy audio in.clean JSON out.