Audio Caption

Audio Caption

Critical - WCAG Level A

What the issue is: The page uses an <audio> element to present prerecorded spoken or other audio content but does not provide captions or a transcript. For HTML5 audio, accessible captions are provided as timed text tracks (e.g., WebVTT) with the <track> element (kind="captions") or, for audio-only content, a full text transcript. Without these, people who are deaf or hard of hearing cannot access the spoken information.

Why it matters: Users who are deaf or hard of hearing rely on captions or transcripts to access the content. Screen reader users may benefit from a transcript to read the content non-linearly. Captions also help people in noisy environments, with cognitive disabilities, or learners of a second language. WCAG 1.2.2 (Captions (Prerecorded)) requires captions for prerecorded synchronized media; for audio-only content, provide an alternative (transcript) under 1.2.1.

How to fix it:

Create a timed text file in a supported format (WebVTT .vtt) that contains synchronized captions including speaker IDs and non-speech sounds.
Add a <track> element inside the <audio> element: <track kind="captions" srclang="en" label="English" src="captions_en.vtt">.
Provide a human-readable transcript on the page (or linked) for audio-only or for users who prefer a text alternative; link it from the audio with aria-describedby or a visible link.
Offer captions in all languages you publish and ensure captions are accurate and synchronized. If your player doesn’t surface <track> captions, implement a JS player that reads the WebVTT and displays captions in a visible region.

Best practices: Use WebVTT, include non-speech information and speaker names, supply multiple language tracks with correct srclang, provide a persistent, keyboard-focusable caption toggle, and proof captions (avoid relying solely on auto-captions). Use aria-describedby on the audio to point to the transcript and mark the transcript region with role="region" and aria-label.

Common mistakes: Omitting srclang, using subtitles instead of captions (subtitles assume the user can hear), relying only on a transcript for synchronized media where captions are required, embedding captions only in a visual-only player that is inaccessible to assistive tech, and not proofreading auto-generated captions.

Related articles

Was this page helpful?