Transcript Sync

I’m looking for a simple way to get a transcript to automatically scroll as I play embedded audio or video on the same page.
If there’s a solution that will work with an embedded YouTube video, then that would be great.

Here are the features I’m looking for:

  • When the video is played, the transcript auto scrolls in sync.
  • When the video is scrubbed forward or backward, the transcript will automatically jump to the new timecode.
  • The user has the ability to click a phrase/word/syllable/letter in the transcript and the video player will jump to that timecode.
  • The user has the ability to turn off automated scrolling of the transcript in favor of free scrolling and searching.
  • The user has the ability to toggle off the function of clicking a word seeking to that point in the video (meaning that words in the transcript could contain hyperlinks to pertinent information such as Wikipedia entries, Twitter profiles of people being mentioned, other videos being mentioned, etc.)
  • This library/player can also generate a URL fragment which can be appended to the end of the page URL (in a similar fashion to appending a timecode to the end of a YouTube URL) which can pass a timecode to the player on page load for the purpose of linking others to a specific quote.
  • In addition to supplying only a beginning time, we might simply offer the user the ability to highlight the appropriate phrases and have a collection of (not necessarily consecutive) timecodes be auto generated + represented in as concise + human readable manner as possible.
    (In other words I want the ability to construct a playlist of moments selected within the audio or video which correspond to the desired quotes.)

If you turn on the transcript under the TED Talk video player, it offers a few of these features which may help you visualize more clearly what I am after.
I believe that their transcript viewer previously had automated scrolling of the transcript to keep it in sync with the video but this appears to no longer be the case.

A quick search on Bing also turned up TimeJump for deep linking inside of audio or video on a page.

YouTube is good insofar as:

  • It provides a free hosting solution.
  • It provides the only free solution I’m aware of to automatically synchronize a transcript to the audio of a video and auto-generate subtitles.
    (The automated line breaks are usually borderline useless, however this caption/subtitle file can be downloaded, manually edited, and resynchronized once we have the first automatic rough draft.)
  • While on a video’s page on YouTube.com any timecode mentioned in the description or comments is a hyperlink which will seek the player to that hour/minute/second.
  • We can easily append a timecode to the end of a YouTube URL to seek to a particular quote on page load.
  • YouTube allows us to easily create video playlists + within a playlist any video can be trimmed to a selection of a minimum of 15 seconds.
  • The same video can be added to a single playlist multiple times if we need multiple discrete quotes from a single video.

However YouTube is limited to the degree that:

  • When embedding a YouTube video on an external site, I do not know of a way to have timecode links on that page affect the video player’s progress bar.
  • Because of this, while I can display a transcript alongside the video and have a link to a timecode for each line/phrase in the transcript, it will want to open the link to that YouTube video in a new tab on YouTube.com rather than seeking to that point in the video player that I’ve embedded on the same page.
  • I know of no way for the transcript to know where the video player is progress-wise, and therefore do not know how to automatically scroll the transcript to the part being spoken in the video if the user manually scrubs forward or backward.
  • While the ability is there to temporally crop playback of individual videos in a YouTube playlist, this only works on laptops and desktops.
    On any mobile OS where the video is passed to the native media player, it will always play the entire video, meaning that if we added the same video to a YouTube playlist multiple times in order to play several non-consecutive portions, when viewing the playlist on a mobile OS or video game console the full length video will play multiple times, rather than playing only the selected portions.

If there’s no way to get this type of control in concert with YouTube’s Flash and HTML5 video players, then a solution which takes an MP4 and an SRT as input and renders them with something universal like Javascript would be ideal, as long as the performance is good (with progressive enhancement to use Canvas or anything else which will help us avoid slowdown and give us more presentation options).

Bonus points if it:

  • can work on mobile devices rather than the video being shuffled off to the mobile OS’ default media player (because the media player will not support any custom transcript viewer or obey commands to only play specified portions of the media)
  • can apply dynamic animations to the text currently being heard, such as printing the phrase to screen character by character or a subtle highlight moving across different phrases, based on the number of characters between a beginning timecode and finishing timecode for a given phrase.

Please let me know if you’re aware of anything like this.
I’m @natelawrence on Twitter.

Provisos:

  • Because we are dealing with heavily compressed web audio + video which comes in fairly large chunks, even if we had very precise timecodes, the player may only be able to seek to the beginning of the nearest chunk of audio.

Related concepts:

  • Verbatim transcript collapses to Proofread transcript.
    In cases of transcripts of impromptu conversation which contains rephrasings, stutters, etc. it would help to have transcript markup that allows the transcript viewer to toggle between showing verbal missteps (repeated words, “um”s, “uh”s, etc.) and hiding those in a proofread version without storing those as separate text files so that visual continuity is maintained by all text that the verbatim and proofread transcripts have in common.
  • A cross-site audio / video playlist webservice which allows me to add any embeddable web video to a single central audio + video playlist, specify that I want only a selected portion of the audio or video to play, and add captions/subtitles/synchronized transcripts, independent of the ability of the audio or video player from the site of origin to display such things.

 

EDIT:
I originally wrote the above entry to be able to link to in a comment on someone’s short blog post on TimeJump (so as to not write an entire chapter in their comment section) but my actual comment there included a bunch of other related thoughts I’ve had on this topic for many years and ended up being as long (if not longer) than this post. There is some redundancy, but for my own sake here is the full text of that comment.

Eric, hi.

I’m wondering if you’ve ever come across anything that would include concepts from TimeJump, but add more ideas as well.

I jotted out a quick sketch of what I’m looking for on my blog.

The gist of it is that, given a chunk of audio or video and a full transcript of what was said therein combined with subtitle-style timecodes that correspond to different phrases in the text, I want to be able to present the transcript and the source media together and have the text auto scroll to display the part of the transcript that is currently being heard.

I tend to transcribe audio as thoroughly as possible on first pass (including stuttering, “um”s, “uh”s, rewordings, etc.) just to be sure that I’ve not missed anything.

I then make a copy which I proofread and clean up, trying to punctuate the original words to be as grammatically correct as possible and remove any repeated phrases to convey the speaker’s intended meaning and to provide a text that will translate well. This also includes correcting any slips of the tongue when someone is speaking (say someone is illustrating a point and says “Abraham” when they obviously meant to say “Moses” or accidentally say “Pacifically” rather than “specifically”)

This results in two transcripts, but I would like to be able to present the option to the viewer to toggle between these on the fly as the transcript plays.

My ideal is that I could simply mark up the proofread differences into the verbatim transcript and not have a duplicate copy of all the text they have in common.

This is good for two reasons:

  1. File size/download time is cut by not redundantly storing the majority of the words twice.
  2. The transcript player can keep visual continuity on the words that don’t change and simply animate the repeated phrases collapsing to the proofread replacement so that our viewers do not lose their place.
  • The transcript is good for translation.
  • The transcript is good for searching.
  • The transcript is good for hyperlinking to a specific point in the video.
  • The transcript is good for hyperlinking to references (as you mention).
  • The transcript is good for those who cannot hear or cannot hear well or where the video is not of sufficient quality to lip read or the video has cut away to illustrative material while the speaker continues.

Above and beyond these requirements:

  • I’m looking for a way to highlight multiple sections of a single time-synchronized-transcript, which could auto generate the appropriate timecodes and then present a playlist of selections within a single piece of audio or video.
  • I’m looking for a way to have a playlist of multiple audio + video clips (potentially from different websites) and retain the ability to play multiple selections of arbitrary length from within each of them.
    (Imagine you have some audio journals and some video journals throughout a creative project and you want to play back all statements regarding a particular subject that is mentioned at least once in 75% of the recordings.)
  • I am looking for a way for the player to buffer ahead the specific selections of audio or video that we have specified that we will be playing.
  • It should always be made clear to the user, through text, that what they have listened to is only a portion of the whole.
  • I am looking for a way for the selections of audio + video to work on mobile devices and video game consoles without the operating system’s default media player intercepting the HTML5 audio or video and simply playing back the entire recording.
  • I am looking for a way (currently uploading a video to YouTube along with a uniquely formatted transcript is the only way that I know of) of automatically generating timecodes for the beginning of each word in a transcript.

My ideal is actually a timecode for the beginning and end of each syllable.
This gives us several things:

  1. When we switch from the verbatim transcript to the proofread transcript, we could actually skip the pieces of audio that we have folded out of view, allowing us to cut out stammering, rephrasing, unnecessarily long pauses while a speaker reviews their notes, etc.
  2. At this point, we can have the subtitles/captions/transcript print itself to screen on a per-character basis, synchronized with the speaker’s speech. We can thus visualize the speakers words materializing into existence on the page as they are spoken because we simply divide the amount of time that a syllable takes to play back by the number of characters in that syllable. This provides a satisfactory synchronization between a speaker’s nuances of speed of pronunciation whereas averaging the total number of letters in a word over the whole word may be too loose to be pleasant whenever one syllable is held for a longer period of time.
  3. This gives us the ability to generate a timecode for every letter in the transcript to seek to any point in the media (within the constraints of the compressed format that the media is stored in).
  4. It would require more markup, but we could actually animate someone forming their wording when they are rephrasing a sentence by use of crossed out text, deleting a phrase that will be replaced to complete the sentence, etc. My instinct, since I was in elementary school, was that this would help people think more clearly about grammar and punctuation.

My summary seems to have surpassed my original post in some ways and there are more ideas here than my core requirements which are thus:

For a single piece of audio/video on a webpage and a time-synced transcript I would like to:

  1. Display them side by side, showing the text that is currently being spoken
    (resynchronizing when the user seeks to a previous or future point in the recording).
    The idea is that we can add our own transcript/subtitles for a video or audio that did not include one.
  2. By clicking on a phrase within the transcript which corresponds to a segment of time, it will scrub the media player to that timecode (as TimeJump does on page load).
    See the transcript below the video player at TED.com for an example.
  3. Allow the user to stop the transcript’s auto scroll to search the text for a particular keyword + then seek to that time in the audio/video.
  4. Allow the user to toggle off time hyperlinks in order to click on hyperlinks to external resources mentioned (a passage on Bible.com, a Wikipedia article, another church’s website, someone’s profile on Twitter, etc.). Perhaps it would suffice to hold down Ctrl and then click on a hyperlink embedded in the transcript to avoid seeking to that time in the media.