Automatic and Accurate Captioning
Allan Knight and Kevin C. Almeroth
Network and Multimedia Systems Lab
Department of Computer Science
University of California, Santa Barbara
  1. Introduction
  2. Suppose that an instructor possesses a collection of multimedia that includes video as well as a transcript of the dialog from the video. Synchronizing these two pieces of content to create a captioned video presentation is useful and valuable. However, this synchronizing presents several issues that must be addressed in order to successfully create the combined video and captions.

    There are several issues that arise when faced with the scenario above. One of the issues involves the process of accurately converting speech into text using a Speech Recognition System (SRS). Since state-of-the-art SRSs are not currently achieving an acceptable level of accuracy, the first issue is to improve their accuracy in order to achieve reliable synchronization. Second, aligning recognized words with the spoken dialog could be difficult since the text transcript may not match word-for-word with what is spoken in the audio. This mismatch between the audio dialog and transcript occurs when some words that are recognized accurately may have been edited from the transcript. This problem is not trivial since the SRS may recognize words incorrectly. Third, techniques need to be developed that, as accurately as possible, estimate the time unrecognized words of the transcript were spoken, because not every word in a transcript is likely to be recognized by the SRS. And, finally, indicating to content creators how inaccurate these estimated times may be would be useful. This paper outlines a process and an architecture that offers solutions to these issues.

    Current research in multimedia integration deals with real-time transcription and the improvement of SRSs, but not with how to synchronize transcripts in the absence of near perfect automatic transcription. The research that does look at this problem involves situations where some timing information exists before synchronization begins. An example of this would be when news program are transcribed manually in real-time resulting in captions offset by several seconds from where the words are actually spoken. Also this body of research does not address how to identify and estimate possible error in generated time- stamps.

    Our AutoCap system automatically synchronizes pre- segmented transcripts with video. These transcript segments along with the time they were spoken, are combined to form captions. Unfortunately, no SRS in its normal configuration can achieve the accuracy necessary to synchronize transcripts without introducing an unreasonable amount of error for some or all of the captions. AutoCap overcomes this limitation by creating a better language model for use with the SRS that increases the accuracy to the necessary level. The role of the SRS is to collect as many groups of contiguous words, called utterances, as well as the time each word in the utterance was spoken. Once the SRS has collected all of these utterances, AutoCap creates captions by aligning the utterances to the transcript. For those words that are not recognized, AutoCap estimates when the words were spoken along with an error bound that gives the content creator an idea of caption accuracy. The result of this process is a collection of accurately time-stamped captions that can be displayed with the video.

    The basic requirement for synchronizing a transcript with video is to determine the exact time the first word of each caption is spoken. To achieve this objective AutoCap must have as input spoken audio from a video program and an edited transcript of this video pre-segmented into captions. The output is a caption file with the segmented text along with time-stamps for each.

    The AutoCap process for automatically synchronizing captions is accomplished in four phases. First, a new language model is created for a video by processing the video's transcript using a statistical language modeling toolkit. This language model is created to increase the accuracy of the speech recognition system, and is paramount in creating an accurate collection of time-stamped captions. Once this language model is created, the final three phases of synchronization are performed.

    The second phase of AutoCap, speech recognition, collects all recognizable utterances along with the time-stamp of each of its words. This collection of utterances is then aligned with the transcript in the alignment phase to create time-stamps for as many captions as possible. Finally, in the fourth phase, for each caption that is not time-stamped,because the first word of the caption is not recognized by the SRS, an estimated time-stamp is calculated based on the nearest time-stamped words. This final phase, known as estimation, produces the final output: a caption file containing segmented text and their associated time-stamps. The details of each of these phases are outlined in the following sections.

  3. Installation
  4. Before installing AutoCap it is important that the following software requirements are met. These reguirments include libraries and the utlities required to properly run AutoCap. Be sure that all requiremets are met before moving onto the Installing AutoCap section. Links to all the requirments are listed in the section Useful Links

    1. Requirements for running AutoCap
    2. The following libraries and system utilities are necessary for the proper operation of AutoCap:

      • Java 1.4 VM or greater
      • MPlayer 1.0pre4-3.3.3 or greater
      • CMU Cam Toolkit 2.0
      • Perl 5.8 or greater
      • Java Media Framework (JMF) 2.1.1e or greater
      • Sphinx SRS 4.0 or greater
      • Java Sound API (JSAPI)
      • ant build system 1.6.2 or later

    3. Downloading AutoCap
    4. The latest version of AutoCap is version 1.0. Click here to obtain the tar file. The tar file contains all the code, build scripts and Java libraries needed to run AutoCap. Once this tar file has been downloaded proceed to the next section, Installing AutoCap.

    5. Installing AutoCap
    6. Once the tar file has been downloaded, installing AutoCap is as simple as untaring the source and building the AutoCap JAR file. To accomplish this task issue the following commands:

      %> tar -xjf autocap_<version>.tbz2
      %> cd autocap_<version>
      %> ant

      If no error messages were observed after ant is finished building AutoCap, then move onto the next session Running AutoCap.

  5. Running AutoCap
  6. Before running AutoCap it helps to understand the proper formats of the input and output files. Java only supports certain types of media files and the output file currently uses a proprietary file format that is not defined by any public standard. Finally, the execution environment needs to be set up properly for AutoCap to run properly. Follow the following directions carefully, and AutoCap should run properly.

    1. Input Files
    2. AutoCap accepts two files for each run. The first of these files is a media files. Supported media types include WMA, WMV and AVI. See the link to JMF below for more details on the media formats accepted by AutoCap.

      The second file accepted by AutoCap is the transcript file. This XML file contains the transcript presegmented into time- stamped segments, called captions. See the next section, Output File for an explanation of this file format. The only difference between the input file and the output file is the output file will always have the time-stamps filled-in

    3. Output File
    4. The file format output by AutoCap is a non-standard transcript file developed by QAD Inc. for its educational material. This file format does not adhere to any recognized standards for XML meta-data.

      The QAD caption file format consists of an XML file that is capable of storing the captions along with the time the caption is spoken in an associated video. These captions can be in multiple languages. The folling is an example of what is contained withing a QAD caption file:

       <xml>
         <captions>
           <time value="00:00:02">
             <language="English" text="caption 1"/>
           </time> 
           <time value="00:00:10">
             <language="English" text="caption 2"/>
           </time> 
         </captions>
       </xml>
       

      The above XML represents the captions as a node called "caption" with multiple child nodes called "time". Each time node represents a caption that has multiple "language" children. These "language" children represent the caption text that is to be displayed for a particular language. One "langauge" element is diplayed based on the preferred language of the viewer of the material.

      Other formats can be used by AutoCap by other implementations of the TranscriptFileWriter interface.

    5. Execution Envrionment
    6. Before executing AutoCap it is important that the PATH and CLASSPATH environment variables are set. Execute the following commands to set these variables in the bash shell:

      %> export PATH=$PATH:path_to_autocap/scripts:path_to_autocap/bin\
      >:path_to_cmu_cam/bin
      %> export CLASSPATH=.:path_to_autocap/lib/jmf.jar\
      >:path_to_autocap/lib/sphinx4.jar
    7. Understanding Absolute Beam Width
    8. The Absolute Beam Width (ABW) is the size of the active list used during the recognition of each utterance. This active list contains all the possible chains of word automatons. The active list also contains all the likely matches for an utterance the SRS is currently processing. Each possible choice in the active list is chosen by the acoustic model and then pruned to the final choice using a language model. The final choice is then returned as the text of the utterance. Because there is no possible way to determine the optimal match until the entire utterance is processed, the SRS must store all possible chains in the active list. However, comparing each word to each chain in the active list is processor intensive. By limiting, or pruning, the size of the active list, the SRS is able to decrease the required processing time. This limit on the size of the active list is called the ABW. Therefore, as each phoneme of an utterance is processed, the list is pruned to no larger that the ABW. Doubling the value of the ABW roughly doubles the processing time required by the SRS.

      To change the ABW parameter for a run an make AutoCap do more or less work modify the following line inside config.xml:

      <property name="absoluteBeamWidth" value="500"/>

      Change the number 500 to a value that produces acceptable results. Experience has shown that values of 500, 1000, or 2000 produce reasonable results.

    9. Running AutoCap
    10. Once the execution environment is set up properly and the appropriate ABW value is selected, execution of AutoCap is possible. To automatically procude caption time-stamps for a video or audio file along with its transcript in the described format issue the following command:

      %> autocap -media=<media file> -transcript <transcript file>

      After the command is given the audio is extracted from the input file and a language model is created specific to the input transcript. Once the audio is extracted and language model is created, caption alignment begins. This phase of execution is quite long and can take as long as the media in the input file for an ABW value of 500. A file called captiondata.xml is the output of the run if all went well. Unfortunately, this transcript file is not viewable by any known media layer. See the next section, Viewing the Output to find out how to view the results.

    11. Viewing the Output
    12. Once AutoCap is run and a caption file with time-stamps is produced, it is useless without being able to view the media with the aligned captions. To solve this problem AutoCap supplies a Java viewer that can view the media file with the aligned captions. The CaptionedMediaPlayer can be used to view the results with the following command:

      %> java edu.ucsb.nmsl.tools.CaptionedMediaPlayer  captiondata.xml

  7. Useful Links