- Introduction
Suppose that an instructor possesses a collection of multimedia
that includes video as well as a transcript of the dialog from
the video. Synchronizing these two pieces of content to create a
captioned video presentation is useful and valuable. However,
this synchronizing presents several issues that must be addressed
in order to successfully create the combined video and
captions.
There are several issues that arise when faced with the scenario
above. One of the issues involves the process of accurately
converting speech into text using a Speech Recognition System
(SRS). Since state-of-the-art SRSs are not currently achieving an
acceptable level of accuracy, the first issue is to improve their
accuracy in order to achieve reliable synchronization. Second,
aligning recognized words with the spoken dialog could be
difficult since the text transcript may not match word-for-word
with what is spoken in the audio. This mismatch between the audio
dialog and transcript occurs when some words that are recognized
accurately may have been edited from the transcript. This problem
is not trivial since the SRS may recognize words incorrectly.
Third, techniques need to be developed that, as accurately as
possible, estimate the time unrecognized words of the transcript
were spoken, because not every word in a transcript is likely to
be recognized by the SRS. And, finally, indicating to content
creators how inaccurate these estimated times may be would be
useful. This paper outlines a process and an architecture that
offers solutions to these issues.
Current research in multimedia integration deals with real-time
transcription and the improvement of SRSs, but not with how to
synchronize transcripts in the absence of near perfect automatic
transcription. The research that does look at this problem
involves situations where some timing information exists before
synchronization begins. An example of this would be when news
program are transcribed manually in real-time resulting in
captions offset by several seconds from where the words are
actually spoken. Also this body of research does not address how
to identify and estimate possible error in generated time-
stamps.
Our AutoCap system automatically synchronizes pre-
segmented transcripts with video. These transcript segments along
with the time they were spoken, are combined to form captions.
Unfortunately, no SRS in its normal configuration can achieve the
accuracy necessary to synchronize transcripts without introducing
an unreasonable amount of error for some or all of the captions.
AutoCap overcomes this limitation by creating a better language
model for use with the SRS that increases the accuracy to the
necessary level. The role of the SRS is to collect as many groups
of contiguous words, called utterances, as well as the time each
word in the utterance was spoken. Once the SRS has collected all
of these utterances, AutoCap creates captions by aligning the
utterances to the transcript. For those words that are not
recognized, AutoCap estimates when the words were spoken along
with an error bound that gives the content creator an idea of
caption accuracy. The result of this process is a collection of
accurately time-stamped captions that can be displayed with the
video.
The basic requirement for synchronizing a transcript with video
is to determine the exact time the first word of each caption is
spoken. To achieve this objective AutoCap must have as input
spoken audio from a video program and an edited transcript of
this video pre-segmented into captions. The output is a caption
file with the segmented text along with time-stamps for each.
The AutoCap process for automatically synchronizing captions is
accomplished in four phases. First, a new language model is
created for a video by processing the video's transcript using a
statistical language modeling toolkit. This language model is
created to increase the accuracy of the speech recognition
system, and is paramount in creating an accurate collection of
time-stamped captions. Once this language model is created, the
final three phases of synchronization are performed.
The second phase of AutoCap, speech recognition, collects all
recognizable utterances along with the time-stamp of each of its
words. This collection of utterances is then aligned with the
transcript in the alignment phase to create time-stamps for as
many captions as possible. Finally, in the fourth phase, for each
caption that is not time-stamped,because the first word of the
caption is not recognized by the SRS, an estimated time-stamp is
calculated based on the nearest time-stamped words. This final
phase, known as estimation, produces the final output: a caption
file containing segmented text and their associated time-stamps.
The details of each of these phases are outlined in the following
sections.
- Installation
Before installing AutoCap it is important that the following
software requirements are met. These reguirments include
libraries and the utlities required to properly run AutoCap.
Be sure that all requiremets are met before moving onto the
Installing AutoCap section. Links to all the
requirments are listed in the section Useful Links
- Requirements for running AutoCap
The following libraries and system utilities are necessary
for the proper operation of AutoCap:
- Java 1.4 VM or greater
- MPlayer 1.0pre4-3.3.3 or greater
- CMU Cam Toolkit 2.0
- Perl 5.8 or greater
- Java Media Framework (JMF) 2.1.1e or
greater
- Sphinx SRS 4.0 or greater
- Java Sound API (JSAPI)
- ant build system 1.6.2 or later
- Downloading AutoCap
The latest version of AutoCap is version 1.0. Click here to
obtain the tar file. The tar file contains all the code,
build scripts and Java libraries needed to run AutoCap.
Once this tar file has been downloaded proceed to the next
section, Installing AutoCap.
- Installing AutoCap
Once the tar file has been downloaded, installing AutoCap
is as simple as untaring the source and building the
AutoCap JAR file. To accomplish this task issue the
following commands:
%> tar -xjf autocap_<version>.tbz2
%> cd autocap_<version>
%> ant
If no error messages were observed after ant is finished
building AutoCap, then move onto the next session
Running AutoCap.
- Running AutoCap
Before running AutoCap it helps to understand the proper
formats of the input and output files. Java only supports
certain types of media files and the output file currently uses
a proprietary file format that is not defined by any public
standard. Finally, the execution environment needs to be set
up properly for AutoCap to run properly. Follow the following
directions carefully, and AutoCap should run properly.
- Input Files
AutoCap accepts two files for each run. The first of these
files is a media files. Supported media types include WMA,
WMV and AVI. See the link to JMF below for more details on
the media formats accepted by AutoCap.
The second file accepted by AutoCap is the transcript file.
This XML file contains the transcript presegmented into time-
stamped segments, called captions. See the next section,
Output File for an explanation of this file format.
The only difference between the input file and the output
file is the output file will always have the time-stamps
filled-in
- Output File
The file format output by AutoCap is a non-standard
transcript file developed by QAD Inc. for its educational
material. This file format does not adhere to any recognized
standards for XML meta-data.
The QAD caption file format consists of an XML file that is
capable of storing the captions along with the time the
caption is spoken in an associated video. These captions can
be in multiple languages. The folling is an example of what
is contained withing a QAD caption file:
<xml>
<captions>
<time value="00:00:02">
<language="English" text="caption 1"/>
</time>
<time value="00:00:10">
<language="English" text="caption 2"/>
</time>
</captions>
</xml>
The above XML represents the captions as a node called
"caption" with multiple child nodes called "time". Each time
node represents a caption that has multiple "language"
children. These "language" children represent the caption
text that is to be displayed for a particular language. One
"langauge" element is diplayed based on the preferred
language of the viewer of the material.
Other formats can be used by AutoCap by other implementations
of the TranscriptFileWriter interface.
- Execution Envrionment
Before executing AutoCap it is important that the PATH and
CLASSPATH environment variables are set. Execute the
following commands to set these variables in the bash shell:
%> export PATH=$PATH:path_to_autocap/scripts:path_to_autocap/bin\
>:path_to_cmu_cam/bin
%> export CLASSPATH=.:path_to_autocap/lib/jmf.jar\
>:path_to_autocap/lib/sphinx4.jar
- Understanding Absolute Beam Width
The Absolute Beam Width (ABW) is the size of the
active list used during the recognition of each utterance. This
active list contains all the possible chains of word automatons.
The active list also contains all the likely matches for an
utterance the SRS is currently processing. Each possible choice
in the active list is chosen by the acoustic model and then
pruned to the final choice using a language model. The final
choice is then returned as the text of the utterance. Because
there is no possible way to determine the optimal match until
the entire utterance is processed, the SRS must store all
possible chains in the active list. However, comparing each word
to each chain in the active list is processor intensive. By
limiting, or pruning, the size of the active list, the SRS is
able to decrease the required processing time. This limit on the
size of the active list is called the ABW. Therefore, as each
phoneme of an utterance is processed, the list is pruned to no
larger that the ABW. Doubling the value of the ABW roughly
doubles the processing time required by the SRS.
To change the ABW parameter for a run an make AutoCap do more
or less work modify the following line inside config.xml:
<property name="absoluteBeamWidth" value="500"/>
Change the number 500 to a value that produces acceptable
results. Experience has shown that values of 500, 1000, or
2000 produce reasonable results.
- Running AutoCap
Once the execution environment is set up properly and the
appropriate ABW value is selected, execution of AutoCap is
possible. To automatically procude caption time-stamps for a
video or audio file along with its transcript in the
described format issue the following command:
%> autocap -media=<media file> -transcript <transcript file>
After the command is given the audio is extracted from the
input file and a language model is created specific to the
input transcript. Once the audio is extracted and language
model is created, caption alignment begins. This phase of
execution is quite long and can take as long as the media
in the input file for an ABW value of 500. A file called
captiondata.xml is the output of the run if all went well.
Unfortunately, this transcript file is not viewable by any
known media layer. See the next section, Viewing the
Output to find out how to view the results.
- Viewing the Output
Once AutoCap is run and a caption file with time-stamps is
produced, it is useless without being able to view the media
with the aligned captions. To solve this problem AutoCap
supplies a Java viewer that can view the media file with the
aligned captions. The CaptionedMediaPlayer can be used to
view the results with the following command:
%> java edu.ucsb.nmsl.tools.CaptionedMediaPlayer captiondata.xml
- Useful Links