Abstract.
Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams.
We address these challenges with two key contributions: a new dataset and model, called OCTAV
(Omni Context and Temporal Audio Video) and OMCAT
respectively. OCTAV
(Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across both modalities. Second, OMCAT
(Omni Context Aware Transformer) is a powerful model that leverages RoTE
(Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks.
Through a robust three-stage training pipeline—feature alignment, instruction tuning, and OCTAV
--specific training--OMCAT
excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV
benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies.
OMCAT
achieves remarkable performance on natural videos as shown below.
Video | Question-Answer |
---|---|
A segment from Whiplash movie trailer |
User: What musical instruments can be heard in the video? OMCAT: The audio features a drum beat, a bass guitar, and a keyboard playing a melody. |
A segment from Wicked movie trailer |
User: What musical instruments can be heard in the video? OMCAT: The audio features a percussion instrument playing a rhythmic beat, followed by a synthesized sound effect. |
YouTube video |
User: Describe what is happening in the video before the sound of man speaking with start and end timestamps. OMCAT: The man is speaking from 7 to 34 seconds. From 0 to 7 seconds, the detergent is being sprayed from a bottle. User: What is shown in the video between timestamps 20 and 39 seconds? OMCAT: A person is shown spraying a bottle of cleaner on a car's dashboard and then wiping it off. |
UnAV-100 dataset |
User: Please summarize what is happening in the video at the start, specifying the start and end timestamps. Ground Truth: At the start of the video, from 0 to 7.5 seconds, there is lawn mowing happening. OMCAT: At the start of the video, from 0 to 8.3 seconds, there is lawn mowing happening. User: Please summarize what is happening in the video after the first instance of lawn mowing, specifying the start and end timestamps. Ground Truth: After the first instance of lawn mowing from 0 to 7.5 seconds, there is another instance of lawn mowing from 16.5 to 46 seconds. OMCAT: The first instance of lawn mowing is from 0 to 8.3 seconds. From 12.42 to 43 seconds, there is lawn mowing again. |
UnAV-100 dataset |
User: Describe what is happening in the video with start and end timestamps. Ground Truth: There are lions roaring in the video from 3.7 to 42.5 seconds. OMCAT: There are lions roaring in the video from 0 to 38.3 seconds. |
UnAV-100 dataset |
User: Include start and end time details when summarizing what is happening in the video at the start. Ground Truth: At the start of the video from 0 to 36 seconds, there is an acoustic guitar and a ukulele being played. OMCAT: At the start of the video from 0 to 36 seconds, there is a ukulele being played and from 0 to 36 seconds, there is an acoustic guitar being played. |
OMCAT
achieves state-of-the-art performance on the OCTAV-ST ActivityNet
bechmark which consists of single-turn question answers
for videos that are anchored on single sound events. Below, we show some examples from OMCAT
and compare it to Video LLaMA 2
, an audio-visual language model.
Video | Question-Answer | Question-Answer |
---|---|---|
User: Please briefly describe what is happening in the video before the sound of engine with exact start and end time information. Ground Truth: The sound of engine is from 8.2 to 9.1 seconds. From 0 to 8.2 seconds, we see the machine folded and put away. Video LLaMA 2: In the video, a man is seen sitting on an exercise machine and pulling a handle. The sound of an engine can be heard from 12.1 seconds to 20.1 seconds. OMCAT: The sound of engine is from 8.5 to 10.1 seconds. From 0 to 8.5 seconds, we see a man folding a fitness machine for storage. |
User: Please briefly describe what is happening in the video after the sound of engine with exact start and end time information. Ground Truth: The sound of engine is from 8.2 to 9.1 seconds. From 9.1 to 17 seconds, we see the man workout again. Video LLaMA 2: A man is using a rowing machine and pulling a handle attached to a string. The machine is digital and has a compact storage system. OMCAT: The sound of engine is from 8.5 to 10.1 seconds. From 10.1 to 17 seconds, the man continues to exercise on the machine. |
|
User: Descibe what is happening in the video before the sound of street music with start and end timestamps. Ground Truth: The sound of street music is from 2.8 to 4.9 seconds. From 0 to 2.8 seconds, a man is seen standing on a bowling lane holding a ball. Video LLaMA 2: Before the sound of street music, a man is seen standing in a bowling alley, holding a bowling ball. He then throws the ball down the lane. The exact start time of this event is 0.0 second, and the end time is 5.0 second. OMCAT: The sound of street music is from 2.9 to 4.9 seconds. From 0 to 2.9 seconds, a man is seen standing in a bowling alley holding a ball. |
User: Provide a brief description of what is happening in the video after the sound of street music along with start and end timestamps. Ground Truth: The sound of street music is from 2.8 to 4.9 seconds. From 4.9 to 10 seconds, he throws the ball down the lane in the end. Video LLaMA 2: A man is seen bowling in a bowling alley. He then throws a bowling ball down the lane and hits several pins. OMCAT: The sound of street music is from 2.8 to 4.9 seconds. From 4.9 to 10 seconds, he throws the ball down and hits several pins. |
|
User: Please briefly describe what is happening in the video before the sound of clock alarm with exact start and end time information. Ground Truth: The sound of clock alarm is from 7 to 8.4 seconds. From 0 to 7 seconds, a man is standing on a field. Video LLaMA 2: In the video, a man is seen throwing a javelin. The sound of the clock alarm occurs from 10.0 second to 15.0 second. OMCAT: The sound of clock alarm is from 7.2 to 8.4 seconds. From 0 to 7.2 seconds, a man is seen standing on a large field holding a shot put. |
User: Descibe what is happening in the video after the sound of clock alarm with start and end timestamps. Ground Truth: The sound of clock alarm is from 7 to 8.4 seconds. From 8.4 to 18 seconds, he picks up a ball and spins around before throwing it onto the field. Video LLaMA 2: A man throws a discus while a group of people watches. OMCAT: The sound of clock alarm is from 7.8 to 9 seconds. From 9 to 18 seconds, he spins and throws the ball. |
OMCAT
achieves state-of-the-art performance on the OCTAV-ST Youcook2
.
To show advanced capabilities of OMCAT
in this benchmark, we show comparison with a fime-tuned version of GroundingGPT
below.
Video | Question-Answer | Question-Answer |
---|---|---|
User: Provide a brief description of what is happening in the video before the sound of sneeze along with start and end timestamps. Ground Truth: The sound of sneeze is from 3 to 8 seconds. From 0 to 3 seconds, the shrimps are defrosted and the water is drained. GroundingGPT: The sound of sneeze is from 2.4 to 7.2 seconds. From 0 to 2.4 seconds, the shrimp is drained of water. OMCAT: The sound of sneeze is from 3.1 to 7.7 seconds. From 0 to 3.1 seconds, the shrimp is being drained of water. |
User: Please briefly describe what is happening in the video after the sound of sneeze with exact start and end time information. Ground Truth: The sound of sneeze is from 3 to 8 seconds. From 8 to 17 seconds, the tails of the shrimps are cut off and the shrimps are chopped into rough pieces about inch thick and set aside. GroundingGPT: The sound of sneeze is from 2.4 to 7.2 seconds. From 7.2 to 17 seconds, the shrimp is drained and put on a plate. OMCAT: The sound of sneeze is from 3.1 to 7.7 seconds. From 7.7 to 17 seconds, the shrimp is being peeled and cut. |
|
User: Provide a brief description of what is happening in the video before the sound of rooster along with start and end timestamps. Ground Truth: The sound of rooster is from 3 to 5 seconds. From 0 to 3 seconds, pepperoni is being added. GroundingGPT: The sound of rooster is from 2.8 to 5 seconds. From 0 to 2.8 seconds, the pizza is being baked in the oven. OMCAT: The sound of rooster is from 3 to 4.8 seconds. From 0 to 3 seconds, the pepperoni is being placed on the pizza. |
User: Provide a brief description of what is happening in the video after the sound of rooster along with start and end timestamps. Ground Truth: The sound of rooster is from 3 to 5 seconds. From 5 to 10 seconds, the pizza is being wrapped. GroundingGPT: The sound of rooster is from from 2.8 to 5 seconds. From 5 to 10 seconds, the pizza is being baked in the oven. OMCAT: The sound of rooster is from 3 to 4.8 seconds. From 4.8 to 10 seconds, the pizza is being baked in the oven. |
|
User: Provide a brief description of what is happening in the video before the sound of children playing along with start and end timestamps. Ground Truth: The sound of children playing is from 15.99 to 19 seconds. From 0 to 15.99 seconds, the batter is being spread on the pan. GroundingGPT: The sound of children playing is from 17.7 to 19.4 seconds. From 0 to 17.7 seconds, the batter is being spread on the pan. OMCAT: The sound of children playing is from 15.6 to 19.1 seconds. From 0 to 15.6 seconds, the batter is being spread on the pan. |
User: Descibe what is happening in the video after the sound of children playing with start and end timestamps. Ground Truth: The sound of children playing is from 15.99 to 19 seconds. From 19 to 26 seconds, the masala is being spread on the dosa. GroundingGPT: The sound of children playing is from 17.6 to 18.8 seconds. From 18.8 to 26 seconds, the batter is being spread on the pan. OMCAT: The sound of children playing is from 17.3 to 18.9 seconds. From 17.3 to 18.8 seconds, the dosa is being removed from the pan and served on a plate. |
Below, we show some examples from our proposed OCTAV-MT
benchmark that consists of multi-turn dialogues
on videos that are anchored on multiple sound events.
Video | Dialogue | ||
---|---|---|---|
User: Give a short description of what is happening in the video before the second sound of Acoustic guitar, including the precise start and end time details. Ground Truth: The second sound of Acoustic guitar is from 13 to 20 seconds. From 6 to 20 seconds, flour, onion seed, salt and baking powder are mixed. OMCAT: The second sound of Acoustic guitar is from 12.7 to 19.1 seconds. From 0 to 19.1 seconds, the flour is being added to the bread mixture. |
User: Ensure to mention the start and end timestamps when explaining what is happening in the video after the second sound of Acoustic guitar. Ground Truth: The second sound of Acoustic guitar is from 13 to 20 seconds. From 20 to 27 seconds, the dough is kneaded by hand. OMCAT: The second sound of Acoustic guitar is from 12.7 to 19.1 seconds. From 19.1 to 27 seconds, the dough is being kneaded. |
User: Provide a brief description of what is happening in the video before the first sound of Acoustic guitar, along with the start and end times. Ground Truth: The first sound of Acoustic guitar is from 5 to 6 seconds. From 0 to 5 seconds, yeast and sugar are mixed in a bowl of warm water. OMCAT: The first sound of Acoustic guitar is from 5.4 to 7.2 seconds. From 0 to 5.4 seconds, the dough is being kneaded. (Failure case: The model incorrectly predicts the event as it is not clearly visible from the visual frames itself that yeast and sugar are being mixed.) |
|
User: When describing what is happening in the video after the first sound of laugh? Answer with start and end timestamps, include the exact start and end time information. Ground Truth: The first sound of laugh is from 5.8 to 7 seconds. From 7 to 17.4, two hair stylists stand next to a young man as they speak to a camera about Justin Bieber's hairstyle look. OMCAT: The first sound of laugh is from 4.1 to 6.2 seconds. From 6.2 to 16.6 seconds, the stylist uses a brush and blow dryer to dry the model's hair. (Failure case: The model incorrectly predicts the video event.) |
User: Could you describe what is happening in the video before the second sound of laugh? Answer with start and end timestamps, including the exact start and end times? Ground Truth: The second sound of laugh is from 17.4 to 18.5 seconds. From 7 to 18.5 seconds, two hair stylists stand next to a young man as they speak to a camera about Justin Bieber's hairstyle look. OMCAT: The second sound of laugh is from 7.7 to 9 seconds. From 0 to 7.7 seconds, the hair stylists rubs some pomade on the man's hair. (The model incorrectly predicts the sound of second laugh hence incorrect video caption. However, the video caption predicted for the timestamp 0 to 7.7 seconds is correct. ) |
User: Give a short description of what is happening in the video before the first sound of laugh? Answer with start and end timestamps, including the precise start and end time details. Ground Truth: The first sound of laugh is from 6 to 7 seconds. From 0 to 6 seconds, a woman sprays mist onto the hair of young man and she continues on to cut his hair, blow dry it, and finishes off by styling it. OMCAT: The first sound of laugh is from 7.7 to 9.7 seconds. From 0 to 7.7 seconds, the stylist uses a brush to style the boy's hair. |