Logo
OMCAT
Omni Context Aware Transformer

Paper


Abstract.

Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams.
We address these challenges with two key contributions: a new dataset and model, called OCTAV (Omni Context and Temporal Audio Video) and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across both modalities. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks.
Through a robust three-stage training pipeline—feature alignment, instruction tuning, and OCTAV--specific training--OMCAT excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies.

Temporal question answering on natural videos

OMCAT achieves state-of-the-art performance on the UnAV-MT-100 dataset, a version of the UnAV-100 dataset with multiple question answers.

Video Question-Answer
User: Please summarize what is happening in the video at the start, specifying the start and end timestamps.

Ground Truth: At the start of the video, from 0 to 7.5 seconds, there is lawn mowing happening.

OMCAT: At the start of the video, from 0 to 8.3 seconds, there is lawn mowing happening.


User: Please summarize what is happening in the video after the first instance of lawn mowing, specifying the start and end timestamps.

Ground Truth: After the first instance of lawn mowing from 0 to 7.5 seconds, there is another instance of lawn mowing from 16.5 to 46 seconds.

OMCAT: The first instance of lawn mowing is from 0 to 8.3 seconds. From 12.42 to 43 seconds, there is lawn mowing again.

User: Describe what is happening in the video with start and end timestamps.

Ground Truth: There are lions roaring in the video from 3.7 to 42.5 seconds.

OMCAT: There are lions roaring in the video from 0 to 38.3 seconds.

User: Include start and end time details when summarizing what is happening in the video at the start.

Ground Truth: At the start of the video from 0 to 36 seconds, there is an acoustic guitar and a ukulele being played.

OMCAT: At the start of the video from 0 to 36 seconds, there is a ukulele being played and from 0 to 36 seconds, there is an acoustic guitar being played.

Single-turn temporal and cross-modal question answering

OMCAT achieves state-of-the-art performance on the OCTAV-ST ActivityNet bechmark which consists of single-turn question answers
for videos that are anchored on single sound events. Below, we show some examples from OMCAT and compare it to Video LLaMA 2, an audio-visual language model.

Video Question-Answer Question-Answer
User: Please briefly describe what is happening in the video before the sound of engine with exact start and end time information.

Ground Truth: The sound of engine is from 8.2 to 9.1 seconds. From 0 to 8.2 seconds, we see the machine folded and put away.

Video LLaMA 2: In the video, a man is seen sitting on an exercise machine and pulling a handle. The sound of an engine can be heard from 12.1 seconds to 20.1 seconds.

OMCAT: The sound of engine is from 8.5 to 10.1 seconds. From 0 to 8.5 seconds, we see a man folding a fitness machine for storage.

User: Please briefly describe what is happening in the video after the sound of engine with exact start and end time information.

Ground Truth: The sound of engine is from 8.2 to 9.1 seconds. From 9.1 to 17 seconds, we see the man workout again.

Video LLaMA 2: A man is using a rowing machine and pulling a handle attached to a string. The machine is digital and has a compact storage system.

OMCAT: The sound of engine is from 8.5 to 10.1 seconds. From 10.1 to 17 seconds, the man continues to exercise on the machine.

User: Descibe what is happening in the video before the sound of street music with start and end timestamps.

Ground Truth: The sound of street music is from 2.8 to 4.9 seconds. From 0 to 2.8 seconds, a man is seen standing on a bowling lane holding a ball.

Video LLaMA 2: Before the sound of street music, a man is seen standing in a bowling alley, holding a bowling ball. He then throws the ball down the lane. The exact start time of this event is 0.0 second, and the end time is 5.0 second.

OMCAT: The sound of street music is from 2.9 to 4.9 seconds. From 0 to 2.9 seconds, a man is seen standing in a bowling alley holding a ball.

User: Provide a brief description of what is happening in the video after the sound of street music along with start and end timestamps.

Ground Truth: The sound of street music is from 2.8 to 4.9 seconds. From 4.9 to 10 seconds, he throws the ball down the lane in the end.

Video LLaMA 2: A man is seen bowling in a bowling alley. He then throws a bowling ball down the lane and hits several pins.

OMCAT: The sound of street music is from 2.8 to 4.9 seconds. From 4.9 to 10 seconds, he throws the ball down and hits several pins.

User: Please briefly describe what is happening in the video before the sound of clock alarm with exact start and end time information.

Ground Truth: The sound of clock alarm is from 7 to 8.4 seconds. From 0 to 7 seconds, a man is standing on a field.

Video LLaMA 2: In the video, a man is seen throwing a javelin. The sound of the clock alarm occurs from 10.0 second to 15.0 second.

OMCAT: The sound of clock alarm is from 7.2 to 8.4 seconds. From 0 to 7.2 seconds, a man is seen standing on a large field holding a shot put.

User: Descibe what is happening in the video after the sound of clock alarm with start and end timestamps.

Ground Truth: The sound of clock alarm is from 7 to 8.4 seconds. From 8.4 to 18 seconds, he picks up a ball and spins around before throwing it onto the field.

Video LLaMA 2: A man throws a discus while a group of people watches.

OMCAT: The sound of clock alarm is from 7.8 to 9 seconds. From 9 to 18 seconds, he spins and throws the ball.

OMCAT achieves state-of-the-art performance on the OCTAV-ST Youcook2.
To show advanced capabilities of OMCAT in this benchmark, we show comparison with a fime-tuned version of GroundingGPT below.

Video Question-Answer Question-Answer
User: Provide a brief description of what is happening in the video before the sound of sneeze along with start and end timestamps.

Ground Truth: The sound of sneeze is from 3 to 8 seconds. From 0 to 3 seconds, the shrimps are defrosted and the water is drained.

GroundingGPT: The sound of sneeze is from 2.4 to 7.2 seconds. From 0 to 2.4 seconds, the shrimp is drained of water.

OMCAT: The sound of sneeze is from 3.1 to 7.7 seconds. From 0 to 3.1 seconds, the shrimp is being drained of water.

User: Please briefly describe what is happening in the video after the sound of sneeze with exact start and end time information.

Ground Truth: The sound of sneeze is from 3 to 8 seconds. From 8 to 17 seconds, the tails of the shrimps are cut off and the shrimps are chopped into rough pieces about inch thick and set aside.

GroundingGPT: The sound of sneeze is from 2.4 to 7.2 seconds. From 7.2 to 17 seconds, the shrimp is drained and put on a plate.

OMCAT: The sound of sneeze is from 3.1 to 7.7 seconds. From 7.7 to 17 seconds, the shrimp is being peeled and cut.

User: Provide a brief description of what is happening in the video before the sound of rooster along with start and end timestamps.

Ground Truth: The sound of rooster is from 3 to 5 seconds. From 0 to 3 seconds, pepperoni is being added.

GroundingGPT: The sound of rooster is from 2.8 to 5 seconds. From 0 to 2.8 seconds, the pizza is being baked in the oven.

OMCAT: The sound of rooster is from 3 to 4.8 seconds. From 0 to 3 seconds, the pepperoni is being placed on the pizza.

User: Provide a brief description of what is happening in the video after the sound of rooster along with start and end timestamps.

Ground Truth: The sound of rooster is from 3 to 5 seconds. From 5 to 10 seconds, the pizza is being wrapped.

GroundingGPT: The sound of rooster is from from 2.8 to 5 seconds. From 5 to 10 seconds, the pizza is being baked in the oven.

OMCAT: The sound of rooster is from 3 to 4.8 seconds. From 4.8 to 10 seconds, the pizza is being baked in the oven.

User: Provide a brief description of what is happening in the video before the sound of children playing along with start and end timestamps.

Ground Truth: The sound of children playing is from 15.99 to 19 seconds. From 0 to 15.99 seconds, the batter is being spread on the pan.

GroundingGPT: The sound of children playing is from 17.7 to 19.4 seconds. From 0 to 17.7 seconds, the batter is being spread on the pan.

OMCAT: The sound of children playing is from 15.6 to 19.1 seconds. From 0 to 15.6 seconds, the batter is being spread on the pan.

User: Descibe what is happening in the video after the sound of children playing with start and end timestamps.

Ground Truth: The sound of children playing is from 15.99 to 19 seconds. From 19 to 26 seconds, the masala is being spread on the dosa.

GroundingGPT: The sound of children playing is from 17.6 to 18.8 seconds. From 18.8 to 26 seconds, the batter is being spread on the pan.

OMCAT: The sound of children playing is from 17.3 to 18.9 seconds. From 17.3 to 18.8 seconds, the dosa is being removed from the pan and served on a plate.

Multi-turn temporal and cross-modal dialogue

Below, we show some examples from our proposed OCTAV-MT benchmark that consists of multi-turn dialogues
on videos that are anchored on multiple sound events.

Video Dialogue
User: Give a short description of what is happening in the video before the second sound of Acoustic guitar, including the precise start and end time details.

Ground Truth: The second sound of Acoustic guitar is from 13 to 20 seconds. From 6 to 20 seconds, flour, onion seed, salt and baking powder are mixed.

OMCAT: The second sound of Acoustic guitar is from 12.7 to 19.1 seconds. From 0 to 19.1 seconds, the flour is being added to the bread mixture.

User: Ensure to mention the start and end timestamps when explaining what is happening in the video after the second sound of Acoustic guitar.

Ground Truth: The second sound of Acoustic guitar is from 13 to 20 seconds. From 20 to 27 seconds, the dough is kneaded by hand.

OMCAT: The second sound of Acoustic guitar is from 12.7 to 19.1 seconds. From 19.1 to 27 seconds, the dough is being kneaded.

User: Provide a brief description of what is happening in the video before the first sound of Acoustic guitar, along with the start and end times.

Ground Truth: The first sound of Acoustic guitar is from 5 to 6 seconds. From 0 to 5 seconds, yeast and sugar are mixed in a bowl of warm water.

OMCAT: The first sound of Acoustic guitar is from 5.4 to 7.2 seconds. From 0 to 5.4 seconds, the dough is being kneaded.

(Failure case: The model incorrectly predicts the event as it is not clearly visible from the visual frames itself that yeast and sugar are being mixed.)
User: When describing what is happening in the video after the first sound of laugh? Answer with start and end timestamps, include the exact start and end time information.

Ground Truth: The first sound of laugh is from 5.8 to 7 seconds. From 7 to 17.4, two hair stylists stand next to a young man as they speak to a camera about Justin Bieber's hairstyle look.

OMCAT: The first sound of laugh is from 4.1 to 6.2 seconds. From 6.2 to 16.6 seconds, the stylist uses a brush and blow dryer to dry the model's hair.

(Failure case: The model incorrectly predicts the video event.)
User: Could you describe what is happening in the video before the second sound of laugh? Answer with start and end timestamps, including the exact start and end times?

Ground Truth: The second sound of laugh is from 17.4 to 18.5 seconds. From 7 to 18.5 seconds, two hair stylists stand next to a young man as they speak to a camera about Justin Bieber's hairstyle look.

OMCAT: The second sound of laugh is from 7.7 to 9 seconds. From 0 to 7.7 seconds, the hair stylists rubs some pomade on the man's hair.

(The model incorrectly predicts the sound of second laugh hence incorrect video caption. However, the video caption predicted for the timestamp 0 to 7.7 seconds is correct. )
User: Give a short description of what is happening in the video before the first sound of laugh? Answer with start and end timestamps, including the precise start and end time details.

Ground Truth: The first sound of laugh is from 6 to 7 seconds. From 0 to 6 seconds, a woman sprays mist onto the hair of young man and she continues on to cut his hair, blow dry it, and finishes off by styling it.

OMCAT: The first sound of laugh is from 7.7 to 9.7 seconds. From 0 to 7.7 seconds, the stylist uses a brush to style the boy's hair.