OM-CAT: Omni Context Aware Transformer

Video	Question-Answer	Question-Answer
	User: Please briefly describe what is happening in the video before the sound of engine with exact start and end time information. Ground Truth: The sound of engine is from 8.2 to 9.1 seconds. From 0 to 8.2 seconds, we see the machine folded and put away. Video LLaMA 2: In the video, a man is seen sitting on an exercise machine and pulling a handle. The sound of an engine can be heard from 12.1 seconds to 20.1 seconds. OMCAT: The sound of engine is from 8.5 to 10.1 seconds. From 0 to 8.5 seconds, we see a man folding a fitness machine for storage.	User: Please briefly describe what is happening in the video after the sound of engine with exact start and end time information. Ground Truth: The sound of engine is from 8.2 to 9.1 seconds. From 9.1 to 17 seconds, we see the man workout again. Video LLaMA 2: A man is using a rowing machine and pulling a handle attached to a string. The machine is digital and has a compact storage system. OMCAT: The sound of engine is from 8.5 to 10.1 seconds. From 10.1 to 17 seconds, the man continues to exercise on the machine.
	User: Descibe what is happening in the video before the sound of street music with start and end timestamps. Ground Truth: The sound of street music is from 2.8 to 4.9 seconds. From 0 to 2.8 seconds, a man is seen standing on a bowling lane holding a ball. Video LLaMA 2: Before the sound of street music, a man is seen standing in a bowling alley, holding a bowling ball. He then throws the ball down the lane. The exact start time of this event is 0.0 second, and the end time is 5.0 second. OMCAT: The sound of street music is from 2.9 to 4.9 seconds. From 0 to 2.9 seconds, a man is seen standing in a bowling alley holding a ball.	User: Provide a brief description of what is happening in the video after the sound of street music along with start and end timestamps. Ground Truth: The sound of street music is from 2.8 to 4.9 seconds. From 4.9 to 10 seconds, he throws the ball down the lane in the end. Video LLaMA 2: A man is seen bowling in a bowling alley. He then throws a bowling ball down the lane and hits several pins. OMCAT: The sound of street music is from 2.8 to 4.9 seconds. From 4.9 to 10 seconds, he throws the ball down and hits several pins.
	User: Please briefly describe what is happening in the video before the sound of clock alarm with exact start and end time information. Ground Truth: The sound of clock alarm is from 7 to 8.4 seconds. From 0 to 7 seconds, a man is standing on a field. Video LLaMA 2: In the video, a man is seen throwing a javelin. The sound of the clock alarm occurs from 10.0 second to 15.0 second. OMCAT: The sound of clock alarm is from 7.2 to 8.4 seconds. From 0 to 7.2 seconds, a man is seen standing on a large field holding a shot put.	User: Descibe what is happening in the video after the sound of clock alarm with start and end timestamps. Ground Truth: The sound of clock alarm is from 7 to 8.4 seconds. From 8.4 to 18 seconds, he picks up a ball and spins around before throwing it onto the field. Video LLaMA 2: A man throws a discus while a group of people watches. OMCAT: The sound of clock alarm is from 7.8 to 9 seconds. From 9 to 18 seconds, he spins and throws the ball.

OMCAT achieves state-of-the-art performance on the OCTAV-ST Youcook2.
To show advanced capabilities of OMCAT in this benchmark, we show comparison with a fime-tuned version of GroundingGPT below.

Video	Question-Answer	Question-Answer
	User: Provide a brief description of what is happening in the video before the sound of sneeze along with start and end timestamps. Ground Truth: The sound of sneeze is from 3 to 8 seconds. From 0 to 3 seconds, the shrimps are defrosted and the water is drained. GroundingGPT: The sound of sneeze is from 2.4 to 7.2 seconds. From 0 to 2.4 seconds, the shrimp is drained of water. OMCAT: The sound of sneeze is from 3.1 to 7.7 seconds. From 0 to 3.1 seconds, the shrimp is being drained of water.	User: Please briefly describe what is happening in the video after the sound of sneeze with exact start and end time information. Ground Truth: The sound of sneeze is from 3 to 8 seconds. From 8 to 17 seconds, the tails of the shrimps are cut off and the shrimps are chopped into rough pieces about inch thick and set aside. GroundingGPT: The sound of sneeze is from 2.4 to 7.2 seconds. From 7.2 to 17 seconds, the shrimp is drained and put on a plate. OMCAT: The sound of sneeze is from 3.1 to 7.7 seconds. From 7.7 to 17 seconds, the shrimp is being peeled and cut.
	User: Provide a brief description of what is happening in the video before the sound of rooster along with start and end timestamps. Ground Truth: The sound of rooster is from 3 to 5 seconds. From 0 to 3 seconds, pepperoni is being added. GroundingGPT: The sound of rooster is from 2.8 to 5 seconds. From 0 to 2.8 seconds, the pizza is being baked in the oven. OMCAT: The sound of rooster is from 3 to 4.8 seconds. From 0 to 3 seconds, the pepperoni is being placed on the pizza.	User: Provide a brief description of what is happening in the video after the sound of rooster along with start and end timestamps. Ground Truth: The sound of rooster is from 3 to 5 seconds. From 5 to 10 seconds, the pizza is being wrapped. GroundingGPT: The sound of rooster is from from 2.8 to 5 seconds. From 5 to 10 seconds, the pizza is being baked in the oven. OMCAT: The sound of rooster is from 3 to 4.8 seconds. From 4.8 to 10 seconds, the pizza is being baked in the oven.
	User: Provide a brief description of what is happening in the video before the sound of children playing along with start and end timestamps. Ground Truth: The sound of children playing is from 15.99 to 19 seconds. From 0 to 15.99 seconds, the batter is being spread on the pan. GroundingGPT: The sound of children playing is from 17.7 to 19.4 seconds. From 0 to 17.7 seconds, the batter is being spread on the pan. OMCAT: The sound of children playing is from 15.6 to 19.1 seconds. From 0 to 15.6 seconds, the batter is being spread on the pan.	User: Descibe what is happening in the video after the sound of children playing with start and end timestamps. Ground Truth: The sound of children playing is from 15.99 to 19 seconds. From 19 to 26 seconds, the masala is being spread on the dosa. GroundingGPT: The sound of children playing is from 17.6 to 18.8 seconds. From 18.8 to 26 seconds, the batter is being spread on the pan. OMCAT: The sound of children playing is from 17.3 to 18.9 seconds. From 17.3 to 18.8 seconds, the dosa is being removed from the pan and served on a plate.

Multi-turn temporal and cross-modal dialogue

Below, we show some examples from our proposed OCTAV-MT benchmark that consists of multi-turn dialogues
on videos that are anchored on multiple sound events.

Video	Dialogue
	User: Give a short description of what is happening in the video before the second sound of Acoustic guitar, including the precise start and end time details. Ground Truth: The second sound of Acoustic guitar is from 13 to 20 seconds. From 6 to 20 seconds, flour, onion seed, salt and baking powder are mixed. OMCAT: The second sound of Acoustic guitar is from 12.7 to 19.1 seconds. From 0 to 19.1 seconds, the flour is being added to the bread mixture.	User: Ensure to mention the start and end timestamps when explaining what is happening in the video after the second sound of Acoustic guitar. Ground Truth: The second sound of Acoustic guitar is from 13 to 20 seconds. From 20 to 27 seconds, the dough is kneaded by hand. OMCAT: The second sound of Acoustic guitar is from 12.7 to 19.1 seconds. From 19.1 to 27 seconds, the dough is being kneaded.	User: Provide a brief description of what is happening in the video before the first sound of Acoustic guitar, along with the start and end times. Ground Truth: The first sound of Acoustic guitar is from 5 to 6 seconds. From 0 to 5 seconds, yeast and sugar are mixed in a bowl of warm water. OMCAT: The first sound of Acoustic guitar is from 5.4 to 7.2 seconds. From 0 to 5.4 seconds, the dough is being kneaded. (Failure case: The model incorrectly predicts the event as it is not clearly visible from the visual frames itself that yeast and sugar are being mixed.)
	User: When describing what is happening in the video after the first sound of laugh? Answer with start and end timestamps, include the exact start and end time information. Ground Truth: The first sound of laugh is from 5.8 to 7 seconds. From 7 to 17.4, two hair stylists stand next to a young man as they speak to a camera about Justin Bieber's hairstyle look. OMCAT: The first sound of laugh is from 4.1 to 6.2 seconds. From 6.2 to 16.6 seconds, the stylist uses a brush and blow dryer to dry the model's hair. (Failure case: The model incorrectly predicts the video event.)	User: Could you describe what is happening in the video before the second sound of laugh? Answer with start and end timestamps, including the exact start and end times? Ground Truth: The second sound of laugh is from 17.4 to 18.5 seconds. From 7 to 18.5 seconds, two hair stylists stand next to a young man as they speak to a camera about Justin Bieber's hairstyle look. OMCAT: The second sound of laugh is from 7.7 to 9 seconds. From 0 to 7.7 seconds, the hair stylists rubs some pomade on the man's hair. (The model incorrectly predicts the sound of second laugh hence incorrect video caption. However, the video caption predicted for the timestamp 0 to 7.7 seconds is correct. )	User: Give a short description of what is happening in the video before the first sound of laugh? Answer with start and end timestamps, including the precise start and end time details. Ground Truth: The first sound of laugh is from 6 to 7 seconds. From 0 to 6 seconds, a woman sprays mist onto the hair of young man and she continues on to cut his hair, blow dry it, and finishes off by styling it. OMCAT: The first sound of laugh is from 7.7 to 9.7 seconds. From 0 to 7.7 seconds, the stylist uses a brush to style the boy's hair.

Video	Question-Answer
A segment from Whiplash movie trailer	User: What musical instruments can be heard in the video? OMCAT: The audio features a drum beat, a bass guitar, and a keyboard playing a melody.
A segment from Wicked movie trailer	User: What musical instruments can be heard in the video? OMCAT: The audio features a percussion instrument playing a rhythmic beat, followed by a synthesized sound effect.
YouTube video	User: Describe what is happening in the video before the sound of man speaking with start and end timestamps. OMCAT: The man is speaking from 7 to 34 seconds. From 0 to 7 seconds, the detergent is being sprayed from a bottle. User: What is shown in the video between timestamps 20 and 39 seconds? OMCAT: A person is shown spraying a bottle of cleaner on a car's dashboard and then wiping it off.
UnAV-100 dataset	User: Please summarize what is happening in the video at the start, specifying the start and end timestamps. Ground Truth: At the start of the video, from 0 to 7.5 seconds, there is lawn mowing happening. OMCAT: At the start of the video, from 0 to 8.3 seconds, there is lawn mowing happening. User: Please summarize what is happening in the video after the first instance of lawn mowing, specifying the start and end timestamps. Ground Truth: After the first instance of lawn mowing from 0 to 7.5 seconds, there is another instance of lawn mowing from 16.5 to 46 seconds. OMCAT: The first instance of lawn mowing is from 0 to 8.3 seconds. From 12.42 to 43 seconds, there is lawn mowing again.
UnAV-100 dataset	User: Describe what is happening in the video with start and end timestamps. Ground Truth: There are lions roaring in the video from 3.7 to 42.5 seconds. OMCAT: There are lions roaring in the video from 0 to 38.3 seconds.
UnAV-100 dataset	User: Include start and end time details when summarizing what is happening in the video at the start. Ground Truth: At the start of the video from 0 to 36 seconds, there is an acoustic guitar and a ukulele being played. OMCAT: At the start of the video from 0 to 36 seconds, there is a ukulele being played and from 0 to 36 seconds, there is an acoustic guitar being played.

OMCAT
Omni Context Aware Transformer

Temporal question answering on natural videos

Single-turn temporal and cross-modal question answering

Multi-turn temporal and cross-modal dialogue

OMCATOmni Context Aware Transformer

Temporal question answering on natural videos

Single-turn temporal and cross-modal question answering

Multi-turn temporal and cross-modal dialogue

OMCAT
Omni Context Aware Transformer