image.png

Mitigating catastrophic forgetting in VQA curriculum learning for VLMs

By Serge Malo and Francis Picard

Dataset examples

image.png

We have selected the Spatial457 dataset for our project. It is a VQA dataset on spatial reasoning. Questions are divided in levels of increasing difficulty. We have selected 5 of the 7 levels available. Here are some qualitative examples.

image.png

Question Level Question Answer
L1 - Single Object What shape is the large brown thing? truck
L2 - Multiple Objects Are there any other things that have the same size as the brown thing? False
L3 - 2D Spatial relationship What is the color of the object behind the object that is to the right of the mountain bike? blue
L4 - Orientation (Pose) What is the size of the object that faces the same direction as the yellow thing? brown
L5 - 6D Spatial relationsip What color is the car in front of the large brown truck to the right of the small yellow object? blue

image.png

Question Level Question Answer
L1 - Single Object There is a gray thing; what shape is it? double bus
L2 - Multiple Objects Is the number of tiny cyan dirt bikes greater than the number of red cruisers? False
L3 - 2D Spatial relationship Do the object to the left of the tiny gray bus and the small object that is in front of the big car have the same color? False
L4 - Orientation (Pose) What is the color of the thing that faces the same direction as the double bus? red
L5 - 6D Spatial relationship There is a blue thing that is right of the small object on the left side of the biplane; what shape is it? sedan

image.png

Question Level Question Answer
L1 - Single Object How big is the purple bicycle? large
L2 - Multiple Objects Are there more big fighters than big bicycles? False
L3 - 2D Spatial relationship What number of cars are to the right of the brown object? 0
L4 - Orientation (Pose) What is the shape of the object that occludes the yellow object? suv
L5 - 6D Spatial relationship The big mountain bike that is behind the yellow thing is what color? purple

image.png

Question Level Question Answer
L1 - Single Object The tiny minivan has what color? yellow
L2 - Multiple Objects What number of objects are large cyan choppers or motorbikes? 3
L3 - 2D Spatial relationship How many objects are either things to the left of the gray tandem bike or gray objects that are behind the small blue thing? 6
L4 - Orientation (Pose) What is the shape of the object which faces to the right? wagon
L5 - 6D Spatial relationship How many objects are either objects that are right of the gray tandem bike or objects behind the bicycle? 3

Methods

We fine-tune Qwen2-VL-2B using our CurriculuMoE Adapters. We freeze the vision components of the VLM and augment the MLP blocks in each decoder layer of the language model with our adapters. During training, we use the annotated task labels to direct tokens to their respective Routers. We apply Soft Annealing Routing to ensure that new Experts learned early in training while still allowing them to complement previously acquired skills.

image.png

At inference time, we assume that task labels are not available. To determine which router to guide the tokens to, we train a Task Classifier to predict the task associated with a given input. In our setup, tasks are defined entirely by the type of question; they do not depend on the input image. Our Task Classifier consists of an all-MiniLM-L6-v2 sentence transformer encoder augmented with a lightweight classification head composed of two linear layers.

image.png