Mitigating catastrophic forgetting in VQA curriculum learning for VLMs

Dataset examples

We have selected the Spatial457 dataset for our project. It is a VQA dataset on spatial reasoning. Questions are divided in levels of increasing difficulty. We have selected 5 of the 7 levels available. Here are some qualitative examples.

Question Level	Question	Answer
L1 - Single Object	What shape is the large brown thing?	truck
L2 - Multiple Objects	Are there any other things that have the same size as the brown thing?	False
L3 - 2D Spatial relationship	What is the color of the object behind the object that is to the right of the mountain bike?	blue
L4 - Orientation (Pose)	What is the size of the object that faces the same direction as the yellow thing?	brown
L5 - 6D Spatial relationsip	What color is the car in front of the large brown truck to the right of the small yellow object?	blue

Question Level	Question	Answer
L1 - Single Object	There is a gray thing; what shape is it?	double bus
L2 - Multiple Objects	Is the number of tiny cyan dirt bikes greater than the number of red cruisers?	False
L3 - 2D Spatial relationship	Do the object to the left of the tiny gray bus and the small object that is in front of the big car have the same color?	False
L4 - Orientation (Pose)	What is the color of the thing that faces the same direction as the double bus?	red
L5 - 6D Spatial relationship	There is a blue thing that is right of the small object on the left side of the biplane; what shape is it?	sedan

Question Level	Question	Answer
L1 - Single Object	How big is the purple bicycle?	large
L2 - Multiple Objects	Are there more big fighters than big bicycles?	False
L3 - 2D Spatial relationship	What number of cars are to the right of the brown object?	0
L4 - Orientation (Pose)	What is the shape of the object that occludes the yellow object?	suv
L5 - 6D Spatial relationship	The big mountain bike that is behind the yellow thing is what color?	purple

Question Level	Question	Answer
L1 - Single Object	The tiny minivan has what color?	yellow
L2 - Multiple Objects	What number of objects are large cyan choppers or motorbikes?	3
L3 - 2D Spatial relationship	How many objects are either things to the left of the gray tandem bike or gray objects that are behind the small blue thing?	6
L4 - Orientation (Pose)	What is the shape of the object which faces to the right?	wagon
L5 - 6D Spatial relationship	How many objects are either objects that are right of the gray tandem bike or objects behind the bicycle?	3

Methods

We fine-tune Qwen2-VL-2B using our CurriculuMoE Adapters. We freeze the vision components of the VLM and augment the MLP blocks in each decoder layer of the language model with our adapters. During training, we use the annotated task labels to direct tokens to their respective Routers. We apply Soft Annealing Routing to ensure that new Experts learned early in training while still allowing them to complement previously acquired skills.

At inference time, we assume that task labels are not available. To determine which router to guide the tokens to, we train a Task Classifier to predict the task associated with a given input. In our setup, tasks are defined entirely by the type of question; they do not depend on the input image. Our Task Classifier consists of an all-MiniLM-L6-v2 sentence transformer encoder augmented with a lightweight classification head composed of two linear layers.