
By Serge Malo and Francis Picard

We have selected the Spatial457 dataset for our project. It is a VQA dataset on spatial reasoning. Questions are divided in levels of increasing difficulty. We have selected 5 of the 7 levels available. Here are some qualitative examples.

| Question Level | Question | Answer |
|---|---|---|
| L1 - Single Object | What shape is the large brown thing? | truck |
| L2 - Multiple Objects | Are there any other things that have the same size as the brown thing? | False |
| L3 - 2D Spatial relationship | What is the color of the object behind the object that is to the right of the mountain bike? | blue |
| L4 - Orientation (Pose) | What is the size of the object that faces the same direction as the yellow thing? | brown |
| L5 - 6D Spatial relationsip | What color is the car in front of the large brown truck to the right of the small yellow object? | blue |

| Question Level | Question | Answer |
|---|---|---|
| L1 - Single Object | There is a gray thing; what shape is it? | double bus |
| L2 - Multiple Objects | Is the number of tiny cyan dirt bikes greater than the number of red cruisers? | False |
| L3 - 2D Spatial relationship | Do the object to the left of the tiny gray bus and the small object that is in front of the big car have the same color? | False |
| L4 - Orientation (Pose) | What is the color of the thing that faces the same direction as the double bus? | red |
| L5 - 6D Spatial relationship | There is a blue thing that is right of the small object on the left side of the biplane; what shape is it? | sedan |

| Question Level | Question | Answer |
|---|---|---|
| L1 - Single Object | How big is the purple bicycle? | large |
| L2 - Multiple Objects | Are there more big fighters than big bicycles? | False |
| L3 - 2D Spatial relationship | What number of cars are to the right of the brown object? | 0 |
| L4 - Orientation (Pose) | What is the shape of the object that occludes the yellow object? | suv |
| L5 - 6D Spatial relationship | The big mountain bike that is behind the yellow thing is what color? | purple |

| Question Level | Question | Answer |
|---|---|---|
| L1 - Single Object | The tiny minivan has what color? | yellow |
| L2 - Multiple Objects | What number of objects are large cyan choppers or motorbikes? | 3 |
| L3 - 2D Spatial relationship | How many objects are either things to the left of the gray tandem bike or gray objects that are behind the small blue thing? | 6 |
| L4 - Orientation (Pose) | What is the shape of the object which faces to the right? | wagon |
| L5 - 6D Spatial relationship | How many objects are either objects that are right of the gray tandem bike or objects behind the bicycle? | 3 |
We fine-tune Qwen2-VL-2B using our CurriculuMoE Adapters. We freeze the vision components of the VLM and augment the MLP blocks in each decoder layer of the language model with our adapters. During training, we use the annotated task labels to direct tokens to their respective Routers. We apply Soft Annealing Routing to ensure that new Experts learned early in training while still allowing them to complement previously acquired skills.

At inference time, we assume that task labels are not available. To determine which router to guide the tokens to, we train a Task Classifier to predict the task associated with a given input. In our setup, tasks are defined entirely by the type of question; they do not depend on the input image. Our Task Classifier consists of an all-MiniLM-L6-v2 sentence transformer encoder augmented with a lightweight classification head composed of two linear layers.
