Arcana: Improving Multi-modal Large Language Model through Boosting Vision Capabilities

1Nanjing University of Science and Technology,    2Baidu VIS,
3Huazhong University of Science and Technology
*Indicates Equal Contribution   ✉Indicates Corresponding Author

Motivation

The visual perception capabilities of MLLMs directly impact their performance. It is well-known that the main factors influencing MLLMs' visual perception are data and structure. Arcana aims to enhance the visual perception capabilities of MLLMs by addressing both of these aspects.

MY ALT TEXT

Sampled some VQA examples involving color, quantity, small objects, and localization tasks, showcasing the importance of visual recognition capabilities for multimodal language models (MLLMs).

High-quality and Fine-grained Instruction Dataset

On the data side, there is a scarcity of open-source data, and the available multimodal datasets contain limited visual components, preventing MLLMs from gaining sufficient visual perception capabilities from these sources. To this end, we have developed a data engine to annotate multimodal data that ensures a diversity of visual factors. The specific process is as follows:

MY ALT TEXT

The Arcana data engine involves three crucial steps: (1) augmenting and filtering image basic annotations, (2) obtaining text information and description for each annotated region, and (3) using a large language model to integrate these visual annotations into different types of captions.

Arcana_VG-COCO

File Type Description
captions.json Image-level captions The captions.json file contains 86k conversations about image-level detail/short captions generated by Arcana's data engine. Each element in the JSONL file is formatted as follows:
    {
      "id": "image_id",
      "image": "image_path",,
      "conversations":[
        { "from": "human", "value": "<image>\nProvide a detailed description
        of the following image" },
        { "from": "gpt", "value": "The scene depicts an airport setting with a 
        sizable white and green airplane, likely a jumbo jet, stationed on the
        runway. The airplane showcases a color scheme of red and green,
        highlighting its unique look.."}
      ]
    }
                      
region_captions.json Region-level captions The region_captions.json file contains 86k conversations about region-level captions generated by Arcana's data engine. Each element in the JSONL file is formatted as follows:
    {
      "id": "image_id",
      "image": "image_path",,
      "conversations":[
        { "from": "human", "value": "<image>\nProvide a comprehensive
        depiction of the area bounded by [0.286, 0.342, 0.658, 0.484]." },
        {"from": "gpt", "value": "The area features a large red double-decker
        bus with a vibrant advertisement displayed on its side.."},
         ...
      ]
    }
                      
detections.json Detections The detections.json file contains 86k conversations about object detections generated by Arcana's data engine. Each element in the JSONL file is formatted as follows:
    {
      "id": "image_id",
      "image": "image_path",,
      "conversations":[
        { "from": "human", "value": "<image>\nDetect the category encompasses
        the region defined by the coordinates [0.324, 0.523, 0.338, 0.538]?" },
        {"from": "gpt", "value": "According to the taxonomy, this region is 
        categorized as a window."},
        ...
      ]
    }
                      
ocrs.json OCRs The ocrs.json file contains 13k conversations about OCR content in image generated by Arcana's data engine. Each element in the JSONL file is formatted as follows:
    {
      "id": "image_id",
      "image": "image_path",,
      "conversations":[
        { "from": "human", "value": "<image>\nProvide the text in 
        the area bounded by [0.142, 0.863, 0.225, 0.892]." },
        { "from": "gpt", "value": "The text is 2726B2V."},
        { "from": "human", "value": "Tell me where the text 2726B2V is in
         this picture." },
        { "from": "gpt", "value": "The text is in [0.142, 0.863, 0.225, 0.892]." }
      ]
    }
                      

Model

On the structural side, the language-driven decoder couples visual and language modalities within the same space, disregarding their unique characteristics and potentially causing information confusion or blurring. Furthermore, the frozen visual encoder cannot provide robust visual features, and directly fine-tuning it with a small dataset can affect its generalization capabilities. Toward this end, Arcana introduces MM-LoRA, which constructs a multimodal decoder to preserve the unique characteristics of different modalities. We also propose a Query Ladder adapter (QLadder) for the visual encoder, which retains the pre-trained image encoder's capabilities while introducing a small number of visual tokens to significantly enhance the model's ability to learn and represent visual information.

MY ALT TEXT

(a) The architecture of the Arcana.
(b) The training pipeline of Arcana. MM-LoRA is optional during the pre-training phase.

MY ALT TEXT

(a) The farmework of MM-LoRA vs. LoRA. MM-LoRA introduces two new hyperparameters, β and γ , to control the ranks of the visual and language LoRAs, respectively. Notably, we set β + γ = 1 to ensure that MM-LoRA has the same number of parameters as LoRA.
(b) The architecture of the visual encoder includes the QLadder adapter and CLIP. The QLadder adapter consists of cross-attention and FFN layers, with weights initialized from those of CLIP.

Experiments

Our Arcana model achieves competitive performance among the exisiting 7B models. To validate the effectiveness of QLadder and MM-LoRA, we designed a series of experiments. Additionally, to ensure fairness, we use only LLAVA-v1.5 data for the ablation experiments.

BibTeX

BibTex Code Here