Arcana: Improving Multi-modal Large Language Model through Boosting Vision Capabilities

¹Nanjing University of Science and Technology, ²Baidu VIS,
³Huazhong University of Science and Technology
^*Indicates Equal Contribution ✉Indicates Corresponding Author

Motivation

The visual perception capabilities of MLLMs directly impact their performance. It is well-known that the main factors influencing MLLMs' visual perception are data and structure. Arcana aims to enhance the visual perception capabilities of MLLMs by addressing both of these aspects.

High-quality and Fine-grained Instruction Dataset

On the data side, there is a scarcity of open-source data, and the available multimodal datasets contain limited visual components, preventing MLLMs from gaining sufficient visual perception capabilities from these sources. To this end, we have developed a data engine to annotate multimodal data that ensures a diversity of visual factors. The specific process is as follows:

Arcana_VG-COCO

File	Type	Description
`captions.json`	Image-level captions	The `captions.json` file contains `86k` conversations about image-level detail/short captions generated by Arcana's data engine. Each element in the JSONL file is formatted as follows: { "id": "image_id", "image": "image_path",, "conversations":[ { "from": "human", "value": "<image>\nProvide a detailed description of the following image" }, { "from": "gpt", "value": "The scene depicts an airport setting with a sizable white and green airplane, likely a jumbo jet, stationed on the runway. The airplane showcases a color scheme of red and green, highlighting its unique look.."} ] }
`region_captions.json`	Region-level captions	The `region_captions.json` file contains `86k` conversations about region-level captions generated by Arcana's data engine. Each element in the JSONL file is formatted as follows: { "id": "image_id", "image": "image_path",, "conversations":[ { "from": "human", "value": "<image>\nProvide a comprehensive depiction of the area bounded by [0.286, 0.342, 0.658, 0.484]." }, {"from": "gpt", "value": "The area features a large red double-decker bus with a vibrant advertisement displayed on its side.."}, ... ] }
`detections.json`	Detections	The `detections.json` file contains `86k` conversations about object detections generated by Arcana's data engine. Each element in the JSONL file is formatted as follows: { "id": "image_id", "image": "image_path",, "conversations":[ { "from": "human", "value": "<image>\nDetect the category encompasses the region defined by the coordinates [0.324, 0.523, 0.338, 0.538]?" }, {"from": "gpt", "value": "According to the taxonomy, this region is categorized as a window."}, ... ] }
`ocrs.json`	OCRs	The `ocrs.json` file contains `13k` conversations about OCR content in image generated by Arcana's data engine. Each element in the JSONL file is formatted as follows: { "id": "image_id", "image": "image_path",, "conversations":[ { "from": "human", "value": "<image>\nProvide the text in the area bounded by [0.142, 0.863, 0.225, 0.892]." }, { "from": "gpt", "value": "The text is 2726B2V."}, { "from": "human", "value": "Tell me where the text 2726B2V is in this picture." }, { "from": "gpt", "value": "The text is in [0.142, 0.863, 0.225, 0.892]." } ] }

Model

On the structural side, the language-driven decoder couples visual and language modalities within the same space, disregarding their unique characteristics and potentially causing information confusion or blurring. Furthermore, the frozen visual encoder cannot provide robust visual features, and directly fine-tuning it with a small dataset can affect its generalization capabilities. Toward this end, Arcana introduces MM-LoRA, which constructs a multimodal decoder to preserve the unique characteristics of different modalities. We also propose a Query Ladder adapter (QLadder) for the visual encoder, which retains the pre-trained image encoder's capabilities while introducing a small number of visual tokens to significantly enhance the model's ability to learn and represent visual information.

Experiments

Our Arcana model achieves competitive performance among the exisiting 7B models. To validate the effectiveness of QLadder and MM-LoRA, we designed a series of experiments. Additionally, to ensure fairness, we use only LLAVA-v1.5 data for the ablation experiments.

Arcana: Improving Multi-modal Large Language Model through Boosting Vision Capabilities

Motivation

Sampled some VQA examples involving color, quantity, small objects, and localization tasks, showcasing the importance of visual recognition capabilities for multimodal language models (MLLMs).

High-quality and Fine-grained Instruction Dataset

The Arcana data engine involves three crucial steps: (1) augmenting and filtering image basic annotations, (2) obtaining text information and description for each annotated region, and (3) using a large language model to integrate these visual annotations into different types of captions.

LLaVA’s coarse-grained caption v.s. Our fine-grained caption.
Important visual recognition related descriptions are highlighted in dark green.

Comparisons of dense caption between LLaVA-RaCap-COCO118k and ours.
The hallucination parts are highlighted in red, whereas detailed and accurate parts are emphasized in dark green.

Arcana_VG-COCO

Model

(a) The architecture of the Arcana.
(b) The training pipeline of Arcana. MM-LoRA is optional during the pre-training phase.

Experiments

Performance on General Visual Question Answering benchmarks.

Performance on Large Vision-Language Models (LVLM) benchmarks.

Ablation study.

BibTeX

Arcana: Improving Multi-modal Large Language Model through Boosting Vision Capabilities

Motivation

Sampled some VQA examples involving color, quantity, small objects, and localization tasks, showcasing the importance of visual recognition capabilities for multimodal language models (MLLMs).

High-quality and Fine-grained Instruction Dataset

The Arcana data engine involves three crucial steps: (1) augmenting and filtering image basic annotations, (2) obtaining text information and description for each annotated region, and (3) using a large language model to integrate these visual annotations into different types of captions.

LLaVA’s coarse-grained caption v.s. Our fine-grained caption. Important visual recognition related descriptions are highlighted in dark green.

Comparisons of dense caption between LLaVA-RaCap-COCO118k and ours. The hallucination parts are highlighted in red, whereas detailed and accurate parts are emphasized in dark green.

Arcana_VG-COCO

Model

(a) The architecture of the Arcana. (b) The training pipeline of Arcana. MM-LoRA is optional during the pre-training phase.

Experiments

Performance on General Visual Question Answering benchmarks.

Performance on Large Vision-Language Models (LVLM) benchmarks.

Ablation study.

BibTeX

LLaVA’s coarse-grained caption v.s. Our fine-grained caption.
Important visual recognition related descriptions are highlighted in dark green.

Comparisons of dense caption between LLaVA-RaCap-COCO118k and ours.
The hallucination parts are highlighted in red, whereas detailed and accurate parts are emphasized in dark green.

(a) The architecture of the Arcana.
(b) The training pipeline of Arcana. MM-LoRA is optional during the pre-training phase.