Published: Jul 4, 2025 by Boammani Aser Lompo
In this short tutorial, weāll walk through how to fine-tune a VisionāLanguage Model (VLM) on a custom dataset.
Our case study will use Qwen2.5-VL-7B-Instruct, a relatively compact VLM, which weāll fine-tune on the Visual-TableQA dataset.
Visual-TableQA is a recently introduced benchmark designed for reasoning over visual tables, and although modest in sizeāwith about 7k training samplesāit provides a rich and challenging setting for experimentation.
Setup
First, letās install the required libraries and import them into our environment:
!pip install torch==2.6.0 torchvision torchaudio
!pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.12/flash_attn-2.8.0+cu124torch2.6-cp310-cp310-linux_x86_64.whl
!pip install -U trl>=0.9.6 transformers>=4.42 peft>=0.12.0 accelerate>=0.33.0 bitsandbytes>=0.43.3 datasets>=2.18 qwen-vl-utils pillow
import os, torch
from datasets import load_dataset
import torch.nn.functional as F
from transformers import (AutoProcessor, AutoTokenizer, BitsAndBytesConfig, Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor, EarlyStoppingCallback)
from qwen_vl_utils import process_vision_info
from peft import LoraConfig, get_peft_model, PeftModel
from trl import SFTConfig, SFTTrainer
from PIL import Image, ImageOps
import re
use_bf16 = torch.cuda.is_bf16_supported()
Next, letās download the dataset directly from Hugging Face:
DATASET_REPO = "AI-4-Everyone/Visual-TableQA"
ds = load_dataset(DATASET_REPO)
train = ds.get("train")
evald = ds.get("validation")
Preprocessing
Now we can move on to data preprocessing.
This step focuses on formatting the dataset properly before feeding it into the model.
system_message = """You are a Vision Language Model specialized in interpreting visual data from charts and diagrams images.
Answer the questions strictly from the image, with clear, rigorous step-by-step justification. Stay concise, but include all reasoning thatās relevant."""
def to_pil(img):
return img if isinstance(img, Image.Image) else Image.fromarray(img)
def format_data(sample):
return [
{"role": "system",
"content": [{"type": "text", "text": system_message}],
},
{"role": "user",
"content": [
{"type": "image",
"image": to_pil(sample["image"]),},
{"type": "text",
"text": sample["question"],
},
],
},
{"role": "assistant",
"content": [{"type": "text", "text": sample["answer"]}],
},
]
def collate_fn(examples):
msgs = [format_data(sample) for sample in examples]
texts = [processor.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in msgs]
image_inputs, _ = process_vision_info(msgs)
batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True, truncation=False, max_length=None,)
labels = batch["input_ids"].detach().clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
# collect image/video token ids in a model-agnostic way
img_ids = set()
for attr in ("image_token_id", "video_token_id"):
v = getattr(model.config, attr, None)
if v is not None:
img_ids.add(v)
# fallback to common special tokens if present
for tok in ("<image>", "<img>", "<video>", '<|vision_start|>', '<|vision_end|>'):
tid = processor.tokenizer.convert_tokens_to_ids(tok)
if tid not in (None, -1, processor.tokenizer.unk_token_id):
img_ids.add(tid)
if img_ids:
mask = torch.isin(labels, torch.tensor(sorted(img_ids), dtype=labels.dtype, device=labels.device))
labels[mask] = -100
batch["labels"] = labels
return batch
A Tier
Next, we begin the first phase of fine-tuning by splitting the model parameters into two groups: Tier A and Tier B.
- Tier A includes all parameters except those in the Vision Tower.
- Tier B corresponds to the Vision Tower parameters. Since these layers are highly sensitiveāespecially when working with a relatively small dataset like Visual-TableQAāwe avoid modifying them directly. Instead, we only update the attention projections in the last four vision blocks, leaving the rest of the Vision Tower untouched.
For fine-tuning, we use the Low-Rank Adapter (LoRA) method. LoRA keeps the original model weights frozen while introducing lightweight low-rank adapters that guide the model towards better task performance.
MODEL_ID = "Qwen/Qwen2.5-VL-7B-Instruct"
OUTPUT_DIR_TIER_A = "qwen-vl-sft-lora-tableqa-tierA"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype=torch.bfloat16 if use_bf16 else torch.float16,
attn_implementation="flash_attention_2", low_cpu_mem_usage=True,)
model.config.use_cache = False
model.gradient_checkpointing_enable()
model.config.pretraining_tp = 1
min_pixels = 256*28*28
max_pixels = 2560*28*28
processor = AutoProcessor.from_pretrained(MODEL_ID, min_pixels=min_pixels, max_pixels=max_pixels)
TARGETS = [
"q_proj", "v_proj", "k_proj","o_proj", "gate_proj", "up_proj", "down_proj", "multi_modal_projector"]
r, lora_alpha = 16, 8
peft_cfg = LoraConfig(r=r, lora_alpha=lora_alpha, lora_dropout=0.05, bias="none", target_modules=TARGETS, task_type="CAUSAL_LM")
peft_model = get_peft_model(model, peft_cfg)
Print trainable parameters
peft_model.print_trainable_parameters()
trainable params: 47,589,376 || all params: 8,339,756,032 || trainable%: 0.5706
Now we can launch the training phase.
For this, weāll rely on the TRL libraryās training manager, which streamlines the entire process.
All we need to do is specify the training argumentsāTRL will take care of the rest.
args = SFTConfig(
output_dir=OUTPUT_DIR_TIER_A, num_train_epochs=2,
# ---- batch / memory ----
per_device_train_batch_size=3, per_device_eval_batch_size=4,
gradient_accumulation_steps=8, dataloader_pin_memory=True, # global batch ~= 24
# ---- stability & speed ----
gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False},
bf16=use_bf16, fp16=not use_bf16, tf32=True,
# ---- optimization (LoRA-friendly) ----
learning_rate=1e-4, lr_scheduler_type="cosine", warmup_ratio=0.07, max_grad_norm=1.0, weight_decay=0.01, optim="adamw_torch_fused",
adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-8,
# ---- logging / eval / save ----
logging_steps=20, eval_strategy="steps", eval_steps=100,
save_strategy="steps", save_steps=100, load_best_model_at_end=True,
metric_for_best_model="eval_loss", greater_is_better=False, save_total_limit=3,
remove_unused_columns=False, dataset_kwargs={"skip_prepare_dataset": True},
dataset_text_field="", # keep TRL from looking for "text")
callbacks = [EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.0)]
trainer = SFTTrainer(
model=peft_model, args=args, train_dataset=train, eval_dataset=evald, data_collator=collate_fn, callbacks=callbacks,
)
trainer.train()
trainer.save_model(OUTPUT_DIR_TIER_A) # saves the adapters
processor.save_pretrained(OUTPUT_DIR_TIER_A)
Before wrapping up this phase, we need to merge the newly trained adapter back into the model architecture.
MERGED_DIR = "qwen-vl-merged-tierA-tableqa"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID, device_map="auto",
torch_dtype=torch.bfloat16 if use_bf16 else torch.float16, attn_implementation="flash_attention_2",
low_cpu_mem_usage=True)
peft_model = PeftModel.from_pretrained(model, OUTPUT_DIR_TIER_A, is_trainable=False)
merged = peft_model.merge_and_unload()
merged.save_pretrained(MERGED_DIR)
B Tiers
Now we move on to the second phase of training.
We reload the merged model, gather the Tier B parameters, and define a new set of arguments to pass to the trainer.
OUTPUT_DIR_TIER_B= "qwen-vl-sft-lora-tableqa-tierB"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MERGED_DIR, device_map="auto", torch_dtype="auto",
attn_implementation="flash_attention_2", low_cpu_mem_usage=True)
model.config.use_cache = False
model.gradient_checkpointing_enable()
model.config.pretraining_tp = 1
tierB_cfg = LoraConfig(
r=8, lora_alpha=32, lora_dropout=0.10, bias="none", task_type="CAUSAL_LM",
target_modules=["attn.qkv", "attn.proj"],
)
peft_model = get_peft_model(model, tierB_cfg)
for _, p in peft_model.named_parameters():
p.requires_grad_(False)
# capture indices from your paths: model.visual.blocks.<idx>.
layer_re = re.compile(r"model\.visual\.blocks\.(\d+)\.")
# discover how many blocks the vision tower has
vision_layers = sorted({int(m.group(1)) for n, _ in peft_model.named_modules()
if (m := layer_re.search(n))})
N = 4 # tune last 4; adjust 2ā6 if needed
lastN = set(vision_layers[-N:])
def is_lastN_vision_attn_lora(name: str) -> bool:
if "lora_" not in name:
return False
m = layer_re.search(name)
if not m:
return False
idx = int(m.group(1))
if idx not in lastN:
return False
# only attention projections
return ("attn.qkv" in name) or ("attn.proj" in name)
for n, p in peft_model.named_parameters():
if is_lastN_vision_attn_lora(n):
p.requires_grad_(True)
#(Optional) keep the multimodal projector from Tier A trainable a bit more
for n, p in peft_model.named_parameters():
if "multi_modal_projector" in n and ("lora_" in n) and ("default" in n):
p.requires_grad_(True)
peft_model.print_trainable_parameters()
trainable params: 245,760 || all params: 8,294,132,736 || trainable%: 0.0030
args_tierB = SFTConfig(
output_dir=OUTPUT_DIR_TIER_B, num_train_epochs=1,
per_device_train_batch_size=3, per_device_eval_batch_size=4, gradient_accumulation_steps=8, gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False}, bf16=use_bf16, fp16=not use_bf16, tf32=True,
learning_rate=2e-5, # we reduce it for safety measure
lr_scheduler_type="cosine", warmup_ratio=0.07, max_grad_norm=1.0, weight_decay=0.01,
optim="adamw_torch_fused", logging_steps=20, eval_strategy="steps", eval_steps=100,
save_strategy="steps", save_steps=100, load_best_model_at_end=True,
metric_for_best_model="eval_loss", greater_is_better=False, save_total_limit=3,
remove_unused_columns=False, dataset_kwargs={"skip_prepare_dataset": True}, dataset_text_field="",
)
callbacks = [EarlyStoppingCallback(early_stopping_patience=2)]
trainer = SFTTrainer(
model=peft_model, args=args_tierB, train_dataset=train, eval_dataset=evald,
data_collator=collate_fn, processing_class=processor, callbacks=callbacks)
trainer.train()
trainer.save_model(OUTPUT_DIR_TIER_B) # saves the adapters
processor.save_pretrained(OUTPUT_DIR_TIER_B)
Final Model Loading and Inferance
base = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MERGED_DIR, device_map="auto", torch_dtype="auto",
attn_implementation="flash_attention_2", low_cpu_mem_usage=True
)
model = PeftModel.from_pretrained(base, OUTPUT_DIR_TIER_B)
del base
torch.cuda.empty_cache()
model.eval()
def generate_text_from_sample(model, sample, max_new_tokens=5000, device="cuda"):
# Prepare the text input by applying the chat template
text_input = processor.apply_chat_template(format_data(sample)[:2], tokenize=False, add_generation_prompt=True)
# Process the visual input from the sample
image_inputs, _ = process_vision_info(format_data(sample)[:2])
# Prepare the inputs for the model
model_inputs = processor(text=[text_input], images=image_inputs, return_tensors="pt").to(device) # Move inputs to the specified device
# Generate text with the model
generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens, do_sample=False)
# Trim the generated ids to remove the input ids
trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)]
# Decode the output text
output_text = processor.batch_decode(
trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
return output_text[0] # Return the first decoded output text
output = generate_text_from_sample(model, train[0])
output
Share