[ACM MM 2025] InstructFLIP:

Exploring Unified Vision-Language Model

for Face Anti-spoofing


    1National Taiwan University, 2National Taiwan Normal University

    Teaser

    InstructFLIP: A unified instruction-tuned framework leverages vision-language models and a meta-domain strategy to achieve efficient face anti-spoofing generalization without redundant cross-domain training.

    Abstract

    Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstrctFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS.

    Pipeline

    Pipeline

    Overview of the proposed InstructFLIP framework for FAS.

    We first encode the input image x, resulting in the separated content and style features fc and fs. The dual-branch architecture presented in (a) and (b) is designed for instruction tuning based on the corresponding expertise in each branch. Eventually, the prediction for determining whether x is spoofing or not is carried out in (c) via a classifier đť’ž, according to the fused queries and fc, coupled with the cue map generated by đť’˘.

    Quantitative Results

    T2I_Compbench

    Unified FAS benchmark results on MCIO and WCS datasets.

    InstructFLIP consistently outperforms existing FAS methods across all metrics and datasets in the unified MCIO and WCS benchmarks. It achieves the lowest HTER and highest AUC and TPR@FPR=1%, with notable HTER improvements over the prior SOTA (CFPL) by up to 47%. These gains demonstrate InstructFLIP’s ability to robustly distinguish live from spoofed faces while maintaining high accuracy and low false positives. While performance on the CASIA-CeFA (C) dataset is relatively modest, this highlights an opportunity to enhance sensitivity to cultural or environmental subtleties. Overall, Table 2c confirms InstructFLIP as a balanced and generalizable solution for real-world FAS applications.

    * The CA dataset is used for training, with evaluation conducted on both MCIO and WCS benchmarks. Subtable (a) presents results on MCIO, (b) reports on WCS, and (c) summarizes average performance across all datasets. The best and second-best results are highlighted in red and blue, respectively.

    Qualitative Results

    HICO_DET&VCOCO

    Illustration of samples predicted by the proposed method.

    Figure (a) and (b) show successful predictions from InstructFLIP, where the model accurately distinguishes live and spoof faces with confident fake scores. It also correctly infers style attributes such as lighting, environment, and camera quality, highlighting strong generalization to diverse domains. In contrast, Figure (c) and (d) illustrate failure cases. The model misclassifies a spoofed poster as a real face due to misleading texture and gloss, and incorrectly labels a real face as a spoofed PC screen, likely due to overfitting to reflective patterns. These errors emphasize challenges in capturing fine-grained material cues and image sharpness.

    * Red indicates incorrect answers.

    T2I_Compbench

    Comparison with Open VLMs.

    We compare InstructFLIP with InstructBLIP and GPT-4o using content and style-based instructions on both spoofed and live images. InstructFLIP consistently provides accurate predictions, identifying spoof types like Pad screens and correctly recognizing real faces, while also handling style attributes such as lighting and environment effectively. In contrast, InstructBLIP often misclassifies spoofed samples and struggles with complex style cues. GPT-4o avoids making explicit judgments, offering general suggestions instead, which limits its applicability. These results highlight InstructFLIP’s superior contextual understanding and task adaptability among open VLMs.

    * Red indicates incorrect answers and gray represents indirect or ambiguous responses.

    BibTeX

    @inproceedings{lin2025instructflip,
      title={InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing},
      author={Kun-Hsiang Lin and Yu-Wen Tseng and Kang-Yang Huang and Jhih-Ciang Wu and Wen-Huang Cheng},
      booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
      year={2025}
    }