Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstrctFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS.
InstructFLIP consistently outperforms existing FAS methods across all metrics and datasets in the unified MCIO and WCS benchmarks. It achieves the lowest HTER and highest AUC and TPR@FPR=1%, with notable HTER improvements over the prior SOTA (CFPL) by up to 47%. These gains demonstrate InstructFLIP’s ability to robustly distinguish live from spoofed faces while maintaining high accuracy and low false positives. While performance on the CASIA-CeFA (C) dataset is relatively modest, this highlights an opportunity to enhance sensitivity to cultural or environmental subtleties. Overall, Table 2c confirms InstructFLIP as a balanced and generalizable solution for real-world FAS applications.
* The CA dataset is used for training, with evaluation conducted on both MCIO and WCS benchmarks. Subtable (a) presents results on MCIO, (b) reports on WCS, and (c) summarizes average performance across all datasets. The best and second-best results are highlighted in red and blue, respectively.
@inproceedings{lin2025instructflip,
title={InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing},
author={Kun-Hsiang Lin and Yu-Wen Tseng and Kang-Yang Huang and Jhih-Ciang Wu and Wen-Huang Cheng},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
year={2025}
}