Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. The major spoofing attacks are shown in Fig. 1, including the “print attack” in which a printed face image is presented and “display attack” in which a face video is displayed on a device. To detect such spoofing attacks, it is necessary to detect a local and/or global difference between the live face image and the spoofed face image such as texture and depth. Since local features can be extracted in the shallow layer of Vision Transformer (ViT) and global features can be extracted in the deep layer of ViT, we investigate spoofing attack detection that takes advantage of such characteristics of ViT. Then, we propose a spoofing attack detection method that utilizes the intermediate features of ViT and introduces two data augmentation methods. We demonstrate the effectiveness of the proposed method through experiments using the SiW dataset.