Computer Structures Laboratory
Graduate School of Information Sciences
Tohoku University

Project page

2025
Graphical Abstract
Leveraging Intermediate Features of Vision Transformer for Face Anti-Spoofing

IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents... Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti-spoofing data augmentation and patch-wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU-NPU and SiW datasets. The project page is available at: https://gsisaoki.github.io/FAS-ViT-CVPRW/ .
Graphical Abstract
ErpGS:3D ガウシアン正則化を用いた360度画像の新規視点合成

第31回画像センシングシンポジウム

360 度カメラを用いることで,カメラを中心とする全周を正距円筒図法(Equirectangular Projection: ERP)に基づいて画像平面に投影し,ERP 画像(360 度画像)を取得することができる.ERP 画像は少ない枚数で広範囲を画像化することができるため,その利点を活かして,Neural Radi... 360 度カメラを用いることで,カメラを中心とする全周を正距円筒図法(Equirectangular Projection: ERP)に基づいて画像平面に投影し,ERP 画像(360 度画像)を取得することができる.ERP 画像は少ない枚数で広範囲を画像化することができるため,その利点を活かして,Neural Radiance Fields (NeRF) や3D Gaussian Splatting (3DGS) を用いた新規視点合成手法(Novel View Synthesis: NVS) が提案されている.ERP 画像のNVS では,EPR に基づいた投影で生じる大きな歪みに対応する必要があるが,現在までに提案されている手法は,必ずしも十分に対応できていない.特に,3DGS では,ERP の歪みの影響で大きな3D ガウシアンが生成されるため,新規視点画像のレンダリングの精度が低下する.本論文では,3D ガウシアン正則化を用いたERP 画像のNVS であるErpGS を提案する.EprGS では,幾何学的な正則化,3D ガウシアンのスケールに対する正則化,歪みを考慮した重み付け,視点固有のマスキングを導入することで,ERP の歪みに対応する.公開データセットを用いた性能評価実験を通して,ErpGS が従来手法よりも高精度に新規視点画像をレンダリングできることを実証する.
Graphical Abstract
Image quality improvement in single plane-wave imaging using deep learning

Ultrasonics

In ultrasound image diagnosis, single plane-wave imaging (SPWI), which can acquire ultrasound images at more than 1000 fps, has been used to observe detailed ti... In ultrasound image diagnosis, single plane-wave imaging (SPWI), which can acquire ultrasound images at more than 1000 fps, has been used to observe detailed tissue and evaluate blood flow. SPWI achieves high temporal resolution by sacrificing the spatial resolution and contrast of ultrasound images. To improve spatial resolution and contrast in SPWI, coherent plane-wave compounding (CPWC) is used to obtain high-quality ultrasound images, i.e., compound images, by coherent addition of radio frequency (RF) signals acquired by transmitting plane waves in different directions. Although CPWC produces high-quality ultrasound images, their temporal resolution is lower than that of SPWI. To address this problem, some methods have been proposed to reconstruct a ultrasound image comparable to a compound image from RF signals obtained by transmitting a small number of plane waves in different directions. These methods do not fully consider the properties of RF signals, resulting in lower image quality compared to a compound image. In this paper, we propose methods to reconstruct high-quality ultrasound images in SPWI by considering the characteristics of RF signal of a single plane wave to obtain ultrasound images with image quality comparable to CPWC. The proposed methods employ encoder–decoder models of 1D U-Net, 2D U-Net, and their combination to generate the high-quality ultrasound images by minimizing the loss that considers the point spread effect of plane waves and frequency spectrum of RF signals in training. We also create a public large-scale SPWI/CPWC dataset for developing and evaluating deep-learning methods. Through a set of experiments using the public dataset and our dataset, we demonstrate that the proposed methods can reconstruct higher-quality ultrasound images from RF signals in SPWI than conventional method.
2024
Graphical Abstract
Enhancing Remote Adversarial Patch Attacks on Face Detectors with Tiling and Scaling

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference

This paper discusses the attack feasibility of Remote Adversarial Patch (RAP) targeting face detectors. The RAP that targets face detectors is similar to the RA... This paper discusses the attack feasibility of Remote Adversarial Patch (RAP) targeting face detectors. The RAP that targets face detectors is similar to the RAP that targets general object detectors, but the former has multiple issues in the attack process the latter does not. (1) It is possible to detect objects of various scales. In particular, the area of small objects that are convolved during feature extraction by CNN is small, so the area that affects the inference results is also small. (2) It is a two-class classification, so there is a large gap in characteristics between the classes. This makes it difficult to attack the inference results by directing them to a different class. In this paper, we propose a new patch placement method and loss function for each problem. The patches targeting the proposed face detector showed superior detection obstruct effects compared to the patches targeting the general object detector.
Graphical Abstract
Multibiometrics Using a Single Face Image

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference

Multibiometrics, which uses multiple biometric traits to improve recognition performance instead of using only one biometric trait to authenticate individuals, ... Multibiometrics, which uses multiple biometric traits to improve recognition performance instead of using only one biometric trait to authenticate individuals, has been investigated. Previous studies have combined individually acquired biometric traits or have not fully considered the convenience of the system. Focusing on a single face image, we propose a novel multibiometric method that combines five biometric traits, i.e., face, iris, periocular, nose, eyebrow, that can be extracted from a single face image. The proposed method does not sacrifice the convenience of biometrics since only a single face image is used as input. Through a variety of experiments using the CASIA Iris Distance database, we demonstrate the effectiveness of the proposed multibiometrics method.
Graphical Abstract
High-resolution Microscopic Image Dataset of Freshwater Plankton in Japanese Lakes and Reservoirs (FREPJ): I. Zooplankton

Bulletin of the National Museum of Nature and Science. Series B, Botany

Plankton are important organisms that structure food webs in aquatic ecosystems and are also effective environmental indicators. However, the identification and... Plankton are important organisms that structure food webs in aquatic ecosystems and are also effective environmental indicators. However, the identification and enumeration of these organisms for environmental monitoring is challenging in terms of sustainability and accuracy. To overcome these difficulties, we collected plankton images that would be usable for developing an AI-based plankton monitoring system. As a series of plankton image collections, we first made an image dataset for zooplankton. This dataset contains a total of 88,653 images of 214 freshwater zooplankton taxa collected from 87 lakes and reservoirs located in different areas of the Japanese archipelago. To obtain these images, zooplankton samples collected at various locations were first scanned using an intelligent microscope, and high-resolution photographs containing multiple plankton individuals were taken. Then, each plankton individual was cropped and extracted from the photographs as a single image, classified and labeled with multiple taxonomic ranks (phylum, class, order, family, genus, and species), and stored in the dataset. The present dataset will be useful not only as an atlas of freshwater zooplankton in Japan, but also for the construction, training, and evaluation of an automatic plankton identification and enumeration system based on machine learning.
Graphical Abstract
LabellessFace: Fair metric learning for face recognition without attribute labels

IEEE International Joint Conference on Biometrics

Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on sp... Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on specific demographic groups or demographic classifier, making it difficult to address performance for unrecognised groups. This paper introduces “LabellessFace," a novel framework that improves demographic bias in face recognition without requiring demographic group labeling typically required for fairness considerations. We propose a novel fairness enhancement metric called the class favoritism level, which assesses the extent of favoritism towards specific classes across the dataset. Leveraging this metric, we introduce the fair class margin penalty, an extension of existing marginbased metric learning. This method dynamically adjusts learning parameters based on class favoritism levels, promoting fairness across all attributes. By treating each class as an individual in facial recognition systems, we facilitate learning that minimizes biases in authentication accuracy among individuals. Comprehensive experiments have demonstrated that our proposed method is effective for enhancing fairness while maintaining authentication accuracy.
Graphical Abstract
FSErasing: Improving face recognition with data augmentation using face parsing

IET Biometrics

We propose original semantic labels for detailed face parsing to improve the accuracy of face recognition by focusing on parts in a face. The part labels used i... We propose original semantic labels for detailed face parsing to improve the accuracy of face recognition by focusing on parts in a face. The part labels used in conventional face parsing are defined based on biological features, and thus, one label is given to a large region, such as skin. Our semantic labels are defined by separating parts with large areas based on the structure of the face and considering the left and right sides for all parts to consider head pose changes, occlusion, and other factors. By utilizing the capability of assigning detailed part labels to face images, we propose a novel data augmentation method based on detailed face parsing called Face Semantic Erasing (FSErasing) to improve the performance of face recognition. FSErasing is to randomly mask a part of the face image based on the detailed part labels, and therefore, we can apply erasing-type data augmentation to the face image that considers the characteristics of the face. Through experiments using public face image datasets, we demonstrate that FSErasing is effective for improving the performance of face recognition and face attribute estimation. In face recognition, adding FSErasing in training ResNet-34 with Softmax using CelebA improves the average accuracy by 0.354 points and the average equal error rate (EER) by 0.312 points, and with ArcFace, the average accuracy and EER improve by 0.752 and 0.802 points, respectively. ResNet-50 with Softmax using CASIA-WebFace improves the average accuracy by 0.442 points and the average EER by 0.452 points, and with ArcFace, the average accuracy and EER improve by 0.228 points and 0.500 points, respectively. In face attribute estimation, adding FSErasing as a data augmentation method in training with CelebA improves the estimation accuracy by 0.54 points. We also apply our detailed face parsing model to visualize face recognition models and demonstrate its higher explainability than general visualization methods.
2025
Graphical Abstract
Leveraging Intermediate Features of Vision Transformer for Face Anti-Spoofing

IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents... Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti-spoofing data augmentation and patch-wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU-NPU and SiW datasets. The project page is available at: https://gsisaoki.github.io/FAS-ViT-CVPRW/ .
Graphical Abstract
ErpGS:3D ガウシアン正則化を用いた360度画像の新規視点合成

第31回画像センシングシンポジウム

360 度カメラを用いることで,カメラを中心とする全周を正距円筒図法(Equirectangular Projection: ERP)に基づいて画像平面に投影し,ERP 画像(360 度画像)を取得することができる.ERP 画像は少ない枚数で広範囲を画像化することができるため,その利点を活かして,Neural Radi... 360 度カメラを用いることで,カメラを中心とする全周を正距円筒図法(Equirectangular Projection: ERP)に基づいて画像平面に投影し,ERP 画像(360 度画像)を取得することができる.ERP 画像は少ない枚数で広範囲を画像化することができるため,その利点を活かして,Neural Radiance Fields (NeRF) や3D Gaussian Splatting (3DGS) を用いた新規視点合成手法(Novel View Synthesis: NVS) が提案されている.ERP 画像のNVS では,EPR に基づいた投影で生じる大きな歪みに対応する必要があるが,現在までに提案されている手法は,必ずしも十分に対応できていない.特に,3DGS では,ERP の歪みの影響で大きな3D ガウシアンが生成されるため,新規視点画像のレンダリングの精度が低下する.本論文では,3D ガウシアン正則化を用いたERP 画像のNVS であるErpGS を提案する.EprGS では,幾何学的な正則化,3D ガウシアンのスケールに対する正則化,歪みを考慮した重み付け,視点固有のマスキングを導入することで,ERP の歪みに対応する.公開データセットを用いた性能評価実験を通して,ErpGS が従来手法よりも高精度に新規視点画像をレンダリングできることを実証する.
Graphical Abstract
Image quality improvement in single plane-wave imaging using deep learning

Ultrasonics

In ultrasound image diagnosis, single plane-wave imaging (SPWI), which can acquire ultrasound images at more than 1000 fps, has been used to observe detailed ti... In ultrasound image diagnosis, single plane-wave imaging (SPWI), which can acquire ultrasound images at more than 1000 fps, has been used to observe detailed tissue and evaluate blood flow. SPWI achieves high temporal resolution by sacrificing the spatial resolution and contrast of ultrasound images. To improve spatial resolution and contrast in SPWI, coherent plane-wave compounding (CPWC) is used to obtain high-quality ultrasound images, i.e., compound images, by coherent addition of radio frequency (RF) signals acquired by transmitting plane waves in different directions. Although CPWC produces high-quality ultrasound images, their temporal resolution is lower than that of SPWI. To address this problem, some methods have been proposed to reconstruct a ultrasound image comparable to a compound image from RF signals obtained by transmitting a small number of plane waves in different directions. These methods do not fully consider the properties of RF signals, resulting in lower image quality compared to a compound image. In this paper, we propose methods to reconstruct high-quality ultrasound images in SPWI by considering the characteristics of RF signal of a single plane wave to obtain ultrasound images with image quality comparable to CPWC. The proposed methods employ encoder–decoder models of 1D U-Net, 2D U-Net, and their combination to generate the high-quality ultrasound images by minimizing the loss that considers the point spread effect of plane waves and frequency spectrum of RF signals in training. We also create a public large-scale SPWI/CPWC dataset for developing and evaluating deep-learning methods. Through a set of experiments using the public dataset and our dataset, we demonstrate that the proposed methods can reconstruct higher-quality ultrasound images from RF signals in SPWI than conventional method.
2024
Graphical Abstract
Enhancing Remote Adversarial Patch Attacks on Face Detectors with Tiling and Scaling

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference

This paper discusses the attack feasibility of Remote Adversarial Patch (RAP) targeting face detectors. The RAP that targets face detectors is similar to the RA... This paper discusses the attack feasibility of Remote Adversarial Patch (RAP) targeting face detectors. The RAP that targets face detectors is similar to the RAP that targets general object detectors, but the former has multiple issues in the attack process the latter does not. (1) It is possible to detect objects of various scales. In particular, the area of small objects that are convolved during feature extraction by CNN is small, so the area that affects the inference results is also small. (2) It is a two-class classification, so there is a large gap in characteristics between the classes. This makes it difficult to attack the inference results by directing them to a different class. In this paper, we propose a new patch placement method and loss function for each problem. The patches targeting the proposed face detector showed superior detection obstruct effects compared to the patches targeting the general object detector.
Graphical Abstract
Multibiometrics Using a Single Face Image

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference

Multibiometrics, which uses multiple biometric traits to improve recognition performance instead of using only one biometric trait to authenticate individuals, ... Multibiometrics, which uses multiple biometric traits to improve recognition performance instead of using only one biometric trait to authenticate individuals, has been investigated. Previous studies have combined individually acquired biometric traits or have not fully considered the convenience of the system. Focusing on a single face image, we propose a novel multibiometric method that combines five biometric traits, i.e., face, iris, periocular, nose, eyebrow, that can be extracted from a single face image. The proposed method does not sacrifice the convenience of biometrics since only a single face image is used as input. Through a variety of experiments using the CASIA Iris Distance database, we demonstrate the effectiveness of the proposed multibiometrics method.
Graphical Abstract
High-resolution Microscopic Image Dataset of Freshwater Plankton in Japanese Lakes and Reservoirs (FREPJ): I. Zooplankton

Bulletin of the National Museum of Nature and Science. Series B, Botany

Plankton are important organisms that structure food webs in aquatic ecosystems and are also effective environmental indicators. However, the identification and... Plankton are important organisms that structure food webs in aquatic ecosystems and are also effective environmental indicators. However, the identification and enumeration of these organisms for environmental monitoring is challenging in terms of sustainability and accuracy. To overcome these difficulties, we collected plankton images that would be usable for developing an AI-based plankton monitoring system. As a series of plankton image collections, we first made an image dataset for zooplankton. This dataset contains a total of 88,653 images of 214 freshwater zooplankton taxa collected from 87 lakes and reservoirs located in different areas of the Japanese archipelago. To obtain these images, zooplankton samples collected at various locations were first scanned using an intelligent microscope, and high-resolution photographs containing multiple plankton individuals were taken. Then, each plankton individual was cropped and extracted from the photographs as a single image, classified and labeled with multiple taxonomic ranks (phylum, class, order, family, genus, and species), and stored in the dataset. The present dataset will be useful not only as an atlas of freshwater zooplankton in Japan, but also for the construction, training, and evaluation of an automatic plankton identification and enumeration system based on machine learning.
Graphical Abstract
LabellessFace: Fair metric learning for face recognition without attribute labels

IEEE International Joint Conference on Biometrics

Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on sp... Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on specific demographic groups or demographic classifier, making it difficult to address performance for unrecognised groups. This paper introduces “LabellessFace," a novel framework that improves demographic bias in face recognition without requiring demographic group labeling typically required for fairness considerations. We propose a novel fairness enhancement metric called the class favoritism level, which assesses the extent of favoritism towards specific classes across the dataset. Leveraging this metric, we introduce the fair class margin penalty, an extension of existing marginbased metric learning. This method dynamically adjusts learning parameters based on class favoritism levels, promoting fairness across all attributes. By treating each class as an individual in facial recognition systems, we facilitate learning that minimizes biases in authentication accuracy among individuals. Comprehensive experiments have demonstrated that our proposed method is effective for enhancing fairness while maintaining authentication accuracy.
Graphical Abstract
FSErasing: Improving face recognition with data augmentation using face parsing

IET Biometrics

We propose original semantic labels for detailed face parsing to improve the accuracy of face recognition by focusing on parts in a face. The part labels used i... We propose original semantic labels for detailed face parsing to improve the accuracy of face recognition by focusing on parts in a face. The part labels used in conventional face parsing are defined based on biological features, and thus, one label is given to a large region, such as skin. Our semantic labels are defined by separating parts with large areas based on the structure of the face and considering the left and right sides for all parts to consider head pose changes, occlusion, and other factors. By utilizing the capability of assigning detailed part labels to face images, we propose a novel data augmentation method based on detailed face parsing called Face Semantic Erasing (FSErasing) to improve the performance of face recognition. FSErasing is to randomly mask a part of the face image based on the detailed part labels, and therefore, we can apply erasing-type data augmentation to the face image that considers the characteristics of the face. Through experiments using public face image datasets, we demonstrate that FSErasing is effective for improving the performance of face recognition and face attribute estimation. In face recognition, adding FSErasing in training ResNet-34 with Softmax using CelebA improves the average accuracy by 0.354 points and the average equal error rate (EER) by 0.312 points, and with ArcFace, the average accuracy and EER improve by 0.752 and 0.802 points, respectively. ResNet-50 with Softmax using CASIA-WebFace improves the average accuracy by 0.442 points and the average EER by 0.452 points, and with ArcFace, the average accuracy and EER improve by 0.228 points and 0.500 points, respectively. In face attribute estimation, adding FSErasing as a data augmentation method in training with CelebA improves the estimation accuracy by 0.54 points. We also apply our detailed face parsing model to visualize face recognition models and demonstrate its higher explainability than general visualization methods.