Bayesian Optimization Meets Self-Distillation

Abstract

Bayesian optimization (BO) has contributed greatly to improving model performance by suggesting promising hyperparameter configurations iteratively based on observations from multiple training trials. However, only partial knowledge (i.e., the measured performances of trained models and their hyperparameter configurations) from previous trials is transferred. On the other hand, Self-Distillation (SD) only transfers partial knowledge learned by the task model itself. To fully leverage the various knowledge gained from all training trials, we propose the BOSS framework, which combines BO and SD. BOSS suggests promising hyperparameter configurations through BO and carefully selects pre-trained models from previous trials for SD, which are otherwise abandoned in the conventional BO process. BOSS achieves significantly better performance than both BO and SD in a wide range of tasks including general image classification, learning with noisy labels, semi-supervised learning, and medical image analysis tasks.

Motivation

Bayesian optimization (BO) is an iterative process that suggests promising hyperparameters based on previous observations. Unfortunately, the knowledge acquired by the network during these iterations is typically disregarded. However, recent studies on self-distillation (SD) has demonstrated that transferring knowledge from a previously trained model with the identical capacity can improve the performance of the model.

BOSS Framework

By performing these steps simply in an alternating manner (left-to-right), we propagate both the conditional probability learned over hyperparameter configurations (depicted with graphs) and the knowledge learned by each task network, resulting in large performance gains in the final model.

BOSS (Bayesian Optimization meets Self-diStillation) combines BO and SD to fully leverage the knowledge obtained from previous trials. The framework suggests a hyperparameter configuration based on observations that are most likely to improve performance. After that, It carefully selects pre-trained networks from previous trials for the next round of training with SD. The process is iterative, allowing the network to consistently improve upon previous trials.

Experiments

Performance on object classification tasks. Top-1 accuracy (%) on CIFAR10/100 and Tiny-ImageNet with VGG-16.
Performance on learning with noisy labels tasks. Top-1 accuracy (%) on CIFAR-100 with VGG-16.
Performance on semi-supervised learning tasks. Top-1 accuracy (%) on CIFAR-100 with VGG-16.
Performance on medical image analysis tasks.

The effectiveness of BOSS was evaluated across various computer vision tasks such as object classification, learning with noisy labels, and semi-supervised learning. It was also tested with medical image analysis tasks, including medical image classification and segmentation.

While random search succeeds to improve the performance of the baseline, BO further boosts the performance by adaptively suggesting probable configurations. SD also achieves enhanced performance compared to the baseline as expected. However, the effectiveness of SD and BO varies across datasets and tasks. On the other hand, BOSS consistently improves the performance by a large margin, leveraging the advantages of both methods.

Ablastion Study and Analysis

The paper presents ablation studies and analytical experiments to investigate the design choices of the BOSS algorithm.

Ablation Study

Comparision of different choices for utilizing pretrained (PT) weights on CIFAR-100 with VGG-16.

Employing pretrained weights for either the student or teacher network results in improved performance compared to standard BO. Furthermore, utilizing pretrained weights for both the teacher and student networks leads to even greater performance gains. This indicates that the knowledge of teacher and student could generate a positive synergy.

Effect of Pretrained Weight

Identical models have the same initialization weights for both teacher and student, while asymmetric models have different initialization weights.

For all possible combinations of teacher and student networks, we conducted a single-round self-distillation. As expected, utilizing distinct pretrained models for the teacher and student network leads to better performance. Our analysis suggests that it is crucial to initialize the student and teacher network with different models to benefit from warm-starting the student model.

Conclusion

Through extensive experiments in various settings and tasks, we demonstrate that BOSS achieves significant performance improvements, that is consistently better than standard BO or SD on their own. Based on the presented evidence, we believe that the concept of marrying BO and SD is a powerful approach to training models, that should be further explored by the research community.

Citation

@inproceedings{lee2023bayesian,
  author={Lee, HyunJae and Song, Heon and Lee, Hyeonsoo and Lee, Gi-hyeon and Park, Suyeong and Yoo, Donggeun},
  title={Bayesian Optimization Meets Self-Distillation},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month={October},
  year={2023},
  pages={TBU}
}