In this study, we investigated the extent to which Vision Language Models (VLMs) possess sensibilities similar to those of humans by focusing on color impressions, which have a significant impact on the sensory aspects of vision, and sound symbolism, which constitutes linguistic and auditory sensibilities. For the experiments, we newly constructed an evolving image generation system based on the CONRAD algorithm, which evolves images based on human evaluations. Our system can also reflect the evaluations of VLMs in addition to humans. Using this system, we analyzed the sensibilities of VLMs. The experimental results suggested similarities between human and VLM sensibilities in both color impressions and sound symbolism. In sound symbolism, VLMs demonstrated sound-symbolic sensibilities similar to those of humans, even for the pseudo-words we newly generated, yielding intriguing results. These findings suggest that VLM evaluations and feedback may have a certain level of effectiveness in tasks that have previously required human evaluations or annotations related to sensibility.

This content is only available as a PDF.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.