12  Tutorial VI: Reasoning models and images

12.1 Virtual RStudio

12.2 Materials for this session

Reasoning models

Image analysis

  • Vision model: ministral-3
  • Paper with an interesting application of vision models: Meltzer et al. (2025)

We used state-of-the-art zero-shot classification with multimodal large language models (MLLM) in order to classify individual video frames. Based on the aforementioned literature of manual content analyses of music videos, we selected three major dimensions: (1) revealing or suggestive clothes, (2) sexually suggestivemoves (including dancing), and (3) sexually suggestive poses and facial expr essions.

Example videos