Cross-modal retrieval is an AI search technique that lets you search for data in one modality using a query from an entirely different modality. You type a description, and a search engine hands you back a matching image. Or you upload a photo, and it finds related audio clips. That’s cross-modal retrieval doing its thing.
The word “modal” just refers to a type of data, like text, images, audio, or video. “Cross-modal” means you’re working across two or more of those types. So cross-modal retrieval is when you use one type of data to search for another.