Sunday 11:00–11:45 in LG7

Cross-modal Representation Learning

Tanmoy Mukherjee, Maryam Abdollahyan

Audience level:


In this talk, we introduce an alternative to vector representation based on Gaussian distribution, and we show various learning algorithms using distributions for cross-modal representation learning. We show various advantages of learning based on distributions and demonstrate examples in learning word/concept representation, image annotation and retrieval and zero-shot learning.


In the past, approaches to lexical semantics, such as distributional semantic models (DSMs) that rely on corpus-extracted vectors (e.g. Topic Models), have provided good approximations of the word meanings. Today, despite the large amount of progress in Computer Vision due to advances in Deep learning, cross-modal representation learning remains constrained by vector representation. Techniques based on vector representation have a number of limitations: first, they do not necessarily capture the intra-class variability; second, they do not express the uncertainty associated with assigning target concepts to input data, and finally, distance between the objects is commonly computed as the dot product between their corresponding vectors, a metric which does not allow asymmetric comparisons.

In our work, we attempt to go beyond vector representation and move towards representations based on probability distributions. In particular, we use Gaussian embedding which innately incorporates uncertainty and forms a basis upon which various divergences can be defined (e.g. f-divergence). In addition, we consider transferring representations learnt from images to text for better distributional semantic models. We draw inspiration from earlier work on kernels on distributions to propose an embedding space for words and images for efficient cross-modal representation learning and show how one can transfer from visual mode to language mode or vice-versa.

Finally, we demonstrate examples in learning word representation, image annotation and retrieval, and zero-shot learning.