Zero Shot Learning

Overview

Zero-Shot Learning (ZSL) is a machine learning paradigm where a model is trained to recognize or perform tasks on classes or concepts it has not seen during training. Instead of relying on direct examples of unseen classes, ZSL models typically leverage auxiliary information that describes these unseen classes, often in the form of attributes, textual descriptions, or embeddings from other modalities.

In a multimodal context, ZSL often involves transferring knowledge from a rich modality (like text, where class descriptions are available) to another modality (like vision) to classify images of unseen objects, or to generate content across modalities for unseen concepts.

Core Concepts

  • Generalized Zero-Shot Learning (GZSL)

    Generalized Zero-Shot Learning (GZSL) is a more realistic and challenging variant where the model is evaluated on samples from both seen (training) classes and unseen classes during testing. This requires the model to not only recognize unseen classes but also to avoid misclassifying seen classes as unseen ones, or vice-versa. GZSL aims to prevent a strong bias towards either seen or unseen classes.

  • Key Idea: Semantic Embedding Space

    Most ZSL approaches rely on learning a mapping from input features (e.g., visual features of an image) to a semantic embedding space. This semantic space is often derived from or aligned with auxiliary information that describes class properties.

    • Attribute-based ZSL: Classes are described by a set of predefined attributes (e.g., for animals: 'has_stripes', 'eats_meat', 'is_mammal'). An image is mapped to this attribute space, and classification is done by finding the closest class attribute vector.
    • Text-based ZSL (e.g., CLIP): Class names or textual descriptions are encoded into a rich semantic space using powerful language models. Visual features are then projected into this space (or a shared space). Classification of an image for an unseen class involves finding the class whose text embedding is closest to the image's embedding. CLIP is a prime example of this enabling powerful zero-shot image classification.
  • Embedding-based Methods (e.g., Cross-Modal Mapping)

    These methods aim to learn a mapping function that projects features from one modality (e.g., visual) into the semantic space of another (e.g., textual embeddings of class labels or attributes), or into a common shared embedding space.

    Examples include DeViSE (Deep Visual-Semantic Embedding Model), ALE (Attribute Label Embedding), SJE (Structured Joint Embedding).

    Training often involves minimizing a distance or compatibility function between the projected visual features and the semantic embeddings of the corresponding classes for seen data.

  • Generative Methods

    Generative ZSL approaches learn to synthesize visual features (or even images) for unseen classes based on their semantic descriptions (attributes or text embeddings). Once synthetic features are generated for unseen classes, the ZSL problem can be converted into a standard supervised classification problem by training a classifier on both real features of seen classes and synthetic features of unseen classes.

    Techniques like Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) are often used for this feature generation. This approach can be particularly effective for GZSL by alleviating the bias towards seen classes.

  • Contrastive Learning (e.g., CLIP)

    Contrastive Language-Image Pre-training (CLIP) and similar models learn a shared embedding space where corresponding image and text pairs have high similarity. For zero-shot image classification:

    1. Embed the input image using the image encoder.
    2. For each candidate class (seen or unseen), create a textual prompt (e.g., "an image of a [class_name]") and embed it using the text encoder.
    3. Calculate the cosine similarity between the image embedding and each text prompt embedding.
    4. The class corresponding to the highest similarity is predicted.

    This approach has demonstrated remarkable ZSL performance without explicit attribute annotations or complex mapping functions, relying instead on large-scale pre-training on image-text pairs.

  • Key Difficulties

    • Domain Shift / Bias: Models trained on seen classes might not generalize well to unseen classes if there's a significant difference in their distributions, or if the semantic descriptions (attributes/text) don't perfectly capture distinguishing visual characteristics. Seen classes often dominate predictions in GZSL.
    • Quality of Semantic Information: The effectiveness of ZSL heavily depends on the quality, discriminability, and completeness of the auxiliary semantic information (attributes or text descriptions). Noisy or ambiguous descriptions lead to poor performance.
    • Hubness Problem: In high-dimensional embedding spaces, some points (hubs) can become nearest neighbors to a disproportionately large number of other points, affecting retrieval and classification accuracy.
    • Scalability: As the number of classes (seen and unseen) grows, ZSL can become more challenging.
    • Evaluation: Choosing appropriate metrics for ZSL, especially GZSL (e.g., harmonic mean of seen and unseen class accuracies), is crucial to avoid misleading conclusions.
  • Examples

    • Object Recognition in Images/Videos: Classifying objects belonging to categories not seen during training, based on their textual descriptions or attributes.
    • Action Recognition: Recognizing human actions in videos for which no training examples were available.
    • Cross-modal Retrieval: Retrieving images/videos based on textual descriptions of unseen concepts (and vice-versa).
    • Robotics: Enabling robots to understand and interact with novel objects or environments described through language or attributes.
    • Automated Content Tagging/Categorization: Assigning tags or categories to images/videos even for emerging or rare concepts.

Implementation

  • Conceptual CLIP-based Zero-Shot Classification (PyTorch-like)

    Illustrates the inference process for ZSL image classification using a pre-trained CLIP-like model.
    
    import torch
    import torch.nn.functional as F
    # Assume a pre-trained CLIPLikeModel with image and text encoders is available
    # from previous_examples import CLIPLikeModel, ImageEncoder, TextEncoder
    
    # Dummy CLIPLikeModel for conceptual illustration
    class DummyCLIPModel(torch.nn.Module):
        def __init__(self):
            super().__init__()
            # These would be complex, pre-trained models in reality
            self.image_encoder = lambda x: torch.randn(x.size(0) if x.dim() > 3 else 1, 512) # Dummy image encoder output
            self.text_encoder = lambda x: torch.randn(len(x), 512) # Dummy text encoder output
            self.logit_scale = torch.nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
    
        def encode_image(self, image):
            return F.normalize(self.image_encoder(image), dim=-1)
    
        def encode_text(self, text_list):
            return F.normalize(self.text_encoder(text_list), dim=-1)
    
        def forward(self, image, text_prompts_list):
            image_features = self.encode_image(image)
            text_features = self.encode_text(text_prompts_list)
            
            logit_scale = self.logit_scale.exp()
            logits = logit_scale * image_features @ text_features.t()
            return logits
    
    # --- Zero-Shot Classification Example ---
    # model_clip = DummyCLIPModel() # Load your actual pre-trained CLIP model here
    # model_clip.eval()
    
    # unseen_class_names = ["emu", "hummingbird", "platypus", "armadillo"]
    # text_prompts = [f"a photo of a {c}" for c in unseen_class_names]
    
    # # Assume 'input_image' is a preprocessed image tensor for an unseen class object
    # # input_image = preprocess_image('path/to/image.jpg') # (1, C, H, W)
    
    # # with torch.no_grad():
    # #     logits = model_clip(input_image, text_prompts) # Get logits for the image against all class prompts
    # #     probabilities = F.softmax(logits, dim=-1)
    # #     predicted_class_index = probabilities.argmax().item()
    # #     predicted_class_name = unseen_class_names[predicted_class_index]
    # #     confidence = probabilities.max().item()
    
    # # print(f"Predicted class: {predicted_class_name} with confidence {confidence:.4f}")
    # # print(f"Probabilities: {probabilities.cpu().numpy()}")
    

Interview Examples

What is the difference between Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL)? Why is GZSL harder?

Explain how attribute-based ZSL works.

What is the 'hubness' problem in ZSL, and how can it affect performance?

Practice Questions

1. Explain the core concepts of Zero Shot Learning Easy

Hint: Think about the fundamental principles

2. How would you implement this in a production environment? Hard

Hint: Consider scalability and efficiency

3. What are the practical applications of Zero Shot Learning? Medium

Hint: Consider both academic and industry use cases