Software Alternatives & Reviews

Improving Search Quality for Non-English Queries with Fine-tuned Multilingual CLIP Models

OpenAI CommonCrawl
  1. 1
    GPT-3 access without the wait
    Pricing:
    • Open Source
    We’re going to look at a model that Open AI has trained with a broad multilingual dataset: The [xlm-roberta-base-ViT-B-32](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) CLIP model, which uses the [ViT-B/32](https://github.com/google-research/vision_transformer)image encoder, and the [XLM-RoBERTa](https://huggingface.co/xlm-roberta-large) multilingual language model. Both of these are pre-trained:.

    #Productivity #Developer Tools #IDE 299 social mentions

  2. Common Crawl
    Open AI then co-trained the two encoders with the multilingual laion5b dataset, which contains 5.85 billion image-text pairs: 2.2 billion of these pairs are labelled in 100+ non-English languages, with the rest in English or containing text that can’t be nailed down to any one language (like place names or other proper nouns). These are taken from a sampling of images and their HTML alt-text in the Common Crawl web archive.

    #Search Engine #Web Scraping #Data Extraction 90 social mentions

Discuss: Improving Search Quality for Non-English Queries with Fine-tuned Multilingual CLIP Models

Log in or Post with