GitHub All Languages Trending

The latest build: 2024-06-14Source of data: GitHubTrendingRSS

State-of-the-art Machine Learning for the web. Run Transformers directly in your browser, with no need for a server!

transformers.js javascript library logo

NPMNPM DownloadsjsDelivr HitsLicenseDocumentation

State-of-the-art Machine Learning for the web. Run Transformers directly in your browser, with no need for a server!

Transformers.js is designed to be functionally equivalent to Hugging Face's transformers python library, meaning you can run the same pretrained models using a very similar API. These models support common tasks in different modalities, such as:

  • Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
  • Computer Vision: image classification, object detection, and segmentation.
  • Audio: automatic speech recognition and audio classification.
  • Multimodal: zero-shot image classification.

Transformers.js uses ONNX Runtime to run models in the browser. The best part about it, is that you can easily convert your pretrained PyTorch, TensorFlow, or JAX models to ONNX using Optimum.

For more information, check out the full documentation.

Quick tour

It's super simple to translate from existing code! Just like the python library, we support the pipeline API. Pipelines group together a pretrained model with preprocessing of inputs and postprocessing of outputs, making it the easiest way to run models with the library.

Python (original)Javascript (ours)
from transformers import pipeline# Allocate a pipeline for sentiment-analysispipe = pipeline('sentiment-analysis')out = pipe('I love transformers!')# [{'label': 'POSITIVE', 'score': 0.999806941}]
import { pipeline } from '@xenova/transformers';// Allocate a pipeline for sentiment-analysislet pipe = await pipeline('sentiment-analysis');let out = await pipe('I love transformers!');// [{'label': 'POSITIVE', 'score': 0.999817686}]

You can also use a different model by specifying the model id or path as the second argument to the pipeline function. For example:

// Use a different model for sentiment-analysislet pipe = await pipeline('sentiment-analysis', 'Xenova/bert-base-multilingual-uncased-sentiment');


To install via NPM, run:

npm i @xenova/transformers

Alternatively, you can use it in vanilla JS, without any bundler, by using a CDN or static hosting. For example, using ES Modules, you can import the library with:

<script type="module"> import { pipeline } from '[email protected]';</script>


Want to jump straight in? Get started with one of our sample applications/templates:

Whisper WebSpeech recognition w/ Whispercode, demo
Doodle DashReal-time sketch-recognition gameblog, code, demo
Code PlaygroundIn-browser code completion websitecode, demo
Semantic Image Search (client-side)Search for images with textcode, demo
Semantic Image Search (server-side)Search for images with text (Supabase)code, demo
Vanilla JavaScriptIn-browser object detectionvideo, code, demo
ReactMultilingual translation websitecode, demo
Text to speech (client-side)In-browser speech synthesiscode, demo
Browser extensionText classification extensioncode
ElectronText classification applicationcode
Next.js (client-side)Sentiment analysis (in-browser inference)code, demo
Next.js (server-side)Sentiment analysis (Node.js inference)code, demo
Node.jsSentiment analysis APIcode
Demo siteA collection of demoscode, demo

Check out the Transformers.js template on Hugging Face to get started in one click!

Custom usage

By default, Transformers.js uses hosted pretrained models and precompiled WASM binaries, which should work out-of-the-box. You can customize this as follows:


import { env } from '@xenova/transformers';// Specify a custom location for models (defaults to '/models/').env.localModelPath = '/path/to/models/';// Disable the loading of remote models from the Hugging Face Hub:env.allowRemoteModels = false;// Set location of .wasm files. Defaults to use a CDN.env.backends.onnx.wasm.wasmPaths = '/path/to/files/';

For a full list of available settings, check out the API Reference.

Convert your models to ONNX

We recommend using our conversion script to convert your PyTorch, TensorFlow, or JAX models to ONNX in a single command. Behind the scenes, it uses Optimum to perform conversion and quantization of your model.

python -m scripts.convert --quantize --model_id <model_name_or_path>

For example, convert and quantize bert-base-uncased using:

python -m scripts.convert --quantize --model_id bert-base-uncased

This will save the following files to ./models/:

bert-base-uncased/ config.json tokenizer.json tokenizer_config.json onnx/ model.onnx model_quantized.onnx

For the full list of supported architectures, see the Optimum documentation.

Supported tasks/models

Here is the list of all tasks and architectures currently supported by Transformers.js. If you don't see your task/model listed here or it is not yet supported, feel free to open up a feature request here.

To find compatible models on the Hub, select the "transformers.js" library tag in the filter menu (or visit this link). You can refine your search by selecting the task you're interested in (e.g., text-classification).


Natural Language Processing

Fill-Maskfill-maskMasking some of the words in a sentence and predicting which words should replace those masks.(docs)
Question Answeringquestion-answeringRetrieve the answer to a question from a given text.(docs)
Sentence Similaritysentence-similarityDetermining how similar two texts are.(docs)
SummarizationsummarizationProducing a shorter version of a document while preserving its important information.(docs)
Table Question Answeringtable-question-answeringAnswering a question about information from a given table.
Text Classificationtext-classification or sentiment-analysisAssigning a label or class to a given text.(docs)
Text Generationtext-generationProducing new text by predicting the next word in a sequence.(docs)
Text-to-text Generationtext2text-generationConverting one text sequence into another text sequence.(docs)
Token Classificationtoken-classification or nerAssigning a label to each token in a text.(docs)
TranslationtranslationConverting text from one language to another.(docs)
Zero-Shot Classificationzero-shot-classificationClassifying text into classes that are unseen during training.(docs)
Feature Extractionfeature-extractionTransforming raw data into numerical features that can be processed while preserving the information in the original dataset.(docs)


Depth Estimationdepth-estimationPredicting the depth of objects present in an image.(docs)
Image Classificationimage-classificationAssigning a label or class to an entire image.(docs)
Image Segmentationimage-segmentationDivides an image into segments where each pixel is mapped to an object. This task has multiple variants such as instance segmentation, panoptic segmentation and semantic segmentation.(docs)
Image-to-Imageimage-to-imageTransforming a source image to match the characteristics of a target image or a target image domain.(docs)
Mask Generationmask-generationGenerate masks for the objects in an image.
Object Detectionobject-detectionIdentify objects of certain defined classes within an image.(docs)
Video Classificationn/aAssigning a label or class to an entire video.
Unconditional Image Generationn/aGenerating images with no condition in any context (like a prompt text or another image).
Image Feature Extractionimage-feature-extractionTransforming raw data into numerical features that can be processed while preserving the information in the original image.(docs)


Audio Classificationaudio-classificationAssigning a label or class to a given audio.(docs)
Audio-to-Audion/aGenerating audio from an input audio source.
Automatic Speech Recognitionautomatic-speech-recognitionTranscribing a given audio into text.(docs)
Text-to-Speechtext-to-speech or text-to-audioGenerating natural-sounding speech given text input.(docs)


Tabular Classificationn/aClassifying a target category (a group) based on set of attributes.
Tabular Regressionn/aPredicting a numerical value given a set of attributes.


Document Question Answeringdocument-question-answeringAnswering questions on document images.(docs)
Image-to-Textimage-to-textOutput text from a given image.(docs)
Text-to-Imagetext-to-imageGenerates images from input text.
Visual Question Answeringvisual-question-answeringAnswering open-ended questions based on an image.
Zero-Shot Audio Classificationzero-shot-audio-classificationClassifying audios into classes that are unseen during training.(docs)
Zero-Shot Image Classificationzero-shot-image-classificationClassifying images into classes that are unseen during training.(docs)
Zero-Shot Object Detectionzero-shot-object-detectionIdentify objects of classes that are unseen during training.(docs)

Reinforcement Learning

Reinforcement Learningn/aLearning from actions by interacting with an environment through trial and error and receiving rewards (negative or positive) as feedback.


  1. ALBERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
  2. Audio Spectrogram Transformer (from MIT) released with the paper AST: Audio Spectrogram Transformer by Yuan Gong, Yu-An Chung, James Glass.
  3. BART (from Facebook) released with the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
  4. BEiT (from Microsoft) released with the paper BEiT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong, Furu Wei.
  5. BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  6. Blenderbot (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
  7. BlenderbotSmall (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
  8. BLOOM (from BigScience workshop) released by the BigScience Workshop.
  9. CamemBERT (from Inria/Facebook/Sorbonne) released with the paper CamemBERT: a Tasty French Language Model by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
  10. Chinese-CLIP (from OFA-Sys) released with the paper Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
  11. CLAP (from LAION-AI) released with the paper Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
  12. CLIP (from OpenAI) released with the paper Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
  13. CLIPSeg (from University of Göttingen) released with the paper Image Segmentation Using Text and Image Prompts by Timo Lüddecke and Alexander Ecker.
  14. CodeGen (from Salesforce) released with the paper A Conversational Paradigm for Program Synthesis by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
  15. CodeLlama (from MetaAI) released with the paper Code Llama: Open Foundation Models for Code by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
  16. ConvBERT (from YituTech) released with the paper ConvBERT: Improving BERT with Span-based Dynamic Convolution by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
  17. ConvNeXT (from Facebook AI) released with the paper A ConvNet for the 2020s by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
  18. ConvNeXTV2 (from Facebook AI) released with the paper ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie.
  19. DeBERTa (from Microsoft) released with the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
  20. DeBERTa-v2 (from Microsoft) released with the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
  21. DeiT (from Facebook) released with the paper Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
  22. Depth Anything (from University of Hong Kong and TikTok) released with the paper Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
  23. DETR (from Facebook) released with the paper End-to-End Object Detection with Transformers by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
  24. DINOv2 (from Meta AI) released with the paper DINOv2: Learning Robust Visual Features without Supervision by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
  25. DistilBERT (from HuggingFace), released together with the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2, RoBERTa into DistilRoBERTa, Multilingual BERT into DistilmBERT and a German version of DistilBERT.
  26. DiT (from Microsoft Research) released with the paper DiT: Self-supervised Pre-training for Document Image Transformer by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
  27. Donut (from NAVER), released together with the paper OCR-free Document Understanding Transformer by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park.
  28. DPT (from Intel Labs) released with the paper Vision Transformers for Dense Prediction by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
  29. EfficientNet (from Google Brain) released with the paper EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks by Mingxing Tan, Quoc V. Le.
  30. ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA: Pre-training text encoders as discriminators rather than generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
  31. ESM (from Meta AI) are transformer protein language models. ESM-1b was released with the paper Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. ESM-1v was released with the paper Language models enable zero-shot prediction of the effects of mutations on protein function by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. ESM-2 and ESMFold were released with the paper Language models of protein sequences at the scale of evolution enable accurate structure prediction by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives.
  32. Falcon (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme.
  33. FastViT (from Apple) released with the paper FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.
  34. FLAN-T5 (from Google AI) released in the repository google-research/t5x by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei
  35. GLPN (from KAIST) released with the paper Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
  36. GPT Neo (from EleutherAI) released in the repository EleutherAI/gpt-neo by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
  37. GPT NeoX (from EleutherAI) released with the paper GPT-NeoX-20B: An Open-Source Autoregressive Language Model by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach
  38. GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
  39. GPT-J (from EleutherAI) released in the repository kingoflolz/mesh-transformer-jax by Ben Wang and Aran Komatsuzaki.
  40. GPTBigCode (from BigCode) released with the paper SantaCoder: don't reach for the stars! by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
  41. HerBERT (from, AGH University of Science and Technology) released with the paper KLEJ: Comprehensive Benchmark for Polish Language Understanding by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
  42. Hubert (from Facebook) released with the paper HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
  43. LongT5 (from Google AI) released with the paper LongT5: Efficient Text-To-Text Transformer for Long Sequences by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
  44. LLaMA (from The FAIR team of Meta AI) released with the paper LLaMA: Open and Efficient Foundation Language Models by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample.
  45. Llama2 (from The FAIR team of Meta AI) released with the paper Llama2: Open Foundation and Fine-Tuned Chat Models by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom.
  46. M2M100 (from Facebook) released with the paper Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
  47. MarianMT Machine translation models trained using OPUS data by Jörg Tiedemann. The Marian Framework is being developed by the Microsoft Translator Team.
  48. mBART (from Facebook) released with the paper Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
  49. mBART-50 (from Facebook) released with the paper Multilingual Translation with Extensible Multilingual Pretraining and Finetuning by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
  50. Mistral (from Mistral AI) by The Mistral AI team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
  51. MMS (from Facebook) released with the paper Scaling Speech Technology to 1,000+ Languages by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
  52. MobileBERT (from CMU/Google Brain) released with the paper MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
  53. MobileViT (from Apple) released with the paper MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer by Sachin Mehta and Mohammad Rastegari.
  54. MobileViTV2 (from Apple) released with the paper Separable Self-attention for Mobile Vision Transformers by Sachin Mehta and Mohammad Rastegari.
  55. MPNet (from Microsoft Research) released with the paper MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
  56. MPT (from MosaiML) released with the repository llm-foundry by the MosaicML NLP Team.
  57. MT5 (from Google AI) released with the paper mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
  58. NLLB (from Meta) released with the paper No Language Left Behind: Scaling Human-Centered Machine Translation by the NLLB team.
  59. Nougat (from Meta AI) released with the paper Nougat: Neural Optical Understanding for Academic Documents by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
  60. OPT (from Meta AI) released with the paper OPT: Open Pre-trained Transformer Language Models by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
  61. OWL-ViT (from Google AI) released with the paper Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
  62. OWLv2 (from Google AI) released with the paper Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
  63. Phi (from Microsoft) released with the papers - Textbooks Are All You Need by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, Textbooks Are All You Need II: phi-1.5 technical report by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
  64. Qwen2 (from the Qwen team, Alibaba Group) released with the paper Qwen Technical Report by Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu.
  65. ResNet (from Microsoft Research) released with the paper Deep Residual Learning for Image Recognition by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
  66. RoBERTa (from Facebook), released together with the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
  67. RoFormer (from ZhuiyiTechnology), released together with the paper RoFormer: Enhanced Transformer with Rotary Position Embedding by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
  68. SegFormer (from NVIDIA) released with the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
  69. Segment Anything (from Meta AI) released with the paper Segment Anything by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
  70. SigLIP (from Google AI) released with the paper Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
  71. SpeechT5 (from Microsoft Research) released with the paper SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
  72. SqueezeBERT (from Berkeley) released with the paper SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
  73. StableLm (from Stability AI) released with the paper StableLM 3B 4E1T (Technical Report) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
  74. Starcoder2 (from BigCode team) released with the paper StarCoder 2 and The Stack v2: The Next Generation by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
  75. Swin Transformer (from Microsoft) released with the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  76. Swin2SR (from University of Würzburg) released with the paper Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
  77. T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
  78. T5v1.1 (from Google AI) released in the repository google-research/text-to-text-transfer-transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
  79. Table Transformer (from Microsoft Research) released with the paper PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents by Brandon Smock, Rohith Pesala, Robin Abraham.
  80. TrOCR (from Microsoft), released together with the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
  81. UniSpeech (from Microsoft Research) released with the paper UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
  82. UniSpeechSat (from Microsoft Research) released with the paper UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
  83. Vision Transformer (ViT) (from Google AI) released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
  84. ViTMatte (from HUST-VL) released with the paper ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
  85. VITS (from Kakao Enterprise) released with the paper Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech by Jaehyeon Kim, Jungil Kong, Juhee Son.
  86. Wav2Vec2 (from Facebook AI) released with the paper wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
  87. Wav2Vec2-BERT (from Meta AI) released with the paper Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team.
  88. WavLM (from Microsoft Research) released with the paper WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
  89. Whisper (from OpenAI) released with the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
  90. XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.
  91. XLM-RoBERTa (from Facebook AI), released together with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
  92. YOLOS (from Huazhong University of Science & Technology) released with the paper You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu.

NocoBase is a scalability-first, open-source no-code/low-code platform for building business applications and enterprise solutions.

English |

We'd love your support!

nocobase%2Fnocobase | Trendshift

NocoBase - Scalability-first, open-source no-code platform | Product Hunt

Recent major updates

What is NocoBase

NocoBase is a scalability-first, open-source no-code development platform.
Instead of investing years of time and millions of dollars in research and development, deploy NocoBase in a few minutes and you'll have a private, controllable, and extremely scalable no-code development platform!


Online Demo:


Contact Us:
[email protected]

Distinctive features

1. Data model-driven

Most form-, table-, or process-driven no-code products create data structures directly in the user interface, such as Airtable, where adding a new column to a table is adding a new field. This has the advantage of simplicity of use, but the disadvantage of limited functionality and flexibility to meet the needs of more complex scenarios.

NocoBase adopts the design idea of separating the data structure from the user interface, allowing you to create any number of blocks (data views) for the data collections, with different type, styles, content, and actions in each block. This balances the simplicity of no-code operation with the flexibility of native development.


2. What you see is what you get

NocoBase enables the development of complex and distinctive business systems, but this does not mean that complex and specialized operations are required. With a single click, configuration options are displayed on the usage interface, and administrators with system configuration privileges can directly configure the user interface in a WYSIWYG manner.


3. Everything is implemented as plugins

NocoBase adopts plugin architecture, all new functions can be realized by developing and installing plugins, and expanding the functions is as easy as installing an APP on your phone.



NocoBase supports three installation methods:

  • Installing With Docker (Recommended)

    Suitable for no-code scenarios, no code to write. When upgrading, just download the latest image and reboot.

  • Installing from create-nocobase-app CLI

    The business code of the project is completely independent and supports low-code development.

  • Installing from Git source code

    If you want to experience the latest unreleased version, or want to participate in the contribution, you need to make changes and debug on the source code, it is recommended to choose this installation method, which requires a high level of development skills, and if the code has been updated, you can git pull the latest code.

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. GPT-4V

image InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites A Pioneering Open-Source Alternative to GPT-4V

[][ Blog][ InternVL 1.0 Paper][ InternVL 1.5 Report][ Chat Demo][ HuggingFace Demo]

[ Quick Start][ Community-hosted API][ ]

OpenGVLab%2FInternVL | Trendshift   image


  • 2024/06/04: InternVL 1.5 achieved state-of-the-art in the Image MLLM category on the Video-MME dataset, demonstrating strong generalization across multiple images, surpassing many specialized Video MLLMs and nearing the top open-source video model, LLaVA-Next-Video.
  • 2024/05/30: We release ShareGPT-4o, a groundbreaking large-scale resource that we plan to open-source with 200K meticulously annotated images, 10K videos with highly descriptive captions, and 10K audio files with detailed descriptions.
  • 2024/05/29: We release the Mini-InternVL-Chat series, which includes two models: Mini-InternVL-Chat-2B-V1-5 and Mini-InternVL-Chat-4B-V1-5. Our small models achieve impressive performance with minimal size: the 2B model delivers 80% of the performance with only 8% of the model size, and the 4B model achieves 90% of the performance with just 16% of the model size. For more details, please check our blog.
  • 2024/05/28: Thanks to the lmdeploy team for providing AWQ quantization support. The 4-bit model is available at OpenGVLab/InternVL-Chat-V1-5-AWQ.
  • 2024/05/13: InternVL can now be used as the text encoder for diffusion models to support multilingual generation natively in over 110 languages worldwide. See MuLan for more details.
  • 2024/04/28: We release the INT8 version of InternVL-Chat-V1-5, see HF link.
  • 2024/04/28: We achieve the SOTA performance (75.74) on the Infographics VQA benchmark, see here.
  • 2024/04/18: InternVL-Chat-V1-5 has been released at HF link, approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
  • 2024/02/27: InternVL is accepted by CVPR 2024!
  • 2024/02/24: InternVL-Chat models have been included in the VLMEvalKit.
  • 2024/02/21: InternVL-Chat-V1-2-Plus achieves SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our blog for more details.
  • 2024/02/12: InternVL-Chat-V1-2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our blog, SFT data or try our demo. The model is now available on HuggingFace, and both training/evaluation data and scripts are open-sourced.
  • 2024/02/04: InternVL-Chat-V1-1 achieves 44.67% on MMVP, higher than GPT-4V!
  • 2024/01/27: We release 448 resolution model, achieving 76.6 on MMBench dev, see here.
  • 2024/01/24: InternVL-Chat-V1-1 is released, it supports Chinese and has stronger OCR capability, see here.
  • 2024/01/16: We release our customized mmcv/mmsegmentation/mmdetection code, integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.


  • Installation

    • How to install the environment? [link]
  • Training or Fine-tuning

    • How to reproduce the SFT stage of InternVL-Chat-V1-2? [link]
    • How to fine-tune InternVL-Chat-V1-2 on a custom dataset? [link]
    • How to fine-tune the Mini-InternVL-Chat series on a custom dataset? [TODO]
  • Benchmark Test

    Due to minor implementation differences between this codebase and VLMEvalKit, slight discrepancies in performance metrics may occur when testing the same model.

    • How to evaluate InternVL-Chat-V1-5? [link]
    • How to evaluate InternVL-Chat-V1-5 using VLMEvalKit? (Recommend) [link]
    • How to evaluate Mini-InternVL-Chat-2B-V1-5 using VLMEvalKit? (Recommend) [link]
    • How to evaluate Mini-InternVL-Chat-4B-V1-5 using VLMEvalKit? (Recommend) [link]
  • Deployment

    • How to deploy a local demo? [link]
    • How to run InternVL-1.5 8bit with Nvidia V100 GPU? [link][]
    • How to perform batch inference? [link]
    • Inference Acceleration by LMDeploy [link][]

Compared with SOTA VLLMs


What is InternVL?

InternVL scales up the ViT to 6B parameters and aligns it with LLM.

Model Zoo

Vision Large Language Model

MiniInternVLChat4BV152024.05.28HF link 16% of the model size, 90% of the performance
Mini-InternVL-Chat-2B-V1-52024.05.19HF link 8% of the model size, 80% of the performance
InternVL-Chat-V1-5-AWQ2024.05.28HF linkThe 4-bit version of InternVL-Chat-V1-5
InternVL-Chat-V1-5-Int82024.04.28HF linkThe 8-bit version of InternVL-Chat-V1-5
InternVL-Chat-V1-52024.04.18HF linksupport 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (new)
InternVL-Chat-V1-2-Plus2024.02.21HF linkmore SFT data and stronger
InternVL-Chat-V1-22024.02.11HF linkscaling up LLM to 34B
InternVL-Chat-V1-12024.01.24HF linksupport Chinese and stronger OCR
InternVL-Chat-19B-448px2024.02.03HF link448 resolution
InternVL-Chat-19B2023.12.25HF linkEnglish multimodal dialogue
InternVL-Chat-13B2023.12.25HF linkEnglish multimodal dialogue

Vision-Language Foundation Model

InternViT-300M-448px2024.05.25HF linkdistilled small vision foundation model with 300M parameters (new)
InternViT-6B-448px-V1-52024.04.20HF linksupport dynamic resolution, super strong OCR (new)
InternViT-6B-448px-V1-22024.02.11HF link448 resolution
InternViT6B448pxV102024.01.30HF link448 resolution
InternViT-6B-224px2023.12.22HF linkvision foundation model
InternVL-14B-224px2023.12.22HF linkvision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrival like CLIP

What can InternVL do?

Visual Perception (click to expand)
  • Linear-Probe Image Classification [see details]

    ViT-22B uses the private JFT-3B dataset.

    InternViT-6B (ours)5.9B88.290.479.977.589.869.1
  • Semantic Segmentation [see details]

    methoddecoder#param (train/total)crop sizemIoU
    OpenCLIP-G (frozen)Linear0.3M / 1.8B51239.3
    ViT-22B (frozen)Linear0.9M / 21.7B50434.6
    InternViT-6B (frozen)Linear0.5M / 5.9B50447.2 (+12.6)
    ViT-22B (frozen)UperNet0.8B / 22.5B50452.7
    InternViT-6B (frozen)UperNet0.4B / 6.3B50454.9 (+2.2)
    ViT-22BUperNet22.5B / 22.5B50455.3
    InternViT-6BUperNet6.3B / 6.3B50458.9 (+3.6)
  • Zero-Shot Image Classification [see details]

    InternVL-C (ours)83.283.895.577.373.980.6
  • Multilingual Zero-Shot Image Classification [see details]

    EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

    methodIN-1K (EN)IN-1K (ZH)IN-1K (JP)IN-1K (AR)IN-1K (IT)
    InternVL-C (ours)83.264.561.544.965.7
  • Zero-Shot Video Classification [see details]

    InternVL-C (ours)171.071.365.7
    InternVL-C (ours)879.478.871.5
Cross-Modal Retrieval (click to expand)
  • English Zero-Shot Image-Text Retrieval [see details]

    InternVL-C (ours)94.799.699.981.796.098.270.689.093.554.177.384.686.6
    InternVL-G (ours)95.799.799.985.097.098.674.991.395.258.681.388.088.8
  • Chinese Zero-Shot Image-Text Retrieval [see details]

    InternVL-C (ours)90.398.899.775.192.996.468.892.096.768.991.996.589.0
    InternVL-G (ours)92.999.499.877.794.897.371.493.997.773.894.498.190.9
  • Multilingual Zero-Shot Image-Text Retrieval on XTD [see details]

    InternVL-C (ours)97.395.795.195.696.092.293.395.595.1
    InternVL-G (ours)98.697.796.596.796.995.194.896.196.6
Multimodal Dialogue (see "Compared with SOTA VLLMs")

Quick Start with Huggingface

using InternViT-6B (click to expand)
import torchfrom PIL import Imagefrom transformers import AutoModel, CLIPImageProcessormodel = AutoModel.from_pretrained( 'OpenGVLab/InternViT-6B-224px', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).cuda().eval()image ='./examples/image1.jpg').convert('RGB')image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px')pixel_values = image_processor(images=image, return_tensors='pt').pixel_valuespixel_values = = model(pixel_values)
using InternVL-C(ontrastive) and InternVL-G(enerative) (click to expand)
import torchfrom PIL import Imagefrom transformers import AutoModel, CLIPImageProcessorfrom transformers import AutoTokenizermodel = AutoModel.from_pretrained( 'OpenGVLab/InternVL-14B-224px', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).cuda().eval()image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')tokenizer = AutoTokenizer.from_pretrained( 'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)tokenizer.pad_token_id = 0 # set pad_token_id to 0images = ['./examples/image1.jpg').convert('RGB'),'./examples/image2.jpg').convert('RGB'),'./examples/image3.jpg').convert('RGB')]prefix = 'summarize:'texts = [ prefix + 'a photo of a red panda', # English prefix + '', # Chinese prefix + '' # Japanese]pixel_values = image_processor(images=images, return_tensors='pt').pixel_valuespixel_values = = tokenizer(texts, return_tensors='pt', max_length=80, truncation=True, padding='max_length').input_ids.cuda()# InternVL-Clogits_per_image, logits_per_text = model( image=pixel_values, text=input_ids, mode='InternVL-C')probs = logits_per_image.softmax(dim=-1)# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],# [2.2949e-02, 9.7656e-01, 5.9903e-06],# [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',# dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)# InternVL-Glogits_per_image, logits_per_text = model( image=pixel_values, text=input_ids, mode='InternVL-G')probs = logits_per_image.softmax(dim=-1)# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],# [8.6060e-03, 9.9219e-01, 2.8759e-06],# [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',# dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)# please set add_eos_token to False for generationtokenizer.add_eos_token = Falseimage ='./examples/image1.jpg').convert('RGB')pixel_values = image_processor(images=image, return_tensors='pt').pixel_valuespixel_values = = tokenizer("English caption:", return_tensors='pt')pred = model.generate( pixel_values=pixel_values, input_ids=tokenized.input_ids.cuda(), attention_mask=tokenized.attention_mask.cuda(), num_beams=5, min_new_tokens=8,)caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()# English caption: a red panda sitting on top of a wooden platform
using InternVL-Chat (click to expand)
from transformers import AutoTokenizer, AutoModelimport torchimport torchvision.transforms as Tfrom PIL import Imagefrom torchvision.transforms.functional import InterpolationModeIMAGENET_MEAN = (0.485, 0.456, 0.406)IMAGENET_STD = (0.229, 0.224, 0.225)def build_transform(input_size): MEAN, STD = IMAGENET_MEAN, IMAGENET_STD transform = T.Compose([ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD) ]) return transformdef find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): best_ratio_diff = float('inf') best_ratio = (1, 1) area = width * height for ratio in target_ratios: target_aspect_ratio = ratio[0] / ratio[1] ratio_diff = abs(aspect_ratio - target_aspect_ratio) if ratio_diff < best_ratio_diff: best_ratio_diff = ratio_diff best_ratio = ratio elif ratio_diff == best_ratio_diff: if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: best_ratio = ratio return best_ratiodef dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False): orig_width, orig_height = image.size aspect_ratio = orig_width / orig_height # calculate the existing image aspect ratio target_ratios = set( (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num) target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) # find the closest aspect ratio to the target target_aspect_ratio = find_closest_aspect_ratio( aspect_ratio, target_ratios, orig_width, orig_height, image_size) # calculate the target width and height target_width = image_size * target_aspect_ratio[0] target_height = image_size * target_aspect_ratio[1] blocks = target_aspect_ratio[0] * target_aspect_ratio[1] # resize the image resized_img = image.resize((target_width, target_height)) processed_images = [] for i in range(blocks): box = ( (i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size ) # split the image split_img = resized_img.crop(box) processed_images.append(split_img) assert len(processed_images) == blocks if use_thumbnail and len(processed_images) != 1: thumbnail_img = image.resize((image_size, image_size)) processed_images.append(thumbnail_img) return processed_imagesdef load_image(image_file, input_size=448, max_num=6): image ='RGB') transform = build_transform(input_size=input_size) images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(image) for image in images] pixel_values = torch.stack(pixel_values) return pixel_valuespath = "OpenGVLab/InternVL-Chat-V1-5"# If you have an 80G A100 GPU, you can put the entire model on a single GPU.model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).eval().cuda()# Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.# model = AutoModel.from_pretrained(# path,# torch_dtype=torch.bfloat16,# low_cpu_mem_usage=True,# trust_remote_code=True,# device_map='auto').eval()tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)# set the max number of tiles in `max_num`pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()generation_config = dict( num_beams=1, max_new_tokens=512, do_sample=False,)# single-round single-image conversationquestion = "" # Please describe the picture in detailresponse =, pixel_values, question, generation_config)print(question, response)# multi-round single-image conversationquestion = "" # Please describe the picture in detailresponse, history =, pixel_values, question, generation_config, history=None, return_history=True)print(question, response)question = "" # Please write a poem according to the pictureresponse, history =, pixel_values, question, generation_config, history=history, return_history=True)print(question, response)# multi-round multi-image conversationpixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()pixel_values =, pixel_values2), dim=0)question = "" # Describe the two pictures in detailresponse, history =, pixel_values, question, generation_config, history=None, return_history=True)print(question, response)question = "" # What are the similarities and differences between these two picturesresponse, history =, pixel_values, question, generation_config, history=history, return_history=True)print(question, response)# batch inference (single image per sample)pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()image_counts = [pixel_values1.size(0), pixel_values2.size(0)]pixel_values =, pixel_values2), dim=0)questions = ["Describe the image in detail."] * len(image_counts)responses = model.batch_chat(tokenizer, pixel_values, image_counts=image_counts, questions=questions, generation_config=generation_config)for question, response in zip(questions, responses): print(question) print(response)

Inference Acceleration by LMDeploy

We recommend using LMDeploy, if InternVL-Chat model inference optimization is required.

In the following subsections, we will introduce the usage of LMDeploy with the InternVL-Chat-V1-5 model as an example.

First of all, please setup the inference environment as follows:

conda create -n internvl python=3.10 -yconda activate internvlpip install timm torchvision==0.17.2pip install lmdeploy

LMDeploy pypi package depends on CUDA 12.x by default. For a CUDA 11.x environment, please refer to the installation guide.

Offline Inference Pipeline

from lmdeploy import pipelinefrom lmdeploy.vl import load_imagepipe = pipeline('OpenGVLab/InternVL-Chat-V1-5')image = load_image('examples/image2.jpg')response = pipe(('describe this image', image))print(response)

For more on using the VLM pipeline, including multi-image inference or multi-turn chat, please overview this guide.

Online Inference Service

LMDeploy supports one-click packaging of the VLM model into an OpenAI service, providing seamless integration with the OpenAI API.

The service can be launched by one command as below:

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window, --cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

For more details, including service startup with docker, RESTful API information, and openai integration methods, please refer to this guide.


This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.


If you find this project useful in your research, please consider cite:

@article{chen2023internvl, title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks}, author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng}, journal={arXiv preprint arXiv:2312.14238}, year={2023}}@article{chen2024far, title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, journal={arXiv preprint arXiv:2404.16821}, year={2024}}


InternVL is built with reference to the code of the following projects: OpenAI CLIP, Open CLIP, CLIP Benchmark, EVA, InternImage, ViT-Adapter, MMSegmentation, Transformers, DINOv2, BLIP-2, Qwen-VL, and LLaVA-1.5. Thanks for their awesome work!

If you want to join our WeChat group, please scan the following QR Code to add our assistant as a Wechat friend: