Vision & Multimedia Processing
530 models available
Explore image generation, video AI, speech recognition, and music synthesis models.
stable-diffusion-xl-base-1.0
--- license: openrail++ tags: - text-to-image - stable-diffusion --- !row01 !pipeline SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the ba...
Kokoro-82M
--- license: apache-2.0 language: - en base_model: - yl4579/StyleTTS2-LJSpeech pipeline_tag: text-to-speech --- **Kokoro** is an open-weight TTS model with 82 million parameters. D...
whisper-large-v3
--- language: - en - zh - de - es - ru - ko - fr - ja - pt - tr - pl - ca - nl - ar - sv - it - id - hi - fi - vi - he - uk - el - ms - cs - ro - da - hu - ta - no - th - ur - hr -...
XTTS-v2
--- license: other license_name: coqui-public-model-license license_link: https://coqui.ai/cpml library_name: coqui pipeline_tag: text-to-speech widget: - text: "Once when I was si...
whisper-large-v3-turbo
--- language: - en - zh - de - es - ru - ko - fr - ja - pt - tr - pl - ca - nl - ar - sv - it - id - hi - fi - vi - he - uk - el - ms - cs - ro - da - hu - ta - 'no' - th - ur - hr...
blip-image-captioning-large
--- pipeline_tag: image-to-text tags: - image-captioning languages: - en license: bsd-3-clause --- Model card for image captioning pretrained on COCO dataset - base architecture (w...
speaker-diarization-3.1
No description available.
stable-diffusion-v1-5
--- license: creativeml-openrail-m tags: - stable-diffusion - stable-diffusion-diffusers - text-to-image inference: true --- Modifications to the original model card are in red or ...
vit-gpt2-image-captioning
--- tags: - image-to-text - image-captioning license: apache-2.0 widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_title: Savanna...
detr-resnet-50
--- license: apache-2.0 tags: - object-detection - vision datasets: - coco widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_titl...
vit-base-patch16-224
--- license: apache-2.0 tags: - vision - image-classification datasets: - imagenet-1k - imagenet-21k widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/mai...
blip-image-captioning-base
--- pipeline_tag: image-to-text tags: - image-captioning languages: - en license: bsd-3-clause --- Model card for image captioning pretrained on COCO dataset - base architecture (w...
whisper-small
--- language: - en - zh - de - es - ru - ko - fr - ja - pt - tr - pl - ca - nl - ar - sv - it - id - hi - fi - vi - he - uk - el - ms - cs - ro - da - hu - ta - no - th - ur - hr -...
whisper-tiny
--- language: - en - zh - de - es - ru - ko - fr - ja - pt - tr - pl - ca - nl - ar - sv - it - id - hi - fi - vi - he - uk - el - ms - cs - ro - da - hu - ta - no - th - ur - hr -...
table-transformer-detection
--- license: mit widget: - src: https://www.invoicesimple.com/wp-content/uploads/2018/06/Sample-Invoice-printable.png example_title: Invoice --- Table Transformer (DETR) model trai...
wav2vec2-base-960h
--- language: en datasets: - librispeech_asr tags: - audio - automatic-speech-recognition - hf-asr-leaderboard license: apache-2.0 widget: - example_title: Librispeech sample 1 src...
distil-large-v3
--- language: - en license: mit library_name: transformers tags: - audio - automatic-speech-recognition - transformers.js widget: - example_title: LibriSpeech sample 1 src: https:/...
FLUX.1-dev
No description available.
speaker-diarization
No description available.
FLUX.1-schnell
No description available.
F5-TTS
--- license: cc-by-nc-4.0 pipeline_tag: text-to-speech library_name: f5-tts datasets: - amphion/Emilia-Dataset --- Download F5-TTS or E2 TTS and place under ckpts/ Github: https://...
BiRefNet
--- library_name: birefnet tags: - background-removal - mask-generation - Dichotomous Image Segmentation - Camouflaged Object Detection - Salient Object Detection - pytorch_model_h...
chatterbox
--- license: mit language: - ar - da - de - el - en - es - fi - fr - he - hi - it - ja - ko - ms - nl - no - pl - pt - ru - sv - sw - tr - zh pipeline_tag: text-to-speech tags: - t...
sd-turbo
--- pipeline_tag: text-to-image inference: false --- !row01 SD-Turbo is a fast generative text-to-image model that can synthesize photorealistic images from a text prompt in a sing...