deep-floyd/IF

The DeepFloyd Lab at StabilityAI has introduced DeepFloyd IF, an open-source text-to-image model that generates photorealistic images based on text prompts. The model is composed of a frozen text encoder and three cascaded pixel diffusion modules that generate images of increasing resolution. The model utilizes a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. The model outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. The model is highly efficient and requires a minimum of 16GB vRAM to use. The model is available for use in Dream, Style Transfer, Super Resolution, and Inpainting modes. The model is integrated with the Hugging Face Diffusers library, which allows users to customize the image generation process and inspect intermediate results easily. The models available in this codebase have known limitations and biases, and the code is released under a bespoke license. The creators of DeepFloyd IF are Alex Shonenkov, Misha Konstantinov, Daria Bakshandaeva, Christoph Schuhmann, Ksenia Ivanova, and Nadiia Klokova. The model was trained with invaluable support from StabilityAI and its CEO Emad Mostaque, LAION, and Huggingface teams.

full article

Leave a Reply