DeepFloyd, a research group backed by Stability AI, has unveiled DeepFloyd IF, a text-to-image model that can integrate text into images. Trained on a dataset of more than a billion images and text, DeepFloyd IF can create an image from a prompt like a teddy bear wearing a shirt that reads Deep Floyd’ optionally in a range of styles. DeepFloyd IF uses multiple different processes stacked together in a modular architecture to generate images. The model is particularly good at understanding complex prompts and even spatial relationships described in prompts. It can generate legible and correctly spelled text in images, and can even understand prompts in multiple languages. DeepFloyd IF is expected to unlock a wave of new generative art possibilities, including logo design, web design, posters, billboards, and even memes. However, the model doesn’t generate images that are quite as aesthetically pleasing as some diffusion models. There is also a potential for biases in the model, as texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. DeepFloyd IF, like other open source generative models, could be used for harm, like generating pornographic celebrity deepfakes and graphic depictions of violence.
