InstructPix2Pix: Learning to Follow Image Editing Instructions
This paper introduces a novel method for image editing based on human instructions. The model uses an input image and a written instruction to edit the image by following the instruction. The training data for the model is generated by combining the knowledge of two large pretrained models, a language model (GPT-3) and a text-to-image model (Stable Diffusion), to create a large dataset of image editing examples. The proposed conditional diffusion model, InstructPix2Pix, is trained on this generated data and can generalize to real images and user-written instructions at inference time. Unlike other models, InstructPix2Pix edits images quickly without per-example fine-tuning or inversion, and produces compelling results for a diverse collection of input images and written instructions.