top of page

How Can Diffusion Make Something New? by Dr. Timothy Smith


Photo Source: Unsplash



As things fall apart, it can become clearer how they were put together.


The birth of large language models (LLMs) and the ensuing avalanche of chatbots now permeates all levels of the digital world, including text search results on Google and Microsoft’s Edge and automated computer code generation available in many forums, including OpenAI’s ChatGPT and Anthropic’s ClaudeAI. These powerful LLMs built on language processing models known as transformers grabbed headlines; however, at the same time, another type of model emerged on the scene known as diffusion models. Such diffusion models as DALL-E from OpenAI and Stable Diffusion from StabilityAI generate images from simple text prompts. The name DALL-E references the surrealist painter Salvador Dali as well as the Pixar robot character WALL-E.

 

LLMs such as ChatGPT work by reading billions of words on the internet and in digital libraries forwards and backwards to capture the most likely words that come before and after the word it is looking at. During this process, the computer program hides words and asks the model to guess the correct word, much like a fill-in-the-blank test. When the model guesses the correct word to fill in the blank, the program saves the guess or the probability of the guess more accurately, and the model gets more substantial, effectively memorizing another possible correct answer. Unlike filling in the blanks and checking the answers to generate text, diffusion models approach the problem of generating images in a very different way.

 

Diffusion models learn by randomly changing little bits of an image over time. The diffusion model will add random noise to a picture until the entire picture becomes an unrecognizable blur. The model will remember how the noise changed the picture. Think of a picture such as the Mona Lisa by Leonardo Davinci or a snapshot of your dog. Now, imagine slowly, randomly changing the little bits of the picture until the background has no features at all, and all that remains are hints at maybe the eyes and nose. Eventually, with more random noise, the picture will look like a featureless cloud. The program will remember how the noise affected the original picture and associate this with a description of the original image—Mona Lisa or your dog named Henry.

 

After training a diffusion model with images and random noise, the model, when turned around, will generate new images by removing noise from a noisy image with a requested outcome, such as making an image like the Mona Lisa or of a dog. Try thinking of a drop of deep blue ink just above a clear glass bowl of crystal-clear water. Now, imagine the drop of ink hitting the water. At first, the ink drop remains dark and round, but over time, the ink spreads out in wispy tendrils until the ink diffuses throughout the bowl, changing the water to a uniform pale blue color. The diffusion model captures the noise of the water slowly breaking up the drop of ink. Suppose you wanted an image of a drop of blue ink just above a water bowl. A diffusion model can remove noise from an image and eventually generate an image of the ink drop. Because the noising and denoising steps incorporates randomness, the image will not be identical to the beginning image but similar.

 

Diffusion models such as DALL-E and Claude make generating images such as a cat driving a car or wearing a tuxedo easy and fast. Such a fun example does not hide more serious applications of diffusion models in pharmaceutical research, visual effects in movies and photography, and surveillance imagery. In pharmaceutical research, diffusion models help rapidly accelerate the design of new therapeutics with desired properties such as effectiveness, novelty, and safety. For surveillance, diffusion models help identify criminal suspects in poor-quality images. However, the threat to privacy grows with diffusion models that can unblur faces in pictures of people who want privacy. Diffusion models function in an engaging manner different from LLMs, and they complicate the tension between privacy and safety. Ultimately, these tools speed up necessary research and offer a new tool for creatives to generate imagery in a new way.





Dr. Smith’s career in scientific and information research spans the areas of bioinformatics, artificial intelligence, toxicology, and chemistry. He has published a number of peer-reviewed scientific papers. He has worked over the past seventeen years developing advanced analytics, machine learning, and knowledge management tools to enable research and support high-level decision making. Tim completed his Ph.D. in Toxicology at Cornell University and a Bachelor of Science in chemistry from the University of Washington.


You can buy his book on Amazon in paperback and in kindle format here.





 

Comments


bottom of page