Skip to content

Introducing the Diffusion Taxonomy

The goal of the project is to allow anyone online to quickly understand what kinds of images are possible for specific concepts with these diffusion models. We want to improve the models bby exposing what they can and cannot do. It is the largest online searchable database of stable diffusion generated images with a taxonomy for anyone to research. You can check it out here. We are not focused on large complex prompts like you can find at other places. With Diffusion Taxonomy, you can quickly find out if the model can represent a specific topic or find other concepts that are semantically related to use instead.

Taxonomy

We have collected our own taxonomy of concepts that span different types of objects, tools, cars, clothes, mammals, birds, emotions, verbs, adjectives, and more. You can browse them all here. We created the taxonomy using a combination of wordnet, ImageNet, Open Image, government data, and our own classification. This is where the most work went and to be honest, is not something we are completely done with yet. Getting the right taxonomy is a combination of art and science. We don’t want to just classify every concept into a big tree, but we also want to make it easily searchable and usable. So we continue to refine this and add new classifications to the taxonomy. Currently, the taxonomy is made up of ~21000 concepts. We are adding many taxonomies as we speak.

There are a few databases already such as diffusiondb, and https://rom1504.github.io/clip-retrieval/ , and lexica, but their goals are all different.

Compositionality

The definition of compositionality in artificial intelligence references the ability of AIs and humans to understand concepts as the product of other smaller concepts. If you understand red, fat, run, and dog, you should be able to understand “red fat running dog”. Many artificial intelligence researchers believe that solving compositionality may “solve AI”.

When these diffusion models first came out, many researchers were excited to see that they seemed to understand some form of compositionality, after all, you can feed it any string and it will render that image. After much testing though, it’s clear that these models’ understanding of the world is still very basic and doesn’t seem to have a truly compositional understanding. You can see this by trying simple queries like “red fat running dog” or “dog under a car”, “car under a dog” and you will see that it messes up on basic things. We aim to explore compositionality in future updates.

Many images such as blackberry contain multiple concepts embedded in it:

https://88stacks.com/diffusion-taxonomy/concepts/blackberry

When you search for apple, we get images that have apple fruits and apple the iPhone maker:

So these keywords can have multiple concepts embedded in its neural network weights.

You can force specific images to come out with apples or blackberries, but this is where prompt engineering comes in and you have to figure out what the right combination of apples and text will get you something like “apple logos hanging off an apple tree next to a blueberry bush”.

Currently, the database has 1.5 million images and it is growing every day.

For many concepts, stable diffusion generates a bad image most likely from the text that the image was trained on is inaccurate. I don’t know what the data cleaning process is for stability.ai, the company behind stable diffusion, but I think its important for them to continue to clean the data so that their world model is more accurate across every human concept.

An analysis of adjectives and verbs

The majority of our taxonomy on nouns and objects. That is something that we can concretely focus on and discuss. But many things we want to convey are not nouns, but instead, describe the scene. Lets look at some adjectives like young, old, furry, smelly, and fat.

Does the model really understand them? Let’s see what happens when we combine these adjectives with a noun like a cat.

How about a fat cat,an ugly cat, and old cat, a young cat, wet cats, furry cats, or a smelly cat. For many of these concepts, there is no simple way to visualize concepts like smelly or smart. Doing a search on google for “smelly cat” does show images of cats with fume lines coming out of them. What would you expect smelly cat look like in a diffusion model?

Lets take those same adjectives and apply it to a noun.

Next steps

We are growing and rearranging the taxonomy every day so you may see some links break. We are connecting each leaf node concept to other sites such as Wikipedia and wordnet. We are adding more images into the database. We would like to show images from midjourney and Dalle-2, but those are not available to query or are every expensive. If you know anyone we can talk to at those companies, please let us know. If you have feedback to make this better, please contact us.

We are working on making the database available offline and will be up shortly.

https://88stacks.com/diffusion-taxonomy