my musings on bits, life, and universe

A Universe of Images

Imagine a 1920 × 1080 resolution image, commonly referred to as 1080p. It has a distinct channel for each primary colour. So, in total, there are 3 channels i.e. for red, green, and blue; commonly known as RGB. Let's calculate how many pixels it has. For each colour channel, it is a grid of (1920 × 1080) pixels. So, for all the colour channels combined, it is 1920 × 1080 × 3, which is 6,220,800 pixels (6.2208 × 10^6), in the order of millions.

Now, each of these pixels i.e. roughly 6 million, has intensity values from 0 to 255, which gives 256 distinct values. If you line up these 6 million values in a single straight line, you will get a vector with a dimension in the order of millions. This vector basically represents a point in a vector space with the dimension in the order of millions i.e. millions of perfectly orthogonal directions or basis vectors exist in this space.

Now, the image of the girl you are seeing is a point in this space. Every image you have ever seen in your life, or will ever see in the future, is basically a distinct point in this vector space. This is what we call image space. This space consists of all images that exist in this universe. There would be a point in this image space which represents the image of Einstein, and your favourite singer. There are points in this space which consist of images of all humans that will be born in the future, and also all humans who were born in the past. It is the set of all possible images that a human eye can ever perceive. It contains images of galaxies which humanity has not yet discovered, and the future billionaires the world is yet to see. Just reflecting on this fact should blow your mind.

Now, let’s calculate how many such images exist exactly in this space. Since the pixel intensities aren't real numbers but integers, each of the values in our 6-million-dimensional vector can only be one of the 256 distinct values. So, in total, there are 256 raised to the power of 6 million possible combinations. The scale of this number is something one possibly can't comprehend. Let me tell you this is much, much larger than the estimated number of atoms in the universe. This is a combinatorial explosion.

Now, there are only a finite number of points in this image space that represent an image which makes sense to humans — or, to put it another way, are legible or meaningful to us. All these images that make sense to humans are clustered in this high-dimensional vector space. This is also very intuitive. The point representing a dog of one breed would be close to the point representing a dog of another breed, because their high-level features are roughly similar — which is reflected in the 6-million-dimensional vectors being very similar.

Let's do a thought experiment. Can you generate random values for all these pixels i.e. create a 6-million-dimensional vector that would represent an image meaningful to humans? Although it is possible, it is highly unlikely, because most of the image space is "empty" in the sense that it contains images which don't mean anything to human eyes and can be considered as random noise.

Now, let's say I randomly pick a point in this high-dimensional image space. Can I convert this illegible, noisy image into something meaningful to human eyes? It turns out we can. Let’s say you have a genie which gives you a direction each time you ask: "In which direction of the high-dimensional space does any good image exist with respect to my current location?" You move a little in that direction, and you ask the genie again: “I am here now, give me a direction in which I should step so that I get closer to a point which represents a good image.” You do this again and again. Eventually, you reach a point in this image space which corresponds to an image of a cat, dog, or mountain i.e. any image that means something to the human eye.

Now, this genie is the generative model, which plays the crucial role in generating the image. It is a neural network which gives you the direction in which you should move, given you are at a point in this high-dimensional input space.

This act of moving from a random point in the space closer and closer to a point representing a meaningful image is known as the process of denoising. This is what diffusion models which generates photorealistic images that blow our minds do at their core. There are many non-trivial technical details and nuances to it, but the crux of what it does is as described above.

Now, let's talk about the training of such a model. It turns out you do not need infinite amounts of manually labeled data to train such a network. We can completely train this network in a self-supervised fashion.

Let's say you take a good image and add some noise to it. This noise is basically a direction in the image space that takes you away from the point representing the original image. Now, you generate a pair: the noisy image, and the direction in which you've moved from the original point in image space to reach this noisy one.

To train the network, you use this noisy image and the reverse of the direction you moved in — the one that would take you back toward the original image. You can see we've now generated a supervised pair from an unsupervised process — something we can learn from.

Now, the goal is to learn such a direction for every point in the image space — i.e., a direction that takes you closer to a good image, wherever you currently are.

This neural network essentially learns a vector field in this high-dimensional image space — a direction for each point that points toward a cluster of human-perceptible images.

That's it. When I reflected on this, I was awestruck by the power of this seemingly simple idea.