- TheWhiteBox by Nacho de Gregorio
- Posts
- GILL, a New Chatbot that Generates Both Text and Images Simultaneously
GILL, a New Chatbot that Generates Both Text and Images Simultaneously
š TheTechOasis š
Talking to an AI chatbot that generates impressive text like ChatGPT is really cool.
Now, talking to an AI chatbot that generates text AND images is way coolerā¦ and something ChatGPT doesnāt offer.
But GILL, Carnegie Mellon Universityās new AI chatbot, does.
And you can see it with your own eyes below:
GILL is the first-ever AI chatbot thatās capable of understanding both images and text while also being capable of producing both images and text.
And if being a first in the most innovative industry in the world isnāt already impressive, it was also cheap as hell to train.
But how does GILL work?
When Text Meets Images
Creating an AI model that works well in one modality (text, images, etc.) is already pretty challenging. But creating a model that works across several modalities is something rarely seen.
In fact, when GPT-4 was presented to the world, everyone was amazed that one unique model was capable of understanding text and images as GPT-4 did.
And even though its image processing features arenāt released yet (you can check them here), GPT-4 is only multimodal when it comes to processing data, not generating it.
GILL, on the other hand, not only can process both modalities, but it can also generate both, as seen in the previous GIF.
But to achieve this, they had to overcome an important challenge.
Making Images and Text āTalkā
Even though we humans are capable of understanding both images and text naturally, this isnāt an easy concept to explain to machines.
To do so, we transform text and images into vectorsācalled embeddingsāthat extract their meaning while allowing machines to treat them.
This embeddings represent the modelās understanding of our world, where vectors that represent similar concepts are grouped, and those that donāt are separated.
For instance, if we take the visual example below of a text-only embedding space like the one ChatGPT has, concepts like āA little boy is walkingā and the same but ārunningā are close together, while separate from āLook how sad my cat isā.
If you ever wondered how machines understand our world, there you have it.
But the problem is that every model has its own representation of our world, which means that ChatGPTās embeddings (text-to-text) arenāt valid, for instance, for Stable Diffusion (text-to-image).
This is a problem because if we want a model to process and generate multiple modalities at once, we need to create an embedding space that groups embeddings from all modalities into one unique space.
And thatās precisely what GILL does.
When Maps Are The Solution
GILL does five things:
Process text
Process images
Generate text
Generate
Retrieve images
To do this, it needs to be capable of:
Processing images and text simultaneously
Decide when to output a text or an image
When outputting images, decide if it needs to generate or retrieve the image from a database
To pull this off, the researchers did the following:
To process images and/or text, they assembled a transformation matrix that took images and transformed their embeddings into something that the text-only LLM would understand.
Next, they enhanced the LLM so that not only it was capable of generating text, but also signaling when an image was required
If the model detected it was time to output an image, they developed a decision module that, considering the natural occurrence of the required image, would decide to retrieve it from a database or generate it with Stable Diffusion
In simple terms, this meant that if you required GILL to generate a āpig with wingsā the model would understand that the image had to be generated, as that didnāt exist
Finally, if it had to generate an image, they used GILLMapper, a neural network that transformed the text-embeddings from the LLM into embeddings the text-to-image model, Stable Diffusion, would understand
Et voilĆ .
As this is hard to follow, letās see an example:
Left side of the image:
The model receives an image of several cookies and the request āHow should we display them?ā
Then, the model takes the image, transforms it as we described earlier, and introduces it alongside the text embedding of the request
Right side:
The LLM, GILLās core, outputs the response, that not only includes text, but also an image.
GILLās decision model then decides if it needs to retrieve a valid image or generate a new one
In this case, it decides that the retrieved image (stacked cookies) isnāt the best option, and decides to generate a new one
Using the GILLMapper, the LLMās output image tokens are transformed into a valid image embedding for Stable Diffusion (SD in the image) that then generates the final image
Finally, the answer is given āI think they look bestā¦ between them.ā accompanied by the generated image
And just like that, you have the next generation of conversational chatbots, AI chatbots that can understand images and text, as well as generate images and text, providing a similar experience to the one you have with your friends and family.
GILL is just the beginning.
Key AI concepts youāve learned by reading this newsletter:
- Multimodality generation
- Embedding space
- Text&Image Chatbots
š¾Top AI news for the weekš¾
š OpenAI launches a million-dollar grant to empower teams looking to build AI-based cybersecurity solutions
š³ Entering the creepy world of AI deep fakes that bring crime victims to life
š¤ Ex-OpenAI employeeās startup shows video of their newest robot home butler
š· Greatest Turing test ever proves Chatbots arenāt still at a human level
š¤© A new open-source king, the Falcon LLM
š¹ When gaming and AI collide. Watch NVIDIAās new video demo where you can talk to the characters in the gameā¦ with your own voice
š£ Many global AI and non-AI leaders sign a statement on AI risk and the need to mitigate it