The future is multi-modal, and needs your attention right now

The combination of computer vision with analytical and creative AI is revolutionary. Multi-modal LLMs are going to change everything. Here are some very real reasons why you should be paying attention

Oct 15, 2023

In the annals of technological progress, there are moments, fleeting yet transformative, that chart a new direction for humanity. The evolution of the multi-modal LLM is one such profound inflection point.

DALLE-3’s take on a visual description of this article.

A multi-modal LLM (large-language model) is capable of taking input in a combination of written language, images, or audio. It can process complex instructions and give creative or analytical responses in different modalities. Unless you’ve been living under a rock for the last 12 months, you’ve likely interacted with an LLM, like ChatGPT. Multi-modal LLMs are essentially ChatGPT with the ability to see and hear.

OpenAI made GPT4-vision publicly available last week on the ChatGPT Plus subscription, and having played around with it for a few days, I can confidently say that you need to take a pause. Whatever it is that you do, there’s a high chance you can do it better by leveraging GPT4 and vision. I do not intend to be an alarmist but the potential for impact is unprecedented. This isn’t merely a leap, it’s transcendence.

In the rest of this article, I’ll focus on how multi-modal LLMs can be a powerful tool for every step in developing software products. Though their impact is probably equally revolutionary in a wide variety of domains with the potential for effective co-pilots in science, engineering and medicine alike.

Visual dialogue for your flowcharts and mind-maps

As a product manager, I spend a lot of time in Figjam, drawing and charting thoughts and workflows (Figjam is a whiteboarding tool, quite like Miro, Freeform, etc.). Not only does it help me think, its a great way to communicate visually with my team. I work with designers who do the same for user stories and low fidelity mockups. When I actively worked as an engineer, every big project started with an engineering design visually mapping components that would work together and fit into the system.

The point I am trying to make is that a lot of your work is visual in nature. Even if you’re not cognisant of it.

State-of-art AI can now help. To converse with a machine about a diagram, to share the intricacies of design, and to receive insightful feedback—this is no longer the realm of science fiction. Product-building is on the cusp of an era where collaboration takes on a more profound meaning, reducing the space between ideation and realisation.

The design reviewer you need and deserve

Just as I found solace in Figjam to seamlessly communicate ideas, Figma serves as the linchpin for designers. Crafted user journeys, blueprints of digital experiences, and the very narrative of user interfaces are often beautifully articulated within its bounds. And it's not just about creating designs; it's about telling stories, illustrating pathways, and sculpting digital landscapes.

But here's the magic — imagine if those landscapes could talk back? If your carefully designed mockups could converse with you, offering insights and suggestions? The future I envision is one where Figma designs don't just sit silently. Instead, with the integration of advanced AI, they come alive, resonating with wisdom, bridging the chasm between your creative genius and the tangible digital world. The act of designing is set to transcend into a dialogue, making ideation richer and realization closer.

The data analyst you’ve always wanted

If you work on drawing insights from numerical or categorical data, I am willing to bet your workflow pretty much looks like the following:

figure out what questions you want answered
take relevant raw data from the source
(maybe) structure it an analysable format (CSVs, Excel, etc.)
structure your query (SQL query, R/Python code, Excel function)
get a numerical answer (output of your program)
convert it into a format that’s easier to understand (Chart image, PDF, Word report, etc.)
present your findings

What if I tell you that step 3 through 6, can now be reliably automated? That’s the promise of ChatGPT’s Advanced Data Analysis mode. It’s been out for a couple of months and I am not the only one who’s been impressed.

swyx

has called it effectively GPT4.5, due to the delta of improvements in data analysis it shows above and beyond GPT4.

The engineering architecture critic

Finalising your architecture is the most important decision you make, as an engineer. In more cases than not, it’s a one-way-door decision. In almost all cases, your taking this decision with incomplete information, knowing fully that if the assumptions you’re making about the future are wrong, this would be an expensive decision to go back on. That’s why good engineers take more time in design, than implementation. And this is the reason why engineering reviews are long drawn processes with tiers of reviewers.

You now have help.

You can quite literally take screenshots of your engineering design, have a to-and-fro conversation on the context and get a well-informed review on your engineering design. Augment your expertise.

Parting thoughts

It’s a brave new world. Technology isn't just an isolated tool; it's an extension of our minds and a collaborator of our thoughts. From visual design ideation to in-depth data analysis, the fusion of computer vision with LLMs represents an evolution in human-AI interaction. The days of one-sided commands are behind us. Now, we're on the precipice of dynamic conversations with our creations.

Imagine a world where software isn't just a passive canvas, but a creative partner. A world where data doesn't sit still in spreadsheets but converses with you, revealing hidden patterns and insights. And, where complex engineering decisions aren't solely reliant on human foresight but are complemented by machine intelligence, narrowing the risk of oversight. That world isn't decades away; it's unfolding right now.

The onus is on us to not just witness this revolution but to actively engage with it. By harnessing the potential of multi-modal LLMs, we're not just streamlining our work but reshaping the very fabric of how innovation happens. The synergy between man and machine is burgeoning, presenting a realm of possibilities we're only beginning to fathom.

So, if there's one thing you take away from this article, let it be this: We're no longer at the mercy of technology's potential; we're its co-authors. The multi-modal future beckons, and it's time we rise to its call, crafting tales of innovation, creativity, and collaborative brilliance.

Vivek Kaushal

Discussion about this post