Commerce

AI glasses + multimodal AI = a massive new industry – Computerworld



The powerful role of video in multimodal AI

Multimodal AI simultaneously combines text, audio, photos and video. (And to be clear, it can get the “text” information directly from the audio, photos or video. It can “read” or extract the words it sees, then input that text into the mix.) 

Multimodal AI with video brings the user-computer interface vastly closer to the human experience. While AI can’t think or understand, being able to harness video and other inputs puts people (who are also multimodal) on the same page about physical surroundings or the subject of awareness. 

For example, during the Google I/O keynote, engineers back at Google Deepmind headquarters were watching it, together with project Astra, which (as with OpenAI’s new model) can read and see and “watch” what’s on your computer screen. They posted this video on X, showing an engineer chit-chatting about the video on screen with the AI. 

Another fun demo that emerged showed GPT-4o in action. In that video, an engineer for OpenAI uses a smartphone running GPT-4o and its camera to describe what it sees based on the comments and questions of another instance on another smartphone of GPT-4o. 

In both demos, the phones are doing what another person would be able to do — walk around with a person and answer their questions about objects, people and information in the physical world. 

Advertisers are looking to video in multimodal AI as a way to register the emotional impact of their ads. “Emotions emerge through technology like Project Astra, which can process the real world through the lens of a mobile phone camera. It continually processes images and information that it sees and can return answers, even after it has moved past the object,” according to an opinion piece on MediaPost by Laurie Sullivan





READ SOURCE

This website uses cookies. By continuing to use this site, you accept our use of cookies.