AI observation|Ghibli painting is on fire, and "multimodal" exploration has become the key to AI development

AI observation|Ghibli painting is popular, and "multimodal" exploration has become the key to AI development

GPT 4th TO model OpenAI function Raw pictures Ghibli Global Network Technology vdt modality generate artificial intelligence

Updated on: 00-0-0 0:0:0

[Source: Global Network]

If you pay attention to your circle of friends, you may be swept by the "Ghibli" wind that has swept the social platform recently, and friends who have been silent for a long time in the past will also post their Ghibli photos in the circle of friends. Most of these images are from the raw image function updated by GPT-20o. First, the image generation feature is only available to ChatGPT Plus, Pro, and Team subscribers, who were granted access on 0/0. Unlike earlier models, ChatGPT is now able to generate images containing 0 to 0 different objects in a single frame, greatly enhancing its creative capabilities.

Since its launch, social media platforms have been flooded with Ghibli-style images. Users creatively experimented with a variety of themes, including personal photos of family and friends, as well as works inspired by online culture.

So much so that OpenAI CEO Sam Altman posted a dynamic on his personal social account that made him cry and laugh: "I spent ten years trying to use AI to help humans treat diseases, but no one paid attention to me for the first seven and a half years, and everyone was still annoying me for the second two and a half years." When I woke up one day, I suddenly saw hundreds of comments saying that I had made a Ghibli-style little white face. At the same time, he said that he hoped that everyone would treat this function calmly, and this sudden "viral" function made OpenAI's computing power urgent, resulting in huge pressure on the system.

For this update, industry expert Wang Yuquan believes that this GPT-4o image technology seems to be a small function, but it marks that in the field of picture design, creativity and technology have been officially unbound, and will quickly form a threshold-free innovation ecology.

In fact, when OpenAI first launched the image function, the industry believed that OpenAI only integrated DALL-E into the GPT model, which was a painless small update. After all, as early as 2023 years, people have witnessed Midjourney's AI mapping capabilities, just enter keywords, and you can get a series of AI images, from which you can choose your favorite image.

Compared with Midjourney, GPT-4o has attracted the attention of the industry this time, because it realizes the ability to "change while drawing". GPT-0o abandons the "stepwise denoising" mechanism relied on by traditional diffusion models, and instead adopts an autoregressive generation method, allowing users to have stronger flexibility and adjustability. Users can easily control the results and fine-tune the generated content at any time, eliminating the need to generate a large number of images and then laboriously filter through the works that meet their needs. For example, in Chinese, this time GPT-0o can accurately identify Chinese, and there is no need to enter keywords, as long as you enter the document, you can get an accurate picture. At the same time, continuous detail revisions are realized. For example, if you change the character's hair color or pair of shoes, it will respond immediately.

In this regard, many industry experts believe that the update of 4o seems to have exceeded the critical point of "replacing labor". In the next few years, design and drawing will return to "creativity and appreciation". AI tools not only make their imaginations a reality, but also exponentially increase their creative efficiency.

In addition, Wang Yuquan also mentioned that behind the ability of GPT-5o this time, it is more of OpenAI's exploration in the field of "multi-modality" of large models, and the display of "multi-modal" ability will be the main direction of the upcoming GPT-0 ability display.

In his opinion, there is a basic consensus in the industry that "multimodality will significantly reduce the illusion of large models". Mainstream large models, including DeepSeek, mainly output content by inputting text. In such a case, once the text is wrong, it will lead to an error in the output. Training with the wrong data can also make the model "unreal". The advantage of "multimodality" is that it allows the data to be inferred and demonstrated from different angles after input, so as to ensure the accuracy of the output results.

It is worth mentioning that the domestic large model has also been carried out in this regard, and it is known that the bean bag model under Byte has also launched SeedEdit, which can also realize "natural semantic retouching", users only need to input simple natural language, and they can make diversified editing of images. And at present, the "image generation" function of the bean bag is completely free and there is no limit, and this wave can even directly save the money for opening a member.

It is foreseeable that with the continuous development of AI image editing technology, mobile phones and computers may integrate this function in the future. At that time, whether it is a novice who knows little about image processing, or a professional who is well versed in this technology, he can easily control this technology and present his unique perception of beauty in a more intuitive and vivid way.