Microsoft introduced neural network Kosmos-1, which combines text, images, audio and video content as input.
The researchers called the system a “multimodal grand language model”. In their opinion, such algorithms will become the basis of general AI (AGI), which will be able to perform tasks at the human level.
“As a basic part of intelligence, multimodal perception is essential to achieve AGI in terms of knowledge acquisition and real-world tying,” the researchers said.
According to examples from articlesKosmos-1 can:
- analyze images and answer questions about them;
- read text from pictures;
- create image captions;
- pass a visual IQ test with an accuracy of 22-26%.
Microsoft trained Kosmos-1 on data from the Internet, including the 800 GB English text resource The Pile and the Common Crawl web archive. After training, the researchers evaluated the model’s abilities in several tests:
- understanding and generation of language;
- text classification without optical character recognition;
- subtitles for images;
- visual responses to questions;
- answers to web page questions;
- zero shot image classification.
According to Microsoft, the Kosmos-1 outperformed current models in many of these tests. In the near future, the researchers plan to publish the source code of the project on GitHub.
Recall that in January, Microsoft introduced a human voice simulator based on a short sample of VALL-E.
Found a mistake in the text? Select it and press CTRL+ENTER
bitcoinlinux Newsletters: Keep your finger on the pulse of the bitcoin industry!


