After six years of development, AI has finally surpassed humans in "interpreting images and understanding their meanings".

Introduction: On August 12, a key breakthrough was achieved on the internationally authoritative machine vision question answering leaderboard, VQA Leaderboard: Alibaba DAMO Academy set a new record with an accuracy rate of 81.26%, enabling AI to surpass human benchmarks in "reading images and understanding meaning".

Not long ago, in the authoritative Chinese language understanding benchmark CLUE, Alibaba's AI model surpassed human accuracy in recognizing news text. Now, in "image comprehension," Alibaba DAMO Academy has also surpassed humans in VQA, marking the first time in the six years since the ranking was established.

On August 12, a key breakthrough was achieved on the internationally authoritative machine vision question answering leaderboard: Alibaba DAMO Academy set a new record with an accuracy rate of 81.26%, enabling AI to surpass human benchmarks in "reading images and understanding meaning".

Following AI's surpassing of human scores in visual recognition and text understanding in 2015 and 2018 respectively, artificial intelligence has also made significant progress in the field of multimodal technology.

"Poetry is painting without form, and painting is poetry with form." This is how Zhang Shunmin, a poet of the Song Dynasty, described the connection between language and vision. "Reading pictures and understanding meaning," that is, comprehending information through vision, is a basic human ability, but it is a cognitive task that demands a great deal from AI.

Addressing this challenge is crucial for the development of general artificial intelligence. Over the past decade, AI has made rapid progress in single-modal skills such as chess, vision, and text understanding. However, in higher-order cognitive tasks involving cross-modal vision-text understanding, AI has consistently failed to reach human levels.

The VQA Challenge, established to tackle this challenge, has been held at the world's top computer vision conferences ICCV and CVPR since 2015. It has attracted the participation of many top institutions, including Microsoft, Facebook, Stanford University, Alibaba, and Baidu, and has formed the world's largest and most recognized VQA (Visual Question Answering) dataset, which contains more than 200,000 real photos and 1.1 million questions.

VQA is one of the most challenging aspects of AI. In the test, the AI must generate the correct natural language answer based on a given image and a natural language question.

This means that a single AI model needs to integrate complex computer vision and natural language technologies: first, scan all image information, then combine it with an understanding of the text question, use multimodal technology to learn the correlation between text and images, accurately locate relevant image information, and finally answer questions based on common sense and reasoning.

In June of this year, Alibaba's DAMO Academy won the VQA 2021 Challenge out of 55 submissions, leading the second-place team by about 1 percentage point and last year's champion by 3.4 percentage points. Two months later, DAMO Academy once again set a new VQA Leaderboard global record with an accuracy of 81.26%, surpassing the human benchmark of 80.83% for the first time.

The core challenge of VQA lies in joint reasoning and cognition of multimodal information, that is, semantic mapping and alignment of different modalities within a unified model.

It is understood that the DAMO Academy NLP and Vision team has systematically designed the AI vision-text reasoning system, integrating a large number of algorithm innovations, including diverse visual feature representations, multimodal pre-trained models, adaptive cross-modal semantic fusion and alignment technology, knowledge-driven multi-skill AI integration, etc., which has brought the AI's "reading images and understanding meaning" level to a new level.

VQA technology has a wide range of applications, including text and image reading, cross-modal search, visual question answering for the blind, medical consultation, and intelligent driving, and may revolutionize the way humans interact with computers.

The report indicates that this is not the first time Alibaba's DAMO Academy has surpassed human benchmarks in key areas of AI. In 2018, DAMO Academy made history by enabling machines to surpass human reading comprehension for the first time in the Stanford SQuAD challenge, attracting attention from overseas media.

Since the beginning of this year, DAMO Academy has made frequent moves in the field of AI underlying technology. It has successively released M6, the first ultra-large-scale multimodal pre-trained model among Chinese technology companies, and PLUG, the first ultra-large-scale Chinese language model. It has also open-sourced AliceMind (https://github.com/alibaba/AliceMind), a deep language model system that has been developed over three years. AliceMind has topped six major international authoritative NLP lists, including GLUE.

After six years of development, AI has finally surpassed humans in "interpreting images and understanding their meanings".

Read next

CATDOLL Kara TPE Head

CATDOLL 136CM Tami

CATDOLL 123CM Nanako TPE

CATDOLL Sabrina Soft Silicone Head