[Global Network Technology Comprehensive Report] On March 10, Zhiyuan Robot officially released its first universal embossed base model - Zhiyuan Qiyuan Big Model (Genie Operator-1), and pioneered the Vision-Language-Latent-Action (ViLLA) architecture.
It is reported that this architecture consists of VLM (multimodal large model) + MoE (hybrid expert), in which VLM obtains general scene perception and language understanding capabilities with massive Internet graphic data. Latent Planner in MoE obtains general action understanding capabilities with a large number of cross-ontology and human operation video data. Action Expert in MoE obtains fine action execution capabilities with millions of real machine data.
Zhiyuan Robot stated that these three are linked together, enabling the use of human video learning to complete the rapid generalization of small samples, lower the threshold for embodied intelligence, and successfully deployed to many Zhiyuan robot bodies, continuously evolve, pushing embodied intelligence to a new level.
Zhiyuan Robot said that the GO-1 model will accelerate the popularization of embodied intelligence. The robot will develop from tools that rely on specific tasks to an independent entity with universal intelligence, play a greater role in many fields such as business, industry, and home, leading to a more universal and all-round intelligent future. (Sihan)