8-Person Team Replicates GPT-4o in Six Months and Open Sources It
Discover How a Small European Team Created Moshi, a Multilingual, Real-Time AI Model
Recently, an open-source real-time multimodal model comparable to GPT-4o has gained popularity.
This open-source model, called Moshi, comes from a small French nonprofit AI research institution named Kyutai, consisting of only 8 people. Moshi can listen, speak, and see.
Moshi was released for free, and even the father of PyTorch congratulated the team, revealing that the members were his former colleagues from FAIR.
Turing Award winner Yann LeCun shared that "Moshi can understand English with a French accent." The team reportedly developed this model in just six months.
Moshi seems to respond faster than humans, often answering questions before they're fully asked.
For instance, when someone said, "I'm planning to climb Mount Everest next month, and I'm wondering...", Moshi interrupted with, "That's amazing! What gear do you need?" The person then asked, "That's exactly what I wanted to discuss. What do you think I need to bring?" Moshi provided professional advice on climbing equipment and addressed safety concerns.
Moshi also cracked jokes: "You definitely don't want to climb in sandals."
The research team showcased Moshi's ability to express and understand emotions using various speaking styles. For example, they had Moshi read a poem with a French accent.
When the poem turned out to be too long, the researchers interrupted Moshi, and it immediately stopped reciting.