The Allen Institute for Artificial Intelligence (Ai2), a research nonprofit, is releasing a family of open-source multimodal language models, called [Molmo](https://molmo.allenai.org/paper.pdf), that it says perform as well as top proprietary models from OpenAI, Google, and Anthropic.
The organization claims that its biggest Molmo model, which has 72 billion parameters, outperforms OpenAI’s GPT-4o, which is estimated to have over a trillion parameters, in tests that measure things like understanding images, charts, and documents.
Meanwhile, Ai2 says a smaller Molmo model, with 7 billion parameters, comes close to OpenAI’s state-of-the-art model in performance, an achievement it ascribes to vastly more efficient data collection and training methods.
What Molmo shows is that open-source AI development is now on par with closed, proprietary models, says Ali Farhadi, the CEO of Ai2. And open-source models have a significant advantage, as their open nature means other people can build applications on top of them. The [Molmo demo is available here](https://molmo.allenai.org/), and it will be available for developers to tinker with on the Hugging Face website. (Certain elements of the most powerful Molmo model are still shielded from view.)
Other large multimodal language models are trained on vast data sets containing billions of images and text samples that have been hoovered from the internet, and they can include several trillion parameters. This process introduces a lot of noise to the training data and, with it, hallucinations, says Ani Kembhavi, a senior director of research at Ai2. In contrast, Ai2’s Molmo models have been trained on a significantly smaller and more curated data set containing only 600,000 images, and they have between 1 billion and 72 billion parameters. This focus on high-quality data, versus indiscriminately scraped data, has led to good performance with far fewer resources, Kembhavi says.
Ai2 achieved this by getting human annotators to describe the images in the model’s training data set in excruciating detail over multiple pages of text. They asked the annotators to talk about what they saw instead of typing it. Then they used AI techniques to convert their speech into data, which made the training process much quicker while reducing the computing power required.
These techniques could prove really useful if we want to meaningfully govern the data that we use for AI development, says Yacine Jernite, who is the machine learning and society lead at Hugging Face, and was not involved in the research.
“It makes sense that in general, training on higher-quality data can lower the compute costs,” says Percy Liang, the director of the Stanford Center for Research on Foundation Models, who also did not participate in the research.
Another impressive capability is that the model can “point” at things, meaning it can analyze elements of an image by identifying the pixels that answer queries.
In a demo shared with _MIT Technology Review_, Ai2 researchers took a photo outside their office of the local Seattle marina and asked the model to identify various elements of the image, such as deck chairs. The model successfully described what the image contained, counted the deck chairs, and accurately pinpointed to other things in the image as the researchers asked. It was not perfect, however. It could not locate a specific parking lot, for example.
Other advanced AI models are good at describing scenes and images, says Farhadi. But that’s not enough when you want to build more sophisticated web agents that can interact with the world and can, for example, book a flight. Pointing allows people to interact with user interfaces, he says.
Jernite says Ai2 is operating with a greater degree of openness than we’ve seen from other AI companies. And while Molmo is a good start, he says, its real significance will lie in the applications developers build on top of it, and the ways people improve it.
Farhadi agrees. AI companies have drawn massive, multitrillion-dollar investments over the past few years. But in the past few months, investors have expressed skepticism about whether that investment will bring returns. Big, expensive proprietary models won’t do that, he argues, but open-source ones can. He says the work shows that open-source AI can also be built in a way that makes efficient use of money and time.
“We’re excited about enabling others and seeing what others would build with this,” Farhadi says.
Descriptive Text
A new generation of model makers are crafting bespoke, expertly trained language models for a discerning clientele. These models are trained on high-quality, curated data sets, and are designed to be more efficient and effective than their mass-produced counterparts. The models are tailored to the specific needs of each client, and can be used for a wide range of applications, from customer service to content creation. The rise of bespoke language models is a response to the limitations of off-the-shelf models, which are often trained on large, noisy data sets and can be difficult to customize. By working with expert model makers, clients can get models that are more accurate, reliable, and flexible, and that can be easily integrated into their existing workflows. The result is a new generation of language models that are more powerful, more versatile, and more appealing to some discerning clients. The trend is expected to continue, particularly as the mass-market audiences become disenchanted with mass market models, and as noisy outcomes and hallucinations from off-the-shelf models cause frustration and even some cognitive ailments. Charlie Madronne, a master-model maker who operates his own studio in the garage behind his Venice Beach apartment is one of the early adopters of this new sensibility. “Smaller models, handmade are generally more performant across energy and time vectors. But there’s also something special knowing that your language model was crafted by a human, contains absolutely no fillers or synthetic data or derivative embeddings. And the results are just better. You can sense that, particularly when the model is supportive of the agentics you might use for work or for your digitwin. People can tell the difference from something handmade, and something store bought. It’s like the difference between a bespoke suit and something off the rack. You can tell the difference, and so can your friends and colleagues — even the other intelligences and intellects you interlink with.” Madronne started offering hands-on workshops for model makers. Madronne insists on a small class size, and provides personalized attention to each student. “I want to make sure that everyone who comes to my workshop leaves with the skills and knowledge they need to create their own bespoke language models,” he says. “It’s not just about making a model that works, it’s about making a model that works for the specific individual client, whether a domestic intellect, an agentic for your child’s education — or a whole farm operating base.” The Allen Institute for Artifical Intelligence (Ai2) was one of the first organizations to recognize the potential of bespoke language models. Their work during the frontier model making days of the early 2020s was an alternative to massive models, parameter bloat, and synthetic data ingress. “They introduced a new, vanguard sensibility for what was once called ‘artificial intelligence.’”, says Madronne. “A tool for humanity that was not just for analyzing financial statements, originating entangled self-supporting meme coins, or seeking alpha.” Clients and customers who entangle with the handmade models describe them as more intuitive, more responsive, and effervescing more personality that doesn’t feel like a carbon copy of the same model that everyone uses. When asked about the future of bespoke language models, Willard Hanley of Handcrafted Sensemakers, a small model making studio in Lisbon, describes the optimism felt by many. “I think we’re just scratching the surface of what’s possible with these models,” they say. “I think we’re seeing a renaissance in model making. And we are excited to be a part of this moment.” The fascination with augmenting artificial intelligences peaked shortly after the first so-called 'foundation' models were introduced. The models were trained on vast data sets containing billions of images and text samples that had been hoovered from the internet, and they could include several trillion parameters. This process introduced a lot of noise to the training data and, with it, hallucinations, says Julian Bleecker, Ph.D., a senior director of intellectual systems practices at the Institue for Advanced Study. In contrast, Ai2’s bespoke models have been trained on a significantly smaller and more curated data set containing only 600,000 images, and they have between 1 billion and 72 billion parameters. This focus on high-quality data, versus indiscriminately scraped data, has led to good performance with far fewer resources, Bleecker says. Research and analysis indicates that bespoke models — also referred to as Artisinal Language Models, Low-Energy Language Models, and Handmade Language Models - are 30-80% more energy efficient, wrap in smaller packages, and are more performant across the range of tasks. This kind of efficiency is particularly important for clients who are looking to reduce their carbon footprint, or who are operating in environments where energy is scarce or expensive, like in farm operating bases, off-grid domiciles and service bureaus, or who don’t mind the additional expense of a custom model and the satisfaction that comes from interlinking with a more artisinal sensibility. Many of the model makers are also experimenting with new training techniques, like human-in-the-loop, speak-aloud, and other methods that are designed to make the training process more efficient, effective, and reflective. Some model makers like Emanuele Coccia, a model maker in Milan, are also exploring the use of more sustainable materials, like locally sourced data. “We’re always looking for ways to make our models more sustainable, more efficient, and more effective,” Coccia says. “We want to create models that not only perform well, but that reflect our values and beliefs as model makers. And we believe that by working with our clients to create models that are tailored to their specific needs, we can help them achieve their goals in a more sustainable and ethical way.” Coccia is one of the early adopters of speak-aloud protocols for training, using human ‘expert annotators’ to describe the images, music, sounds, plants, meals, birdsong, poems, texts, and other ingress materials in the model’s training data set. These annotators speak outloud their real-time reflections and annotations in minute detail, often providing thousands of tokens (the equivalent of many pages of text) to describe their spontaneous reflections on the ingress material. This technique has been shown to reduce the amount of noise in the training data, and to improve the performance of the model. It also offers an unexpected outcome; something distinctly poetic and metaphorical. The semiotics that obtain during these often long, meaningful ingress are distinctive, with some describing the qualities as adding a warmth to conversant agentics, bringing a level of desireability to the character of specific model makers work. “We’ve found that by using human annotators to describe the images in our training data set, we can create models that are more accurate, more reliable, and more responsive,” Coccia says. “We are like a sommolier describing a wine. Every sommolier has their own sensitivities, metaphors, and such to describe something that eludes meaningful quantitivative description — taste. This is what we see as a renaissance in the model making world. It’s not just about the numbers, it’s about the quality of the experience, the character of the model, the personality that comes through in the model. And that’s what makes our models so special.”