Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Today Databricks released Dolly 2.0, the next version of the large language model (LLM) with ChatGPT-like human interactivity (aka instruction-following) that the company released just two weeks ago.
The company says Dolly 2.0 is the first open-source, instruction-following LLM fine-tuned on a transparent and freely available dataset that is also open-sourced to use for commercial purposes. That means Dolly 2.0 is available for commercial applications without the need to pay for API access or share data with third parties.
According to Databricks CEO Ali Ghodsi, while there are other LLMs out there that can be used for commercial purposes, “They won’t talk to you like Dolly 2.0.” And, he explained, users can modify and improve the training data because it is made freely available under an open-source license. “So you can make your own version of Dolly,” he said.
Databricks released the dataset Dolly 2.0 used to fine-tune
Databricks said that as part of its ongoing commitment to open source, it is also releasing the dataset on which Dolly 2.0 was fine-tuned on, called databricks-dolly-15k. This is a corpus of more than 15,000 records generated by thousands of Databricks employees, and Databricks says it is the “first open source, human-generated instruction corpus specifically designed to enable large language to exhibit the magical interactivity of ChatGPT.”
There has been a wave of instruction-following, ChatGPT-like LLM releases over the past two months that are considered open-source by many definitions (or offer some level of openness or gated access). One was Meta’s LLaMA, which in turn inspired others like Alpaca, Koala, Vicuna and Databricks’ Dolly 1.0.
Many of these “open” models, however, were under “industrial capture,” said Ghodsi, because they were trained on datasets whose terms purport to limit commercial use — such as a 52,000-question-and-answer dataset from the Stanford Alpaca project that was trained on output from OpenAI’s ChatGPT. But OpenAI’s terms of use, he explained, includes a rule that you can’t use output from services that compete with OpenAI.
Databricks, however, figured out how to get around this issue: Dolly 2.0 is a 12 billion-parameter language model based on the open-source Eleuther AI pythia model family and fine-tuned exclusively on a small, open-source corpus of instruction records (databricks-dolly-15k) generated by Databricks employees. This dataset’s licensing terms allow it to be used, modified and extended for any purpose, including academic or commercial applications.
Models trained on ChatGPT output have, up until now, been in a legal gray area. “The whole community has been tiptoeing around this and everybody’s releasing these models, but none of them could be used commercially,” said Ghodsi. “So that’s why we’re super excited.”
Dolly 2.0 is small but mighty
A Databricks blog post emphasized that like the original Dolly, the 2.0 version is not state-of-the-art, but “exhibits a surprisingly capable level of instruction-following behavior given the size of the training corpus.” The post adds that the level of effort and expense necessary to build powerful AI technologies is “orders of magnitude less than previously imagined.”
“Everyone else wants to go bigger, but we’re actually interested in smaller,” Ghodsi said of Dolly’s diminutive size. “Second, it’s high-quality. We looked over all the answers.”
Ghodi added that he believes Dolly 2.0 will start a “snowball” effect — where others in the AI community can join in and come up with other alternatives. The limit on commercial use, he explained, was a big obstacle to overcome: “We’re excited now that we finally found a way around it. I promise you’re going to see people applying the 15,000 questions to every model that exists out there, and they’re going to see how many of these models suddenly become kind of magical, where you can interact with them.”