Understanding differences between various LLMs
Large Language Models (or LLMs) have revolutionized natural language processing (NLP) and enabled magical functionality well outside of traditional NLP tasks. This functionality includes everything from auto-generating PRs on your GitHub repo to automating marketing copy. In this blog post, we will develop a mental model for the landscape of LLM models, focusing on five key differentiating characteristics: access, licensing, size, context length, and language/domain. Understanding these features is crucial for researchers, developers, and users as they try to navigate the diverse world of LLMs.
(1) Access: Open-access models, such as Falcon, MPT, Pythia, and others, are often available to researchers and developers in the form of downloadable sets parameters and configuration from a hub like Hugging Face (see, for example, this model card for Falcon-7B). The downloaded parameters (or weights) can be combined with code (e.g., Hugging Face’s transformers library) to perform text completions.
Closed models, such as those from OpenAI, Cohere, and Anthropic, are only available to researchers and developers via an API or an enterprise deployment. The parameters and code implementations of these models are not publicly accessible, and use of the models may require some combination of paid subscriptions, enterprise partnerships, licensing agreements, and custom (product specific) terms and conditions.
Implications for developers: Closed models still have better performance on a wide range of tasks compared to open access LLMs (at the time this blog post is being written). This comes at the cost of questionable privacy (as related to the data that you put into the models) and sometimes shocking prices (as you are charged for your usage per token input). However, most developers don’t need to do a huge variety of things with LLMs. If you are looking to accomplish a specific task at scale with an LLM (e.g., data extraction), an open access LLM might do the trick. You can host these models in your own infrastructure (using tools like Sagemaker, Baseten, or Modal), ensure privacy, and create robust customized AI systems.
(2) Licensing: LLMs are subject to a variety licensing terms, which can dictate their usage, distribution, and commercial viability. Some models are open and free for commercial use (like the MPT family of models from MosaicML), allowing developers to leverage their capabilities without restrictions. Others may have open licenses with specific usage restrictions, such as those found in Open-RAIL licenses (for models like BLOOM) or custom licenses that prohibit commercial utilization (for models like Llama). Closed models typically have their licensing terms determined by the organization or company that develops them.
Implications for developers: Check the licensing of the models you are using. Especially for open access models, you will not be able to leverage certain models in commercial products. One pro tip is to filter the Hugging Face model based on license.
(3) Size or Parameter Count: The size of an LLM is typically measured by the number of parameters it contains. Larger models tend to have more parameters, which can lead to improved performance in various tasks. However, increased model size also comes with computational resource requirements, making them more demanding to train and deploy. A model like Dolly-3B, for example, has around 3 billion parameters and can be deployed to a system with a consumer GPU card. A model like Falcon-40B, on the other hand, can not be deployed to a standard, consumer GPU without some quantization, distillation, or model optimization trickery.
We don’t always know the parameters counts of closed models. In terms of deployment, this isn’t an issue for you, as the commercial company offering the model does the hosting. It may, however, influence the performance of the model in terms of both output quality and inference time.
Implications for developers: Larger models often exhibit enhanced language generation capabilities due to the large amount of complexity that has been captured when they are trained on massive amounts of data. However, the computational demands or latency associated with large models may restrict their accessibility to organizations with significant computing resources, limiting their widespread adoption.
(4) Context Length: Another important characteristic that distinguishes LLM models is their ability to handle different lengths of context. That is, you can stuff more or less into the input prompt. Some models are designed to excel with short-form prompts, where the focus is on processing concise text snippets or sentences. On the other hand, other models (like Anthropic’s Claude or MosaicML’s Storywriter) are trained with long-form content and accept large prompts as input (over 100k tokens). This longer context enables developers to process larger bodies of text, such as paragraphs, articles, or even entire books.
Implications for developers: If you are retrieving long form content (e.g., large articles) for context in your LLM prompts, you will either need to use a model with a large context window or you will need to segment the content into “chunks.” You might not have a choice if you are wanting to search and reason over large bodies of text. In these cases, you will need to utilize something like a vector database (Chroma, Weaviate, Pinecone, etc.) to segment, embed, and query relevant chunks of your external knowledge, and you will need inject that retrieved knowledge into shorter prompts. Frameworks like LlamaIndex and LangChain can help here.
(5) Language(s) or Domain(s): Finally, LLMs differ in terms of the languages they support and the domains they are trained on. Some models are focused on English, while others incorporate multiple languages, such as Flan. Additionally, certain LLMs are trained on specific domains like medicine, law, or finance, while others are designed for general domain tasks.
Implications for developers: If you are processing text in a language outside of the top 10 or so languages of the world, most large language models won’t work very well for you. You could take a multilingual model like XLM-Roberta or Flan and fine-tune it with language data you have access to, or you could try to machine translate you prompts and outputs. Similarly, you may need to fine-tune if you are working in a very specialized domain that doesn’t have a purpose built LLM already available.