Text generation and conversational technologies has been around for ages. With the recent boom of text generation models like GPT-4 and open-source alternatives (Falcon, MPT and more!) going mainstream, these technologies will be around and more integrated into every day products. In this post, I’ll go through a brief background on how they work, the types of text generation models and the tools in Hugging Face ecosystem that enable building products using open-source alternatives, challenges, questions and how we respond them.
Small Background on Text Generation
Text generation models are essentially trained with an objective of completing text. Earlier challenges in working with these models were controlling both the coherence and diversity of the text through inference parameters and the discriminative biases. The outputs that sounded more coherent were less creative and closer to the original training data, and wouldn’t sound like something that would be said by a human. Recent developments overcame these challenges, and user friendly UIs enabled everyone to try these models out.
Having more variation of open-source text generation models enables companies to keep privacy with their data (one of their intellectual properties!), ability to adapt models to their domains quicker and cut costs for inference instead of relying on closed paid APIs.
Simply put, these models are firstly trained with the objective of text completion, and later optimized using a process called reinforcement learning from human feedback. This optimization is mainly made over how natural and coherent the text sounds, rather than validity of the answer. You can get more information about this process here. In this post, we will not go through the details of this.
One thing you need to know about before we move on, is fine-tuning. This is the process of taking a very large model and transfer the knowledge contained this model to the use case, a downstream task. This tasks can come in form of instructions. As the model size grows, the models can generalize better to the instructions that do not exist in the fine-tuning data.
As of now, there’s two main types of text generation models. Models that complete the text are referred as Causal Language Models and can be seen below. Most known examples are GPT-3 and BLOOM. These models are trained with bunch of texts where latter part of the text is masked such that the model can complete the given text.
All causal language models on Hugging Face Hub can be found here.
Second type of text generation models is commonly referred as text-to-text generation models. These models are trained on text pairs, which can be questions and answers, or instructions and responses. The most popular ones are T5 and BART (which as of now aren’t state-of-the art). Google has recently released FLAN-T5 series of models. FLAN is a recent technique developed for instruction fine-tuning, and FLAN-T5 is essentially T5 fine-tuned using FLAN. As of now, FLAN-T5 series of models are state-of-the-art and open-source, available on Hugging Face Hub. Below you can see an illustration of how these models work.
The model GPT-3 itself is a causal language model, and the models in the backend of the ChatGPT (which is the UI for GPT-series models) are fine-tuned on prompts that can consist of conversations or instructions through RLHF. It’s an important distinction to make between these models. On Hugging Face Hub, you can find both causal language models, text-to-text models, and causal language models fine-tuned on instruction (which we’ll give links to later in this blog post).
Snippets to use these models are given in either the model repository, or the documentation page of that model type in Hugging Face.
Most of the available text generation models are either closed-source or the license limits commercial use. As of now, MPT-3 and Falcon models are fully open-source, and have open-source friendly licenses (Apache 2.0) that allow commercial use. These models are causal language models. There are versions fine-tuned on various instruction datasets that exist on Hugging Face Hub that come in various sizes depending on your needs.
MPT-30B-Chat has CC-BY-NC-SA license (for non-commercial use) and , MPT-30B-Instruct has CC-BY-SA 3.0 that can be used commercially respectively. Falcon-7B-Instruct has Apache 2.0 license that allows commercial use. Another popular model is OpenAssistant, built on LLaMa model of Meta. LLaMa has restrictive license and due to this, OpenAssistant checkpoints built on LLaMa don’t have fully open-source licenses, but there are other OpenAssistant models built on open-source models like Falcon or pythia that can be used.
Some of the existing instruction datasets are either crowd-sourced or use outputs of existing models (e.g. the models behind ChatGPT). ALPACA dataset created by Stanford is created through the outputs of models behind ChatGPT, which OpenAI doesn’t allow using when training models. Moreover, there are various crowd-sourced instruction datasets with open-source licenses, like oasst1 (created by thousands of people voluntarily!) or databricks/databricks-dolly-15k. Models fine-tuned on these datasets can be distributed.
How can you use these models?
Response times and handling concurrent users remain a challenge for serving these models. For this, Hugging Face has released text-generation-inference (TGI) it’s an open-source serving solution for large language models, built with Rust, Python and gRPc.
TGI currently powers HuggingChat. HuggingChat is the chat UI for large language models. Currently it has OpenAssistant on backend. You can chat as much as you want with HuggingChat, and enable the search feature for validated responses. You can also give feedbacks to each response for model authors to train better models. The UI of HuggingChat is also open-sourced (yes 🤯) and soon, there will be a docker image release on Hugging Face Spaces (app store of machine learning) so you can have your very own HuggingChat instance.
How to find the best model as of now?
Hugging Face hosts an LLM leaderboard here. This leaderboard is created by people uploading models, and metrics that evaluate text generation task are calculated on Hugging Face’s clusters and later added to leaderboard. If you can’t find the language or domain you’re looking for, you can filter them here.
Models created with love by Hugging Face with BigScience and BigCode
Hugging Face has two main large language models, BLOOM 🌸 and StarCoder🌟. StarCoder is a causal language model trained on code from GitHub (with 80+ programming languages 🤯), it’s not fine-tuned on instructions and thus it serves more as a coding assistant to complete a given code, e.g. translate Python to C++, explain concepts (what’s recursion) or act as a terminal. You can try all of the StarCoder checkpoints in this application. It also comes with a VSCode extension.
BLOOM is a causal language model trained on 46 languages and 13 programming languages. It it the first open-source model to have more parameters than GPT-3. You can find available checkpoints in BLOOM documentation.
Bonus: Parameter Efficient Fine Tuning (PEFT)
If you’d like to fine-tune one of the existing large models on your own instruction dataset, it is nearly impossible to do so on consumer hardware and later deploy them (since the instruction models are same size as original checkpoints that are used for fine-tuning). PEFT is a library that allows you to fine-tune smaller part of the parameters for more efficiency. With PEFT, you can do low rank adaptation (LoRA), prefix tuning, prompt tuning and p-tuning.
This is all for this blog post, I’m planning to write down another one as new tools and models are being released. Please let me know what you think or build!