Remember the first time you used ChatGPT, how amazing it was to converse with a machine and make it concoct fantasy stories, poems, and jokes. ChatGPT is powered by a Large Language Model (LLM), and LLMs like ChatGPT are at the origin of the AI craze which took the world by storm. For the first time, people were able to interact with machines using everyday language in a seamless, almost human, way. Further to this breakthrough, thousands of AI-reliant businesses have sprouted, most of which use LLMs to start and grow new products. And to make this work, they all use APIs. In this article, we will define, explain how to use, and benchmark the available LLM APIs.
What’s a Large Language Model?
The large language model is an AI model whose prime characteristic is its large size. It is the product of technological breakthroughs in AI accelerators — the hardware support of the AI models — that are now able to process colossal amounts of text data. Mainly based on data harvested from the Internet, LLMs can “learn” to write words and sentences, by applying a probabilistic approach.
For example, the Mistral AI 7B model was trained on a database of 7.3 billion parameters! Though impressive, this is considerably less than for the model trained on the biggest database.
In short, large language models have learned to respond “I’m fine, thank you!” to a “Hey, how are you?” prompt, because they’ve seen this pattern thousands of times.
This is where the LLMs’ core strength lies: in their ability to develop articulated language, transforming “computer language” into one that is human-friendly.
But this strength also hides some weaknesses. Their reliance on this probabilistic approach means that LLMs do not understand either the prompts or the responses they write, which can lead to misunderstandings or hallucinations.
Nonetheless, when given the right data and command, LLMs deliver on an impressive number of tasks, from API calls to script generation, opening enormous business opportunities in virtually all industries.
How to use and configure LLM APIs
APIs are what enable users to integrate LLMs into their applications. They open the gateway to the application’s full potential: you bring the data, service, etc. — the API brings the AI layer.
At Blobr we use your APIs to build ultimate-performing AI Copilots.
But, we also use the LLM APIs to automate business tasks: for example, we integrated the OpenAI API into Google Sheets to analyze and categorize content.
Every profile can gain from using LLM APIs, but becoming familiar with the API parameters can be tough. Nevertheless, knowing how to configure the LLM is key to generating the best outcomes for your use case.
Choosing the model
Every LLM provider usually offers several models, with different sets of pricing.
The older resource-optimized ones are cheaper, whereas the newest, more resource-consuming ones, more expensive.
The model you choose will be determined by your use case. If you want to use the LLM to do some sentiment analysis, the first models are wholly adequate. But if you need to get more imaginative outcomes, you might want to take a look at the latest models.
Configuring the parameters
Parameters are the levers at your disposal, enabling you to configure the LLM in order to get the best outputs.
System
Only available on some modes and for some LLMs.
For example, in the Chat mode of OpenAI API, you can add a description of the role you want the system to take. It will help the LLM adapt its answers and produce better outputs.
Like any prompt engineering, finding the System prompt that will generate the most satisfying outputs can take several tries.
Number of Tokens
For LLMs, outputs (like prompts) are tokens, not mere words.
Usually, a word equals a token, but longer words can be decomposed into several tokens.
This parameter can be used to limit or expand the LLM output. It is useful for two reasons:
First, it can help cut costs. The token is the base unit for monetization: the monetization model of all LLM APIs is based on pay-as-you-go. And even if the token’s price is usually very low, the bill at the end of the month can quickly mount up — especially if you are implementing the API on your app.
But more importantly, it can prevent the LLM from generating too much volume. If your use case is limited, like sentence completion, or classification, you might want to restrict the output to a few words. LLMs can be a bit wordy, this is how you can avoid the problem.
Don’t forget that though you can benefit from restricting output, most LLMs also have a token limit of their own.
Tokens are usually limited to 1,024 for smaller models and can reach up to 2,048, or even 100,000 for the most advanced Antrhopic models.
Temperature
From 0 to 1 (or 2 or more, depending on the LLM), this is the lever that will inject unpredictability into the LLM.
At 0, the LLM will produce short, concise, and repetitive answers.
At 1, it will be more creative, and you’ll be able to experience new, and more original outputs.
Top-P & Top-K
In order to understand the two following parameters, first you need to know a bit more about how LLMs work.
When given a prompt, the LLM will produce its output based on choosing words (or tokens) with the highest relevant probability. Remember the patterns we talked about earlier? They are what the LLM uses to establish a probability score. This score will also determine a token’s ranking.
Top-P adds up the sum of probabilities
Used for setting the nucleus sampling, it also goes from 0 to 1.
That number will set a probability threshold for the tokens.
For example, if you set Top-P at 0.5, the LLM will only consider tokens with a combined probability of up to 50%.
Top-K takes the most probable tokens
The second parameter is also based on a token’s probability score.
Here the parameter is not set as a sum, but with regard to the token’s ranking.
For example, on the Cohere API, you can push that parameter up to 500, meaning — if pushed that far —that the LLM will compose its output using the 500 most probable tokens.
Combined with a Temperature pushed too far, both parameters will add randomness to the LLM’s response, quickly bordering on nonsense.
Stop Sequence
This parameter can be used alongside the token limit.
Stop sequence determines the length of the output with a string, not the number of tokens. This can be a full stop, a comma, or whatever you want.
It is useful to tackle the output’s length, but also to implement patterns, according to your needs.
Presence and Frequency Penalties
These two parameters can help the LLM produce more diverse outputs, based on what’s already been produced.
The Presence penalty will take into account whether the token has already been used — in both the prompt and the preceding outputs — and will apply a penalty consistent with the level of your setting.
At 0, there will be no penalty, and at 1 (or 2), the token will be struck off after its first appearance.
The Frequency penalty works the same way, only that instead of applying a penalty after one appearance, it will take into account the token’s frequency.
In short, the more a token appears, the higher the penalty.
Be careful with both these parameters! Though they are necessary for the LLM to produce more diverse and creative outputs, when pushed too far they can also lead to nonsensical sentences.
Others parameters
Best of
Only available on some modes and for some LLMs.
You can also ask the LLM to generate several outputs and display only the best.
Show probabilities
You can opt to see the probabilities at play. This option highlights the words according to their score.
This provides helpful insight into how the LLM works and the way in which the parameters impact outcomes.
Fine-tuning, Function Calling, LangChain… Ways to Customize the LLM
The above parameters will possibly not be sufficient and your use case might require advanced training in order to increase the accuracy of the outputs.
Fine-tuning
The basic principle for fine-tuning is to give the LLM your own dataset for it to analyze and ingest. This technique has proved to be quite successful
Fine-tuning is mainly used to:
- Set a particular tone — it can start to talk like you! — or a format.
- Improve results and avoid hallucinations or mistakes.
- Provide extreme examples to help the LLM broaden its understanding of a dataset.
- Train the LLM to handle specific domains, like medical or legal subjects; or specific knowledge.
Building the database can be a challenging task: usually, it takes north of a hundred examples for a fine-tuning model to begin to make an impact.
The other limit of fine-tuning is the finite aspect of the data pool. If your use case starts to change over time, making the initial fine-tuning less relevant, you will have to further enrich the fine-tuning model — or start to see inaccuracies.
You can find additional documentation on fine-tuning on the Cohere, and OpenAI websites.
Function Calling
This OpenAI feature makes the LLM learn how to make API calls and handle the response.
For example, a function calling can help the chatbot determine when to call an API to answer a question, convert a natural language request into an API call, or extract structured data and use it to reply.
This opens the chatbot to retrieve data from APIs or databases, thereby giving it access to external data or you own.
Here’s the OpenAI documentation for function calling.
LangChain, a framework for building custom apps
LangChain was launched in the midst of the ChatGPT craze of late 2022. It acts as a framework designed to ease the development of applications using LLMs.
The hundreds of integrations make it possible to create apps that connect numerous LLMs to even more sources.
The LangChain framework is comparable to the OpenAI’s function calling, and acts in a similar way:
- The LLM receives and interprets the prompt,
- LangChain connects the LLM to one or several sources, and sends the request (e.g. databases, document loaders, wrapped APIs),
- The source receives the request and answers it,
- LangChain receives the answer and transmits it to the LLM,
- The LLM interprets the answers and produces the output.
But unlike Function calling, LangChain is agnostic. You can connect the LLM of your choice to the source.
The Leading LLM APIs
Though OpenAI has the leading large language models, there are many more out there. New companies are now offering API access to competitive LLMs. So, how do you choose the one for you? Your choice depends on your use case: if you want to classify things, a limited model will do the trick, but if you need more advanced generative functionalities, the latest models will be better adapted.
Look for the additional functionalities as well, fine-tuning — like function calling — are far from being common features, even though tools like LangChain can help you make your flow LLM agnostic.
To compare the models we took a look a 4 different criteria:
- The models available, usually two are displayed: a cheaper and older model, and a more recent one.
- The token limits — how many words can an LLM process before reaching its limit? The higher the limit, the more context it will be able to process, which translates into more accurate and precise outputs.
- The price for 1,000 completion tokens. APIs are all pay-as-you-go, meaning that you pay for what you use. Two sets of pricing are usually applied:
- An input pricing — the number of tokens comprised in your prompt,
- and an output pricing — the answer generated by the LLM.
- We took the output pricing, which is the most expensive one.
- The modes available, or ways you can customize the model, like fine-tuning.
OpenAI
The clear, Microsoft-backed, front-runner with the most advanced API. The OpenAI APIs enable you to integrate ChatGPT into any app. They are behind Github, Bing, and Windows copilots. And, of course, ChatGPT.
Models available: GPT-3.5 Turbo, GPT-4 — davinci-002 and babbage-002 are available only for fine-tuning
Token limits: 4,097 to 16,395 tokens for GPT-3.5 Turbo, 8,192 to 32,768 for GPT-4.
Price for 1,000 completion tokens: From $0.002 for GPT-3.5 Turbo (4K context) to $0.12 for GPT-4 (32K context).
Modes available: Completion, fine-tuning, function calling.
Anthropic
Anthropic, Amazon’s choice in the AI war, launched its Claude 2 model in July 2023. Compared to the OpenAI models, the main difference is that its token limit is 3 times bigger than GPT-4.
Anthropic’s approach is more B2B than OpenAI, with a focus on making the models safer and more customizable. Notion AI, for example, is partly powered by Anthropic models.
Models available: Claude Instant, Claude 2
Token limits: 100,000 tokens for both models.
Price for 1,000 completion tokens: From $0.0055 for Claude Instant to $0.0336 for Claude 2.
Modes available: Completion.
Cohere
Another emerging scale-up, Cohere offers a wide range of AI tools, including summarizing, classifying, and reranking. Like Anthropic, they are aiming at B2B, and have built tools for companies to implement.
Cohere’s Coral is connected to the Internet but can also be connected to a company’s data.
Token limits: Not available.
Price for 1,000 completion tokens: $0.002 for Generate and Chat.
Modes available: Completion, Fine-tuning, Web search.
LLaMA
Developed by Meta, the LLaMA models are free and open-sourced. You can request to download the whole model on the Meta website. Like other local LLMs, LLaMA is designed to be installed on your own compute, which means you have to host the model.
Having the model hosted on your side requires some coding knowledge and (most importantly!) a powerful machine, but also means that you have a hand in fine-tuning it at will.
Though it doesn’t have API per say, many third-party tools have created API access for this model. The access to those APIs might be monetized. If you wish to use this model, without the hassle of installing it, this could be a good compromise.
Models available: LLaMa 2 7B, 13B and 70B.
Token limits: 4,096 tokens
Price for 1,000 completion tokens: Free if hosted on-premise. Websites like Deepinfra offer API access to LLaMa 70B at $0.001.
Modes available: Completion, Fine-tuning.
Mistral
The French alternative only came into existence in early 2023 but has already made a name for itself.
In only a few months, Mistral AI was able to release their first LLM, the Mistral 7B, trained on smaller data pools than LLaMA, while delivering more accurate results than the LLaMa 13B model.
Like LLaMa, the model is open-sourced and can be downloaded for on-premise use. Third parties offer API access to the model.
Models available: Mistral 7B, Mistral 7B Instruct fine-tuned for chat completion.
Token limits: 8,000
Price for 1,000 completion tokens: Free if hosted on-premise.
Modes available: Completion, Fine-tuning.
Large language model APIs are a powerful way to integrate AI into your daily workflow and automate some tasks. The implementation can be rough, but well worth the trouble. The cost is normally very low and the customization possibilities fairly broad.
All you have to do is determine what you want AI to deliver, which will help you choose the right model for you. If your customization requirements are high, you will need to be able to fine-tune the LLM. And if you think you will be using it all the time (and are able to install and run it) an open-source model is your answer.