Migration Assistance - Get expert help migrating your business workloads to Neon.
Engineering

Announcing pg_tiktoken: A Postgres Extension for Fast BPE Tokenization

Analyze and Process Text Data in Neon with pg_tiktoken's Fast BPE Tokenization

We’re excited to announce the release of the pg_tiktoken extension on Neon. This new Postgres extension provides fast and efficient tokenization using the BPE (Byte Pair Encoding) algorithm. pg_tiktoken is a wrapper around OpenAI’s tokenizer, known for its speed and performance in handling natural language processing tasks.

pg_tiktoken solves the problem of tokenizing text data within a Postgres database. The tiktoken_encode function allows you to tokenize text inputs and returns a tokenized output, making it easier to analyze and process text data for various applications. The tiktoken_count function enables users to return the number of tokens in a text, which is useful for checking text length limits, like those imposed by OpenAI’s language models.

What are text tokens

Language models process text in chunks known as tokens. In English, a token can range from a single character to a complete word such as “a” or “apple.” In certain languages, tokens can even be shorter than one character or longer than one word.

For instance, the sentence “Neon is Serverless Postgres” is divided into seven tokens: [“Ne”, “on”, ” is”, ” Server”, “less”, ” Post”, “gres”].

Get started with pg_tiktoken

With pg_tiktoken, you can tokenize text data within your Postgres database, making it easier to analyze and process the data for various applications. The extension supports various text inputs, including multiple languages and special characters, and is optimized for speed and efficiency.

To start with pg_tiktoken, you need to install the extension: 

CREATE EXTENSION pg_tiktoken

Once the extension is installed, you can use the pg_tiktoken function within your queries to tokenize text data within your database. The function takes a text input and returns a tokenized output:

SELECT tiktoken_encode('text-davinci-003', 'The universe is a vast and captivating mystery, waiting to be explored and understood.');

tiktoken_encode

---------------

{464,6881,318,257,5909,290,3144,39438,10715,11,4953,284,307,18782,290,7247,13}

This will tokenize the input text using the BPE algorithm and return the tokenized output.

You can also return the number of tokens in a text using the `tiktoken_count` function:

SELECT tiktoken_count('text-davinci-003', 'The universe is a vast and captivating mystery, waiting to be explored and understood.');

tiktoken_count

---------------

17

Supported models

tiktoken_count and tiktoken_encode functions accept both encoding and OpenAI model names as the first argument:

tiktoken_count(<encoding or model>,<text>)

Here is the list of supported models:

Encoding nameOpenAI models
cl100k_baseChatGPT models, text-embedding-ada-002
p50k_baseCode models, text-davinci-002, text-davinci-003
p50k_editUse for edit models like text-davinci-edit-001, code-davinci-edit-001
r50k_base (or gpt2)GPT-3 models like davinci

Use pg_tiktoken with the ChatGPT model

You can persist the message history in Postgres and query the database to get messages that fit within OpenAI’s model limits. 

Here is an example of the “message” table:

CREATE TABLE message (
  role VARCHAR(50) NOT NULL, -- equals to 'system', 'user' or 'assistant'
  content TEXT NOT NULL,
  created TIMESTAMP NOT NULL DEFAULT NOW(),
  n_tokens INTEGER -- number of content tokens
);

The gpt-3.5-turbo model requires the following parameters:

{
  "model": "gpt-3.5-turbo",
  "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}
    ]
}

The “messages” parameter consists of an array of message objects. Each object within the array should contain two key pieces of information: the role of the message sender (either “system,” “user,” or “assistant”) and the actual message content. Conversations can be brief, consisting of just one message, or they may continue for multiple pages as long as the sum of the message tokens is within the 4096 limit. 

You can use the following query to insert role, content, and the number of tokens into the database:

INSERT INTO message (role, content, n_tokens)
VALUES ('user', 'Hello, how are you?', tiktoken_count('Hello, how are you?'));

Manage your text tokens

When you have a conversation with more tokens than a model can handle (like more than 4096 tokens for gpt-3.5-turbo), you’ll need to shorten your text to fit the limit. But be careful. If you remove a message from the conversation, the model will not know about it anymore.

Also, you might receive incomplete replies if your conversation is too long. For instance, if your gpt-3.5-turbo conversation is 4090 tokens long, you’ll only get a reply of 6 tokens.

The query below allows you to get messages up to your desired token limits:

WITH cte AS (
  SELECT role, content, created, n_tokens,
         SUM(tokens) OVER (ORDER BY created DESC) AS cumulative_sum
  FROM message
)

SELECT role, content, created, n_tokens, cumulative_sum
FROM cte
WHERE cumulative_sum <= <MAX_HISTORY_TOKENS>;

`MAX_HISTORY_TOKENS` represents the conversation history you want to keep for the chat completion and follow the following formula:

MAX_HISTORY_TOKENS = MODEL_MAX_TOKENS – NUM_SYSTEM_TOKENS – NUM_COMPLETION_TOKENS

Let’s use the example seen above and assume that the desired completion length is 100 tokens (`NUM_COMPLETION_TOKENS=90`).

MAX_HISTORY_TOKENS = 4096 – 6 – 90 = 4000

{
  "model": "gpt-3.5-turbo", // MODEL_MAX_TOKENS = 4096
  "messages": [
         {"role": "system", "content": "You are a helpful assistant."}, // NUM_SYSTEM_TOKENS = 6
         {"role": "user", "content": "Who won the world series in 2020?"},
         {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
         {"role": ...}
         .
         .
         .
         {"role": "user", "content": "Great! Have a great day."// MAX_HISTORY_TOKENS = 4000
    ] 
}

Conclusion

In summary, the pg_tiktoken extension provides fast and efficient tokenization using the BPE algorithm within Postgres, making it easier to analyze and process text data for various applications. We explored how to use the tiktoken_count function to query the database and retrieve messages that fit within OpenAI’s model limits and how to avoid failing API calls. 

What is your use case for the pg_tiktoken extension and AI? We would love to hear your feedback and learn about features you would like to see added to the  pg_tiktoken in future updates.

We look forward to seeing the natural language processing applications you will build with pg_tiktoken!