Zetuo Kunlun Klustron Vector Data Management
Zetuo Kunlun Klustron Vector Data Management
Since OpenAI released ChatGPT in 2022, generative AI has sparked a revolution across industries, with major research-capable companies and institutions worldwide investing heavily in this field. As various AGI applications are deployed, vector databases are set to gain widespread use. The current vector database landscape includes dedicated vector databases like Milvus and Pinecone, as well as extensions for relational databases like PostgreSQL's PGVector and PGEmbedding.
PostgreSQL offers vector database functionality through the pgvector plugin. Starting with version 1.3, Zetuo Kunlun Klustron, a distributed database, supports vector data management by integrating PGVector and enhances its capabilities by supporting PostgresML for in-database machine learning. Zetuo Kunlun Klustron Database V1.3 is scheduled for release in late March. Stay tuned.
This article provides an overview of Klustron's vector data management and demonstrates its basic usage through examples.
01 Introduction to Zetuo Kunlun Klustron's Integrated Data Management Capabilities
Rooted in the MySQL and PostgreSQL open-source ecosystems, Zetuo Kunlun Klustron's distributed database seamlessly integrates and aligns with these communities. It not only supports a wide range of applications from both MySQL and PostgreSQL but also embraces third-party functional extensions from their open-source communities. This compatibility positions Klustron as a multi-model AI data foundation capable of supporting a diverse array of application scenarios.
Klustron's fusion capabilities, powered by PostgreSQL's community components, facilitate:
- Multi-Model Data Integration: Accommodates complex data types like JSON, GIS, and vectors, in addition to the relational data model.
- Diverse Data Source Integration: Manages data from within the cluster, external databases, and external object storage.
- Open-Source Ecosystem Extensions Integration: Incorporates tools such as PostGIS, PostgresML, PGVector, and beyond.
- Custom Algorithm Integration: Supports multi-language (Python, Java, Perl, Lua, JavaScript, PL/SQL) stored procedures.
Tailored for machine learning, in-database computation, and vector data management needs, Klustron stands as a robust AI data foundation. It empowers users to utilize extensive server computing resources (CPU, memory, GPU) within the database cluster, significantly enhancing efficiency while reducing costs.
Klustron extends beyond PostgreSQL’s single-server hardware limitations, leveraging vast server resources for data management and query analysis functionalities.
Furthermore, Klustron's multi-level parallel query processing significantly boosts the performance of OLAP analysis tasks, stored procedures, and machine learning computations—a leap beyond the capabilities of standalone PostgreSQL systems.
02 Vector Data Management Requirements Analysis
Exploring vector data management from a technical viewpoint reveals that current large model embeddings and image vectorization processes involve vectors with hundreds to thousands of dimensions. Each dimension's floating point occupies 8 bytes, meaning each vector's data size roughly falls between 1 to 16 KB. For many practical business systems, managing vector data scales from terabytes to petabytes or more.
The most common query for vector data is similarity (distance) querying, such as identifying the top 100 images most similar to a target image out of 100 million. Technically, this translates to finding the top 100 vectors closest to a target vector from 100 million vectors. Though 100 million images seem extensive, in business scenario involves street camera captures of vehicles and pedestrians, the total daily captures from all cameras in any major city in China could reach or even exceed 100 million images.
Given the high dimensions and usually large volume of vectors, similarity (distance) computation requires substantial computing resources. The storage scale and computational demands of vector data quickly reach the limits of single-server resources, naturally calling for distributed database systems for management.
In practical business scenarios, various data types are often needed concurrently. Beyond common types like numeric, datetime, and string, spatial location (GIS), JSON, and vectors are also used. For example, in the scenario mentioned, each camera's ID and the image capture time must be stored alongside the image vectors. To narrow down the search scope, filtering criteria such as capture time range and camera information, including geographic area, street, administrative region, and technical features, are commonly added. Thus, actual queries involve table joins, with the results joining camera information (cameras) and image vectors (img_vecs) tables after filtering applicable cameras and image vectors.
Managing vector data within a relational database system simplifies development for application systems. Using separate vector and relational databases would necessitate implementing cross-database table joins at the application layer, significantly increasing development and maintenance work. Moreover, logical errors due to inconsistent cross-database queries are difficult to solve cost-effectively and universally.
Zetuo Kunlun Klustron's distributed database, by supporting PGVector integration on compute nodes, incorporates PGVector's vector data management capabilities. Users can add columns for complex data types like vectors, GIS, and JSON to tables that already include various scalar data type columns. They can access vectors, GIS, JSON, and scalar data types within the same transaction. In Klustron, vector data is stored in Klustron-storage nodes, not within compute nodes. In the Klustron 1.3 version, vector data is stored as JSON data in storage nodes.
03 Challenges with Vector Indexing
In the previously mentioned scenario, where the image vector table img_vecs
is filtered and joined with the camera information table cameras
based on camera ID, the resulting dataset includes a vector column. However, sorting this vector column based on relevance to a target vector cannot directly utilize the vector index img_vecs_idx
of the img_vecs
table. This is because some vectors in img_vecs_idx
might be filtered out by the query conditions, and directly using the returned vectors could include some that do not meet other query criteria. Therefore, classic vector indexing algorithms like HNSW require special modifications to fit the flexible applications within relational databases.
Another challenge is the typical implementation of indexes in OLTP relational databases, including PGVector, which involves storing each row's every field in the main row and each corresponding field value defined by the index in the index row. This means each field used in N indexes would be stored (N+1) times. Given the considerable size of vector fields, the space overhead for duplicate storage is significantly high.
In light of these challenges, Klustron 1.3.1 has not implemented vector indexing capabilities. We anticipate introducing a vector indexing feature in Klustron 1.3.2 that addresses the aforementioned challenges. Klustron cluster's multi-level parallel query processing capabilities can partially mitigate performance issues related to vector similarity sorting in large datasets. Ideally, leveraging all CPUs across all servers in a cluster to execute a single query could facilitate distributed vector similarity sorting across all storage nodes, with the computation nodes merging the sorting results to complete the final global sort. The computation nodes in Klustron 1.3 are designed to automatically perform this task.
04 Zetuo Kunlun Klustron Vector Data Management: Usage Examples
4.1 Preparation
This guide covers two experimental scenarios. The first requires only a Klustron instance, while the second (a Q&A knowledge base) necessitates an accessible OpenAI key and appropriate network settings for accessing OpenAI's API services. Ensure Python 3 and related libraries are installed for the Python code to run smoothly.
4.1.1 Klustron Installation and Configuration
Klustron Environment Details:
- XPanel: http://192.168.0.152:40180/KunlunXPanel/#/cluster
- Compute Node: 192.168.0.152, Port: 47001
- Storage Node (shard1): 192.168.0.153, Port: 57003 (Primary)
- Storage Node (shard2): 192.168.0.155, Port: 57005 (Primary)
- Klustron is installed under the kl user.
4.1.2 Confirming pgvector Configuration
Open a shell session, connect to a Klustron database compute node, and log into the Klustron database as shown:
kl@kunlun1:~$ source /data/kl/env.sh
kl@kunlun1:~$ psql -h 192.168.0.152 -p 47001 postgres
postgres=# CREATE EXTENSION vector WITH SCHEMA public ;
4.2 Basic Experiments
4.2.1 A Simple Usage Example of pgvector:
Create a table (items) to store vector types for embeddings:
postgres=# CREATE TABLE items (
id bigserial PRIMARY KEY,
item text,
embedding vector(2)
);
Insert vector data into the table:
postgres=# INSERT INTO
items (item, embedding)
VALUES ('Apple', '[1, 1]'), ('Banana', '[1.2, 0.8]'), ('Cat', '[6, 0.4]');
Query the recently inserted data:
postgres=# select * from items ;
Example results:
Calculate the cosine similarity between a cat and bananas & apples using the cosine similarity operator <=>:
postgres=# SELECT item, 1 - (embedding <=> '[6,0]') AS cosine_similarity
FROM items ORDER BY cosine_similarity DESC;
Example results:
In the results:
- The cat scores 0.97, indicating a close to perfect match;
- The apple scores 0.92, showing some similarity with the cat;
- The banana scores 0.90, indicating a lower similarity to the cat.
4.2.2 Q&A Knowledge Base Scenario with pgvector
4.2.2.1 Environment Requirements
This experiment requires an OpenAI key. Ensure you have a valid API key and appropriate network access to OpenAI services, or you can set up an LLM environment without network restrictions. Note that the test code may need adjustments.
4.2.2.2 Data Preparation
This experiment uses text content from an online document, splitting and storing it in the Klustron database. The key phase of data preparation involves transforming domain-specific knowledge into text embeddings and efficiently storing and matching this information. Utilizing LLM's semantic understanding capabilities allows for obtaining answers and suggestions relevant to specific domains.
First, install the following dependencies:
root@kunlun1:/home/kl# pip3 install openai==0.28 psycopg2 tiktoken requests beautifulsoup4 numpy
Create a test table (rds_pg_help_docs in this case) for storing knowledge base content:
kl@kunlun1:~$ source /data/kl/env.sh
kl@kunlun1:~$ psql -h 192.168.0.152 -p 47001 postgres
postgres=# CREATE TABLE rds_pg_help_docs (
id bigserial PRIMARY KEY,
title text, -- Document title
description text, -- Description
doc_chunk text, -- Document chunk
token_size int, -- Word count of document chunk
embedding vector(1024)); -- Text embedding information
Create a .py
file (knowledge_chunk_storage.py
as an example) to split and store knowledge base document content in the Klustron database. Example code:
knowledge_chunk_storage.py
---------------------------------------------
import openai
import psycopg2
import tiktoken
import requests
from bs4 import BeautifulSoup
EMBEDDING_MODEL = "text-embedding-ada-002"
tokenizer = tiktoken.get_encoding("cl100k_base")
# Connect to Klustron database
conn = psycopg2.connect(database="postgres",
host="192.168.0.152",
user="abc",
password="abc",
port="47001")
conn.autocommit = True
# OpenAI API Key
openai.api_key = '<Secret API Key>'
# Custom splitting method (for demonstration only)
def get_text_chunks(text, max_chunk_size):
chunks_ = []
soup_ = BeautifulSoup(text, 'html.parser')
content = ''.join(soup_.strings).strip()
length = len(content)
start = 0
while start < length:
end = start + max_chunk_size
if end >= length:
end = length
chunk_ = content[start:end]
chunks_.append(chunk_)
start = end
return chunks_
# Specify the webpage to split
url = 'https://help.aliyun.com/document_detail/148038.html'
response = requests.get(url)
if response.status_code == 200:
# Fetch webpage content
web_html_data = response.text
soup = BeautifulSoup(web_html_data, 'html.parser')
# Extract title (H1 tag)
title = soup.find('h1').text.strip()
# Extract description (content of p tag with class='shortdesc')
description = soup.find('p', class_='shortdesc').text.strip()
# Split and store
chunks = get_text_chunks(web_html_data, 500)
for chunk in chunks:
doc_item = {
'title': title,
'description': description,
'doc_chunk': chunk,
'token_size': len(tokenizer.encode(chunk))
}
query_embedding_response = openai.Embedding.create(
model=EMBEDDING_MODEL,
input=chunk,
)
doc_item['embedding'] = query_embedding_response['data'][0]['embedding']
cur = conn.cursor()
insert_query = '''
INSERT INTO rds_pg_help_docs
(title, description, doc_chunk, token_size, embedding) VALUES (%s, %s, %s, %s, %s);
'''
cur.execute(insert_query, (
doc_item['title'], doc_item['description'], doc_item['doc_chunk'], doc_item['token_size'],
doc_item['embedding']))
conn.commit()
else:
print('Failed to fetch web page')
Notes:
- Replace
openai.api_key = '<Secret API Key>'
with your OpenAI key. - Update psycopg2.connect = … to connect to your Klustron database environment.
Run the Python program:
kl@kunlun1:~$ export all_proxy="socks5h://127.0.0.1:1080" # Example local proxy configuration
kl@kunlun1:~$ python3 knowledge_chunk_storage.py
Execution example:
Enter the database and run the following command to check if the knowledge base document content has been split and stored as vector data.
kl@kunlun1:~$ psql -h 192.168.0.152 -p 47001 postgres
postgres=# SELECT id,token_size,SUBSTRING(title FROM -1 FOR 30), SUBSTRING(description FROM -1 FOR 30), SUBSTRING(doc_chunk FROM -1 FOR 30),SUBSTRING(embedding FROM -1 FOR 50) from rds_pg_help_docs;
Example query results:
Note: Due to the full row data being too lengthy, the above query SQL only captures a portion for display.
4.2.2.3 Q&A Testing
In the Python project, create a .py
file (chatbot.py
as an example), create questions, and compare their similarity with the content in the database to return results.
chatbot.py
------------------------------------------
import openai
import psycopg2
from psycopg2.extras import DictCursor
GPT_MODEL = "gpt-3.5-turbo"
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_COMPLETIONS_MODEL = "text-davinci-003"
MAX_TOKENS = 1024
# OpenAI API Key
openai.api_key = '<Secret API Key>'
prompt = 'How to create an RDS PostgreSQL instance'
prompt_response = openai.Embedding.create(
model=EMBEDDING_MODEL,
input=prompt,
)
prompt_embedding = prompt_response['data'][0]['embedding']
# Connect to Klustron database
conn = psycopg2.connect(database="postgres",
host="192.168.0.152",
user="abc",
password="abc",
port="47001")
conn.autocommit = True
def answer(prompt_doc, prompt):
improved_prompt = f"""
Follow the document and steps below to answer the following question:
(1) First, analyze the content in the document to see if it's relevant to the question.
(2) Next, responses should only use content from the document, be as detailed as possible, and output in markdown format.
(3) Finally, if the question is not related to RDS PostgreSQL, reply with "I am not very familiar with topics outside of RDS PostgreSQL."
Document:
\"\"\"
{prompt_doc}
\"\"\"
Question: {prompt}
"""
response = openai.Completion.create(
model=GPT_COMPLETIONS_MODEL,
prompt=improved_prompt,
temperature=0.2,
max_tokens=MAX_TOKENS
)
print(f"{response['choices'][0]['text']}\n")
similarity_threshold = 0.78
max_matched_doc_counts = 8
# Use pgvector to filter documents with a similarity above a certain threshold
similarity_search_sql = f'''
SELECT doc_chunk, token_size, 1 - (embedding <=> '{prompt_embedding}') AS similarity
FROM rds_pg_help_docs WHERE 1 - (embedding <=> '{prompt_embedding}') > {similarity_threshold} ORDER BY id LIMIT {max_matched_doc_counts};
'''
cur = conn.cursor(cursor_factory=DictCursor)
cur.execute(similarity_search_sql)
matched_docs = cur.fetchall()
total_tokens = 0
prompt_doc = ''
print('Answer: \n')
for matched_doc in matched_docs:
if total_tokens + matched_doc['token_size'] <= 1000:
prompt_doc += f"\n---\n{matched_doc['doc_chunk']}"
total_tokens += matched_doc['token_size']
continue
answer(prompt_doc,prompt)
total_tokens = 0
prompt_doc = ''
answer(prompt_doc,prompt)
Notes:
Replace openai.api_key = '<Secret API Key>'
with your OpenAI key.- Update psycopg2.connect = … to connect to your Klustron database environment.
Run the Python program:
kl@kunlun1:~$ python3 chatbot.py
In the execution window, you'll see an answer similar to the following:
Note: You can optimize the splitting method and question prompt for more accurate and comprehensive answers. This is just a demonstration.
The Zetuo product development team also provides another interesting scenario of using pgvector for text-to-image search.
- Source code link: https://github.com/zettadb/techtalk/blob/main/pgvector%E7%9A%84%E5%BA%94%E7%94%A8%E5%92%8C%E5%8E%9F%E7%90%86/clip.ipynb
- For a description of this demo, refer to: https://doc.klustron.com/zh/Application_and_principle_of_pgvector.html#-1