Test Drive: sqlite-vec
Published on .
Today I’m test driving sqlite-vec (a vector search SQLite extension that runs anywhere!) in order to search a database of company descriptions with semantic natural language and without relying on exact keyword matches. The goal is to be able to search something like “sellers of technology parts to pharmaceutical companies”
The results will be ranked by comparing the similarity of our search phrase to the descriptions both encoded as vector embeddings to enable semantic search, where words or phrases with similar meanings have vectors that are close to each other. Lower cosine distance means more similar vectors.
The process will be:
- Ingest a dataset of companies. I’ll use https://www.kaggle.com/datasets/zanderluke777/company-business-descriptions which has around 23,000 companies with descriptions.
- Create and store embeddings of the company descriptions. I’ll use
text-embedding-3-small
from OpenAI and sqlite-rembed ( A SQLite extension for generating text embeddings from remote APIs) - Search the embeddings! (e.g.
company_embeddings match rembed('text-embedding-3-small', 'sellers of technology parts to televison manufacturers')
The first thing to do is set up a sqlite database, load the company descriptions out of an Excel file, then insert into the database.
"""vector similarity search with sqlite"""
import openpyxl
import sqlite3
import sqlite_vec
def connect():
db = sqlite3.connect("companies.db")
db.enable_load_extension(True)
db.load_extension("/Users/ntaylor/src/sqlite-rembed/dist/release/rembed0.dylib")
sqlite_vec.load(db)
db.enable_load_extension(False)
return db
db = connect()
db.execute("""INSERT INTO temp.rembed_clients(name, options) VALUES ('text-embedding-3-small', 'openai');""")
db.execute("""create table if not exists companies(name text, description text)""")
db.execute("""create virtual table vec_companies using vec0(company_embeddings float[1536])""")
workbook = openpyxl.load_workbook('/Users/ntaylor/Downloads/Extended Company Descriptions 2.xlsx')
for row in workbook.active.iter_rows(values_only=True):
db.execute("""insert into companies values (?, ?)""", (row[1], row[2]))
Code language: Python (python)
Next step is generating the embeddings. I’m just doing a sample, because OpenAI’s rate limits are too low to create all 2500 embeddings in one batch. This batch of 500 will take a few minutes since calling OpenAI involves some latency and work on their side.
db.execute("""insert into vec_companies(rowid, company_embeddings)
select rowid, rembed('text-embedding-3-small', description)
from companies
limit 500""")
Code language: Python (python)
Finally, we can search! Here I’m looking for 'sellers of technology parts to televison manufacturers'
which results in Samsung, Sharp and LG. I think that’s pretty good since it contains a typo and has a meaningful rank compared to select name from companies where lower(description) like '%television%'
. The traditional solution for search, which can at least rank results like this, is full-text-search to calculate an inverted index and keep track of where keywords occur within strings.
db.execute("""with matches as (
select
rowid,
distance
from vec_companies
where company_embeddings match rembed('text-embedding-3-small', 'sellers of technology parts to televison manufacturers')
order by distance
limit 3
)
select
name,
distance
from matches
left join companies on companies.rowid = matches.rowid;""").fetchall()
Code language: Python (python)
That’s it! I may publish a serialized archive of the full embeddings to save you a step if you’re following along, and I may compare to the FTS results. Questions? Feedback? Contact nattaylor@gmail.com