Introduction to Vector Embeddings and Vector Databases
In the era of Artificial Intelligence, traditional keyword-based databases are insufficient for handling unstructured data like text, images, and audio. Let us look at vector databases and the role they play in AI systems.
1. What are Vector Embeddings?
A vector embedding is a representation of unstructured data as a list of decimal numbers (a vector) in a high-dimensional mathematical space.
These vectors are generated by Machine Learning models (such as OpenAI text-embedding-3-small or open-source Hugging Face embedding models).
Capturing Semantic Meaning
The key feature of embeddings is that data with similar semantic meaning are placed close to each other in the mathematical space:
- The sentence "I love puppies" and "Dogs are my favorite animals" will have vectors positioned close to each other because their meanings are related.
- The sentence "I love puppies" and "Database tables require primary keys" will have vectors positioned far apart because their meanings are unrelated.
2. Keyword Search vs Semantic Search
Traditional databases use index structures to match exact keyword characters:
- Keyword Search: If you search for "automobile", the database returns only rows containing that exact word. Rows containing "car" or "vehicle" might be missed.
- Semantic Search: Translates your search query into a vector embedding, then searches the database for rows with the closest vector values. A query for "automobile" can successfully return rows containing "car" because their vector embeddings are mathematically close.
3. What is a Vector Database?
A vector database is a data store designed to save and query high-dimensional vectors efficiently. While dedicated vector databases (like Pinecone or Milvus) exist, relational databases can be extended to support vector operations.
For PostgreSQL, the pgvector extension turns your relational database into a fully-functional vector database, allowing you to run SQL operations on both relational columns and vector columns in the same query.