Vector Databases and Weaviate: Building Semantic Search Applications

Introduction

In the ever-evolving landscape of data storage and retrieval systems, vector databases have emerged as a revolutionary technology that bridges the gap between traditional databases and modern AI applications. Unlike conventional databases that excel at exact matches and structured queries, vector databases are designed to understand the meaning and context behind data, enabling similarity-based search capabilities that power next-generation applications.

In this article, we’ll explore:

  • The fundamental concepts behind vector databases and how they differ from traditional systems
  • The architecture and key components of vector databases
  • An in-depth look at Weaviate, one of the leading open-source vector database platforms
  • How to implement powerful semantic search applications with Weaviate
  • Best practices, optimization strategies, and real-world use cases

Whether you’re new to vector databases or looking to enhance your existing knowledge, this guide will provide the theoretical foundation you need to understand and leverage this powerful technology.

1. Understanding Vector Databases: The Fundamentals

What Are Vector Databases?

A vector database is a specialized database system designed to store, manage, and query high-dimensional vector representations of data. These vectors (also called embeddings) capture the semantic meaning of items such as text, images, audio, or any other data that can be represented in a numerical format.

The core idea behind vector databases is simple yet powerful: by converting data into vectors that preserve semantic relationships, we can perform similarity searches based on meaning rather than exact matches. This enables a wide range of applications that were previously difficult or impossible with traditional database systems.

Vector Embeddings: The Building Blocks

At the heart of vector databases are embeddings—numerical representations of data in a high-dimensional space. Here’s how they work:

  1. Embedding Generation: Data (such as text, images, or audio) is processed through an embedding model that converts it into a vector of floating-point numbers.
  2. Dimensional Representation: These vectors typically have hundreds or thousands of dimensions, with each dimension capturing some aspect of the data’s semantic properties.
  3. Semantic Proximity: Items with similar meanings or characteristics are positioned close to each other in this high-dimensional space.

For example, in a text embedding space, the words “happy” and “joyful” would be positioned closer to each other than “happy” and “automobile” because they share more semantic similarity.

Illustration of text embeddings in vector space

Vector Databases vs. Traditional Databases

To appreciate the value of vector databases, it’s important to understand how they differ from traditional database systems:

FeatureTraditional DatabasesVector Databases
Data StorageStructured records, tables, documentsVector embeddings + metadata
Query TypeExact matches, range queries, joinsSimilarity search, nearest neighbors
Search Paradigm“Find exact records matching criteria”“Find similar items to this reference”
IndexingB-trees, hash indexes, inverted indexesApproximate Nearest Neighbor (ANN) indexes
Use CasesTransactional data, structured reportingSemantic search, recommendations, classification
Query ComplexitySQL, complex query languagesVector similarity + optional filtering

The fundamental difference lies in the approach to understanding data: traditional databases treat data as literal values, while vector databases understand the semantic relationships between data points.

2. The Architecture of Vector Databases

Core Components of a Vector Database

Modern vector databases consist of several key components working together:

  1. Vector Storage Layer: Responsible for efficiently storing high-dimensional vector data
  2. Indexing Mechanism: Enables fast retrieval of similar vectors
  3. Metadata Storage: Stores additional information about each vector (original text, categories, timestamps, etc.)
  4. Query Processing Engine: Handles vector similarity searches and optional filtering
  5. API Layer: Provides interfaces for data insertion and querying

These components work together to deliver the core functionality of storing and retrieving vectors based on similarity.

Approximate Nearest Neighbor (ANN) Algorithms

One of the biggest challenges in vector databases is efficiently finding similar vectors in high-dimensional spaces. Exact nearest-neighbor searches become computationally prohibitive as the dataset grows. This is where Approximate Nearest Neighbor (ANN) algorithms come in.

Popular ANN algorithms used in vector databases include:

  • HNSW (Hierarchical Navigable Small World): Creates a multi-layer graph structure for efficient navigation
  • IVF (Inverted File Index): Partitions the vector space into clusters for faster search
  • PQ (Product Quantization): Compresses vectors while preserving similarity relationships
  • FAISS (Facebook AI Similarity Search): Combines multiple techniques for scalable similarity search

These algorithms make tradeoffs between search accuracy and performance, typically allowing for sub-millisecond query times even with millions of vectors.

Vector Database Architecture Models

Vector databases can be implemented with different architectural approaches:

  1. Standalone Vector Databases: Purpose-built systems designed exclusively for vector storage and retrieval (e.g., Weaviate, Pinecone)
  2. Vector Extensions to Traditional Databases: Adding vector capabilities to existing database systems (e.g., PostgreSQL with pgvector)
  3. Hybrid Architectures: Combining vector search with traditional search capabilities

Each approach has its advantages, with standalone vector databases typically offering the best performance for pure vector-based workloads.

3. Weaviate: An Open-Source Vector Database

What is Weaviate?

Weaviate is a powerful open-source vector database that combines the capabilities of a vector search engine with the features of a traditional database system. It’s designed to provide semantic search capabilities with the reliability and scalability expected in production environments.

Key features of Weaviate include:

  • GraphQL API: Intuitive interface for queries and mutations
  • RESTful API: Alternative interface for standard HTTP operations
  • Hybrid Search: Combining vector search with keyword-based retrieval
  • Modular Architecture: Pluggable vectorizer modules for different embedding models
  • Schema Definition: Flexible data modeling capabilities
  • Cross-References: Ability to link objects within the database
  • Classification Capabilities: Built-in machine learning for data classification

Weaviate Architecture

Weaviate’s architecture consists of several core components:

  1. Vector Index: The HNSW-based vector index for similarity search
  2. Object Store: Storage for both vector data and associated metadata
  3. GraphQL Interface: The primary API for interacting with Weaviate
  4. Vectorization Modules: Pluggable components for generating embeddings
  5. Query Planner: Optimizes and executes search queries

This architecture allows Weaviate to deliver high-performance vector search while maintaining a user-friendly interface.

Data Modeling in Weaviate

Weaviate uses a flexible schema-based approach to data modeling. Each data object in Weaviate:

  1. Belongs to a class (similar to a table in relational databases)
  2. Has properties (attributes that describe the object)
  3. Has an associated vector embedding (automatically generated or imported)
  4. Can have cross-references to other objects

A simple Weaviate schema for a collection of articles might look like this:

{
  "classes": [
    {
      "class": "Article",
      "description": "A news or blog article",
      "properties": [
        {
          "name": "title",
          "dataType": ["text"],
          "description": "The title of the article"
        },
        {
          "name": "content",
          "dataType": ["text"],
          "description": "The main content of the article"
        },
        {
          "name": "publishedDate",
          "dataType": ["date"],
          "description": "When the article was published"
        },
        {
          "name": "author",
          "dataType": ["Author"],
          "description": "The author of the article"
        }
      ]
    },
    {
      "class": "Author",
      "description": "A person who writes articles",
      "properties": [
        {
          "name": "name",
          "dataType": ["text"],
          "description": "The author's full name"
        },
        {
          "name": "biography",
          "dataType": ["text"],
          "description": "The author's biographical information"
        }
      ]
    }
  ]
}

This schema-based approach makes Weaviate approachable for developers familiar with traditional database systems.

4. Key Capabilities of Vector Databases

The most prominent capability of vector databases is semantic search—finding results based on meaning rather than keywords. This includes:

  1. Conceptual Matching: Finding results that match the concept of a query, even if they use different terminology
  2. Contextual Understanding: Recognizing the intent behind queries and matching accordingly
  3. Multilingual Support: Finding relevant content across different languages based on semantic similarity

Semantic search significantly improves search quality over traditional keyword-based approaches, especially for complex or nuanced queries.

While vector search excels at capturing semantic relationships, it may miss exact keyword matches that are relevant. Hybrid search combines the best of both worlds:

  1. Vector Search: Captures semantic similarity and conceptual matching
  2. Keyword Search: Ensures exact term matches are not missed
  3. Fusion Methods: Intelligently combines results from both approaches

Hybrid search typically delivers better overall search quality than either approach alone, especially for general-purpose search applications.

Advanced vector databases support multimodal search—the ability to search across different types of media using a single query:

  1. Cross-Modal Search: Using one modality to search another (e.g., text-to-image search)
  2. Unified Embeddings: Representing different data types in a shared vector space
  3. Multimodal Queries: Combining multiple modalities in a single query

This capability enables powerful applications such as searching for images using text descriptions, or finding text content related to an image.

Classification and Recommendation

Beyond search, vector databases enable other important capabilities:

  1. Automatic Classification: Categorizing new items based on similarity to existing categorized items
  2. Recommendation Systems: Suggesting related items based on vector similarity
  3. Anomaly Detection: Identifying outliers in vector space

These capabilities make vector databases valuable not just for search applications, but for a wide range of AI-enhanced systems.

5. Advantages of Vector Databases

Enhanced Search Quality

Vector databases offer significant improvements in search quality:

  1. Understanding Intent: Capturing the meaning behind queries rather than just matching keywords
  2. Handling Synonyms: Recognizing different terms with the same meaning
  3. Concept Matching: Finding results related to the same concept even with different terminology
  4. Context Sensitivity: Understanding the nuance and context of queries

These improvements translate to better user experiences and more effective information retrieval.

Scalability and Performance

Modern vector databases are designed for production-grade performance:

  1. Sub-Second Query Times: Even with millions of vectors
  2. Horizontal Scaling: Distributing workloads across multiple nodes
  3. Efficient Storage: Optimized for high-dimensional vector data
  4. Low Latency: For real-time applications

This scalability makes vector databases suitable for everything from small applications to enterprise-scale deployments.

Integration with AI Workflows

Vector databases fit seamlessly into modern AI workflows:

  1. LLM Integration: Enhancing large language models with retrieval capabilities
  2. Embedding Pipeline Integration: Working directly with embedding models
  3. Data Processing Support: Handling continuous updates and new data
  4. API-First Design: Easy integration with existing systems

This integration potential makes vector databases a key component in the AI infrastructure stack.

6. Common Use Cases for Vector Databases

Vector databases excel at finding relevant documents based on meaning:

  • Enterprise Knowledge Bases: Helping employees find relevant information
  • Legal Document Search: Finding case law and precedents based on concepts
  • Research Databases: Discovering relevant papers across different terminology
  • Content Management: Organizing and retrieving content based on themes

Recommendation Systems

Vector similarity powers effective recommendation systems:

  • Product Recommendations: “Customers who viewed this also viewed…”
  • Content Discovery: Finding articles, videos, or music with similar themes
  • Person Matching: Dating apps, professional networking, team formation
  • Educational Resources: Suggesting learning materials on similar topics

Beyond text, vector databases handle multimedia content effectively:

  • Reverse Image Search: Finding similar images to a reference
  • Content-Based Media Retrieval: Finding videos or audio with similar content
  • Digital Asset Management: Organizing and retrieving media based on content
  • Visual Product Search: Finding products that look similar to a reference

RAG (Retrieval Augmented Generation)

Vector databases are a key component in RAG systems:

  • Query-Based Document Retrieval: Finding relevant documents for an LLM
  • Knowledge Grounding: Providing factual information to ground LLM responses
  • Domain Adaptation: Allowing general LLMs to access domain-specific knowledge
  • Fact Verification: Retrieving information to verify LLM-generated content

7. Implementing Weaviate in Applications

Setting Up a Weaviate Server

The first step in working with Weaviate is setting up a server. This can be done in several ways:

  1. Using Docker (most common method)
  2. Cloud-hosted options
  3. Building from source

In the implementation section, I’ll provide code examples for setting up a Weaviate server using Docker.

Defining Your Data Schema

Before importing data, you need to define your schema. This involves:

  1. Defining classes (collections of similar objects)
  2. Specifying properties for each class
  3. Setting up cross-references between classes
  4. Configuring vectorizer modules

The implementation section will include code for defining and creating a schema in Weaviate.

Importing Data into Weaviate

Once your schema is defined, you can import data. This typically involves:

  1. Preparing your data in the correct format
  2. Batching data for efficient imports
  3. Handling errors and validation
  4. Monitoring the import process

I’ll provide code examples for efficient data importing in the implementation section.

Hybrid search combines vector search with traditional keyword search. Implementation involves:

  1. Configuring both search types
  2. Setting appropriate weights for each
  3. Tuning parameters for optimal results
  4. Handling and processing results

The implementation code will demonstrate how to set up and execute hybrid search queries.

8. Best Practices and Optimization Strategies

Schema Design Best Practices

Effective schema design is crucial for Weaviate performance:

  1. Appropriate Property Types: Using the right data types for each property
  2. Vectorization Configuration: Choosing the right vectorizer for your data
  3. Cross-Reference Planning: Designing efficient object relationships
  4. Indexing Strategy: Configuring HNSW parameters for your use case

Data Import Optimization

Efficient data importing ensures faster time-to-value:

  1. Batch Processing: Grouping objects for more efficient imports
  2. Parallel Processing: Utilizing multiple threads for faster imports
  3. Incremental Updates: Strategies for updating existing data
  4. Error Handling: Gracefully managing import failures

Query Optimization Techniques

Optimizing queries improves both performance and result quality:

  1. Vector Search Parameters: Tuning distance metrics and k values
  2. Hybrid Search Configuration: Balancing vector and keyword components
  3. Filter Optimization: Efficient use of metadata filters
  4. Result Processing: Handling and transforming search results

Scaling and Production Considerations

For production deployments, consider these best practices:

  1. Resource Allocation: CPU, memory, and storage requirements
  2. Backup Strategies: Ensuring data durability
  3. Monitoring and Logging: Tracking performance and errors
  4. High Availability: Configuring for redundancy and failover

9. Challenges and Limitations

Vector Database Challenges

While powerful, vector databases come with certain challenges:

  1. Cold Start Problem: Initial setup without existing embeddings
  2. Embedding Drift: Changes in embedding models over time
  3. Explainability: Understanding why certain results are returned
  4. Resource Intensity: Higher computational requirements than traditional databases

Addressing Common Pitfalls

To avoid common issues with vector databases:

  1. Test Search Quality: Regularly evaluate search results with real queries
  2. Monitor Performance: Track query times and resource utilization
  3. Update Embeddings: Regenerate embeddings when models change
  4. Tune Incrementally: Make small changes and measure impact

Implementation: Development with Weaviate

Following the steps at https://weaviate.io/developers/weaviate/quickstart, we create an endpoint and API key for ourselves.

We start a project locally or from a GitHub repository and run the following commands in sequence.

npm init -y
npm install axios openai weaviate-client

After creating our project and installing the necessary components, we use the following script to convert sample data from https://dummyjson.com/products into embeddings.

import axios from 'axios';
import { writeFileSync } from 'fs';
import { OpenAI } from 'openai';
import fs from 'fs';

// Initialize OpenAI API client
const openai = new OpenAI({
  apiKey: '', // Add your OpenAI API key here
});

/**
 * Converts product title to URL-friendly format
 * @param {string} title - Product title
 * @returns {string} - Formatted URI string
 */
function formatProductUri(title) {
  if (!title) return '';
  return title.toLowerCase().replace(/\s+/g, '-');
}

/**
 * Processes product data and generates embeddings
 * @param {Object} product - Raw product data
 * @returns {Object} - Formatted product with vector embedding
 */
async function formatProductData(product) {
  try {
    // Combine product data as string for embedding
    const productString = JSON.stringify(product);

    // Generate embedding using OpenAI
    const embeddingResponse = await openai.embeddings.create({
      model: 'text-embedding-ada-002',
      input: productString,
    });

    const embedding = embeddingResponse.data[0].embedding;

    // Return product in the required format
    return {
      data: product,
      id: formatProductUri(product?.title) + '-' + product?.id,
      vectorContent: embedding,
    };
  } catch (error) {
    console.error(`Error processing product: ${product.title}`, error);
    throw error;
  }
}

/**
 * Main function to fetch products from API and process them
 * Generates vector embeddings for each product
 */
async function fetchAndProcessProducts() {
  try {
    console.log('Fetching products...');
    const response = await axios.get('https://dummyjson.com/products?limit=200');
    const products = response.data.products;
    fs.writeFileSync('products.json', JSON.stringify(products, null, 2));

    console.log(`Fetched ${products.length} products. Processing...`);

    // Process all products
    const processedProducts = [];
    let index = 0;

    // Process each product sequentially to avoid rate limits
    for (const product of products) {
      index++;
      console.log(`Processing product ${index}/${products.length}: ${product.title}`);

      try {
        const formattedProduct = await formatProductData(product);
        processedProducts.push(formattedProduct);

        // Add delay every 10 products to avoid OpenAI rate limits
        if (index % 10 === 0) {
          console.log('Waiting 1 second for rate limit...');
          await new Promise((resolve) => setTimeout(resolve, 1000));
        }
      } catch (error) {
        console.error(`Error processing product: ${product.title}`, error);
        // Continue despite errors
        continue;
      }
    }

    // Save results to JSON file
    writeFileSync('processed_products.json', JSON.stringify(processedProducts, null, 2));
    console.log("Processing complete. Results saved to 'processed_products.json'");
  } catch (error) {
    console.error('Error occurred:', error);
  }
}

// Start the process
fetchAndProcessProducts();

This script will give us a JSON file named ‘processed_products.json’ that contains both the data itself and its embedded version. Later, we import this data into the Weaviate cloud environment using the script below. Unlike what was explained above, I’m not using the schema approach here as I’m trying to explain the concept and help you understand the logic behind it.

/**
 * Import script for loading product data into Weaviate vector database
 * This script reads product data from a JSON file and imports it into a Weaviate collection
 */
import weaviate, { generateUuid5 } from 'weaviate-client';
import { readFileSync } from 'fs';

// Load product data from JSON file
const products = JSON.parse(readFileSync('./processed_products.json', 'utf8'));

// Weaviate cloud service connection details
const WCD_URL = ''; // Add your WCD URL here
const WCD_API_KEY = ''; // Add your WCD API key here

const wcdUrl = WCD_URL;
const wcdApiKey = WCD_API_KEY;

// Validate connection details are available
if (!wcdUrl || !wcdApiKey) {
  throw new Error('Missing required environment variables WCD_URL or WCD_API_KEY');
}

// Initialize Weaviate client
const client = await weaviate.connectToWeaviateCloud(wcdUrl, {
  authCredentials: new weaviate.ApiKey(wcdApiKey),
});

/**
 * Main function to import product data into Weaviate
 */
async function main() {
  try {
    console.log('Starting data import to Weaviate...');
    console.log('products recieved', products.length);

    // Transform product data into Weaviate objects
    const dataObjects = products.map((item) => ({
      properties: {
        data: JSON.stringify(item.data),
      },
      // Generate a deterministic UUID based on collection name and item ID
      id: generateUuid5('Ecommerce', item.id),
      // Use vector if available, otherwise let Weaviate generate it
      vector: Array.isArray(item.vectorContent) ? item.vectorContent : undefined,
    }));
    console.log('dataObjects received', dataObjects.length);

    // Get reference to the Ecommerce collection
    const collection = await client.collections.get('Ecommerce');
    // Insert all objects in a batch operation
    const result = await collection.data.insertMany(dataObjects);
    console.log('Insertion response: ', result);

    console.log('Data import completed');
  } catch (error) {
    console.error('Error:', error);
  }
}

// Run the main function
main().catch(console.error);

With this script, we upload the created JSON file to the Weaviate system. There are two important aspects here:

  1. If your data contains its own ID value, you can specify it separately in the properties section or define it with a different property name because Weaviate has its own Id parameter which must be in UUID format, so we cannot use the data’s id field directly there. However, we can use Weaviate’s own UUID generator as shown in the example.
  2. The other field is the collection name. This actually represents the class name. This collection name is used in all operations such as searching, deleting data, etc.

After importing our data, we move on to searching the data. Above, I mentioned various search types both specific to Weaviate and vector databases in general. In the script below, I used Weaviate’s GraphQL feature and hybrid search capability.

// Import required libraries
// axios for making HTTP requests
import axios from 'axios';
// OpenAI SDK for generating embeddings
import OpenAI from 'openai';

// Initialize OpenAI client with API key
const openai = new OpenAI({
  apiKey: '', //Add your openAI API key
});

// Array of questions to process
const questions = ['do you have macbook?'];

/**
 * Generate an embedding vector for the given text using OpenAI's API
 * @param {string} text - The text to generate an embedding for
 * @returns {Promise<Array<number>>} The embedding vector
 */
async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-ada-002',
    input: text,
  });
  return response.data[0].embedding;
}

/**
 * Search Weaviate database using hybrid search (text + vector)
 * @param {string} question - The question to search for
 * @param {Array<number>} vector - The embedding vector of the question
 * @returns {Promise<Object>} Search results and metadata
 */
async function searchWeaviate(question, vector) {
  // Track response time
  const startTime = Date.now();

  // Prepare GraphQL query for Weaviate
  const data = JSON.stringify({
    query: `{
      Get {
        Ecommerce (
          hybrid: {
            query: "${question}"
            alpha: 0.5,
            vector: ${JSON.stringify(vector)}
          }
          limit: 3
        ) {
          data
          _additional { score }
        }
      }
    }`,
  });

  // Configure request parameters
  const config = {
    method: 'post',
    url: 'https://ltwfnj0szusp5t1natqdw.c0.us-west3.gcp.weaviate.cloud/v1/graphql',
    headers: {
      'Content-Type': 'application/json',
      Authorization: 'Bearer AUV3GwGDrHKwVCJpFFxKc6ppUl7xXF54uQAM',
    },
    data: data,
  };

  try {
    // Execute the request
    const response = await axios.request(config);
    const endTime = Date.now();

    // Format and return results
    return {
      question: question,
      results: response.data.data.Get.Ecommerce.map((item) => item.data),
      responseTime: `${endTime - startTime}ms`,
    };
  } catch (error) {
    // Handle errors
    console.error(`Error for question "${question}":`, error);
    return {
      question: question,
      error: error.message,
      responseTime: null,
    };
  }
}

/**
 * Main function to process all questions
 */
async function main() {
  for (const question of questions) {
    try {
      // Generate embedding for the question
      const embedding = await getEmbedding(question);

      // Search Weaviate using the question and its embedding
      const result = await searchWeaviate(question, embedding);

      // Output results
      console.log('\n-------------------');
      console.log('Question:', result.question);
      console.log('Products:', result.results);
      console.log('Response Time:', result.responseTime);
    } catch (error) {
      console.error(`Failed to process question "${question}":`, error);
    }
  }
}

// Execute the main function
main();

This script takes the question(s) in the questions array and sequentially searches for them on Weaviate servers. While doing this, it uses

const data = JSON.stringify({
  query: `{
      Get {
        Ecommerce (
          hybrid: {
            query: "${question}"
            alpha: 0.5,
            vector: ${JSON.stringify(vector)}
          }
          limit: 3
        ) {
          data
          _additional { score }
        }
      }
    }`,
});

This part uses Weaviate’s hybrid search feature. In the query section, the direct question itself is included. In the vector section, the question’s embedded version is used. This way, it performs a hybrid search using both text and vector. The alpha parameter determines the weight of the search. As it increases, the weight of vector search increases, and as it decreases, the weight of keyword search increases. In the limit section, we specify the number of returned data items. In the data section, we specify the property name we want to be returned. The score section shows the scores determined for search results.

Conclusion

Vector databases like Weaviate represent a significant advancement in how we store, retrieve, and understand data. By capturing semantic relationships and enabling similarity-based search, they unlock new capabilities that were previously impractical with traditional database systems.

As AI continues to transform software development, vector databases will play an increasingly central role in connecting language models, search systems, and user interfaces. Understanding their capabilities, architecture, and implementation strategies is becoming an essential skill for developers working with AI-enhanced applications.