Production-Ready Text Embeddings with WebAssembly: WasmEdge + GGML

November 9, 2025 · 8 min read

Software Engineer

Building production ML inference services that run anywhere—from Raspberry Pi to cloud edge—requires a different approach. This article walks through a complete implementation of a text embedding API using WasmEdge, GGML, and Rust, delivering a 136KB WASM module paired with a 1.8MB async HTTP server that processes embeddings in ~100-200ms per request.

Full implementation: github.com/porameht/wasmedge-ggml-llama-embedding

What We're Building

A production-ready embedding API that transforms text into 384-dimensional vectors using:

WebAssembly runtime - Cross-platform portability (ARM, x86, RISC-V)
GGML quantization - Efficient model inference (46MB model)
Async HTTP server - High-throughput request handling with Tokio
Dual-binary architecture - Flexible deployment (WASM-only or full server)
Zero-config deployment - Environment variable configuration

Use cases:

Semantic search - Index and query millions of documents by meaning
RAG pipelines - Retrieval-augmented generation for LLMs
Recommendation engines - Find similar products, content, or users
Duplicate detection - Identify semantically similar items
Edge computing - Run inference locally on IoT devices

The Challenge: Traditional ML Deployment

Traditional ML deployment approaches come with significant overhead:

Approach	Binary Size	Dependencies	Portability
Python + PyTorch	N/A	Python runtime + packages	Platform-dependent
ONNX Runtime	~20MB	Platform-specific builds	Good
TensorFlow Serving	~500MB	Heavy dependencies	Platform-dependent
WasmEdge + GGML	136KB	WasmEdge only	True cross-platform

Why WebAssembly for edge computing:

Truly cross-platform - Same binary runs on ARM, x86, RISC-V
Minimal footprint - 136KB WASM module + 1.8MB server
Sandboxed execution - Built-in isolation for multi-tenant environments
No runtime dependencies - Just WasmEdge, no Python/Node.js required

System Architecture:

Why WebAssembly for Edge ML?

WebAssembly (Wasm) is the perfect match for edge computing ML inference:

1. True Cross-Platform Deployment

Deploy the same binary to any edge device:

# Build once for WebAssembly
cargo build --target wasm32-wasip1 --release

# Run on ANY edge device - no recompilation needed
# Raspberry Pi (ARM)
wasmedge model.wasm

# Edge server (x86)
wasmedge model.wasm

# IoT gateway (RISC-V)
wasmedge model.wasm

2. Edge-First Security

Critical for multi-tenant edge environments:

# Sandboxed execution - perfect for edge multi-tenancy
wasmedge --dir .:. model.wasm

# Explicit resource control
wasmedge --nn-preload default:GGML:AUTO:model.gguf model.wasm

3. Production Performance Metrics

Verified specifications from the implementation:

Metric	Value	Notes
WASM Binary	136KB	Portable inference module
Server Binary	1.8MB	Async HTTP API wrapper
Model Size	46MB	Quantized GGUF format
Cold Start	2-3 seconds	Model load + initialization
Inference Latency	100-200ms	Per embedding request

The Implementation

Dual-Binary Architecture - Two specialized components working together:

Component 1: WASM Embedding Module (136KB)

src/wasm.rs - The core inference engine

Responsibilities:

Load GGUF models via WASI-NN
Process text through GGML backend
Output pure JSON embeddings
Handle context management
CLI interface for direct usage

Build command:

cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
  --features wasm --no-default-features

Component 2: HTTP Server (1.8MB)

src/server.rs - Production API wrapper

Responsibilities:

Async HTTP handling (Warp + Tokio)
Process management (spawn WasmEdge)
CORS support for web integration
Health monitoring
Error handling and logging

Build command:

cargo build --bin embedding-api-server --release

Why this architecture?

Flexibility - Use WASM alone or full server
Performance - Async server handles concurrency
Portability - WASM runs anywhere
Maintainability - Clear separation of concerns

Let's dive into the implementation details.

1. Environment Configuration

Load runtime settings from environment variables for zero-recompilation deployment:

fn get_options_from_env() -> Value {
    let mut options = json!({});
    if let Ok(val) = env::var("enable_log") {
        options["enable-log"] = serde_json::from_str(val.as_str()).unwrap()
    }
    if let Ok(val) = env::var("ctx_size") {
        options["ctx-size"] = serde_json::from_str(val.as_str()).unwrap()
    }
    if let Ok(val) = env::var("batch_size") {
        options["batch-size"] = serde_json::from_str(val.as_str()).unwrap()
    }
    if let Ok(val) = env::var("threads") {
        options["threads"] = serde_json::from_str(val.as_str()).unwrap()
    }
    options
}

View configuration code

Available options:

enable_log - Detailed logging (token counts, versions)
ctx_size - Context window size (default: 512)
batch_size - Batch processing size (default: 512)
threads - CPU threads for inference (default: 4)

2. WASI-NN Graph Initialization

The core of WASI-NN integration - loading the GGML model:

let graph = GraphBuilder::new(GraphEncoding::Ggml, ExecutionTarget::AUTO)
    .config(options.to_string())
    .build_from_cache(model_name)
    .expect("Create GraphBuilder Failed");

let mut context = graph
    .init_execution_context()
    .expect("Init Context Failed");

View graph initialization

The WASI-NN Flow:

3. HTTP Server Implementation

The production API wraps the WASM module with an async HTTP server:

async fn generate_embedding(&self, text: &str) -> Result<EmbedResponse> {
    let mut child = Command::new("wasmedge")
        .arg("--dir").arg(".:.")
        .arg("--nn-preload").arg(&format!("default:GGML:AUTO:{}", self.model_path.display()))
        .arg(&self.wasm_path)
        .arg("default")
        .arg(text)
        .stdout(Stdio::piped())
        .spawn()?;

    let mut stdout = Vec::new();
    if let Some(mut out) = child.stdout.take() {
        out.read_to_end(&mut stdout).await?;
    }

    let output = String::from_utf8_lossy(&stdout);
    let parsed: serde_json::Value = serde_json::from_str(output.trim())?;

    Ok(EmbedResponse {
        n_embedding: parsed["n_embedding"].as_u64().unwrap() as usize,
        embedding: parsed["embedding"].as_array().unwrap()
            .iter().filter_map(|v| v.as_f64()).collect()
    })
}

View server implementation

API Endpoints:

# Generate embedding
curl -X POST http://localhost:3000/embed \
  -H "Content-Type: application/json" \
  -d '{"text":"What is the capital of France?"}'

# Health check
curl http://localhost:3000/health

# API info
curl http://localhost:3000/

4. Tensor Processing

Proper tensor dimension handling for embeddings:

fn set_data_to_context(context: &mut GraphExecutionContext, data: Vec<u8>) -> Result<(), Error> {
    context.set_input(0, TensorType::U8, &[data.len()], &data)
}

fn get_data_from_context(context: &GraphExecutionContext, index: usize) -> String {
    const MAX_OUTPUT_BUFFER_SIZE: usize = 4096 * 20 + 128;
    let mut output_buffer = vec![0u8; MAX_OUTPUT_BUFFER_SIZE];
    let mut output_size = context.get_output(index, &mut output_buffer).unwrap();
    output_size = std::cmp::min(MAX_OUTPUT_BUFFER_SIZE, output_size);
    String::from_utf8_lossy(&output_buffer[..output_size]).to_string()
}

View tensor processing

Why 4096 * 20 + 128?

Most embedding models output ≤ 4096 dimensions
Each float printed as string: ~20 bytes
128 bytes for JSON structure ({"n_embedding":...})

5. Output Format

HTTP Response:

{
  "n_embedding": 384,
  "embedding": [0.5426, -0.0384, -0.0364, ..., 0.1234]
}

WASM CLI Output:

$ wasmedge --dir .:. --nn-preload default:GGML:AUTO:model.gguf \
  wasmedge-ggml-llama-embedding.wasm default "Hello world"

{"n_embedding":384,"embedding":[0.5426,-0.0384,-0.0364,...]}

The WASM module outputs pure JSON (no extra text), making it easy to parse in any language or integrate with shell pipelines.

The Model: All-MiniLM-L6-v2

We use the all-MiniLM-L6-v2 model in GGUF format:

Specification	Value
Output Dimensions	384
Model Size	46MB (f16 quantized)
Max Sequence Length	256 tokens
Performance	100-200ms per request
Use Case	General-purpose embeddings

Download from HuggingFace:

curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf

Model comparison:

Model	Dimensions	Size	Inference	Quality
MiniLM-L6	384	46MB	100-200ms	Good
BERT-Base	768	440MB	~30ms	Better
MPNet	768	440MB	~35ms	Better
E5-Large	1024	1.3GB	~100ms	Best

Building and Running

Quick Start

# 1. Install WasmEdge with WASI-NN plugin
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasi_nn-ggml
source $HOME/.wasmedge/env

# 2. Install Rust target
rustup target add wasm32-wasip1

# 3. Download model
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf

# 4. Build both binaries
./build-wasm.sh  # Builds WASM module (136KB)
cargo build --bin embedding-api-server --release  # Builds HTTP server (1.8MB)

# 5. Run the server
./target/release/embedding-api-server

Build time: ~10 seconds (both binaries) Output sizes: 136KB (WASM) + 1.8MB (server)

Build Configuration

The Cargo.toml uses feature flags for optimal binary sizes:

[features]
default = ["server"]
wasm = ["wasmedge-wasi-nn"]
server = ["warp", "tokio", "serde", "anyhow"]

[profile.release]
opt-level = 3        # Maximum optimization
lto = true           # Link-time optimization
strip = true         # Strip debug symbols

Build targets:

# WASM module only (edge devices)
cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
  --features wasm --no-default-features

# HTTP server only (cloud/production)
cargo build --bin embedding-api-server --release

Size comparison:

WASM debug:      450KB
WASM release:    136KB (70% reduction)
Server release:  1.8MB

Resources:

What We're Building​

The Challenge: Traditional ML Deployment​

Why WebAssembly for Edge ML?​

1. True Cross-Platform Deployment​

2. Edge-First Security​

3. Production Performance Metrics​

The Implementation​

Component 1: WASM Embedding Module (136KB)​

Component 2: HTTP Server (1.8MB)​

1. Environment Configuration​

2. WASI-NN Graph Initialization​

3. HTTP Server Implementation​

4. Tensor Processing​

5. Output Format​

The Model: All-MiniLM-L6-v2​

Building and Running​

Quick Start​

Build Configuration​