Skip to main content

Production-Ready Text Embeddings with WebAssembly: WasmEdge + GGML

· 8 min read
fr4nk
Software Engineer
Hugging Face

Building production ML inference services that run anywhere—from Raspberry Pi to cloud edge—requires a different approach. This article walks through a complete implementation of a text embedding API using WasmEdge, GGML, and Rust, delivering a 136KB WASM module paired with a 1.8MB async HTTP server that processes embeddings in ~100-200ms per request.

Full implementation: github.com/porameht/wasmedge-ggml-llama-embedding

What We're Building

A production-ready embedding API that transforms text into 384-dimensional vectors using:

  • WebAssembly runtime - Cross-platform portability (ARM, x86, RISC-V)
  • GGML quantization - Efficient model inference (46MB model)
  • Async HTTP server - High-throughput request handling with Tokio
  • Dual-binary architecture - Flexible deployment (WASM-only or full server)
  • Zero-config deployment - Environment variable configuration

Use cases:

  • Semantic search - Index and query millions of documents by meaning
  • RAG pipelines - Retrieval-augmented generation for LLMs
  • Recommendation engines - Find similar products, content, or users
  • Duplicate detection - Identify semantically similar items
  • Edge computing - Run inference locally on IoT devices

The Challenge: Traditional ML Deployment

Traditional ML deployment approaches come with significant overhead:

ApproachBinary SizeDependenciesPortability
Python + PyTorchN/APython runtime + packagesPlatform-dependent
ONNX Runtime~20MBPlatform-specific buildsGood
TensorFlow Serving~500MBHeavy dependenciesPlatform-dependent
WasmEdge + GGML136KBWasmEdge onlyTrue cross-platform

Why WebAssembly for edge computing:

  • Truly cross-platform - Same binary runs on ARM, x86, RISC-V
  • Minimal footprint - 136KB WASM module + 1.8MB server
  • Sandboxed execution - Built-in isolation for multi-tenant environments
  • No runtime dependencies - Just WasmEdge, no Python/Node.js required

System Architecture:

Why WebAssembly for Edge ML?

WebAssembly (Wasm) is the perfect match for edge computing ML inference:

1. True Cross-Platform Deployment

Deploy the same binary to any edge device:

# Build once for WebAssembly
cargo build --target wasm32-wasip1 --release

# Run on ANY edge device - no recompilation needed
# Raspberry Pi (ARM)
wasmedge model.wasm

# Edge server (x86)
wasmedge model.wasm

# IoT gateway (RISC-V)
wasmedge model.wasm

2. Edge-First Security

Critical for multi-tenant edge environments:

# Sandboxed execution - perfect for edge multi-tenancy
wasmedge --dir .:. model.wasm

# Explicit resource control
wasmedge --nn-preload default:GGML:AUTO:model.gguf model.wasm

3. Production Performance Metrics

Verified specifications from the implementation:

MetricValueNotes
WASM Binary136KBPortable inference module
Server Binary1.8MBAsync HTTP API wrapper
Model Size46MBQuantized GGUF format
Cold Start2-3 secondsModel load + initialization
Inference Latency100-200msPer embedding request

The Implementation

Dual-Binary Architecture - Two specialized components working together:

Component 1: WASM Embedding Module (136KB)

src/wasm.rs - The core inference engine

Responsibilities:

  • Load GGUF models via WASI-NN
  • Process text through GGML backend
  • Output pure JSON embeddings
  • Handle context management
  • CLI interface for direct usage

Build command:

cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
--features wasm --no-default-features

Component 2: HTTP Server (1.8MB)

src/server.rs - Production API wrapper

Responsibilities:

  • Async HTTP handling (Warp + Tokio)
  • Process management (spawn WasmEdge)
  • CORS support for web integration
  • Health monitoring
  • Error handling and logging

Build command:

cargo build --bin embedding-api-server --release

Why this architecture?

  • Flexibility - Use WASM alone or full server
  • Performance - Async server handles concurrency
  • Portability - WASM runs anywhere
  • Maintainability - Clear separation of concerns

Let's dive into the implementation details.

1. Environment Configuration

Load runtime settings from environment variables for zero-recompilation deployment:

fn get_options_from_env() -> Value {
let mut options = json!({});
if let Ok(val) = env::var("enable_log") {
options["enable-log"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("ctx_size") {
options["ctx-size"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("batch_size") {
options["batch-size"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("threads") {
options["threads"] = serde_json::from_str(val.as_str()).unwrap()
}
options
}

View configuration code

Available options:

  • enable_log - Detailed logging (token counts, versions)
  • ctx_size - Context window size (default: 512)
  • batch_size - Batch processing size (default: 512)
  • threads - CPU threads for inference (default: 4)

2. WASI-NN Graph Initialization

The core of WASI-NN integration - loading the GGML model:

let graph = GraphBuilder::new(GraphEncoding::Ggml, ExecutionTarget::AUTO)
.config(options.to_string())
.build_from_cache(model_name)
.expect("Create GraphBuilder Failed");

let mut context = graph
.init_execution_context()
.expect("Init Context Failed");

View graph initialization

The WASI-NN Flow:

3. HTTP Server Implementation

The production API wraps the WASM module with an async HTTP server:

async fn generate_embedding(&self, text: &str) -> Result<EmbedResponse> {
let mut child = Command::new("wasmedge")
.arg("--dir").arg(".:.")
.arg("--nn-preload").arg(&format!("default:GGML:AUTO:{}", self.model_path.display()))
.arg(&self.wasm_path)
.arg("default")
.arg(text)
.stdout(Stdio::piped())
.spawn()?;

let mut stdout = Vec::new();
if let Some(mut out) = child.stdout.take() {
out.read_to_end(&mut stdout).await?;
}

let output = String::from_utf8_lossy(&stdout);
let parsed: serde_json::Value = serde_json::from_str(output.trim())?;

Ok(EmbedResponse {
n_embedding: parsed["n_embedding"].as_u64().unwrap() as usize,
embedding: parsed["embedding"].as_array().unwrap()
.iter().filter_map(|v| v.as_f64()).collect()
})
}

View server implementation

API Endpoints:

# Generate embedding
curl -X POST http://localhost:3000/embed \
-H "Content-Type: application/json" \
-d '{"text":"What is the capital of France?"}'

# Health check
curl http://localhost:3000/health

# API info
curl http://localhost:3000/

4. Tensor Processing

Proper tensor dimension handling for embeddings:

fn set_data_to_context(context: &mut GraphExecutionContext, data: Vec<u8>) -> Result<(), Error> {
context.set_input(0, TensorType::U8, &[data.len()], &data)
}

fn get_data_from_context(context: &GraphExecutionContext, index: usize) -> String {
const MAX_OUTPUT_BUFFER_SIZE: usize = 4096 * 20 + 128;
let mut output_buffer = vec![0u8; MAX_OUTPUT_BUFFER_SIZE];
let mut output_size = context.get_output(index, &mut output_buffer).unwrap();
output_size = std::cmp::min(MAX_OUTPUT_BUFFER_SIZE, output_size);
String::from_utf8_lossy(&output_buffer[..output_size]).to_string()
}

View tensor processing

Why 4096 * 20 + 128?

  • Most embedding models output ≤ 4096 dimensions
  • Each float printed as string: ~20 bytes
  • 128 bytes for JSON structure ({"n_embedding":...})

5. Output Format

HTTP Response:

{
"n_embedding": 384,
"embedding": [0.5426, -0.0384, -0.0364, ..., 0.1234]
}

WASM CLI Output:

$ wasmedge --dir .:. --nn-preload default:GGML:AUTO:model.gguf \
wasmedge-ggml-llama-embedding.wasm default "Hello world"

{"n_embedding":384,"embedding":[0.5426,-0.0384,-0.0364,...]}

The WASM module outputs pure JSON (no extra text), making it easy to parse in any language or integrate with shell pipelines.

The Model: All-MiniLM-L6-v2

We use the all-MiniLM-L6-v2 model in GGUF format:

SpecificationValue
Output Dimensions384
Model Size46MB (f16 quantized)
Max Sequence Length256 tokens
Performance100-200ms per request
Use CaseGeneral-purpose embeddings

Download from HuggingFace:

curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf

Model comparison:

ModelDimensionsSizeInferenceQuality
MiniLM-L638446MB100-200msGood
BERT-Base768440MB~30msBetter
MPNet768440MB~35msBetter
E5-Large10241.3GB~100msBest

Building and Running

Quick Start

# 1. Install WasmEdge with WASI-NN plugin
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasi_nn-ggml
source $HOME/.wasmedge/env

# 2. Install Rust target
rustup target add wasm32-wasip1

# 3. Download model
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf

# 4. Build both binaries
./build-wasm.sh # Builds WASM module (136KB)
cargo build --bin embedding-api-server --release # Builds HTTP server (1.8MB)

# 5. Run the server
./target/release/embedding-api-server

Build time: ~10 seconds (both binaries) Output sizes: 136KB (WASM) + 1.8MB (server)

Build Configuration

The Cargo.toml uses feature flags for optimal binary sizes:

[features]
default = ["server"]
wasm = ["wasmedge-wasi-nn"]
server = ["warp", "tokio", "serde", "anyhow"]

[profile.release]
opt-level = 3 # Maximum optimization
lto = true # Link-time optimization
strip = true # Strip debug symbols

Build targets:

# WASM module only (edge devices)
cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
--features wasm --no-default-features

# HTTP server only (cloud/production)
cargo build --bin embedding-api-server --release

Size comparison:

WASM debug:      450KB
WASM release: 136KB (70% reduction)
Server release: 1.8MB

Resources: