Production-Ready Text Embeddings with WebAssembly: WasmEdge + GGML
Building production ML inference services that run anywhere—from Raspberry Pi to cloud edge—requires a different approach. This article walks through a complete implementation of a text embedding API using WasmEdge, GGML, and Rust, delivering a 136KB WASM module paired with a 1.8MB async HTTP server that processes embeddings in ~100-200ms per request.
Full implementation: github.com/porameht/wasmedge-ggml-llama-embedding
What We're Building
A production-ready embedding API that transforms text into 384-dimensional vectors using:
- WebAssembly runtime - Cross-platform portability (ARM, x86, RISC-V)
- GGML quantization - Efficient model inference (46MB model)
- Async HTTP server - High-throughput request handling with Tokio
- Dual-binary architecture - Flexible deployment (WASM-only or full server)
- Zero-config deployment - Environment variable configuration
Use cases:
- Semantic search - Index and query millions of documents by meaning
- RAG pipelines - Retrieval-augmented generation for LLMs
- Recommendation engines - Find similar products, content, or users
- Duplicate detection - Identify semantically similar items
- Edge computing - Run inference locally on IoT devices
The Challenge: Traditional ML Deployment
Traditional ML deployment approaches come with significant overhead:
| Approach | Binary Size | Dependencies | Portability |
|---|---|---|---|
| Python + PyTorch | N/A | Python runtime + packages | Platform-dependent |
| ONNX Runtime | ~20MB | Platform-specific builds | Good |
| TensorFlow Serving | ~500MB | Heavy dependencies | Platform-dependent |
| WasmEdge + GGML | 136KB | WasmEdge only | True cross-platform |
Why WebAssembly for edge computing:
- Truly cross-platform - Same binary runs on ARM, x86, RISC-V
- Minimal footprint - 136KB WASM module + 1.8MB server
- Sandboxed execution - Built-in isolation for multi-tenant environments
- No runtime dependencies - Just WasmEdge, no Python/Node.js required
System Architecture:
Why WebAssembly for Edge ML?
WebAssembly (Wasm) is the perfect match for edge computing ML inference:
1. True Cross-Platform Deployment
Deploy the same binary to any edge device:
# Build once for WebAssembly
cargo build --target wasm32-wasip1 --release
# Run on ANY edge device - no recompilation needed
# Raspberry Pi (ARM)
wasmedge model.wasm
# Edge server (x86)
wasmedge model.wasm
# IoT gateway (RISC-V)
wasmedge model.wasm
2. Edge-First Security
Critical for multi-tenant edge environments:
# Sandboxed execution - perfect for edge multi-tenancy
wasmedge --dir .:. model.wasm
# Explicit resource control
wasmedge --nn-preload default:GGML:AUTO:model.gguf model.wasm
3. Production Performance Metrics
Verified specifications from the implementation:
| Metric | Value | Notes |
|---|---|---|
| WASM Binary | 136KB | Portable inference module |
| Server Binary | 1.8MB | Async HTTP API wrapper |
| Model Size | 46MB | Quantized GGUF format |
| Cold Start | 2-3 seconds | Model load + initialization |
| Inference Latency | 100-200ms | Per embedding request |
The Implementation
Dual-Binary Architecture - Two specialized components working together:
Component 1: WASM Embedding Module (136KB)
src/wasm.rs - The core inference engine
Responsibilities:
- Load GGUF models via WASI-NN
- Process text through GGML backend
- Output pure JSON embeddings
- Handle context management
- CLI interface for direct usage
Build command:
cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
--features wasm --no-default-features
Component 2: HTTP Server (1.8MB)
src/server.rs - Production API wrapper
Responsibilities:
- Async HTTP handling (Warp + Tokio)
- Process management (spawn WasmEdge)
- CORS support for web integration
- Health monitoring
- Error handling and logging
Build command:
cargo build --bin embedding-api-server --release
Why this architecture?
- Flexibility - Use WASM alone or full server
- Performance - Async server handles concurrency
- Portability - WASM runs anywhere
- Maintainability - Clear separation of concerns
Let's dive into the implementation details.
1. Environment Configuration
Load runtime settings from environment variables for zero-recompilation deployment:
fn get_options_from_env() -> Value {
let mut options = json!({});
if let Ok(val) = env::var("enable_log") {
options["enable-log"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("ctx_size") {
options["ctx-size"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("batch_size") {
options["batch-size"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("threads") {
options["threads"] = serde_json::from_str(val.as_str()).unwrap()
}
options
}
Available options:
enable_log- Detailed logging (token counts, versions)ctx_size- Context window size (default: 512)batch_size- Batch processing size (default: 512)threads- CPU threads for inference (default: 4)
2. WASI-NN Graph Initialization
The core of WASI-NN integration - loading the GGML model:
let graph = GraphBuilder::new(GraphEncoding::Ggml, ExecutionTarget::AUTO)
.config(options.to_string())
.build_from_cache(model_name)
.expect("Create GraphBuilder Failed");
let mut context = graph
.init_execution_context()
.expect("Init Context Failed");
The WASI-NN Flow:
3. HTTP Server Implementation
The production API wraps the WASM module with an async HTTP server:
async fn generate_embedding(&self, text: &str) -> Result<EmbedResponse> {
let mut child = Command::new("wasmedge")
.arg("--dir").arg(".:.")
.arg("--nn-preload").arg(&format!("default:GGML:AUTO:{}", self.model_path.display()))
.arg(&self.wasm_path)
.arg("default")
.arg(text)
.stdout(Stdio::piped())
.spawn()?;
let mut stdout = Vec::new();
if let Some(mut out) = child.stdout.take() {
out.read_to_end(&mut stdout).await?;
}
let output = String::from_utf8_lossy(&stdout);
let parsed: serde_json::Value = serde_json::from_str(output.trim())?;
Ok(EmbedResponse {
n_embedding: parsed["n_embedding"].as_u64().unwrap() as usize,
embedding: parsed["embedding"].as_array().unwrap()
.iter().filter_map(|v| v.as_f64()).collect()
})
}
API Endpoints:
# Generate embedding
curl -X POST http://localhost:3000/embed \
-H "Content-Type: application/json" \
-d '{"text":"What is the capital of France?"}'
# Health check
curl http://localhost:3000/health
# API info
curl http://localhost:3000/
4. Tensor Processing
Proper tensor dimension handling for embeddings:
fn set_data_to_context(context: &mut GraphExecutionContext, data: Vec<u8>) -> Result<(), Error> {
context.set_input(0, TensorType::U8, &[data.len()], &data)
}
fn get_data_from_context(context: &GraphExecutionContext, index: usize) -> String {
const MAX_OUTPUT_BUFFER_SIZE: usize = 4096 * 20 + 128;
let mut output_buffer = vec![0u8; MAX_OUTPUT_BUFFER_SIZE];
let mut output_size = context.get_output(index, &mut output_buffer).unwrap();
output_size = std::cmp::min(MAX_OUTPUT_BUFFER_SIZE, output_size);
String::from_utf8_lossy(&output_buffer[..output_size]).to_string()
}
Why 4096 * 20 + 128?
- Most embedding models output ≤ 4096 dimensions
- Each float printed as string: ~20 bytes
- 128 bytes for JSON structure (
{"n_embedding":...})
5. Output Format
HTTP Response:
{
"n_embedding": 384,
"embedding": [0.5426, -0.0384, -0.0364, ..., 0.1234]
}
WASM CLI Output:
$ wasmedge --dir .:. --nn-preload default:GGML:AUTO:model.gguf \
wasmedge-ggml-llama-embedding.wasm default "Hello world"
{"n_embedding":384,"embedding":[0.5426,-0.0384,-0.0364,...]}
The WASM module outputs pure JSON (no extra text), making it easy to parse in any language or integrate with shell pipelines.
The Model: All-MiniLM-L6-v2
We use the all-MiniLM-L6-v2 model in GGUF format:
| Specification | Value |
|---|---|
| Output Dimensions | 384 |
| Model Size | 46MB (f16 quantized) |
| Max Sequence Length | 256 tokens |
| Performance | 100-200ms per request |
| Use Case | General-purpose embeddings |
Download from HuggingFace:
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf
Model comparison:
| Model | Dimensions | Size | Inference | Quality |
|---|---|---|---|---|
| MiniLM-L6 | 384 | 46MB | 100-200ms | Good |
| BERT-Base | 768 | 440MB | ~30ms | Better |
| MPNet | 768 | 440MB | ~35ms | Better |
| E5-Large | 1024 | 1.3GB | ~100ms | Best |
Building and Running
Quick Start
# 1. Install WasmEdge with WASI-NN plugin
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasi_nn-ggml
source $HOME/.wasmedge/env
# 2. Install Rust target
rustup target add wasm32-wasip1
# 3. Download model
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf
# 4. Build both binaries
./build-wasm.sh # Builds WASM module (136KB)
cargo build --bin embedding-api-server --release # Builds HTTP server (1.8MB)
# 5. Run the server
./target/release/embedding-api-server
Build time: ~10 seconds (both binaries) Output sizes: 136KB (WASM) + 1.8MB (server)
Build Configuration
The Cargo.toml uses feature flags for optimal binary sizes:
[features]
default = ["server"]
wasm = ["wasmedge-wasi-nn"]
server = ["warp", "tokio", "serde", "anyhow"]
[profile.release]
opt-level = 3 # Maximum optimization
lto = true # Link-time optimization
strip = true # Strip debug symbols
Build targets:
# WASM module only (edge devices)
cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
--features wasm --no-default-features
# HTTP server only (cloud/production)
cargo build --bin embedding-api-server --release
Size comparison:
WASM debug: 450KB
WASM release: 136KB (70% reduction)
Server release: 1.8MB
Resources:
