engram: shared memory for ai tools

the problem

chatgpt used to be the one tool for everything. folders for health, finances, work, personal stuff—your whole life organized into conversations. one tool that knew you.

that's not how it works anymore. claude code for implementation. poke or chatgpt for thinking on mobile. cursor for a different codebase. gemini for a quick lookup. the tools multiplied, and the context got fragmented.

the issue isn't losing a coding session—it's losing the high-level thread. you're planning something in claude code on your laptop. you step out, pull up poke on your phone, and you're starting from zero. not the exact conversation—just the basics. what you're working on, what you've decided, what your preferences are. every tool has its own silo.

engram is a fix for this. it's an MCP server that acts as a shared memory layer across all your ai tools. open source, self-hostable.

stack: python, fastapi, sqlite, openai embeddings, MCP

cost: ~$5-10/month (railway + openai)

github

how it works

engram stores facts—your preferences, your projects, your decisions—and retrieves them when any connected tool needs context. every tool reads and writes to the same memory.

flowchart LR
    A[Claude Code] -->|read/write| E[Engram
Memory Service]
    B[Poke] -->|read/write| E
    C[Cursor] -->|read/write| E
    D[Any MCP Client] -->|read/write| E
    E -->|store| F[SQLite
+ Vectors]
    E -->|embed| G[OpenAI
Embeddings]

when claude code learns something worth remembering, it stores it. when poke starts a new conversation, it searches for context first. memory accumulates across tools, and every tool benefits from what the others stored.

the transport is streamable HTTP, not stdio. in plain terms: most MCP servers use a direct pipe connection (stdio) that only works with one tool at a time. engram runs as a web server, so claude code, poke, and any other tool can all connect to it simultaneously.

the search pipeline

storing is easy. retrieval is the whole game.

you stored "switched from cursor to claude code in january." you search "what coding tools do i use?" keyword search finds nothing—zero overlapping words. you need search that understands meaning, not just spelling.

engram runs three searches in parallel. think of it like asking three people with different skills the same question and combining their answers:

flowchart TD
    Q[Search Query] --> E[Convert to Numbers
embedding]
    E --> V[1. Meaning Match
vector similarity]
    E --> B[2. Keyword Match
BM25]
    E --> R[3. Question Match
reverse HyDE]
    V --> F[Merge Rankings]
    B --> F
    R --> F
    F --> X[Re-score Top Results
cross-encoder]
    X --> T[Final Results]

1. meaning match. text gets converted into coordinates in a high-dimensional space—like plotting words on a map where related concepts are physically close together. "coding tools", "cursor", and "claude code" all land in the same neighborhood, even though they share no letters. search = find the nearest neighbors on this map.

2. keyword match. sometimes the map overthinks it. "JIRA-1234" is not a concept—it's an exact string. BM25 is a decades-old text search algorithm that counts word overlap, weighted by rarity. runs on sqlite's built-in full-text search. fast, dumb, and exactly right for this case.

3. question match (reverse HyDE). this one is less obvious.

the problem: searching a question against stored statements doesn't match well. "what framework do you use?" and "prefers fastapi over flask" are about the same thing, but they use completely different language and sit far apart on the map.

flowchart LR
    A["store: prefers
fastapi over flask"] --> B[save to SQLite
done, ~1s]
    A -.->|background| D[LLM generates
questions]
    D --> E["Q: what python framework
does this person use?"]
    D --> F["Q: flask or fastapi
preference?"]
    E --> G[embed & store
alongside memory]
    F --> G

the fix: when a memory is stored, a background job generates questions it would answer. "prefers fastapi over flask" → "what python framework?" / "flask or fastapi?" those questions get stored alongside the fact. now at search time, it's matching question against question—same language patterns, same region of the map. much better hit rate.

the cost is a few seconds of LLM work per write. writes are rare, reads are constant, so it's a good trade.

method	good at	bad at
meaning	"coding tools" finds "cursor", "claude code"	exact IDs, ticket numbers
keyword	"JIRA-1234" finds exactly that	different words, same meaning
question	"what framework?" finds "prefers fastapi"	a few extra seconds per write

merging the results. three ranked lists, three incompatible scoring systems. can't compare a 0.85 similarity score to a 12.3 keyword score—different units. reciprocal rank fusion ignores the scores and only looks at position. if something ranks #1 in one search and #3 in another, that's a strong signal regardless of the raw numbers.

final reranking. the top candidates get re-scored by a cross-encoder—a small local model (23MB, no API calls) that reads the query and each result together as a pair. more accurate than the initial searches, which evaluate query and result independently. too slow for every memory, but ideal for the shortlist.

whole pipeline: under one second. repeated queries skip the API call entirely via an embedding cache.

temporal versioning

facts change. your tech stack evolves, you switch tools, you change your mind. engram doesn't overwrite—it versions. when a memory gets updated, the old version is pushed to a history array with timestamps and which tool made the change.

{
  "content": "Tech stack: Python, FastAPI, SQLite, OpenAI embeddings",
  "source": "claude-code",
  "valid_at": "2026-03-22",
  "history": [
    {
      "content": "Tech stack: Python, FastAPI, SQLite, ONNX embeddings",
      "source": "claude-code",
      "valid_at": "2026-03-21",
      "invalid_at": "2026-03-22"
    }
  ]
}

nothing gets deleted. if claude code updates a fact that poke stored last week, both versions are preserved. any tool can see the full timeline of how a fact evolved, who changed it, and when.

passive context

beyond the search tools, engram exposes three MCP resources—these are like read-only feeds that clients pull automatically when they connect. no tool call needed, no agent decision required:

engram://profile — identity, preferences, working style
engram://recent — 10 most recently updated memories
engram://projects — all project context

this means every new conversation starts with baseline context already loaded. the agent knows who you are and what you're working on before you type anything.

the stack

layer	what	why
runtime	python 3.12 / fastapi	fast to build, great MCP SDK support
protocol	MCP (streamable HTTP)	multiple tools connect simultaneously
database	sqlite + sqlite-vec + FTS5	one file for everything: data, vectors, full-text search
embeddings	openai text-embedding-3-small	1536 dimensions, ~$0.02 per million tokens
reranker	ONNX ms-marco-MiniLM-L-6-v2	runs locally (23MB), no API costs
question gen	gpt-5.4-nano	generates reverse HyDE questions, cheapest model that works
hosting	railway + persistent volume	~$5-10/month, deploys with one command

no postgres, no redis, no infrastructure to manage. the entire database is one sqlite file. the reranker runs locally on the server. deploys with railway up and two environment variables.

v1 used local ONNX embeddings (all-MiniLM-L6-v2) to keep everything self-contained with zero external dependencies. switched to openai embeddings because the search quality was noticeably better—1536 dimensions captures more nuance than 384. the cross-encoder reranker stayed local because it does a different job (scoring pairs of texts against each other, not converting text to numbers) and the local model handles that well.

getting agents to use it proactively

you can't just expose MCP tools and hope agents call them. they won't unless you tell them to. two things that work:

tool descriptions as behavioral instructions. most MCP tool descriptions explain what the tool does. engram's also say when—the search tool says "use proactively at the start of every conversation," the store tool says "use when you learn something worth remembering across sessions."
system prompt reinforcement. in each tool's config (CLAUDE.md, poke's system prompt, etc.), add explicit instructions: search at conversation start, store when you learn something new, tag writes with your tool name. tool descriptions plus system prompts together get agents to use memory without being asked.

setup

# clone and deploy
git clone github.com/namanxajmera/engram
cd engram && railway up

# connect claude code
claude mcp add engram https://your-url/mcp \
  -t http -s user \
  -H "Authorization: Bearer YOUR_API_KEY"

# any other MCP client: point at /mcp with a Bearer token

what's next

right now, memories get stored when an agent decides something is worth remembering. that works but it's passive—it depends on the agent's judgment in the moment, and agents miss things.

the next step is auto-distillation: a scheduled job that reads through conversation transcripts, identifies facts, preferences, and decisions, and writes them to engram automatically. the memory layer gets richer over time without any individual tool or person having to think about it.

flowchart LR
    A[conversation
transcripts] -->|scheduled job| B[LLM extracts
facts & decisions]
    B --> C[writes to
engram]
    C --> D[all tools get
richer context]
    D -.->|next conversation| A

ai tools don't need to share conversations. they need to share context. a memory layer with good search is enough to make every tool aware of who you are and what you're doing, from the first message.

the name comes from neuroscience—an engram is the physical trace of a memory in the brain.