DevPulse: Architecture of a Real-Time AI Workspace

DevPulse started from a simple premise: project data should not be frozen. In most developer tools, AI features operate on a snapshot — data that was indexed last night, or whenever a background job ran. For a workspace where developers are actively writing tasks, updating documentation, and querying the AI mid-flow, that model produces a constant mismatch between what exists and what the AI knows about.

The architecture of DevPulse is an attempt to close that gap. This post describes the system as a whole — how the components connect, where the key boundaries sit, and what tradeoffs were made at each layer.

System overview

At the highest level, DevPulse is a three-plane system: a client plane handling user interaction and rendering, a compute plane handling AI orchestration and business logic, and a data plane handling persistence and vector indexing. All three run on GCP/Firebase, which constrains the infrastructure choices but eliminates operational overhead for a small team.

The client plane is stateless with respect to AI: it fires requests and reacts to results. AI orchestration is entirely in the compute plane, which means the client never holds a model context or maintains conversation state directly. State is reconstructed on each Cloud Function invocation from Firestore.

The write pipeline and synchronous indexing

The central architectural decision in DevPulse is where vectorization happens relative to the user-visible write confirmation. Two options exist: trigger-based async indexing, where Firestore fires an event after the write and a separate function handles vectorization in the background; and synchronous callable indexing, where a single Cloud Function handles both the structured write and the vector write before returning a response.

The async approach is simpler and cheaper to build. It also creates a consistency window — a period where the structured data exists but the vector does not. For a workspace where a developer adds a task and asks the AI about it immediately, this window is observable and breaks the user's mental model of the system.

The synchronous approach eliminates the consistency window at the cost of coupling write latency to the embedding API's response time. The tail latency of the embedding call (p99 ~1.8s under typical load) becomes the tail latency of the save operation. This is an acceptable tradeoff for a low-write-frequency workspace. It would not be acceptable for a system processing hundreds of writes per second.

The RAG engine: query architecture

The query path for AI chat is a four-stage pipeline: query embedding, vector retrieval, context assembly, and generation. Each stage has architectural implications that compound. A weak decision at the retrieval stage cannot be fixed by a better prompt at the generation stage.

Two architectural choices here are load-bearing. First, the embedding model used at query time must be identical to the one used at index time. Using a different model — even a newer version of the same model — produces embedding spaces that are not comparable, and retrieval becomes meaningless. Second, the context assembly stage uses a "ground-only" instruction: the model is explicitly told not to fill gaps with its parametric knowledge. This produces refusals for out-of-scope questions, which is the correct behavior for a grounded workspace tool. The alternative produces confident hallucinations about project-specific data.

Vector store architecture

Storing vectors in Firestore without a native index is an O(n) problem — every query must load the entire collection into the function's memory, compute similarity, and return the top results. At 1,000 chunks that is approximately 6MB per query. At 10,000 chunks it exceeds Cloud Function memory limits. The cost per query scales linearly with the number of stored vectors.

Firestore Vector Search, which reached GA in late 2024, solves this at the infrastructure level. A vector index on the chunks sub-collection allows Firestore to perform approximate nearest-neighbor search internally and return only the top-k results to the function. Query cost and latency become essentially constant regardless of collection size, bounded by result count rather than total vector count.

The data model in Firestore uses a two-level hierarchy: a projectVectors root collection, scoped by projectId, with a chunks sub-collection inside each project. This scoping means vector search is automatically bounded to the relevant project — there is no need for a filter predicate to isolate one tenant's data from another's. The vector field must be stored as a Firestore VectorValue type (not a plain array) for the index to be applicable.

Chunking and the quality of retrieval

Retrieval quality is a function of how text is split before embedding, not just which retrieval algorithm is used. A hard character cap — splitting every 4,000 characters — is the default choice because it is simple to implement. It produces two classes of retrieval errors that compound at scale.

The first is boundary degradation. When a split occurs mid-sentence, the resulting chunk begins or ends with a sentence fragment. Embedding models assign lower confidence vectors to incomplete linguistic units than to semantically complete ones. The vector for a half-sentence is a noisy signal. At retrieval time, a query that should match a chunk may fail because the most relevant section of the original text was split across two adjacent chunks, neither of which individually contains enough semantic signal to score highly.

The second is metadata dilution. In a multi-entity workspace — projects, tasks, documents, configs — mid-document chunks lose their entity context. A chunk extracted from the middle of a task description carries no indication that it belongs to a specific task with a specific title. A query referencing that task by name will not retrieve the chunk because the embedding distance is high.

Both problems are addressed by a paragraph-aware split with a 3,800-character budget per chunk, combined with a title prefix injected at the beginning of every chunk — including continuation chunks from multi-part entities. The title prefix ensures that even mid-document chunks carry enough entity context to score well against queries that reference the entity by name. This is a retrieval precision improvement that does not require a model change, a reindex, or any change to the query pipeline.

AI middleware: the hook architecture

Every AI feature in DevPulse — chat, project import, timesheet parsing, slide generation — makes one or more calls to the Gemini API. Without a shared middleware layer, each feature independently handles concerns like usage tracking, safety filtering, and quality gating. The logic is duplicated, monitoring coverage is inconsistent, and adding a new policy (say, PII scrubbing) requires touching every call site.

The hook architecture solves this with a simple pre/post pipeline that wraps every Gemini call. Pre-hooks transform the input before it reaches the model. Post-hooks inspect and optionally modify the output. The pipeline is applied uniformly regardless of which feature triggered the call.

The quality gate post-hook deserves particular attention as an architectural pattern. When the model returns a response that signals insufficient context — phrases like "I don't have enough information" or a low confidence score — the hook can automatically re-run the retrieval stage with a larger top-k value (6 → 12) and regenerate. This gives the system a self-correcting mechanism for retrieval failures without requiring the client to implement retry logic or the developer to manually tune retrieval parameters per feature.

Frontend rendering: Mermaid and the bundle problem

Developers want diagrams. Mermaid.js is the practical choice for a developer-facing tool because it renders from markdown-style syntax and covers the most common diagram types — flowcharts, sequence diagrams, Gantt charts — that appear in project documentation.

Mermaid v11 introduced a dynamic import architecture: the core library lazy-loads diagram subtypes on demand. This is correct in theory — you should not load a Gantt chart parser if the user only ever renders flowcharts. In practice, it creates a production problem when combined with Vite's content-hash chunk strategy.

After a new deployment, old chunk filenames are removed. A user whose browser has cached the old app shell will try to dynamically import a diagram subtype chunk at a URL that no longer exists. The import fails, the diagram does not render, and there is no graceful fallback — the component simply breaks. This is a category of failure distinct from slow rendering or incorrect output.

The fix has two parts that must both be present. Content-hash filenames — Vite's default behavior — ensure that chunk URLs are stable within a deployment. A service worker that detects version changes and forces a reload of the app shell ensures that stale chunk references from old deployments are never attempted. Together, they prevent the failure case. The alternative approach — bundling all Mermaid diagram types into a single large vendor chunk — removes the stale import problem but introduces a synchronous load cost that blocks initial render. Both problems exist independently. Both require their own fix.

Architecture summary

DevPulse is not a complex system by modern standards. The interest is in how a small number of design decisions — where to place the indexing boundary, how to use vector search primitives correctly, how to represent chunk metadata, how to abstract LLM calls — determine whether the system behaves coherently at the level of user experience.

Layer	Key decision	What it enables
Write pipeline	Sync callable over async trigger	Zero consistency window between write and AI awareness
Vector retrieval	Firestore Vector Search (ANN)	Constant-cost retrieval regardless of collection size
Chunking	Paragraph split + entity title prefix	Higher retrieval precision without model change
Generation	Ground-only instruction	Eliminates confident hallucination on project-specific data
AI middleware	Pre/post hook pipeline	Monitoring, safety, quality gating written once across all features
Frontend	Content hash + SW reload	Eliminates stale chunk failures after deploy

The stack itself is conventional. The value is in the specific choices at each layer and the understanding of what each choice costs and what it buys. An async trigger is cheaper to build and breaks user trust. A synchronous callable is harder to build and maintains it. That exchange is the work.

]]>