DevPulse: Architecture of a Real-Time AI Workspace

DevPulse started from a simple premise: project data should not be frozen. In most developer tools, AI features operate on a snapshot — data that was indexed last night, or whenever a background job ran. For a workspace where developers are actively writing tasks, updating documentation, and querying the AI mid-flow, that model produces a constant mismatch between what exists and what the AI knows about.

The architecture of DevPulse is an attempt to close that gap. This post describes the system as a whole — how the components connect, where the key boundaries sit, and what tradeoffs were made at each layer.

System overview

At the highest level, DevPulse is a three-plane system: a client plane handling user interaction and rendering, a compute plane handling AI orchestration and business logic, and a data plane handling persistence and vector indexing. All three run on GCP/Firebase, which constrains the infrastructure choices but eliminates operational overhead for a small team.

DEVPULSE · SYSTEM OVERVIEW · THREE-PLANE ARCHITECTURE CLIENT PLANE React + Vite UI / state management Firebase SDK realtime listeners Callable Client invoke Cloud Functions Mermaid Renderer lazy diagram engine Tailwind styling HTTPS callable / SDK COMPUTE PLANE · Cloud Functions Gen 2 (Cloud Run) saveAppEntity write + vectorize synchronous chatQueryProject embed + retrieve + generate importProject parse unstructured → structured entities AI Hooks middleware pre / post LLM other slides timesheet Firestore R/W · Vector index Gemini API DATA PLANE · Firestore Structured Store projects / tasks / docs Vector Store projectVectors/chunks Vector Index findNearest · COSINE EXTERNAL AI · Google Gemini Flash 1.5 generation Embed-004 768-dim vectors

The client plane is stateless with respect to AI: it fires requests and reacts to results. AI orchestration is entirely in the compute plane, which means the client never holds a model context or maintains conversation state directly. State is reconstructed on each Cloud Function invocation from Firestore.

The write pipeline and synchronous indexing

The central architectural decision in DevPulse is where vectorization happens relative to the user-visible write confirmation. Two options exist: trigger-based async indexing, where Firestore fires an event after the write and a separate function handles vectorization in the background; and synchronous callable indexing, where a single Cloud Function handles both the structured write and the vector write before returning a response.

The async approach is simpler and cheaper to build. It also creates a consistency window — a period where the structured data exists but the vector does not. For a workspace where a developer adds a task and asks the AI about it immediately, this window is observable and breaks the user's mental model of the system.

The synchronous approach eliminates the consistency window at the cost of coupling write latency to the embedding API's response time. The tail latency of the embedding call (p99 ~1.8s under typical load) becomes the tail latency of the save operation. This is an acceptable tradeoff for a low-write-frequency workspace. It would not be acceptable for a system processing hundreds of writes per second.

WRITE PIPELINE · CONSISTENCY TRADEOFF t=0 time → Async (trigger) user write data saved ✓ confirmed toast fires consistency window · AI unaware · 5–15s trigger fires async embed + index vector ready Sync (callable) user write Cloud Function · saveAppEntity ① setDoc → ② embed → ③ vectorDoc ✓ confirmed data + vector both ready zero consistency gap Tradeoff: sync adds embedding latency to write path (p50 ~400ms, p99 ~1.8s). Acceptable when write frequency is low and query freshness is the priority. Not suitable for high-frequency write systems. Choose based on SLO, not on what is easier to build.

The RAG engine: query architecture

The query path for AI chat is a four-stage pipeline: query embedding, vector retrieval, context assembly, and generation. Each stage has architectural implications that compound. A weak decision at the retrieval stage cannot be fixed by a better prompt at the generation stage.

RAG QUERY PIPELINE · chatQueryProject ① Query Embed user message → text-embedding-004 → float32[768] same model as index-time embed ② Vector Retrieval findNearest() COSINE · topK=6 scoped to projectId Firestore infra not in-memory scan ③ Context Assembly top-6 chunks → grounded system prompt answer only from context · no fill ④ Generation gemini-1.5-flash startChat() with history + system conversation state maintained per session

Two architectural choices here are load-bearing. First, the embedding model used at query time must be identical to the one used at index time. Using a different model — even a newer version of the same model — produces embedding spaces that are not comparable, and retrieval becomes meaningless. Second, the context assembly stage uses a "ground-only" instruction: the model is explicitly told not to fill gaps with its parametric knowledge. This produces refusals for out-of-scope questions, which is the correct behavior for a grounded workspace tool. The alternative produces confident hallucinations about project-specific data.

Vector store architecture

Storing vectors in Firestore without a native index is an O(n) problem — every query must load the entire collection into the function's memory, compute similarity, and return the top results. At 1,000 chunks that is approximately 6MB per query. At 10,000 chunks it exceeds Cloud Function memory limits. The cost per query scales linearly with the number of stored vectors.

Firestore Vector Search, which reached GA in late 2024, solves this at the infrastructure level. A vector index on the chunks sub-collection allows Firestore to perform approximate nearest-neighbor search internally and return only the top-k results to the function. Query cost and latency become essentially constant regardless of collection size, bounded by result count rather than total vector count.

VECTOR STORE · SCALING BEHAVIOR Without vector index · O(n) scan Cloud Fn load ALL all N chunks → RAM ~6MB per 1k chunks JS cosine loop O(n) compute top-6 back breaks at 10k With Firestore Vector Search · ANN index Cloud Fn query vec Firestore · findNearest() ANN search runs inside Firestore infra · returns top-k only · cost = O(1) of result count top-6 back scales to 1M+ Comparison Collection size Naive: ~6MB RAM / 1k chunks Vector Search: constant, independent of N Index overhead None (scanned at query time) One-time index creation per collection

The data model in Firestore uses a two-level hierarchy: a projectVectors root collection, scoped by projectId, with a chunks sub-collection inside each project. This scoping means vector search is automatically bounded to the relevant project — there is no need for a filter predicate to isolate one tenant's data from another's. The vector field must be stored as a Firestore VectorValue type (not a plain array) for the index to be applicable.

Chunking and the quality of retrieval

Retrieval quality is a function of how text is split before embedding, not just which retrieval algorithm is used. A hard character cap — splitting every 4,000 characters — is the default choice because it is simple to implement. It produces two classes of retrieval errors that compound at scale.

The first is boundary degradation. When a split occurs mid-sentence, the resulting chunk begins or ends with a sentence fragment. Embedding models assign lower confidence vectors to incomplete linguistic units than to semantically complete ones. The vector for a half-sentence is a noisy signal. At retrieval time, a query that should match a chunk may fail because the most relevant section of the original text was split across two adjacent chunks, neither of which individually contains enough semantic signal to score highly.

The second is metadata dilution. In a multi-entity workspace — projects, tasks, documents, configs — mid-document chunks lose their entity context. A chunk extracted from the middle of a task description carries no indication that it belongs to a specific task with a specific title. A query referencing that task by name will not retrieve the chunk because the embedding distance is high.

CHUNKING STRATEGY · BOUNDARY + METADATA Hard char split · boundary degradation ...the authentication flow is handled CUT ✗ by the middleware layer which no entity context · fragment sentence Paragraph split + title prefix · semantic boundaries [TASK] Implement OAuth middleware The authentication flow is handled by the middleware layer, which validates tokens before forwarding requests downstream. Query: "auth middleware status" → low score · split destroyed signal Query: "auth middleware status" → high score · entity title aligns with query intent

Both problems are addressed by a paragraph-aware split with a 3,800-character budget per chunk, combined with a title prefix injected at the beginning of every chunk — including continuation chunks from multi-part entities. The title prefix ensures that even mid-document chunks carry enough entity context to score well against queries that reference the entity by name. This is a retrieval precision improvement that does not require a model change, a reindex, or any change to the query pipeline.

AI middleware: the hook architecture

Every AI feature in DevPulse — chat, project import, timesheet parsing, slide generation — makes one or more calls to the Gemini API. Without a shared middleware layer, each feature independently handles concerns like usage tracking, safety filtering, and quality gating. The logic is duplicated, monitoring coverage is inconsistent, and adding a new policy (say, PII scrubbing) requires touching every call site.

The hook architecture solves this with a simple pre/post pipeline that wraps every Gemini call. Pre-hooks transform the input before it reaches the model. Post-hooks inspect and optionally modify the output. The pipeline is applied uniformly regardless of which feature triggered the call.

AI HOOK MIDDLEWARE PIPELINE Feature chat / import / timesheet Pre-hooks PII scrub input filter usage log token count Gemini API Flash 1.5 generation Post-hooks quality gate weak → retry hallucination confidence check out to client What the hook pipeline enables · PII scrubbing: applied to every feature input without touching feature code · Quality gate: weak responses (low confidence, explicit "I don't know") trigger expanded RAG retrieval (topK=12) automatically · Usage tracking: token counts, latency, and feature attribution logged centrally for cost allocation and monitoring

The quality gate post-hook deserves particular attention as an architectural pattern. When the model returns a response that signals insufficient context — phrases like "I don't have enough information" or a low confidence score — the hook can automatically re-run the retrieval stage with a larger top-k value (6 → 12) and regenerate. This gives the system a self-correcting mechanism for retrieval failures without requiring the client to implement retry logic or the developer to manually tune retrieval parameters per feature.

Frontend rendering: Mermaid and the bundle problem

Developers want diagrams. Mermaid.js is the practical choice for a developer-facing tool because it renders from markdown-style syntax and covers the most common diagram types — flowcharts, sequence diagrams, Gantt charts — that appear in project documentation.

Mermaid v11 introduced a dynamic import architecture: the core library lazy-loads diagram subtypes on demand. This is correct in theory — you should not load a Gantt chart parser if the user only ever renders flowcharts. In practice, it creates a production problem when combined with Vite's content-hash chunk strategy.

After a new deployment, old chunk filenames are removed. A user whose browser has cached the old app shell will try to dynamically import a diagram subtype chunk at a URL that no longer exists. The import fails, the diagram does not render, and there is no graceful fallback — the component simply breaks. This is a category of failure distinct from slow rendering or incorrect output.

MERMAID · STALE CHUNK FAILURE + FIX Problem: stale chunk import after deploy old app shell in browser cache import() mermaid-flowchart -abc123.js → 404 diagram breaks Fix: content-hash stability + service worker force-reload new deploy SW detects update force reload shell new hash URLs in place import() resolves correct chunk always diagram renders Bundle note: lazy import per diagram type keeps initial bundle small. A manualChunks mega-bundle solves stale imports but blocks first render — the wrong tradeoff.

The fix has two parts that must both be present. Content-hash filenames — Vite's default behavior — ensure that chunk URLs are stable within a deployment. A service worker that detects version changes and forces a reload of the app shell ensures that stale chunk references from old deployments are never attempted. Together, they prevent the failure case. The alternative approach — bundling all Mermaid diagram types into a single large vendor chunk — removes the stale import problem but introduces a synchronous load cost that blocks initial render. Both problems exist independently. Both require their own fix.

Architecture summary

DevPulse is not a complex system by modern standards. The interest is in how a small number of design decisions — where to place the indexing boundary, how to use vector search primitives correctly, how to represent chunk metadata, how to abstract LLM calls — determine whether the system behaves coherently at the level of user experience.

Layer Key decision What it enables
Write pipeline Sync callable over async trigger Zero consistency window between write and AI awareness
Vector retrieval Firestore Vector Search (ANN) Constant-cost retrieval regardless of collection size
Chunking Paragraph split + entity title prefix Higher retrieval precision without model change
Generation Ground-only instruction Eliminates confident hallucination on project-specific data
AI middleware Pre/post hook pipeline Monitoring, safety, quality gating written once across all features
Frontend Content hash + SW reload Eliminates stale chunk failures after deploy

The stack itself is conventional. The value is in the specific choices at each layer and the understanding of what each choice costs and what it buys. An async trigger is cheaper to build and breaks user trust. A synchronous callable is harder to build and maintains it. That exchange is the work.

]]>
← Quay lại Blog