What happens after send() in production event collectors?

Events go through four distinct stages: validation (determining what's acceptable), sequencing (routing to ensure order), buffering (batching for efficiency with durability guarantees), and storage (tiered persistence balancing latency and cost). Different systems implement these differently, but the architectural boundaries remain consistent.

Why do production systems separate collection from validation?

Separating collection from validation prevents data loss while maintaining data quality. The HTTP boundary guarantees durability (events are acknowledged and persisted quickly), while downstream validation can be strict, observable, and recoverable without blocking ingestion. This pattern survived across all four systems studied.

How do event collectors guarantee ordering without global coordination?

Most systems use deterministic routing based on identity (user ID or distinct ID). Events for the same entity are hashed to the same partition/worker, which processes its queue sequentially. This provides partition-local ordering without requiring global coordination. Snowplow is an exception—it uses timestamps and storage-layer deduplication instead.

What is the write-ahead log pattern and why does it matter?

A write-ahead log (WAL) ensures events are durably written to disk before being acknowledged or processed. If a crash occurs, events can be replayed from the WAL rather than lost from memory. This pattern appears across systems (RudderStack's JobsDB, SQLite WAL in eventkit) because it's the fundamental solution to the crash window problem.

Why use tiered storage instead of writing directly to a data warehouse?

Tiered storage balances cost and latency. Object storage (S3/GCS) is cheap for bulk writes but slow to query. Databases/warehouses are fast but expensive at scale. Production systems write to object storage first for durability and cost efficiency, then load into warehouses on a schedule. Some also maintain a real-time path for low-latency queries, creating dual-path architectures.

How does eventkit validate the patterns observed in production systems?

eventkit implements the same structural patterns in Python: adapter-based validation, hash-based routing for ordering, ring buffer with WAL for durability, and GCS + BigQuery for tiered storage. Building it exposed the same constraints (crash windows, hot partitions, replay semantics) that production systems solve, confirming the patterns are fundamental rather than language-specific.

TL;DR

Methodology: Studied four production event collectors (RudderStack, Snowplow, PostHog, Lytics) to understand what happens after send(). Built eventkit—a Python event processing library—to test whether the patterns I observed actually hold under real constraints.

Key Finding: Production systems converge on the same architectural boundaries despite different implementations: separate collection from validation, route events deterministically for ordering, buffer with write-ahead durability, and use tiered storage for cost efficiency.

Outcome: The patterns survived. Building exposed the same failure modes that production systems solve: crash windows, hot partitions, replay semantics, and cost inflection points. The loop—study, build, break, refine—is the method I'm keeping.

Introduction

I recently spent time studying how production event collectors work once an event leaves the browser, trying to build a clearer mental model for the systems my day-to-day UI work depends on.

Frontend SDKs make send() look simple, but the real complexity lives downstream: validation, ordering, buffering, durability, and cost trade-offs that only surface at scale. To build a clearer mental model of that pipeline, I studied four production systems: RudderStack, Snowplow, PostHog, and Lytics, tracing event flow from HTTP ingestion to storage.

To test whether I understood what I was seeing, I built a small event pipeline in Python that implements the same structural patterns: adapter-based validation, deterministic routing, buffered batching, write-ahead durability, and tiered storage. Building it surfaced the same constraints repeatedly — crash windows, replay semantics, hot partitions, and cost inflection points.

This essay compares those systems side by side and explains why they converge on similar architectural boundaries, even with very different implementations.

This is not a tutorial. It's a systems essay: a comparative study of production event collectors, followed by a small build to test whether those patterns actually hold under real constraints. It's written for engineers who spend most of their time at the edges of systems—often in frontend or product-facing roles—and want a clearer mental model of what lives behind a single call to send().

I. Methodology

The method mirrors how I learned to work in a chemistry lab: start with the literature, design the experiment, document what you observe, then test whether you actually understood by reproducing the reaction yourself.

Here, the "literature" was production code.

Field sites: why these four systems

I chose four event collectors with distinct architectural philosophies. Each is widely used in production, and together they span different trade-offs around validation, durability, scale, and developer experience.

For each system, I cloned the repository, traced event flow from the HTTP boundary to storage, and documented recurring patterns and divergences in research notes.

RudderStack (Go, open source) is designed around multi-destination routing. Events are collected once and fanned out to multiple destinations. Collection and delivery are deliberately separated stages.
Snowplow (Scala, open source) takes a schema-first approach. Events must reference a schema. The collector is permissive, but downstream validation is strict. Invalid events are preserved as structured failures rather than silently dropped.
PostHog (Python/Rust hybrid, open source) optimizes for different constraints at different layers. Python for the application layer where iteration speed matters. Rust for ingestion where throughput matters.
Lytics (Go, production CDP where I work) processes millions of events daily. I traced a dual-path architecture: the same event stream feeding two consumers with different latency and cost requirements.

What I looked for

Across all four systems, I asked the same set of questions:

Validation: Where does validation happen? At the HTTP boundary or downstream? What happens to invalid events: rejection, logging, dead-letter streams?
Sequencing: How are events routed deterministically? How are duplicates prevented? Hashing, partitioning, or idempotency windows?
Resilience: Where is durability guaranteed? Write-ahead logs, queues, acknowledgment semantics? What happens on crash or restart?
Storage: Where does an event go first? Object storage, stream, database, warehouse? Why that order?
Scale: What patterns enable throughput under load? Batching, buffering, partitioned workers?

The focus was on separating decisions shaped by universal pressures from those driven by local constraints.

From Hypothesis to Implementation

Comparative study generates hypotheses. Building forces you to commit to them.

I built eventkit in Python—a language I hadn't worked in before—starting with a short sprint in January. The constraint was intentional. If the patterns held up in an unfamiliar ecosystem, they were likely fundamental rather than incidental to Go, Scala, or a specific stack.

I used Cursor with Claude Sonnet 4.5 to accelerate cross-repository analysis. That made it feasible to ask focused questions (how graceful shutdown is handled, where durability is enforced) and verify the answers across multiple systems in parallel.

The implementation followed a spec-driven workflow. I wrote research notes first, documented trade-offs, then coded against those decisions. Each major subsystem has an accompanying note (for example: 011-ring-buffer-wal.md, 013-gcs-bigquery-storage.md) capturing not just what I built, but why: what production constraint it addressed and which systems informed the design.

The first version came together quickly. The second took another week, after watching the initial implementation fail under production constraints I hadn't fully internalized. Each failure sent me back to the field sites to understand how they handled the same problem.

That loop—study, build, break, refine—is the method.

II. Systems Under Study

Each system revealed a different set of priorities. Together, they span the design space for event collection: strict versus flexible validation, single-language versus polyglot architectures, real-time versus batch-optimized pipelines.

RudderStack: Separation of Concerns

RudderStack's architecture cleanly separates collection from routing. The gateway accepts events and writes them immediately to JobsDB, their internal durable queue [1]. Transformation and delivery to downstream destinations happen asynchronously, outside the request path.

What stood out is how explicitly they treat collection as a reliability boundary. Once JobsDB acknowledges a write, the event is considered durable. Delivery failures do not affect ingestion. Retries and fan-out are handled downstream.

The pattern shows up everywhere in production systems: accept events quickly, make them durable, and defer correctness and delivery. The HTTP endpoint's job is durability, not completeness.

Snowplow: Schema as Contract

Snowplow enforces structure through Iglu, its schema registry [2]. Every event references a schema version. The collector is permissive—it accepts events as they arrive—but the Enrich stage validates strictly.

Events that fail validation are not dropped. They are written to a dedicated "bad rows" stream with structured error information [3]. This preserves failures as data, making them debuggable and replayable once schemas are fixed.

What stood out is the explicit contract: validation is unavoidable, but it doesn't block collection. The trade-off is upfront schema work and slower iteration, in exchange for downstream systems that can trust event shape. You discover errors immediately, not months later in a warehouse.

PostHog: Language for the Job

PostHog splits its architecture across Python and Rust. Django handles the application layer—APIs, admin, and product logic—while Rust (capture-rs) handles high-throughput ingestion [4].

The split is deliberate. Python optimizes for developer velocity where business logic changes frequently. Rust optimizes for throughput where events arrive at scale. ClickHouse stores events for fast analytical queries [5].

What stood out is the refusal to force a single language across the system. Performance-critical paths are isolated and optimized without sacrificing iteration speed elsewhere. Hybrid architectures let you apply the right constraints at each layer instead of compromising globally.

Lytics: Dual-Path for Different Constraints

Lytics processes the same event stream along two independent paths. One path batches events for cost-efficient analytics storage. The other updates real-time profiles for low-latency access.

What stood out is how explicitly latency and cost are separated. The analytical path optimizes for throughput and storage efficiency, flushing periodically. The real-time path optimizes for immediacy, updating profiles in under a second.

A single processing path would compromise both goals. Batching breaks real-time guarantees; real-time processing at scale is expensive. Two consumers, each optimized for its constraint, resolve the tension cleanly.

III. The Event Pipeline Under Pressure

Event collection looks simple from the outside: receive a request, store the data, show it in a dashboard. In practice, every production system is shaped by the same set of pressures—unreliable networks, retries, out-of-order delivery, bursty traffic, and cost constraints at scale.

Section III examines how mature systems respond to those pressures by breaking the pipeline into distinct responsibilities:

Validation (III.A) defines where correctness lives—what's checked at ingestion, what's deferred, and how systems prevent bad data from silently corrupting downstream state.

Sequencing & Deduplication (III.B) determine how events are routed deterministically, how order is preserved without global coordination, and how retries are prevented from inflating analytics.

Buffering & Batching (III.C) balance throughput, latency, and failure risk by deciding how long events sit in memory, when they flush, and where durability is guaranteed.

Storage (III.D) reflects the final set of trade-offs: what data is hot vs cold, which queries must be fast, and how cost, latency, and durability shape where events ultimately live.

Each subsection studies the same four production systems through a different lens. Together, they show that scalable event pipelines aren't built by a single clever abstraction, but by isolating pressures and addressing them independently—validation, ordering, buffering, and storage—then composing those solutions into a resilient whole.

A. Validation

The problem

Events arrive in inconsistent shapes. A Segment SDK sends { "type": "track", "userId": "123" }. A webhook sends { "event_type": "click", "user": "123" }. Legacy systems send whatever they've always sent.

Validation determines what's acceptable. Reject too much and you lose data. Accept too much and garbage propagates downstream—corrupting analytics, breaking profiles, and making dashboards untrustworthy.

Every production system makes this choice explicitly or implicitly: where validation happens, how strict it is, and what happens when it fails.

RudderStack: Minimal validation at the edge

RudderStack validates lightly at the HTTP boundary. The goal is fast acceptance and durable storage, not semantic correctness.

At ingestion, validators check only what's necessary to route and store an event—field presence, request metadata, message IDs. They use fast-path JSON queries (gjson) to avoid full deserialization [6].

func (p *messageIDValidator) Validate(payload []byte, properties *stream.MessageProperties) (bool, error) {
    return gjson.GetBytes(payload, "messageId").String() != "", nil
}

The insight here is architectural, not syntactic: the HTTP boundary exists to guarantee durability, not correctness. Once an event is acknowledged and queued, deeper validation can happen asynchronously without risking data loss.

What survived in production: Validate just enough at the edge to guarantee durability, and defer semantic correctness to downstream processing where failures are cheaper to handle.

Snowplow: Schema enforcement downstream

Snowplow separates collection from validation even more aggressively. Events must reference an Iglu schema, but the collector accepts them regardless. Strict validation happens in the Enrich stage [7].

Each event is parsed as self-describing JSON and validated against its declared schema. Failures aren't dropped—they're written to a dedicated "bad rows" stream with structured error metadata and the original payload preserved [3].

sdj <- IorT.fromEither(SelfDescribingData.parse(json))
supersedingSchema <- check(client, sdj, registryLookup)

This turns validation failures into first-class data. You can fix schemas, replay events, and audit failures after the fact.

The trade-off is explicit: slower iteration in exchange for downstream trust. You discover data shape errors immediately, not months later in a warehouse.

What survived in production: Separate collection from validation, but enforce schemas downstream so invalid data becomes structured, replayable failures rather than silent corruption.

PostHog: Flexible parsing with defensive controls

PostHog prioritizes flexibility at ingestion. The Rust-based capture endpoint accepts multiple encodings (gzip, base64, form data), normalizes payloads, and extracts events before applying validation.

let request = RawRequest::from_bytes(
    data,
    compression,
    metadata.request_id,
    state.event_payload_size_limit,
    path.as_str().to_string(),
)?;

Validation in PostHog isn't just about shape. Rate limiting and billing quotas are enforced alongside parsing. Invalid or over-quota requests return 200 OK to avoid SDK retries, protecting system stability over strict correctness.

The insight here: validation also protects the system, not just the data. Abuse prevention, cost control, and ingestion safety are part of the validation surface.

What survived in production: Treat validation as system protection as much as data correctness—normalize aggressively, enforce limits, and prioritize ingestion stability over strict rejection.

Lytics: Deferred validation with flexible identity

Lytics takes a permissive approach at collection. Events are accepted with minimal checks and validated downstream through processing rules.

The distinguishing factor is identity. Rather than enforcing fields like userId or anonymousId, Lytics allows any field to serve as identity—email, account ID, custom identifiers—defined in processing rules, not at ingestion.

This enables rapid iteration and heterogeneous data sources: Stripe webhooks, form submissions, mobile events, custom integrations—all without schema coordination.

The trade-off is downstream complexity. Flexibility at the edge demands stronger validation later to prevent incoherent profiles.

What survived in production: Accept heterogeneous events at ingestion, and push identity resolution and semantic validation into a flexible processing layer that can evolve without blocking data flow.

Design space

Different implementations, similar decisions:

When to validate:

Edge (RudderStack): fast checks, defer semantics
Downstream (Snowplow): strict validation after collection
Hybrid (PostHog): normalize early, enforce limits and structure later
Deferred (Lytics): minimal ingestion, rules in processing

How strict:

Strict (Snowplow): schemas required
Lightweight (RudderStack): presence checks
Permissive (PostHog, Lytics): normalize and adapt downstream

What happens on failure:

Preserve and replay (Snowplow)
Log and observe (Lytics)
Filter defensively (PostHog)

In eventkit

I implemented an adapter-based approach inspired by Lytics' flexibility and RudderStack's lightweight ingestion.

Adapters validate only what's required to normalize an event into a canonical shape. Schema-based validation is optional and pluggable.

class EventAdapter(Protocol):
    def validate(self, raw_event: dict) -> AdapterResult: ...
    def transform(self, raw_event: dict) -> CanonicalEvent: ...

Strict validation (Snowplow-style) can be layered in by swapping adapters, not rewriting the pipeline.

What I learned: validation isn't binary. It's a spectrum. The real question is where you draw the boundary between ingestion reliability and semantic correctness.

Trade-offs

Strict validation improves downstream trust but slows iteration and risks rejecting edge cases. Permissive validation preserves data and flexibility but pushes complexity downstream and increases debugging cost.

The consistent production pattern is clear: separate collection from validation. Accept events fast. Make them durable. Validate where failures are observable, recoverable, and replayable.

That separation shows up in every system that survived scale.

Validation Pattern (Universal):

Accept events fast and make them durable at the edge; perform strict, observable, and recoverable validation downstream where failures can be analyzed, replayed, and corrected.

B. Sequencing & Deduplication

The problem

Events rarely arrive in the order they were generated. Network delays, client-side buffering, retries, and queue replays all scramble sequence. An identify event may arrive after the track it was meant to precede. A purchase may process before an email is set.

Duplicates compound the problem. At-least-once delivery guarantees mean retries are expected. Without deduplication, every retry inflates counts, corrupts analytics, and mutates profile state incorrectly.

Order matters for correctness. Profile state depends on sequence. Distributed systems therefore need deterministic routing: all events for the same entity must land on the same processor, or ordering becomes impossible to enforce.

RudderStack: Hash-based worker assignment

RudderStack guarantees per-user ordering by routing events deterministically to workers using consistent hashing on user identity [8].

func (gw *Handle) findUserWebRequestWorker(userID string) *userWebRequestWorkerT {
    index := int(math.Abs(float64(misc.GetHash(userID) % gw.conf.maxUserWebRequestWorkerProcess)))
    return gw.userWebRequestWorkers[index]
}

The hash function is simple: hash the user ID, modulo by worker count. Each worker processes its queue sequentially. Once an event for a user is routed to worker n, all subsequent events for that user go to the same worker.

The key architectural move is not the hash itself, but routing consistency. Sequential processing inside a worker preserves order without any global coordination.

Snowplow: Eventually consistent ordering, dedup at load

Snowplow does not attempt to preserve ordering at ingestion. Events flow through the collector into Kafka and downstream processing stages. Ordering is reconstructed logically using timestamps rather than enforced through routing [9].

The Enrich stage computes a derived_tstamp, falling back through available timestamps when necessary:

val derivedTstamp = enriched.getDerived_tstamp() match {
  case null => calculateDerivedTimestamp(raw, enriched)
  case dt => dt
}

Deduplication happens later, at the storage layer. Each event carries a UUID event_id, and loaders rely on warehouse semantics (MERGE, INSERT IGNORE, or equivalent) to prevent duplicates.

Snowplow's stance is explicit: ordering is eventual, not strict. Timestamp accuracy matters more than routing determinism. Deduplication is deferred until events are queryable.

PostHog: Kafka partitioning + storage-level deduplication

PostHog delegates sequencing to Kafka. Events are published with a partition key derived from distinct_id and project token [10].

let partition_key = format!("{}:{}", event.token, event.distinct_id);

Kafka guarantees order within a partition, so all events for a given identity are consumed sequentially. PostHog does not implement custom routing logic—it relies entirely on Kafka's partitioning guarantees.

Deduplication happens in ClickHouse using ReplacingMergeTree. Duplicate rows collapse during merges based on primary key and timestamp. The cost of deduplication is paid at merge or query time, not ingestion.

This is a deliberate trade-off: ingestion remains fast and stateless; deduplication cost is amortized later.

Lytics: Sequencer service with bounded dedup windows

Lytics uses a dedicated sequencer to route events based on identity references. Unlike systems that hash a single ID, Lytics hashes all available identity fields together.

The sequencer:

Extracts identity references (email, userId, phone, etc.)
Sorts them for determinism
Hashes the combined set
Routes to a partition via modulo

Each partition has its own worker and deduplication window. Recently seen message IDs are tracked per partition. If a duplicate arrives while the original is still in-flight, it is dropped.

This approach scales horizontally. Deduplication state is bounded and partition-local rather than global.

Design space

The differences are in execution; the choices repeat:

Axis: Routing for order vs. timestamps for order

Systems: RudderStack, PostHog, and Lytics route deterministically (hash-based partitioning); Snowplow relies on timestamps and storage-layer dedup.

Cost you pay: Deterministic routing can create hot partitions when individual entities generate high-volume traffic. Timestamp-based ordering shifts correctness to clock quality and downstream merge logic.

In eventkit

I implemented hash-based partitioning with fallback when identity is missing:

def get_partition_id(self, event: CanonicalEvent) -> int:
    routing_key = (
        f"userId:{event.user_id}" if event.user_id
        else f"anonymousId:{event.anonymous_id}" if event.anonymous_id
        else f"event:{event.message_id}"
    )
    return abs(hash(routing_key)) % self.num_partitions

Deduplication happens in the buffer layer (Section III.C). The sequencer provides partition affinity; the buffer maintains per-partition dedup state.

Recurring move: Deterministic routing buys partition-local order; anything else pushes ordering into timestamps + storage semantics.

C. Buffering & Batching

The problem

Writing events one at a time is inefficient. Network round-trips dominate latency, and storage systems charge overhead per write. A single Firestore write can take 5–10ms. Write 1,000 events individually and you spend seconds. Batch them and you spend milliseconds.

Batching trades latency for throughput. Events accumulate in memory and flush together, either when the buffer fills or a timeout expires. But buffering introduces risk: events in memory disappear on crash.

Production systems balance three forces:

Throughput: batch aggressively

Latency: flush frequently

Durability: don't lose data

Every buffering strategy is a negotiation between these constraints.

RudderStack: Worker queues with batched persistence

RudderStack buffers events inside per-worker queues. Each worker processes requests sequentially and batches writes to JobsDB (Postgres) [1].

type userWebRequestWorkerT struct {
    webRequestQ chan *webRequestT
    batchSize   int
}

The in-memory queue enables fast acceptance and ordering guarantees. Durability comes from JobsDB: once a batch is written to Postgres, events survive crashes. The in-memory buffer itself is intentionally ephemeral.

The architectural move is clear: use memory for speed, storage for safety. Workers buffer just long enough to amortize database overhead, then persist.

What survived at scale: Buffer events in per-worker queues for fast acceptance and ordering; batch writes to durable storage (Postgres) to amortize overhead; accept that in-memory buffers are ephemeral.

Snowplow: In-memory buffers with periodic flush

Snowplow's collectors buffer events in memory and flush them periodically to sinks like S3, Kinesis, or Pub/Sub [11].

trait Sink {
  def storeRawEvents(events: List[Array[Byte]], key: String): Unit
}

Flushes are triggered by size or time. S3 sinks write batches as files; Kinesis sinks use putRecords to batch API calls.

The trade-off is explicit. Collector buffers are fast but ephemeral. Snowplow accepts this risk because durability is guaranteed downstream once events reach storage or streams.

This is a recurring pattern: optimize ingestion for throughput, rely on later stages for durability.

What survived in production: Buffer events in memory with size and time triggers; flush to object storage (S3) or streams (Kinesis) in batches; accept ephemeral buffers at ingestion, guarantee durability downstream.

PostHog: Kafka as the buffer

PostHog avoids buffering at the capture layer almost entirely. Events flow directly from capture-rs into Kafka.

let partition_key = format!("{}:{}", event.token, event.distinct_id);

Kafka provides the buffer: replicated, partitioned, durable, and backpressure-aware. Producers block when brokers are overloaded, naturally throttling ingestion.

Consumers read from Kafka in batches and write to ClickHouse using bulk inserts.

self.client.insert_batch("events", events).await

The insight here is inversion: buffering is infrastructure, not application logic. PostHog keeps ingestion stateless and pushes complexity into Kafka and ClickHouse, where batching and durability are native.

What survived in production: Delegate buffering to durable infrastructure (Kafka); keep ingestion stateless; batch at the consumer level when writing to storage; let infrastructure handle backpressure.

Lytics: Layered buffering across the pipeline

Lytics buffers events at multiple stages, each optimized for a different constraint:

HTTP layer: minimal buffering, fast acknowledgment

Message queue: durable queue between collection and processing

Processing: per-partition buffers before storage writes

Per-partition buffers use dual triggers:

Size-based: flush when full (typically ~100 events)

Time-based: flush periodically (typically ~5 seconds)

Each partition is independent, preventing hot users from blocking others. The message queue guarantees durability; buffers optimize storage writes.

The key idea is layering: ephemeral buffers for speed, durable queues for safety, batched storage for cost.

What survived in production: Layer buffering across pipeline stages—ephemeral at collection for speed, durable queues for safety, per-partition batching for storage efficiency; prevent hot-partition blocking through isolation.

Design space

The same decision points show up in each system:

Where buffering happens:

In-memory at ingestion (RudderStack, Snowplow, Lytics)
External durable queue (PostHog via Kafka)
Hybrid pipelines (most production systems)

Flush triggers:

Size-based
Time-based
Dual triggers (dominant pattern)

Durability model:

Ephemeral buffers + durable storage (RudderStack, Snowplow)
Durable queue + ephemeral consumers (PostHog)
Layered durability across stages (Lytics)

Granularity:

Per-worker (RudderStack)
Per-partition (Lytics)
Global (Snowplow collectors)
Delegated entirely to infrastructure (PostHog)

In eventkit

I implemented per-partition buffering with dual flush triggers and defensive bounds.

class Buffer:
    async def enqueue(self, event: TypedEvent, partition_id: int):
        if len(self.buffers[partition_id]) >= self.max_size:
            await self._flush_partition(partition_id)
            if len(self.buffers[partition_id]) >= self.max_size:
                raise BufferFullError()
        self.buffers[partition_id].append(event)

A background task flushes buffers on a timer. Size-based flushes optimize throughput; time-based flushes cap latency. A hard max_size prevents unbounded memory growth when storage slows.

This made the trade-off explicit: buffering is where latency, throughput, and failure risk meet.

Trade-offs

Large buffers improve throughput but increase memory usage and data loss risk on crash.

Small buffers reduce latency but increase write amplification and cost.

In-memory buffers are fast but fragile. Durable queues survive crashes but add operational complexity.

Per-partition buffers isolate hot traffic but complicate shutdown and memory management.

The consistent production pattern is clear: buffer briefly in memory, flush on size and time, bound buffers defensively, and rely on durable queues or storage for safety.

Buffering Pattern (Universal):

Buffer events in memory using dual size- and time-based flush triggers to balance throughput and latency; bound buffer growth defensively; rely on durable queues or storage for crash safety rather than in-memory state.

D. Storage Patterns

The problem

Where events land determines everything that follows: what questions you can ask, how quickly you can answer them, and how much it costs to keep the system running.

Every production system makes explicit trade-offs around:

Durability: can you afford to lose events?

Queryability: what kinds of questions matter?

Latency: how fast must answers arrive?

Cost: what's the budget at scale?

Storage isn't an implementation detail. It shapes the entire architecture.

RudderStack: Queue first, warehouse later

RudderStack treats storage as a reliability boundary. Events are written immediately to JobsDB (Postgres), which serves as the durable source of truth [1].

func (jd *HandleT) Store(jobs []*JobT) error {
    txn, _ := jd.dbHandle.Begin()
    stmt := txn.Prepare(pq.CopyIn("jobs", "workspace_id", "parameters", "custom_val"))
    for _, job := range jobs {
        stmt.Exec(job.WorkspaceID, job.Parameters, job.CustomVal)
    }
    return txn.Commit()
}

Once events are durable in Postgres, downstream routers deliver them asynchronously to customer-configured destinations: S3, BigQuery, Snowflake, and others.

The architectural stance is explicit: durability lives in the queue, not the warehouse. Warehouses are consumers. The queue is authoritative.

Pattern: Write events to a durable queue (Postgres) as the source of truth; fan out to customer-specified warehouses asynchronously; treat the queue as authoritative and warehouses as consumers.

Snowplow: Object storage as the archive

Snowplow writes enriched events to object storage (S3 or GCS) as its primary archive [12]. Warehouses load from object storage on a schedule.

def store(events: List[Array[Byte]], key: String): Unit = {
  storage.create(blobInfo, content.toByteArray)
}

Files are written as newline-delimited JSON or TSV. Loaders batch-import these files into Redshift, BigQuery, or Snowflake.

Here, object storage is the source of truth. Warehouses are materialized views derived from it. Replay is trivial. Cost is low. Latency is acceptable for batch analytics.

This design prioritizes durability, cost efficiency, and replayability over immediacy.

The architecture that emerged: Write events to object storage (S3/GCS) as the durable archive; load to warehouses on a schedule for querying; accept batch latency in exchange for cost efficiency and replayability.

PostHog: ClickHouse for real-time analytics

PostHog writes events directly into ClickHouse, a columnar OLAP database optimized for analytical queries [5].

self.client.insert("events")?.write(&EventRow { ... }).await?;

ClickHouse unifies storage and query engine. Events are queryable within seconds. Deduplication is handled during merges via ReplacingMergeTree.

The trade-off is operational. ClickHouse must be deployed, replicated, backed up, and tuned. In exchange, PostHog gets sub-second analytical queries and real-time dashboards without a separate warehouse layer.

This is a conscious decision: pay operational complexity to buy latency.

Pattern: Write events directly to a columnar OLAP database (ClickHouse) for real-time analytics; unify storage and query engine; accept operational complexity in exchange for sub-second queries.

Lytics: Storage matched to access patterns

Lytics uses multiple storage systems in parallel, each optimized for a specific workload.

Object storage (archive): Events are batched and written to cost-efficient cloud storage for long-term retention and warehouse ingestion. Cheap, durable, replayable.

Transactional database (profiles): User profiles are updated with sub-second read latency and strong consistency for real-time personalization.

Time-series store (metrics): Aggregated metrics are written to a high-throughput system optimized for time-series queries.

The insight here is constraint-driven design. Cold data goes to cheap storage. Hot data goes to low-latency systems. Analytical workloads and transactional workloads are kept separate.

This architecture is more complex, but it avoids forcing a single storage system to serve incompatible access patterns.

Pattern: Use multiple storage systems matched to access patterns—object storage for archives, transactional databases for profiles, time-series stores for metrics; separate analytical and transactional workloads.

Design space

Each system surfaces the same set of trade-offs:

Storage taxonomy:

Queues/streams (Kafka, Pub/Sub): durability + replay
Object storage (S3, GCS): cheap archive
Warehouses (BigQuery, Snowflake): batch analytics
OLAP databases (ClickHouse): real-time analytics
Transactional databases (Spanner, Postgres): low-latency state

Cost you pay: Storage is where you choose which costs you pay—compute now, ops now, or latency later.

In eventkit

I implemented object storage as the source of truth, with BigQuery as the query layer:

class GCSEventStore:
    async def store_batch(self, events):
        df = self._events_to_dataframe(events)
        path = f"events/date={date}/events-{uuid}.parquet"
        await self._write_parquet(df, path)

A background loader batches files into BigQuery every few minutes. This mirrors Snowplow and RudderStack: cheap, durable object storage as the source of truth; a warehouse for queryability. The latency trade-off (5–10 minutes) is acceptable for batch analytics.

Recurring move: Archives, analytics, and real-time access want different storage systems. Single-store architectures optimize one dimension; multi-store architectures pay complexity to optimize several.

Taken together, these systems converge on the same architecture for a reason. Reliable event pipelines don't rely on a single abstraction to do everything well. They isolate concerns under pressure. Validation is separated from collection so data can be made durable before it's judged. Sequencing is handled through deterministic routing rather than global coordination. Buffering absorbs burstiness and amortizes storage costs without assuming memory is safe. Storage is tiered to match access patterns, not forced into a single database.

None of these choices are accidental, and none are free. Each represents a trade-off made explicit in code: where correctness is enforced, where ordering is guaranteed, where latency is paid, and where cost is optimized. Studying these systems side by side makes the pattern clear—not because they share implementations, but because they've converged on the same boundaries under real operational pressure.

That convergence is the lesson. Production systems survive not by eliminating complexity, but by placing it deliberately.

IV. eventkit — Building to Validate Understanding

Comparative study generates hypotheses about how event collection works. Building requires committing to those hypotheses in code.

I gave myself a short sprint in January to implement an event collector in Python—a language I hadn't worked in before. The constraint was deliberate. If the patterns from Section III held in an unfamiliar ecosystem, they were likely fundamental rather than tied to a specific language or stack.

The first version came together quickly. The second took another week, after the initial implementation encountered constraints I hadn't fully internalized.

The Naive Implementation

The first pass confirmed that the patterns mapped. An HTTP endpoint, adapter-based validation, hash-based sequencing, size-and-time buffering, and storage writes assembled into a functioning pipeline.

@app.post("/v1/track")
async def track(request: TrackRequest):
    raw_event = request.dict()
    
    # Adapt: Segment format → canonical event
    result = adapter.validate(raw_event)
    if not result.valid:
        raise HTTPException(400, detail=result.error)
    
    event = adapter.transform(raw_event)
    
    # Route to partition by identity
    partition = sequencer.get_partition_id(event)
    
    # Buffer until flush
    await buffer.enqueue(event, partition)
    
    return {"success": True}

Events flowed from HTTP through validation, sequencing, buffering, and into Firestore. Tests passed. The architecture resembled the systems studied earlier, but the gaps appeared once failure conditions were considered.

Durability Surfaces as the First Constraint

A colleague reviewing the code asked a simple question:

"What happens if the process crashes while events are in the buffer?"

The answer followed directly from the design: events still in memory are lost.

In-memory buffers optimize for speed. In an event collection system, they also define a window where accepted events exist only in process memory. A crash during that window results in silent loss.

In every production system I studied, the same rule showed up: durability before acknowledgment.

RudderStack writes to Postgres. Snowplow writes to cloud sinks. PostHog publishes to Kafka. Lytics writes to a local WAL.

Different mechanics, same boundary: don't return "accepted" until the event can survive a crash.

This pattern was already documented in my notes (011-ring-buffer-wal.md). Implementing it exposed why it exists.

Implementing Durability with a Ring Buffer

Lytics' architecture provided a concrete reference. Their ingestion API writes events to a local persistent store before acknowledgment. A background worker publishes from local storage to the message queue. Unpublished events remain available across restarts.

The core separation is between acknowledgment and publication.

In eventkit, SQLite with WAL mode provides similar guarantees. The structure remains the same.

class SQLiteRingBuffer(RingBuffer):
    def __init__(self, db_path: str, max_size: int, retention_hours: int):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self.conn.execute("PRAGMA journal_mode=WAL")
        self.conn.execute("PRAGMA synchronous=NORMAL")
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS ring_buffer (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                event_data TEXT NOT NULL,
                created_at TEXT NOT NULL,
                published BOOLEAN DEFAULT FALSE,
                published_at TEXT
            )
        """)
    
    def write(self, event: RawEvent) -> int:
        event_json = event.model_dump_json()
        now = datetime.now(UTC).isoformat()
        cursor = self.conn.execute(
            "INSERT INTO ring_buffer (event_data, created_at) VALUES (?, ?)",
            (event_json, now)
        )
        self.conn.commit()
        return cursor.lastrowid

The API endpoint writes to the ring buffer and returns immediately:

@app.post("/collect", status_code=202)
async def collect(request: Request, ring_buffer: RingBuffer = Depends(get_ring_buffer)):
    raw_event = await parse_raw_event(request)
    ring_buffer.write(raw_event)
    return {"status": "accepted"}

A background publisher drains unpublished entries and enqueues them downstream, marking them published after successful writes.

This change closed the acknowledgment window, reduced request latency, and preserved events across crashes.

Cost Emerges as the Next Constraint

Firestore supported early experimentation. At higher volumes, its cost profile becomes significant: $0.18/GB for storage and $0.06 per 100K reads.

Production CDPs converge on object storage as the first durable layer:

Segment writes newline-delimited JSON to S3

RudderStack writes Parquet to S3 or GCS

Snowplow writes TSV to S3

Lytics writes to object storage for analytics

The common structure places object storage at the center of durability, with warehouses layered on top for analytics. Object storage costs are low ($0.02/GB on GCS), and batch-loading into warehouses avoids streaming ingestion fees.

Eventkit adopts the same structure:

class GCSEventStore:
    async def store_batch(self, events: list[TypedEvent]) -> None:
        by_date = self._group_by_date(events)
        for date, date_events in by_date.items():
            df = self._events_to_dataframe(date_events)
            table = pa.Table.from_pandas(df)
            path = f"events/date={date}/events-{uuid.uuid4().hex[:8]}.parquet"
            with fs.open(f"gs://{self.bucket}/{path}", "wb") as f:
                pq.write_table(table, f, compression="snappy")

A loader polls GCS every five minutes and batch-loads new files into BigQuery.

This arrangement reduces storage cost and enables replay without additional infrastructure.

Observability Becomes Necessary

As the system grew more asynchronous, visibility diminished. Debugging required structured context rather than ad hoc output.

PostHog's logging configuration provided a reference point: structlog, context propagation, and JSON output in production [13]. Eventkit adopts the same approach.

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.stdlib.add_log_level,
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),
    ],
)

Operational events include structured context:

logger.info(
    "event.processed",
    event_type=event.type,
    event_id=event.message_id,
    user_id=event.user_id,
    partition=partition_id,
    latency_ms=latency,
)

Prometheus metrics expose system state: ring buffer depth, buffer flush rates, and storage latency.

These signals make system behavior observable during failure and load.

Production Architecture

The resulting architecture reflects production constraints encountered during iteration:

Production Architecture: Durability, cost optimization, and observability

Current state:

~3,200 lines of Python
~800 lines of tests
~95% test coverage

Key components:

Ring buffer with WAL
Cost-optimized storage
Queue abstraction
Structured logging and metrics

What Building Taught

Production patterns emerge from failure conditions. Those conditions surface only when a system is exercised under realistic constraints.

The learning process followed a repeatable loop:

Study production systems
Implement a working pipeline
Encounter failure modes
Re-examine existing designs
Incorporate structural changes

The initial version demonstrated feasibility. The revised version demonstrated resilience.

The patterns from Section III—durability before acknowledgment, batching for efficiency, object storage for cost—translated directly to Python. The mechanics shifted while the structure remained consistent.

Working in an unfamiliar language removed shortcuts. The design had to stand on its own.

It did.

V. Trade-offs & Design Decisions

Across every system studied and every failure encountered while building eventkit, one constraint kept asserting itself.

In a production CDP, events are the asset. They encode user intent, business signals, and downstream trust. Once an event is acknowledged, preserving its life through the system becomes non-negotiable.

The subsystems examined throughout this essay exist for that reason. Validation, ordering, buffering, durability, and storage are not abstractions layered on for elegance. They are mechanisms for keeping events alive as they move through unreliable networks, bursty traffic, partial failures, and cost boundaries.

The trade-offs below describe how different systems choose to uphold that constraint under different pressures.

Validation: Correctness and Flexibility

Validation determines how event structure is enforced as data enters the system.

Snowplow enforces schemas through Iglu. Events reference explicit schema versions, and validation failures are captured as structured bad rows. This establishes strong guarantees for downstream consumers and supports reliable analytical assumptions.

Lytics accepts events without centralized schema enforcement. Identity fields and transformations are defined through processing rules. This accommodates heterogeneous sources and allows rapid integration, with validation logic applied during processing.

RudderStack performs lightweight validation at ingestion, focusing on routing metadata and structural presence checks. Semantic validation occurs downstream, often within destination-specific logic.

Pattern: Validation strategy aligns with trust boundaries. Systems that depend on strict downstream assumptions validate early. Systems optimized for integration breadth preserve events and surface validation outcomes later.

What determines the choice:

Team iteration speed
Data quality expectations
Diversity of event sources
Downstream query assumptions

Ordering: Coordination and Partition-Local Guarantees

Ordering determines how event sequences affect state.

RudderStack and Lytics route events deterministically using identity-based hashing. Events for the same user consistently arrive at the same partition, allowing sequential processing without global coordination.

PostHog relies on Kafka for ordering. Partition keys derived from identity and project context ensure consistent routing. Kafka provides ordering guarantees within partitions, keeping application logic stateless.

Snowplow reconstructs logical order using timestamps during enrichment and querying. This approach reduces ingestion constraints and places greater emphasis on timestamp accuracy.

Pattern: Partition-local ordering satisfies most event-processing needs and scales independently. Timestamp-based reconstruction supports high-throughput ingestion with fewer routing constraints.

What determines the choice:

State mutation requirements
Expected throughput
Query semantics
Available infrastructure

Durability: Latency and Guarantees

Durability defines when events become safe from loss.

RudderStack persists events synchronously to JobsDB (Postgres) before acknowledgment, establishing durability at the ingestion boundary.

Lytics persists events to a local ring buffer backed by an embedded key-value store before acknowledgment. Publication to the message queue proceeds asynchronously, allowing fast response times with durable local storage.

Snowplow writes events synchronously to S3 or Kinesis, relying on cloud storage guarantees within the request path.

PostHog publishes events to Kafka, with durability determined by replication configuration and acknowledgment strategy.

Pattern: Event collection systems ensure durability before acknowledgment. The distinction lies in where durability is implemented: local disk, transactional databases, or replicated queues.

What determines the choice:

Acceptable loss thresholds
Latency budgets
Operational complexity
Infrastructure familiarity

Batching: Throughput and Latency

Batching governs how systems amortize write costs while protecting events from loss.

Snowplow batches aggressively, flushing to object storage at multi-minute intervals. This emphasizes write efficiency and cost control.

PostHog batches at the consumer layer. Kafka provides buffering, and ClickHouse receives grouped inserts that balance latency and efficiency.

Lytics uses dual triggers for flushing: size-based and time-based. High-volume partitions flush through size thresholds, while low-volume partitions flush periodically.

Pattern: Dual-trigger batching balances throughput and latency across uneven traffic patterns.

What determines the choice:

Query freshness requirements
Storage cost sensitivity
Traffic distribution
Storage backend behavior

Storage: Cost and Query Latency

Storage decisions determine how long events remain accessible and how quickly they can be queried.

Snowplow and RudderStack persist events to object storage first, then load into warehouses. Object storage provides durable, low-cost archives, while warehouses support analytical queries.

PostHog writes directly to ClickHouse. Events become queryable within seconds using a high-performance analytical database.

Lytics employs tiered storage with different systems optimized for archival, low-latency profile access, and time-series metrics. Each system aligns storage choice with access patterns.

Pattern: Object storage supports durable archives, warehouses enable analytical queries, and specialized databases serve low-latency access. Tiered storage aligns cost and performance with usage.

What determines the choice:

Query latency requirements
Budget constraints
Data access frequency
Operational capacity

Language: Velocity and Performance

Language choice influences development speed and runtime behavior.

PostHog uses Python for application logic that evolves frequently and Rust for ingestion paths where throughput is critical. This aligns language characteristics with system demands.

RudderStack, Snowplow, and Lytics use single-language architectures. This simplifies deployment and shared abstractions while balancing performance and developer productivity within one ecosystem.

eventkit is implemented entirely in Python, favoring clarity and iteration speed while accepting throughput limits imposed by the runtime.

Pattern: Hybrid architectures align language choice with system layers. Single-language systems simplify operations while balancing competing needs globally.

What determines the choice:

Team expertise
Frequency of logic changes
Performance targets
Tolerance for operational complexity

The Meta-Pattern

Trade-offs define the architecture of event collection systems because they all respond to the same underlying constraint: preserving events under pressure.

Durability, cost, latency, scale, and operational complexity shape where boundaries are drawn. Differences across systems reflect which pressures dominate in a given context.

Snowplow emphasizes structural guarantees for downstream analytics. Lytics prioritizes real-time customer insights and personalization at scale. PostHog optimizes for real-time analytical access. RudderStack supports fan-out to multiple destinations.

Comparative study makes these forces visible. Each system exposes part of the design space. Together, they reveal why production event pipelines converge on similar subsystems even when implementations differ.

That convergence is the lesson.

VI. First Python Project

This was my first Python project. Working in an unfamiliar language kept attention on structure and behavior.

The ecosystem felt recognizable. My day-to-day work is in TypeScript and Node, and Python's modern stack converges on similar ideas: async/await for concurrency, gradual typing through mypy, FastAPI providing a request model similar to Express, and Pydantic filling a role close to Zod. The syntax differed, but the conceptual models transferred directly.

One feature stood out in particular: Protocol. Python's structural typing enabled a modular architecture with minimal ceremony without sacrificing explicit boundaries.

from typing import Protocol

class EventAdapter(Protocol):
    def adapt(self, raw: RawEvent) -> AdapterResult:
        """Convert raw event to typed event."""
        ...

Any object that satisfies the interface shape qualifies as an adapter at type-check time, without runtime coupling. There is no inheritance hierarchy and no registration step. Compatibility is defined entirely by structure.

class SegmentAdapter:
    def adapt(self, raw: RawEvent) -> AdapterResult:
        # Validate Segment format
        # Transform to canonical event
        ...

class SnowplowAdapter:
    def adapt(self, raw: RawEvent) -> AdapterResult:
        # Different validation rules
        # Same output shape
        ...

Adapters could be exchanged without modifying downstream code. The same approach extended across the system: storage backends (EventStore), queue implementations (EventQueue), validation layers, and subscription coordinators. Each subsystem remained pluggable and testable in isolation.

This reinforced the overall architecture. Components were loosely coupled. Interfaces stayed stable while implementations varied. The design aligned with the patterns observed in the production systems studied earlier.

The system resolved into a set of interchangeable parts:

EventQueue protocol → AsyncQueue, PubSubQueue
EventStore protocol → GCS-backed storage
RingBuffer protocol → SQLite WAL implementation
EventSubscriptionCoordinator → dual-path consumption pattern

Even as a first Python project, the architecture held together because the patterns carried the design.

Frequently Asked Questions

Common questions about the concepts covered

VII. Conclusion

This work began with a focused question: what happens after jstag.send()?

Tracing four production systems made the answer clearer. Validation, sequencing, buffering, and storage encode decisions about durability, cost, latency, and scale. Those decisions determine whether an event pipeline holds up under pressure. Different systems make different trade-offs, but the forces shaping them remain consistent.

Building eventkit tested whether those patterns were actually understood. Failure modes surfaced the same constraints I saw in production systems. Durability before acknowledgment. Partition-local ordering. Batching for efficiency. Tiered storage for cost.

This is the method I'm keeping. Study real systems. Separate universal pressures from local decisions. Build something small enough to reason about and real enough to fail. Let failure refine the model.

Production codebases leave architectural traces. Reading them closely — and then building against what you find — is a reliable way to learn how systems survive contact with reality.

For me, the practical impact shows up back at the UI boundary. I now treat every call to send() as the start of a long-lived process rather than a fire-and-forget side effect. I think more deliberately about identity stability, idempotency, and what guarantees I'm implicitly assuming downstream. Instrumentation decisions feel less like logging and more like contract design. When something looks "flaky" in a dashboard or profile, I'm faster at narrowing whether the problem lives in validation, sequencing, buffering, or storage—because I've seen those boundaries fail in isolation. That shift hasn't changed what I build day to day, but it has changed how confidently I reason about the systems my UI work feeds into.

References

The technical claims in this essay are grounded in direct code inspection of open-source event collection systems. All references are to public repositories as of January 2026.

[1] RudderStack JobsDB implementation: jobsdb/jobsdb.go#L3184 - PostgreSQL-backed durable queue for event persistence. See Store() method and transaction handling. [⤴¹ ⤴² ⤴³]

[2] Snowplow Iglu schema registry: snowplow/iglu - Self-describing JSON schema versioning and validation [⤴]

[3] Snowplow bad rows stream: Enrich stage validation failure handling with structured error preservation [⤴¹ ⤴²]

[4] PostHog capture-rs: rust/capture - Rust-based high-throughput ingestion service [⤴]

[5] PostHog ClickHouse integration: Direct writes to columnar OLAP database for real-time analytics [⤴¹ ⤴²]

[6] RudderStack gjson validation: Fast-path JSON queries using tidwall/gjson library for minimal deserialization. See usage in gateway/gateway_test.go#L578-581 [⤴]

[7] Snowplow Enrich stage: Schema validation and event enrichment pipeline [⤴]

[8] RudderStack worker routing: gateway/handle.go#L150 - findUserWebRequestWorker() uses consistent hashing for user-based partition assignment [⤴]

[9] Snowplow timestamp-based ordering: derived_tstamp computation with fallback logic for eventual consistency [⤴]

[10] PostHog Kafka partitioning: rust/capture/src/router.rs - Partition keys from distinct_id and project token [⤴]

[11] Snowplow collector buffering: stream-collector - In-memory buffers with configurable flush intervals to S3/Kinesis/GCS [⤴]

[12] Snowplow object storage: TSV/Parquet writes to S3 or GCS as primary durable archive [⤴]

[13] PostHog structured logging: structlog configuration with JSON output and context propagation [⤴]

Supporting libraries:

SQLite Write-Ahead Logging: https://www.sqlite.org/wal.html - Durability mechanism for ring buffer implementation
LinkedIn goavro: linkedin/goavro - Avro encoding for OCF (Object Container Files)

Related posts

Building SDK Kit - Frontend patterns
Respecting Intermediates - Lab methodology
Those Who Like to Build - Why we build

Referenced systems

RudderStack - Open-source CDP
Snowplow - Behavioral data platform
PostHog - Product analytics

eventkit

GitHub repository - Source code
Documentation - API reference and guides

TL;DR

Introduction

I. Methodology

Field sites: why these four systems

What I looked for

From Hypothesis to Implementation

II. Systems Under Study

RudderStack: Separation of Concerns

Snowplow: Schema as Contract

PostHog: Language for the Job

Lytics: Dual-Path for Different Constraints

III. The Event Pipeline Under Pressure

A. Validation

The problem

RudderStack: Minimal validation at the edge

Snowplow: Schema enforcement downstream

PostHog: Flexible parsing with defensive controls

Lytics: Deferred validation with flexible identity

Design space

In eventkit

Trade-offs

B. Sequencing & Deduplication

The problem

RudderStack: Hash-based worker assignment

Snowplow: Eventually consistent ordering, dedup at load

PostHog: Kafka partitioning + storage-level deduplication

Lytics: Sequencer service with bounded dedup windows

Design space

In eventkit

C. Buffering & Batching

The problem

RudderStack: Worker queues with batched persistence

Snowplow: In-memory buffers with periodic flush

PostHog: Kafka as the buffer

Lytics: Layered buffering across the pipeline

Design space

In eventkit

Trade-offs

D. Storage Patterns

The problem

RudderStack: Queue first, warehouse later

Snowplow: Object storage as the archive

PostHog: ClickHouse for real-time analytics

Lytics: Storage matched to access patterns

In eventkit

IV. eventkit — Building to Validate Understanding

The Naive Implementation

Durability Surfaces as the First Constraint

Implementing Durability with a Ring Buffer

Cost Emerges as the Next Constraint

Observability Becomes Necessary

Production Architecture

What Building Taught

V. Trade-offs & Design Decisions

Validation: Correctness and Flexibility

Ordering: Coordination and Partition-Local Guarantees

Durability: Latency and Guarantees

Batching: Throughput and Latency

Storage: Cost and Query Latency

Language: Velocity and Performance

The Meta-Pattern

VI. First Python Project

VII. Conclusion

References

Related

eventkit

Related Posts

Building SDK Kit: extracting patterns from production systems

Respecting intermediates: from chemistry lab to spec-driven development

How dev_search helps AI understand code

Questions about this post?