Complexity Analysis

Overview

PRSense is designed for sub-linear scaling using probabilistic data structures and approximate algorithms.

Time Complexity

Per-PR Processing Pipeline

Stage	Operation	Complexity	Notes
1	Bloom filter check	O(k)	k = 5 hash functions
2	Embedding generation	O(t)	t = text length
3	ANN search	O(log n)	n = total PRs
4	Candidate scoring	O(c·d)	c = candidates (20), d = dimensions (768)
5	Attribution update	O(1)	HashMap insertion

Total per PR: O(t + log n + c·d)

For typical values (t=500, n=1M, c=20, d=768):

Text processing: ~500 ops
ANN search: ~20 ops (log₂(1M) ≈ 20)
Scoring: ~15,360 ops (20 × 768)
Total: ~16,000 operations ≈ 2ms on modern CPU

Embedder Latency Comparison

The bottleneck is step 2 (Embedding generation).

Provider	Latency (avg)	Throughput	Cost	Notes
OpenAI (API)	200-500ms	Limited by rate limit	$$	High accuracy, network dependent
ONNX (Local)	50-100ms	CPU dependent	Free	faster, privacy-first, offline
Cached	< 1ms	>10k/sec	Free	O(1) lookup (Feature 4)

Batch Processing Complexity

Using checkBatch() (Feature 3) amortizes overhead:

T_batch = T_overhead + (n · T_per_pr) / Parallelism

Batch Size	Total Time (Serial)	Total Time (Parallel)	Speedup
1	500ms	500ms	1x
10	5000ms	800ms	~6x
100	50s	5s	~10x

Why?

Network requests are pipelined
Database connections are reused
Embedding model weights stay loaded in RAM

Space Complexity

Storage Requirements

Component	Per-PR	Total (1M PRs)	Notes
Bloom filter	—	1 MB	Shared across all PRs
Embeddings	6 KB	6 GB	2 × 768 dims × 4 bytes
Metadata	200 bytes	200 MB	Cached in memory
Attribution graph	20 bytes	20 MB	Sparse edges
ANN index	20 bytes	20 MB	HNSW overhead
Total	~6 KB	~6.2 GB	—

Scaling Projections

PRs	Storage	RAM Required
10K	60 MB	100 MB
100K	600 MB	1 GB
1M	6 GB	8 GB
10M	60 GB	80 GB
100M	600 GB	Requires distributed system

Bloom Filter Analysis

Parameters

Size (m): 8192 bits = 1 KB
Hash functions (k): 5
Expected items (n): 1000

False Positive Rate

FPR = (1 - e^(-kn/m))^k
    = (1 - e^(-5·1000/8192))^5
    ≈ 0.18% (very low)

Optimal Configuration

For different scales:

PRs (n)	Bits (m)	Hashes (k)	FPR	Memory
1K	8 KB	5	0.18%	1 KB
10K	80 KB	5	0.18%	10 KB
100K	800 KB	5	0.18%	100 KB
1M	8 MB	5	0.18%	1 MB

ANN Index Complexity

Build Time

Using HNSW algorithm:

T_build = O(n · log n · d)

For 1M PRs with 768-dim vectors:

T_build ≈ 1M · 20 · 768 = 15.4B operations
        ≈ 30 seconds (single-threaded)
        ≈ 3 seconds (10 threads)

Query Time

T_query = O(log n · ef_search)

Where ef_search = exploration factor (typically 50-100).

For 1M PRs:

T_query ≈ log₂(1M) · 50 = 1000 operations
        ≈ 0.1 ms

Memory Overhead

HNSW stores:

Vectors: n × d × 4 bytes
Graph edges: n × M × 4 bytes (M ≈ 16 neighbors)

For 1M PRs:

Memory = 1M · (768·4 + 16·4) bytes
       = 3.1 GB + 64 MB
       ≈ 3.2 GB

Scoring Complexity

Cosine Similarity

function cosine(a: Float32Array, b: Float32Array): number {
  // O(d) where d = vector dimension
  for (let i = 0; i < d; i++) {
    dot += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dot / (sqrt(normA) * sqrt(normB))
}

Complexity: O(d) = O(768) ≈ 768 operations

Jaccard Similarity

function jaccard(a: Set<string>, b: Set<string>): number {
  // O(min(|a|, |b|))
  for (const item of smaller) {
    if (larger.has(item)) intersection++
  }
  return intersection / (a.size + b.size - intersection)
}

Complexity: O(min(|a|, |b|)) typically O(10-50) files

Total Scoring

Per candidate:

T_score = 2 · O(d) + O(files)
        = 2 · 768 + 30
        ≈ 1566 operations
        ≈ 0.1 ms

For 20 candidates:

T_total = 20 · 0.1 ms = 2 ms

Attribution Graph Complexity

Operations

Add Edge: O(1)

parent.set(duplicatePr, originalPr)      // O(1) HashMap
children.get(originalPr).add(duplicatePr) // O(1) Set

Get Original: O(depth)

while (parent.has(current)) {
  current = parent.get(current)  // O(1) per hop
}

Typical depth: 2-3 levels (most duplicates are 1 hop from original)

Get All Duplicates: O(descendants)

// DFS traversal
while (stack.length > 0) {
  node = stack.pop()
  for (child of children.get(node)) {
    result.push(child)
    stack.push(child)
  }
}

Average case: 5-10 duplicates per PR

Bottleneck Analysis

Latency Breakdown (1M PRs, 20 candidates)

Bloom filter:       0.001 ms  ( 0.05%)
Embedding lookup:   1.000 ms  (50.00%)  ← Bottleneck
ANN search:         0.100 ms  ( 5.00%)
Scoring (20 cands): 2.000 ms  (40.00%)
Attribution:        0.001 ms  ( 0.05%)
Other:              0.098 ms  ( 4.90%)
─────────────────────────────────────
Total:             ~2.0 ms   (100.00%)

Optimization Strategies

Cache embeddings → Eliminates 99% of latency (Feature 4)
- First run: 500ms
- Subsequent runs: <2ms
Reduce ANN candidates (20 → 10) → 20% faster scoring
Quantize vectors (float32 → int8) → 4× smaller, 2× faster
Batch processing → Amortize overhead

Scalability Limits

Single Machine (8 GB RAM)

Max PRs: ~1M
Latency: ~2ms per PR
Throughput: 500 PRs/sec (single-threaded)

Multi-Machine (Distributed)

Max PRs: ~100M
Strategy: Shard by repository
Latency: ~10ms (network overhead)
Throughput: 10K PRs/sec (20 machines)

GPU Acceleration

Cosine similarity: 100× faster on GPU
Batch size: 1000 PRs
Throughput: 50K PRs/sec

Cross-Repository Scaling

For organization-wide checks (Feature 8):

T_cross_repo = T_single · R

Where R = number of repositories.

Repos	PRs Total	Search Strategy	Latency
1	1K	Local Memory	2ms
10	10K	Local Memory	5ms
100	100K	Postgres Index	20ms
1000	1M	Sharded Cluster	50ms

Worst-Case Scenarios

All PRs pass Bloom filter

Time = n · (ANN + scoring)
     = 1M · 2ms
     = 2000 seconds
     = 33 minutes

Mitigation: Bloom filter properly tuned → <1% false positives

Very long PR descriptions

Embedding time = O(text_length)
Max practical: 10,000 tokens ≈ 10ms

Mitigation: Truncate to 512 tokens

Deep attribution chains

getOriginal time = O(depth)
Pathological: 100 hops ≈ 0.1ms

Mitigation: Detect cycles, limit depth to 10

Comparison to Naive Approach

Brute Force: Compare all pairs

Time = O(n²) = 1M² = 1 trillion comparisons
     ≈ 1000 hours (infeasible)

PRSense: Bloom + ANN + Scoring

Time = O(n · log n) = 1M · 20 = 20M comparisons
     ≈ 40 seconds (500× faster)

Speedup: 90,000× faster than brute force

Future Optimizations

Product Quantization: 8× smaller vectors
Inverted Index: O(1) file overlap checks
GPU Batching: 100× faster scoring
Distributed ANN: Horizontal scaling beyond 10M PRs