Star on GitHub

Complexity Analysis

Overview

PRSense is designed for sub-linear scaling using probabilistic data structures and approximate algorithms.


Time Complexity

Per-PR Processing Pipeline

StageOperationComplexityNotes
1Bloom filter checkO(k)k = 5 hash functions
2Embedding generationO(t)t = text length
3ANN searchO(log n)n = total PRs
4Candidate scoringO(c·d)c = candidates (20), d = dimensions (768)
5Attribution updateO(1)HashMap insertion

Total per PR: O(t + log n + c·d)

For typical values (t=500, n=1M, c=20, d=768):

  • Text processing: ~500 ops
  • ANN search: ~20 ops (log₂(1M) ≈ 20)
  • Scoring: ~15,360 ops (20 × 768)
  • Total: ~16,000 operations ≈ 2ms on modern CPU

Embedder Latency Comparison

The bottleneck is step 2 (Embedding generation).

ProviderLatency (avg)ThroughputCostNotes
OpenAI (API)200-500msLimited by rate limit$$High accuracy, network dependent
ONNX (Local)50-100msCPU dependentFreefaster, privacy-first, offline
Cached< 1ms>10k/secFreeO(1) lookup (Feature 4)

Batch Processing Complexity

Using checkBatch() (Feature 3) amortizes overhead:

T_batch = T_overhead + (n · T_per_pr) / Parallelism
Batch SizeTotal Time (Serial)Total Time (Parallel)Speedup
1500ms500ms1x
105000ms800ms~6x
10050s5s~10x

Why?

  • Network requests are pipelined
  • Database connections are reused
  • Embedding model weights stay loaded in RAM

Space Complexity

Storage Requirements

ComponentPer-PRTotal (1M PRs)Notes
Bloom filter1 MBShared across all PRs
Embeddings6 KB6 GB2 × 768 dims × 4 bytes
Metadata200 bytes200 MBCached in memory
Attribution graph20 bytes20 MBSparse edges
ANN index20 bytes20 MBHNSW overhead
Total~6 KB~6.2 GB

Scaling Projections

PRsStorageRAM Required
10K60 MB100 MB
100K600 MB1 GB
1M6 GB8 GB
10M60 GB80 GB
100M600 GBRequires distributed system

Bloom Filter Analysis

Parameters

  • Size (m): 8192 bits = 1 KB
  • Hash functions (k): 5
  • Expected items (n): 1000

False Positive Rate

FPR = (1 - e^(-kn/m))^k
    = (1 - e^(-5·1000/8192))^5
    ≈ 0.18% (very low)

Optimal Configuration

For different scales:

PRs (n)Bits (m)Hashes (k)FPRMemory
1K8 KB50.18%1 KB
10K80 KB50.18%10 KB
100K800 KB50.18%100 KB
1M8 MB50.18%1 MB

ANN Index Complexity

Build Time

Using HNSW algorithm:

T_build = O(n · log n · d)

For 1M PRs with 768-dim vectors:

T_build ≈ 1M · 20 · 768 = 15.4B operations
        ≈ 30 seconds (single-threaded)
        ≈ 3 seconds (10 threads)

Query Time

T_query = O(log n · ef_search)

Where ef_search = exploration factor (typically 50-100).

For 1M PRs:

T_query ≈ log₂(1M) · 50 = 1000 operations
        ≈ 0.1 ms

Memory Overhead

HNSW stores:

  • Vectors: n × d × 4 bytes
  • Graph edges: n × M × 4 bytes (M ≈ 16 neighbors)

For 1M PRs:

Memory = 1M · (768·4 + 16·4) bytes
       = 3.1 GB + 64 MB
       ≈ 3.2 GB

Scoring Complexity

Cosine Similarity

function cosine(a: Float32Array, b: Float32Array): number {
  // O(d) where d = vector dimension
  for (let i = 0; i < d; i++) {
    dot += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dot / (sqrt(normA) * sqrt(normB))
}

Complexity: O(d) = O(768) ≈ 768 operations

Jaccard Similarity

function jaccard(a: Set<string>, b: Set<string>): number {
  // O(min(|a|, |b|))
  for (const item of smaller) {
    if (larger.has(item)) intersection++
  }
  return intersection / (a.size + b.size - intersection)
}

Complexity: O(min(|a|, |b|)) typically O(10-50) files

Total Scoring

Per candidate:

T_score = 2 · O(d) + O(files)
        = 2 · 768 + 30
        ≈ 1566 operations
        ≈ 0.1 ms

For 20 candidates:

T_total = 20 · 0.1 ms = 2 ms

Attribution Graph Complexity

Operations

Add Edge: O(1)

parent.set(duplicatePr, originalPr)      // O(1) HashMap
children.get(originalPr).add(duplicatePr) // O(1) Set

Get Original: O(depth)

while (parent.has(current)) {
  current = parent.get(current)  // O(1) per hop
}

Typical depth: 2-3 levels (most duplicates are 1 hop from original)

Get All Duplicates: O(descendants)

// DFS traversal
while (stack.length > 0) {
  node = stack.pop()
  for (child of children.get(node)) {
    result.push(child)
    stack.push(child)
  }
}

Average case: 5-10 duplicates per PR


Bottleneck Analysis

Latency Breakdown (1M PRs, 20 candidates)

Bloom filter:       0.001 ms  ( 0.05%)
Embedding lookup:   1.000 ms  (50.00%)  ← Bottleneck
ANN search:         0.100 ms  ( 5.00%)
Scoring (20 cands): 2.000 ms  (40.00%)
Attribution:        0.001 ms  ( 0.05%)
Other:              0.098 ms  ( 4.90%)
─────────────────────────────────────
Total:             ~2.0 ms   (100.00%)

Optimization Strategies

  1. Cache embeddings → Eliminates 99% of latency (Feature 4)
    • First run: 500ms
    • Subsequent runs: <2ms
  2. Reduce ANN candidates (20 → 10) → 20% faster scoring
  3. Quantize vectors (float32 → int8) → 4× smaller, 2× faster
  4. Batch processing → Amortize overhead

Scalability Limits

Single Machine (8 GB RAM)

  • Max PRs: ~1M
  • Latency: ~2ms per PR
  • Throughput: 500 PRs/sec (single-threaded)

Multi-Machine (Distributed)

  • Max PRs: ~100M
  • Strategy: Shard by repository
  • Latency: ~10ms (network overhead)
  • Throughput: 10K PRs/sec (20 machines)

GPU Acceleration

  • Cosine similarity: 100× faster on GPU
  • Batch size: 1000 PRs
  • Throughput: 50K PRs/sec

Cross-Repository Scaling

For organization-wide checks (Feature 8):

T_cross_repo = T_single · R

Where R = number of repositories.

ReposPRs TotalSearch StrategyLatency
11KLocal Memory2ms
1010KLocal Memory5ms
100100KPostgres Index20ms
10001MSharded Cluster50ms

Worst-Case Scenarios

All PRs pass Bloom filter

Time = n · (ANN + scoring)
     = 1M · 2ms
     = 2000 seconds
     = 33 minutes

Mitigation: Bloom filter properly tuned → <1% false positives

Very long PR descriptions

Embedding time = O(text_length)
Max practical: 10,000 tokens ≈ 10ms

Mitigation: Truncate to 512 tokens

Deep attribution chains

getOriginal time = O(depth)
Pathological: 100 hops ≈ 0.1ms

Mitigation: Detect cycles, limit depth to 10


Comparison to Naive Approach

Brute Force: Compare all pairs

Time = O(n²) = 1M² = 1 trillion comparisons
     ≈ 1000 hours (infeasible)

PRSense: Bloom + ANN + Scoring

Time = O(n · log n) = 1M · 20 = 20M comparisons
     ≈ 40 seconds (500× faster)

Speedup: 90,000× faster than brute force


Future Optimizations

  1. Product Quantization: 8× smaller vectors
  2. Inverted Index: O(1) file overlap checks
  3. GPU Batching: 100× faster scoring
  4. Distributed ANN: Horizontal scaling beyond 10M PRs