Data Structures

Overview

PRSense uses carefully chosen data structures to balance performance, accuracy, and maintainability.

1. Bloom Filter

Purpose

Probabilistic set for O(1) duplicate rejection.

Implementation

class BloomFilter {
private bits: Uint8Array // Bit array
private size: number // Array size
private hashes: number // Number of hash functions
}

Properties

Space: 8192 bytes (8 KB) by default
False positive rate: ~1% with 5 hash functions
False negatives: 0% (guaranteed)

Tuning

m = -n * ln(p) / (ln(2)^2)
k = (m / n) * ln(2)

Where:

m = bit array size
n = expected number of PRs
p = target false positive rate
k = optimal number of hash functions

Example

For 1000 PRs with 1% FP rate:

m ≈ 9585 bits
k ≈ 7 hash functions

2. Attribution Graph

Purpose

Directed acyclic graph (DAG) tracking PR duplication lineage.

Structure

class AttributionGraph {
private parent: Map<number, number> // child → parent
private children: Map<number, Set<number>> // parent → children
}

Invariants

Acyclic: No PR can be its own ancestor
Single parent: Each PR has at most one original
Multiple children: Original PR can have many duplicates

Visual Example

PR #100 (original)
├── PR #101 (duplicate)
├── PR #102 (duplicate)
└── PR #103 (duplicate)
└── PR #104 (duplicate of duplicate)

Operations

Operation	Time	Description
`addEdge(dup, orig)`	O(1)	Link duplicate to original
`getOriginal(pr)`	O(depth)	Find root ancestor
`getAllDuplicates(pr)`	O(n)	Get all descendants

3. Embedding Storage

Format

interface EmbeddingSet {
text: Float32Array // 384-768 dimensions
diff: Float32Array // 384-768 dimensions
}

Memory Considerations

For 1M PRs with 768-dim embeddings:

storage = 1M × 2 embeddings × 768 dims × 4 bytes/float
= 6.1 GB

Optimization Strategies

Quantization: 4-byte float → 1-byte int8 (4× smaller)
PCA reduction: 768 dims → 256 dims (3× smaller)
Sparse storage: Only store non-zero values

4. Candidate Queue

Purpose

Priority queue for re-ranking ANN results.

Structure

interface Candidate {
prId: number
scoreHint: number // Initial ANN score
}

Usage

// ANN returns candidates with approximate scores
const candidates = annIndex.search(embedding, k=20)

// Re-rank with exact multi-signal scoring
candidates.sort((a, b) => 
exactScore(b) - exactScore(a)
)

Batch Results (Feature 3)

interface BatchResult {
prId: number
bestMatch?: {
originalPrId: number
confidence: number
type: 'DUPLICATE' | 'POSSIBLE'
}
processingTimeMs: number
}

5. File Path Set

Purpose

Fast file overlap computation.

Implementation

type FilePathSet = Set<string>

Why Set?

O(1) membership test: set.has(path)
Native deduplication: No manual handling
Memory efficient: Stores unique paths only

Example

const files1 = new Set(['auth/login.ts', 'auth/utils.ts'])
const files2 = new Set(['auth/login.ts', 'auth/session.ts'])

// Jaccard similarity
const intersection = [...files1].filter(f => files2.has(f)).length
const union = files1.size + files2.size - intersection
const score = intersection / union // 0.33

6. Score Explanation (Feature 2)

Breakdown Structure

interface ScoreBreakdown {
total: number // 0.0 - 1.0
components: {
text: number // Weighted text similarity
diff: number // Weighted diff similarity
files: number // Weighted file overlap
}
weights: {
text: number
diff: number
files: number
}
confidence: 'HIGH' | 'MEDIUM' | 'LOW'
}

Configurable Weights (Feature 5)

interface WeightConfig {
textWeight: number // Default: 0.45
diffWeight: number // Default: 0.35
fileWeight: number // Default: 0.20
// Sum must equal 1.0
}

7. Cross-Repo Detection (Feature 8)

Match Structure

interface CrossRepoMatch {
sourcePr: {
repoId: string
prId: number
}
targetPr: {
repoId: string
prId: number
}
score: number
}

8. PR Metadata Cache

Purpose

Avoid re-fetching PR data during scoring.

Structure

interface PRMetadata {
prId: number
repoId: number
authorId: number
title: string
description: string
createdAt: number
mergedAt?: number
}

Indexing

const prCache = new Map<number, PRMetadata>()

Eviction Policy

LRU (Least Recently Used): Keep hot PRs in memory
TTL (Time To Live): Expire old PRs after N days
Size-based: Cap at max memory usage

9. ANN Index

Interface

interface ANNIndex {
search(vector: Float32Array, k: number): Candidate[]
}

Implementation Options

HNSW (Hierarchical Navigable Small World)

Build time: O(n log n)
Query time: O(log n)
Memory: ~20 bytes per vector
Best for: <10M vectors

IVF (Inverted File Index)

Build time: O(n · k) for k clusters
Query time: O(n_probe + k)
Memory: ~8 bytes per vector
Best for: >10M vectors

Memory Layout Summary

For 1M PRs in a production system:

Component	Size	Notes
Bloom filter	1 MB	8192 bits
Embeddings	6 GB	768-dim float32
Attribution graph	20 MB	~20 edges per PR
PR metadata	200 MB	Cached subset
ANN index	20 MB	HNSW overhead
Total	~6.2 GB	Fits in single machine

Design Tradeoffs

Bloom Filter

Pro: Fast O(1) rejection
Con: False positives possible
Choice: Acceptable for early filtering

In-Memory Storage

Pro: Low latency
Con: Memory constraints
Choice: Optimize for <10M PRs

Float32 Embeddings

Pro: Better accuracy than quantized
Con: 4× larger than int8
Choice: Accuracy over space (for now)

Future Optimizations

Disk-backed storage for cold PRs
GPU acceleration for batch scoring
Distributed index for >100M PRs
Compressed embeddings (Product Quantization)