Algorithms
Overview
PRSense uses a multi-stage pipeline combining probabilistic data structures, approximate nearest neighbor search, and weighted scoring. It is designed for both real-time interactive use and high-throughput batch processing in CI/CD pipelines.
1. Bloom Filter (Fast Rejection)
Purpose
Early rejection of obviously unrelated PRs in O(1) time.
Algorithm
BloomFilter.add(key):
for i in [1..k]:
index = hash(key, seed=i) % size
bits[index] = 1
BloomFilter.mightContain(key):
for i in [1..k]:
index = hash(key, seed=i) % size
if bits[index] == 0:
return false
return true
Properties
- False positives: Allowed (conservatively passes PRs to next stage)
- False negatives: Forbidden (never rejects a potential match)
- Space: O(n) where n = number of PRs
- Time: O(k) where k = number of hash functions (typically 5)
Hash Function
Simple polynomial rolling hash for deterministic behavior across environments.
2. Embedding Generation
Purpose
Convert PR text and diff into fixed-size vectors for similarity comparison.
Pipeline
EmbeddingPipeline.run(title, description, diff):
text = title + "\n" + description
textEmbedding = embedder.embedText(text)
diffEmbedding = embedder.embedDiff(diff)
return { textEmbedding, diffEmbedding }
Supported Backends
PRSense supports pluggable embedders:
- OpenAI (Default): Uses
text-embedding-3-smallfor high accuracy. - ONNX (Offline): Uses
all-MiniLM-L6-v2running locally via ONNX Runtime. Privacy-first option requiring no API keys or external data transfer (Feature 7).
Optimization
- Caching: LRU cache stores computed embeddings to reduce API costs and latency for repeated checks (Feature 4).
3. Approximate Nearest Neighbor (ANN) Search
Purpose
Find top-k candidate PRs similar to the query PR.
Interface
ANNIndex.search(vector, k) → Candidate[]
Implementation Options
- Memory (Default): Exact brute-force search (fast for <10k PRs).
- PostgreSQL: Uses
pgvectorwith HNSW indexing for efficient similarity search at scale. - SQLite: Uses vector extensions when available.
HNSW (Hierarchical Navigable Small World)
HNSW is a graph-based algorithm for approximate nearest neighbor search:
- Builds a multi-layer graph where higher layers contain fewer, well-connected nodes
- Search starts from the top layer and greedily descends to find nearest neighbors
- Provides O(log n) query time with high recall (>95%)
Complexity
- Build: O(n log n)
- Query: O(log n) expected with HNSW/IVF
4. Multi-Signal Scoring (Ranker)
Purpose
Combine multiple similarity signals into a single confidence score.
Formula
score = w₁ · cosine(text₁, text₂)
+ w₂ · cosine(diff₁, diff₂)
+ w₃ · jaccard(files₁, files₂)
Configurable Weights
Weights can be tuned via setWeights() to adapt to different repository styles (Feature 5).
| Component | Default Weight | Represents |
|---|---|---|
| w₁ Text | 0.45 | Intent matching (What is this trying to do?) |
| w₂ Diff | 0.35 | Implementation matching (How does it do it?) |
| w₃ Files | 0.20 | Structural similarity (Where is it changing?) |
Cosine Similarity
cosine(a, b) = (a · b) / (||a|| × ||b||)
- Range: [-1, 1], normalized to [0, 1] for scoring.
Jaccard Similarity
jaccard(A, B) = |A ∩ B| / |A ∪ B|
- Measures overlap of touched files.
5. Threshold Classification
Purpose
Convert continuous score into discrete confidence levels.
Thresholds
classify(score):
if score >= 0.90: return HIGH
if score >= 0.82: return MEDIUM
return LOW
Rationale
- 0.90: Very high confidence (safe for automated labels/comments).
- 0.82: Medium confidence (requires human review).
- <0.82: Low confidence (ignored to prevent noise).
6. Decision Engine
Purpose
Map confidence levels to concrete actions.
Rules
decide(score, originalPrId):
if score >= 0.90:
return { type: 'DUPLICATE', originalPr }
if score >= 0.82:
return { type: 'POSSIBLE', originalPr }
return { type: 'IGNORE' }
Explainability
Using checkDetailed(), the engine provides a full breakdown of why a decision was made, including the individual contribution of text, diff, and file signals (Feature 2).
Batch Mode
For CI/CD pipelines processing multiple PRs, use checkBatch() (Feature 3):
checkBatch(prs: PRData[]) → BatchResult[]
- Processes PRs in parallel for throughput
- Returns individual results for each PR
- Ideal for scheduled jobs scanning open PRs
7. Cross-Repository Detection
Purpose
Detect duplicates across multiple repositories within an organization (Feature 8).
Algorithm
CrossRepoDetector.check(pr, repos[]):
for repo in repos:
candidates = repo.index.search(pr.embedding, k=10)
for candidate in candidates:
score = ranker.score(pr, candidate)
if score >= threshold:
yield { repo, candidate, score }
Use Cases
- Monorepo migrations (detect duplicate fixes across packages)
- Organization-wide PR deduplication
- Fork synchronization
8. Attribution Graph
Purpose
Track duplicate lineage to preserve authorship credit.
Data Structure
class AttributionGraph:
parent: Map<prId, originalPrId>
children: Map<prId, Set<duplicatePrIds>>
Enables O(1) duplicate grouping and O(depth) lookup of original PRs.
Complexity Summary
| Operation | Time | Space |
|---|---|---|
| Bloom filter check | O(k) | O(m) bits |
| Embedding generation | O(text_length) | O(d) per PR |
| ANN search | O(log n) | O(n·d) |
| Scoring | O(d) | O(1) |
| Attribution graph | O(depth) | O(edges) |
Where:
- k = hash functions (5)
- m = bloom filter size (8192)
- n = number of PRs
- d = embedding dimension (384-1536)
PRSense