Scoring System
Multi-Signal Approach
PRSense combines three independent signals to detect duplicate PRs:
- Text Similarity (title + description)
- Diff Similarity (code changes)
- File Overlap (modified files)
Scoring Formula
final_score = w₁ · text_sim + w₂ · diff_sim + w₃ · file_sim
Default Weights
w₁ = 0.45 (text)
w₂ = 0.35 (diff)
w₃ = 0.20 (files)
Rationale
Text Weight (45%)
- Captures intent: Same bug description = likely duplicate
- Language-agnostic: Works across all repos
- Early signal: Available before code is written
Diff Weight (35%)
- Captures implementation: Similar code changes
- High precision: Exact matches are strong signals
- Language-specific: Better for detecting copy-paste
File Weight (20%)
- Captures structure: Same files modified
- Fast to compute: No ML needed
- Hard signal: Same files + high text sim = very likely duplicate
Signal 1: Text Similarity
Input
text = title + "\n" + description
Method
Cosine similarity of embeddings:
text_sim = cosine(embed(text₁), embed(text₂))
Example
PR #1: "Fix login bug when password is empty"
PR #2: "Resolve authentication issue with blank passwords"
// Same intent, different wording
text_sim = 0.87 // High similarity
Edge Cases
- Empty description: Use title only
- Very long text: Truncate to 512 tokens
- Non-English: Use multilingual embeddings
Signal 2: Diff Similarity
Input
diff --git a/auth/login.ts b/auth/login.ts
@@ -10,7 +10,7 @@
function login(password: string) {
- if (password) {
+ if (password && password.length > 0) {
authenticate()
}
}
Method
Cosine similarity of diff embeddings:
diff_sim = cosine(embed(diff₁), embed(diff₂))
Preprocessing
- Remove whitespace-only changes
- Normalize variable names
- Focus on structural changes
Example
// Similar fix, different variable names
diff1: "if (password) → if (password && password.length > 0)"
diff2: "if (pwd) → if (pwd && pwd.length > 0)"
diff_sim = 0.92 // Very high similarity
Signal 3: File Overlap
Input
files = Set of modified file paths
Method
Jaccard similarity:
file_sim = |files₁ ∩ files₂| / |files₁ ∪ files₂|
Example
PR #1 files: ['auth/login.ts', 'auth/utils.ts']
PR #2 files: ['auth/login.ts', 'auth/session.ts']
intersection = 1 // auth/login.ts
union = 3 // all unique files
file_sim = 1/3 = 0.33
Edge Cases
- Refactors: Low file overlap, but high diff similarity
- Renames: Track file moves via git history
- New files: Contribute to union, not intersection
Score Interpretation
Range
All scores normalized to [0, 1]:
- 1.0 = Identical
- 0.5 = Moderately similar
- 0.0 = Completely unrelated
Thresholds
| Range | Level | Action |
|---|---|---|
| ≥ 0.90 | HIGH | Auto-flag as DUPLICATE |
| 0.82 - 0.89 | MEDIUM | Suggest to maintainer (POSSIBLE) |
| < 0.82 | LOW | Ignore |
Examples
Case 1: Exact Duplicate
text_sim = 0.95 (same wording)
diff_sim = 0.98 (identical code)
file_sim = 1.00 (same files)
final = 0.45(0.95) + 0.35(0.98) + 0.20(1.00) = 0.97
→ HIGH (DUPLICATE)
Case 2: Different Implementation
text_sim = 0.88 (similar description)
diff_sim = 0.60 (different approach)
file_sim = 0.50 (some overlap)
final = 0.45(0.88) + 0.35(0.60) + 0.20(0.50) = 0.70
→ LOW (IGNORE)
Case 3: Edge Case
text_sim = 0.92 (clear duplicate)
diff_sim = 0.85 (minor variations)
file_sim = 0.70 (mostly same files)
final = 0.45(0.92) + 0.35(0.85) + 0.20(0.70) = 0.85
→ MEDIUM (POSSIBLE)
Weight Tuning
Methodology
- Collect labeled data: 1000+ PR pairs marked as duplicate/not-duplicate
- Grid search: Try weight combinations in 0.05 steps
- Optimize F1 score: Balance precision and recall
- Validate: Test on held-out set
Results (Example)
Weights: [0.45, 0.35, 0.20]
Precision: 92%
Recall: 78%
F1: 0.84
Alternative Weights
Conservative (fewer false positives)
w₁ = 0.50, w₂ = 0.40, w₃ = 0.10
→ Higher threshold on text/diff, less weight on files
Aggressive (fewer false negatives)
w₁ = 0.40, w₂ = 0.30, w₃ = 0.30
→ More weight on file overlap
Cross-Repository (Feature 8) When checking across repos, file paths often differ. We recommend lower file weight:
w₁ = 0.50, w₂ = 0.45, w₃ = 0.05
→ Rely on text/diff, ignore file paths
Note: Use
detector.setWeights()to adjust these dynamically (Feature 5).
Normalization
Cosine Similarity
Already normalized to [-1, 1], clamped to [0, 1]:
cosine_norm = max(0, cosine(a, b))
Jaccard Similarity
Naturally in [0, 1], no normalization needed.
Explainability
Score Breakdown
{
"prId": 101,
"candidate": 100,
"score": 0.87,
"breakdown": {
"text": { "value": 0.92, "weight": 0.45, "contribution": 0.414 },
"diff": { "value": 0.85, "weight": 0.35, "contribution": 0.298 },
"file": { "value": 0.70, "weight": 0.20, "contribution": 0.140 }
}
}
Visualization
Text: ████████████████████ (0.92)
Diff: ████████████████ (0.85)
Files: ██████████ (0.70)
───────────────────
Final: ████████████████▌ (0.87) → MEDIUM
Performance
Latency
- Embedding lookup: 1ms (cached)
- Cosine computation:
- 0.1ms (OpenAI: 1536-dim)
- 0.05ms (ONNX: 384-dim)
- Jaccard computation: 0.01ms (sets)
- Total: ~2ms per candidate
Throughput
- Single-threaded: 500 candidates/sec
- Parallelized: 5000 candidates/sec (10 cores)
Future Improvements
- Learned weights: Train a small MLP to combine signals
- Temporal decay: Weight recent PRs higher (Planned v1.1)
- Graph Neural Networks: Model dependency graphs (Planned v1.2)
PRSense