SPIN Unprocessed
Source arXiv Artificial Intelligence export.arxiv.org Analyst
July 3, 2026 ai_technology research

SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication

View original on arxiv.org

Summary

arXiv:2607.01601v1 Announce Type: new Abstract: Large scale document deduplication must preserve semantic equivalence while remaining efficient over massive corpora. We present SemHash LLM, a multi granularity framework that unifies semantic projection hashing, attention weighted MinHash, contrastive boundary learning, and selective LLM based adjudication. The method combines character, token, and document level signals through gated fusion, then applies a cascaded filtering pipeline for efficie

SpinGraph analysis pending — check back after processing.

Ask AI about this story

See how AI engines summarize this narrative — one click, prompt included.

More from arXiv Artificial Intelligence

View all →

Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO