How is text similarity calculated?

This tool uses five algorithms: Jaccard similarity computes the ratio of shared unique words to total unique words; Cosine similarity measures the angle between two word-frequency vectors; Bigram overlap finds matching 2-word sequences; Trigram overlap finds matching 3-word sequences; and LCS ratio measures the longest common word subsequence. The overall score is a weighted average of all five, with cosine and n-gram methods weighted higher to detect paraphrasing.

What is cosine similarity for text?

Cosine similarity measures the similarity between two texts by representing each as a vector of word frequencies and calculating the cosine of the angle between them. Unlike Jaccard similarity, it accounts for how often each word appears, not just whether it is present. A score of 1.0 means the texts have identical word distribution; 0 means no shared words at all. It is particularly effective for detecting texts that use the same words with similar frequency — a strong signal of duplication or light paraphrasing.

Duplicate Content Checker — Free Text Similarity Analyzer

Text Similarity Analyzer Live

Text A (Original)

0 words

Text B (Comparison)

0 words

Ignore case

Ignore stopwords

Ignore punctuation

Stemming (basic)

Min phrase length (words)

Paste two texts above to compare them

The tool will instantly calculate similarity percentage, highlight matching phrases, and show a detailed diff view.

What Is Duplicate Content and How Is Similarity Calculated?

Duplicate content refers to blocks of text that appear in more than one location on the web, or that are substantially similar to another piece of content. Search engines like Google may struggle to determine which version to rank, potentially splitting link equity between pages or choosing the "wrong" canonical version. While Google rarely penalizes unintentional duplication, it can dilute rankings and create a poor user experience.

How the Similarity Algorithms Work

Jaccard Similarity

Computes the ratio of shared unique words to the total unique word union — pure vocabulary overlap. Fast and simple, but does not account for word frequency or word order.

Cosine Similarity

Measures the angle between two word-frequency vectors. Words that appear more often carry more weight. Excellent at detecting texts that use the same words with similar frequency — a strong signal of duplication or light paraphrasing.

N-gram Overlap (Bigram & Trigram)

Finds matching sequences of 2 or 3 consecutive words. N-gram analysis detects paraphrasing that single-word methods miss, because paraphrased text often preserves short phrases even when individual words change.

LCS Ratio (Longest Common Subsequence)

Measures the longest sequence of words that appear in the same order in both texts. Effective at detecting structural similarity even when filler words are changed. The overall score is a weighted average of all five algorithms.

Similarity Thresholds Explained

Similarity Range	Verdict	Typical Meaning
0 – 20%	✅ Unique	Texts are largely different in vocabulary and structure. Safe for SEO.
20 – 40%	🟢 Low	Some shared terminology — likely the same topic but distinct content.
40 – 60%	🟡 Moderate	Significant overlap — possible paraphrasing or heavy phrase reuse.
60 – 80%	🔴 High	Likely near-duplicate with light rewording. Review before publishing.
80 – 100%	⚠️ Very High	Essentially the same text — direct copy, minimal edits, or plagiarism.

Common Use Cases

SEO audits: Checking whether a rewritten article is sufficiently unique to avoid duplicate content issues in Google's index.
Academic integrity: Detecting whether submitted work overlaps with a source document or previously published content.
AI content validation: Verifying whether AI-generated or spun content is distinct enough from the original source material.
Translation checks: Measuring how much translated content preserves the structural and lexical patterns of the source language.
Content syndication: Confirming that syndicated versions of your content won't compete with the original in search results.

How to Check for Duplicate Content

Paste Text A

Paste your original reference text into the Text A panel — a blog post, article, product description, or any piece of writing you want to use as the baseline.

Paste Text B

Paste the comparison text into the Text B panel — the suspected duplicate, rewritten version, AI-generated content, or translated text you want to compare against Text A.

Configure Options

Toggle ignore case, ignore stopwords, ignore punctuation, and basic stemming as needed. Set the minimum phrase length for phrase matching. The analysis updates live as you type.

Review the Score

The overall similarity percentage and individual algorithm scores (Jaccard, cosine, bigram, trigram, LCS) are displayed instantly, with a verdict from Unique to Very High / Likely Duplicate.

Explore the Diff View

Switch to the Diff View tab to see both texts side by side with colour-coded highlighting: green for shared words, red for text unique to A, and blue for text unique to B.

Download the Report

Switch to the Report tab and click Download to save a full similarity report as a .txt file including all scores, shared vocabulary, top phrases, and sentence-level analysis.

Frequently Asked Questions

What is duplicate content?

Duplicate content refers to blocks of text that appear in more than one location on the web, or that are substantially similar to another piece of content. Search engines may struggle to determine which version to rank, potentially splitting link equity or indexing the wrong version. While rarely penalized directly, duplicate content can dilute search rankings and reduce organic traffic.

What similarity percentage indicates duplicate content?

Generally: 0–20% is unique; 20–40% is low similarity; 40–60% is moderate with significant phrase overlap; 60–80% is high similarity likely requiring revision; 80–100% is very high — essentially the same text with minor edits. These are guidelines — context matters. Academic, legal, and technical documents naturally share more vocabulary than creative content.

Does duplicate content hurt SEO?

Yes, in several ways. Search engines may split link equity between duplicate pages rather than consolidating signals to one URL. They may index a scraped copy of your content instead of your original. In extreme cases involving mass duplication or manipulative content spinning, Google may apply a ranking penalty. The best practice is to ensure all published content is sufficiently unique and to use canonical tags to signal the preferred version.

What is Jaccard similarity?

Jaccard similarity measures the overlap between two sets of unique words. It divides the number of words shared by both texts by the total number of unique words across both texts. A score of 1.0 means both texts use exactly the same unique vocabulary; 0 means no shared words. It is simple and fast but does not account for word frequency or word order.

What is n-gram similarity and why does it matter?

N-gram similarity measures how many consecutive word sequences of length n are shared between two texts. Bigrams (2-word sequences) and trigrams (3-word sequences) are particularly useful for detecting paraphrasing, because paraphrased text often preserves short phrases even when individual words are replaced. For example, "machine learning model" and "deep learning model" share the bigram "learning model", which Jaccard alone would miss.

Can this tool detect AI-generated or paraphrased content?

Yes, to a degree. The n-gram algorithms (bigram and trigram) are particularly effective at detecting paraphrased content because they identify matching word sequences even when surrounding words change. The LCS ratio also detects structural similarity in sentence ordering. However, heavily paraphrased or AI-rewritten content may score lower despite being semantically similar — for that use case, a dedicated AI detection service is recommended alongside this tool.

What does the diff view show?

The diff view displays both texts side by side with colour-coded word highlighting. Words highlighted in green appear in both texts (shared matches). Words highlighted in red appear only in Text A. Words highlighted in blue appear only in Text B. This visual comparison makes it immediately clear which parts of the texts overlap and where they diverge.

What do the analysis options do?

Ignore case treats "Word" and "word" as the same token. Ignore stopwords removes common function words (the, and, is, etc.) before comparison, focusing the analysis on meaningful content words. Ignore punctuation strips commas, periods, and other symbols. Basic stemming reduces words to their root form so inflected variants match (e.g. "running" and "run"). Minimum phrase length controls the shortest word sequence counted as a matching phrase in the Matching Phrases tab.

How do I fix duplicate content issues?

To fix duplicate content: use <link rel="canonical"> tags to tell search engines your preferred URL; use 301 redirects to consolidate duplicate pages permanently; rewrite content to make it sufficiently unique (generally aim for below 30–40% similarity); use hreflang tags for language/regional variants; and avoid syndication without canonical attribution back to your original. For internal duplicates caused by URL parameters, configure parameter handling in Google Search Console.

Can I download the similarity analysis results?

Yes. Switch to the Report tab after running an analysis. You can copy the full report to your clipboard or download it as a .txt file. The report includes all algorithm scores, text statistics, the top matching phrases, shared vocabulary, and sentence-level similarity results.