Uploaded by Rashidul Hassan

Summary of TF-IDF

advertisement
Why we are using TF-IDF with n-gram similarity?
We use TF-IDF with n-gram similarity to detect plagiarism at the sentence level
because it provides a more accurate measure of similarity between two
sentences.
N-gram similarity along with TF-IDF: N-gram similarity along with TF-IDF is a
technique for measuring the similarity between two texts based on the frequency
of n-grams in the texts, where n-grams are contiguous sequences of n words.
TF-IDF stands for term frequency-inverse document frequency. It is a statistical
measure used to evaluate a word's importance to a corpus. Term frequency
measures how often a word appears in a corpus. Inverse document frequency
measures how rare a word is in a corpus. The final TF-IDF score for a word is
calculated by multiplying both term frequency and inverse document frequency
scores.
For term i in the document j:
Wi,j = TFi,j * log(
𝑁
𝐷𝐹𝑖
)
Here,
Wi,j represents the TF-IDF score of term i in document j.
TFi,j is the number of occurrences of i in document j.
N is the total number of documents
DFi is the number of documents containing i.
TF-IDF with n-gram similarity to detect plagiarism at the sentence level: We
would first preprocess our data by breaking down our documents into individual
sentences and then breaking down each sentence into n-grams of length n. Next,
we would calculate the TF-IDF scores for each n-gram in each sentence using
the formula:
TF-IDF(w, d) = (TF(w, d) / n) * log(N / DF(w))
Here w is an n-gram word,
d is a sentence
n is the number of words in the sentence,
N is the total number of sentences
TF(w, d) is the frequency of the word in the sentence
DF(w) is the number of sentences containing the word
Covert vector: Calculate the TF-IDF(w,d)
Compare similarity: Generate the cosine similarity score and compare it with the
threshold value.
Download