Why we are using TF-IDF with n-gram similarity? We use TF-IDF with n-gram similarity to detect plagiarism at the sentence level because it provides a more accurate measure of similarity between two sentences. N-gram similarity along with TF-IDF: N-gram similarity along with TF-IDF is a technique for measuring the similarity between two texts based on the frequency of n-grams in the texts, where n-grams are contiguous sequences of n words. TF-IDF stands for term frequency-inverse document frequency. It is a statistical measure used to evaluate a word's importance to a corpus. Term frequency measures how often a word appears in a corpus. Inverse document frequency measures how rare a word is in a corpus. The final TF-IDF score for a word is calculated by multiplying both term frequency and inverse document frequency scores. For term i in the document j: Wi,j = TFi,j * log( 𝑁 𝐷𝐹𝑖 ) Here, Wi,j represents the TF-IDF score of term i in document j. TFi,j is the number of occurrences of i in document j. N is the total number of documents DFi is the number of documents containing i. TF-IDF with n-gram similarity to detect plagiarism at the sentence level: We would first preprocess our data by breaking down our documents into individual sentences and then breaking down each sentence into n-grams of length n. Next, we would calculate the TF-IDF scores for each n-gram in each sentence using the formula: TF-IDF(w, d) = (TF(w, d) / n) * log(N / DF(w)) Here w is an n-gram word, d is a sentence n is the number of words in the sentence, N is the total number of sentences TF(w, d) is the frequency of the word in the sentence DF(w) is the number of sentences containing the word Covert vector: Calculate the TF-IDF(w,d) Compare similarity: Generate the cosine similarity score and compare it with the threshold value.