The paper proposed the novel task for detecting and eliminating near duplicate and duplicate web pages to increase the efficiency of web crawling. So, the technique proposed aims at helping document classification in web content mining by eliminating the near-duplicate documents and in document clustering. For this, a novel Algorithm has been proposed to evaluate the similarity content of two ocuments.
Duplicate Detection (DD) Algorithm
Step 1: Consider the Stemmed keywords of the web page.
Step 2: Based on the starting character i.e. A-Z we here by assumed the hash values should start with1-26.
Step 3: Scan every word from the sample and compare with DB (data base) (initially DB Contains NO key values. Once the New keyword is found then generate respective hash value. Store that key value in temporary DB.
Step 4: Repeat the step 3 until all the keywords get completes.
Step 5: Store all Hash values for a given sample in local DB (i.e. here we used array list)
Step 6: Repeat step 1 to step 6 for N no. of samples.
Step 7: Once the selected samples were over then calculate similarity measure on the samples hash values which we stored in local DB with respective to webpages in repository.
Step 8: From similarity measure, we can generate a report on the samples in the score of %forms. Pages that are 80% similar are considered tobe near duplicates
我晕,貌似没有看到精髓啊!