IIT Guwahati Researchers develop method to correct Wikipedia surface name errors

To validate the developed method, the research team compared snapshots of English Wikipedia from 2018 and 2022

Update: 2026-03-06 04:00 GMT
A file image of IIT Guwahati (Photo: X)

GUWAHATI, March 6: Indian Institute of Technology (IIT) Guwahati researchers have developed a multilingual and scalable method to identify and correct Surface Name Errors (SNEs) in Wikipedia, thus helping improve information reliability for both human users and artificial intelligence (AI) systems.

A surface name refers to the text used in Wikipedia articles to mention or link to another entity. A SNE occurs when this text is incorrect.

A study conducted by the IIT Guwahati research team found that about three to six per cent of all entity mentions in Wikipedia contain SNEs. While these errors may appear minor, they have significant implications.

For human users, an incorrect surface name can reduce the perceived credibility and reliability of the information provided.

Similarly, many machine learning and deep learning models use Wikipedia as a core dataset. Such errors in surface names can negatively impact AI tasks and model performance, said the research team.

To address this challenge, Prof Amit Awekar, Associate Professor at the Department of Computer Science and Engineering of IIT Guwahati, along with MTech student Anuj Khare (batch of 2022), built a method that uses mathematical frequency patterns, making it adaptable across languages. The developed method follows a three-step approach to classify SNEs.

The first step included scanning Wikipedia and converting every link into a quadruplet containing information on the page where the link appears, the page it points to, the surface name used in the link, and the surrounding textual context.

In the next step, the developed method reviewed the surface name and considered it correct only if it appeared at least 10 times and if it accounted for at least five per cent of all links pointing to a specific page.

Surface names that did not meet these criteria were flagged as potential errors.

In the final step, it categorized the detected errors into ‘typing mistakes’, such as ‘Gawahati’ instead of ‘Guwahati’, or ‘entity span errors’, where extra or incorrect words are mistakenly included in the link.

The researchers tested the developed method on eight languages, including English, Sanskrit, German, Italian, Urdu, Hindi, Marathi, and Gujarati, and found accurate outcomes.

Speaking about the real-world application of the developed method, Prof Awekar, said, “This work shows us that we should not trust the data from the web blindly, both for human use and training AI models. Good data is the beginning of any good AI model and downstream application.”

To validate the developed method, the research team compared snapshots of English Wikipedia from 2018 and 2022 and found that about 30 per cent of the errors predicted by the method had been corrected on Wikipedia over four years, confirming its accuracy.

Wikipedia is maintained by volunteers worldwide, and the developed method can help editors identify hidden typos and linking errors that might otherwise remain unnoticed for years, Prof Awekar said. The Wikipedia community has accepted more than 99 per cent of the manual corrections suggested by the researchers.


By

Staff Reporter

Tags:    

Similar News