back... |
Information Arbitrage in Multi-Lingual Wikipedia Eytan Adar, Michael Skinner, Dan Weld
The rapid globalization of Wikipedia is generating a parallel,
multi-lingual corpus of unprecedented scale. Pages for the same
topic in many different languages emerge both as a result of
manual translation and independent development. Unfortunately,
these pages may appear at different times, vary in size, scope, and
quality. Furthermore, differential growth rates cause the
conceptual mapping between articles in different languages to be
both complex and dynamic. These disparities provide the
opportunity for a powerful form of information arbitrage–
leveraging articles in one or more languages to improve the
content in another. Analyzing four large language domains
(English, Spanish, French, and German), we present Ziggurat, an
automated system for aligning Wikipedia infoboxes, creating new
infoboxes as necessary, filling in missing information, and
detecting discrepancies between parallel pages. Our method uses
self-supervised learning and our experiments demonstrate the
method’s feasibility, even in the absence of dictionaries.
PDF (614Kb), WSDM'09, Barcelona, Spain, Feb. 9-12, 2009 |