Efficient approximate entity extraction with edit distance constraints

UTSePress Research/Manakin Repository

Search UTSePress Research

Advanced Search


My Account

Show simple item record

dc.contributor.author Wang, Wei en_US
dc.contributor.author Xiao, Chuan en_US
dc.contributor.author Lin, Xuemin en_US
dc.contributor.author Zhang, Chengqi en_US
dc.contributor.editor Ugur, Aetintemel; Stanley, B. Zdonik; Donald, Kossmann; Nesime, Tatbul; en_US
dc.date.accessioned 2010-06-17T04:37:15Z
dc.date.available 2010-06-17T04:37:15Z
dc.date.issued 2009 en_US
dc.identifier 2009001769 en_US
dc.identifier.citation Wang Wei et al. 2009, 'Efficient approximate entity extraction with edit distance constraints', ACM, Rhode Island, USA, pp. 759-770. en_US
dc.identifier.issn 978-1-60558-551-2 en_US
dc.identifier.other E1 en_US
dc.identifier.uri http://hdl.handle.net/10453/12354
dc.description.abstract Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude. en_US
dc.language English en_US
dc.publisher ACM en_US
dc.relation.isbasedon NA en_US
dc.title Efficient approximate entity extraction with edit distance constraints en_US
dc.parent Proceedings of the 35th SIGMOD international conference on Management of data en_US
dc.journal.volume en_US
dc.journal.number en_US
dc.publocation Rhode Island, USA en_US
dc.identifier.startpage 759 en_US
dc.identifier.endpage 770 en_US
dc.cauo.name FEIT.Faculty of Engineering & Information Technology en_US
dc.conference Verified OK en_US
dc.for 080109 en_US
dc.personcode 0000059260 en_US
dc.personcode 0000059261 en_US
dc.personcode 0000059262 en_US
dc.personcode 011221 en_US
dc.percentage 70 en_US
dc.classification.name Pattern Recognition and Data Mining en_US
dc.classification.type FOR-08 en_US
dc.edition en_US
dc.custom ACM Special Interest Group on Management of Data Conference en_US
dc.date.activity 20090629 en_US
dc.location.activity Rhode Island, USA en_US
dc.description.keywords NA en_US
dc.staffid en_US
dc.staffid 011221 en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record