欢迎访问《哈尔滨工业大学学报》编辑部网站！

期刊检索

关键词检索

新闻公告MORE

【03-25】投稿请提供保密审查证明
【05-04】论文版权转让协议
【07-05】出版伦理声明
【04-04】告作者书
【07-11】审稿人的职责
【10-17】《哈工大学报》入选“第5届中国精品科技期刊”
【12-30】《哈工大学报》入选“世界学术影响力Q2期刊”
【01-03】《哈工大学报》入选“2018中国国际影响力优秀学术期刊”
【11-01】哈工大学报荣获2016、2018、2020年度“中国高校百佳科技期刊奖”
【03-24】哈工大学报10篇论文入选中国精品科技期刊顶尖学术论文
【12-18】哈工大学报2023优秀审稿专家
【12-24】哈工大学报2022优秀审稿专家
【12-21】哈工大学报2021优秀审稿专家
【12-10】哈工大学报2020优秀审稿专家
【12-13】哈工大学报2019优秀审稿专家
【11-23】哈工大学报2018优秀审稿专家

主管单位 中华人民共和国
工业和信息化部 主办单位 哈尔滨工业大学主编李隆球 国际刊号ISSN 0367-6234 国内刊号CN 23-1235/T

期刊网站二维码

微信公众号二维码

引用本文:	林建方,牛成,李生,郑德权.Web数据反馈的搭配抽取方法[J].哈尔滨工业大学学报,2010,42(2):281.DOI:10.11918/j.issn.0367-6234.2010.02.023
	LIN Jian-fang,NIU Cheng,LI Sheng,ZHENG De-quan.Automatic collocation extraction using web feedback data[J].Journal of Harbin Institute of Technology,2010,42(2):281.DOI:10.11918/j.issn.0367-6234.2010.02.023

【打印本页】【HTML】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

过刊浏览高级检索

本文已被：浏览 1233次下载 886次	码上扫一扫！
分享到：微信更多字体:加大+\|默认\|缩小-
Web数据反馈的搭配抽取方法
林建方¹, 牛成², 李生¹, 郑德权¹
1.哈尔滨工业大学语言语音教育部-微软重点实验室;2.微软亚洲研究院

摘要:

为了提高搭配(Collocation)抽取的精度,提出一种新的互联网数据的搭配抽取方法.传统的搭配抽取统计方法都是基于语料库的,常受到语料库规模的影响和制约,而在互联网数据中蕴含着丰富的知识和信息,基于Web的词汇相关性度量方法,充分利用搭配在谷歌中的页面数模拟其对应语料库的词频数,并分别选取共现频率、互信息、卡方检验3种经典统计关联度量方法.实验结果表明召回率、精确率均好于对应的基于语料库的方法,这说明互联网中大量数据应用于自然语言处理各种任务的可行性.

关键词: 搭配共现频率互信息卡方检验语料库 Web

DOI：10.11918/j.issn.0367-6234.2010.02.023

分类号:TP391.1

基金项目:国家自然科学基金重点资助项目(60736044);国家科技发展计划探索类资助项目(2006AA01Z150)

Automatic collocation extraction using web feedback data

LIN Jian-fang¹, NIU Cheng², LI Sheng¹, ZHENG De-quan¹

1.MOE-MS Key Laboratory of Natural Language Processing and Speech,Harbin Institute of Technology,Harbin 150001,China;2.Microsoft Research Asia,Beijing 100080,China)

Abstract:

To improve the precison of collocation extraction,this paper proposes a new method based on Internet data.For the constraint by the corpus scale for traditional collocation extraction approach based on linguistic corpus,we acquire collocations from Web,which contains plenty of information and knowledge.Three classical association measures of co-occurrence frequency,mutual information and χ2-test are used to automatically extract the collocation.Based on the experimental results,the benchmarks show that the performance of this new Web-based approach is superior to that of traditional approach in both precision and recall.Thus the data from Internet may be applied in many NLP applications.

Key words: collocation co-occurrence frequency mutual information χ2-test corpora web

期刊检索

关键词检索

新闻公告MORE

友情链接LINKS