信息检索六tfidf
《信息检索六tfidf》由会员分享,可在线阅读,更多相关《信息检索六tfidf(47页珍藏版)》请在装配图网上搜索。
1、单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,*,互联网信息搜索,湖南大学计算机与通信学院,刘钰峰,互联网信息搜索六,tfidf and,vector spaces,回顾,1、中文分词,2、词典压缩,3、posting list压缩,4、tfidf,Scoring documents,How do we construct an index?,What strategies can we use with limited main memory?,Scoring,We wish to return in order the documents most l
2、ikely to be useful to the searcher,How can we rank order the docs in the corpus with respect to a query?,Assign a score say in 0,1,for each doc on each query,Begin with a perfect world no spammers,Nobody stuffing keywords into a doc to make it match queries,More on“adversarial IR”under web search,Li
3、near zone combinations,First generation of scoring methods:use a linear combination of Booleans:,E.g.,Score=0.6*,+0.3*+0.05*+0.05*,Each expression such as takes on a value in 0,1.,Then the overall score is in 0,1.,For this example the scores can only take,on a finite set of values what are they?,Exe
4、rcise,On the query,bill,OR,rights,suppose that we retrieve the following docs from the various zone indexes:,bill,rights,bill,rights,bill,rights,Author,Title,Body,1,5,2,8,3,3,5,9,2,5,1,5,8,3,9,9,Compute the score,for each doc based on the weightings 0.6,0.3,0.1,General idea,We are given a,weight vec
5、tor,whose components sum up to 1.,There is a weight for each zone/field.,Given a Boolean query,we assign a score to each doc by adding up the weighted contributions of the zones/fields.,Typically users want to see the,K,highest-scoring docs.,Index support for zone combinations,In the simplest versio
6、n we have a separate inverted index for each zone,Variant:have a single index with a separate dictionary entry for each term and zone,E.g.,bill.author,bill.title,bill.body,1,2,5,8,3,2,5,1,9,Of course,compress zone names,like author/title/body.,Zone combinations index,The above scheme is still wastef
7、ul:each term is potentially replicated for each zone,In a slightly better scheme,we encode the zone in the postings:,At query time,accumulate contributions to the total score of a document from the various postings,e.g.,bill,1.author,1.body,2.author,2.body,3.title,As before,the zone names get compre
8、ssed.,bill,1.author,1.body,2.author,2.body,3.title,rights,3.title,3.body,5.title,5.body,Score accumulation,As we walk the postings for the query,bill,OR,rights,we accumulate scores for each doc in a linear merge as before.,Note:we get,both,bill,and,rights,in the,Title,field of doc 3,but score it no
9、higher.,Should we give more weight to more hits?,1,2,3,5,0.7,0.7,0.4,0.4,Term-document count matrices,Consider the number of occurrences of a term in a document:,Bag of words,model,Document is a vector:a column below,Bag of words view of a doc,Thus the doc,John is quicker than Mary,.,is indistinguis
10、hable from the doc,Mary is quicker than John,.,Which of the indexes discussed,so far distinguish these two docs?,Counts vs.frequencies,WARNING,:In a lot of IR literature,“frequency”is used to mean“count”,Thus,term frequency,in IR literature is used to mean,number of occurrences,in a doc,Not,divided
11、by document length(which would actually make it a frequency),We will conform to this misnomer,In saying,term frequency,we mean the,number of occurrences,of a term in a document.,Term frequency,tf,Long docs are favored,because theyre more likely to contain query terms,Can fix this to some extent by n
12、ormalizing for document length,But is raw,tf,the right measure?,Document frequency,But document frequency(,df,)may be better:,df,=number of docs in the corpus containing the term,Word,cf,df,ferrari,1042217,insurance,104403997,Document/collection frequency weighting is only possible in known(static)c
13、ollection.,So how do we make use of,df,?,tf x idf term weights,tf x idf measure combines:,term frequency(,tf,),or,wf,some measure of term density in a doc,inverse document frequency(,idf,),measure of informativeness of a term:its rarity across the whole corpus,could just be raw count of number of do
14、cuments the term occurs in(,idf,i,=,1/,df,i,),but by far the most commonly used version is:,See Kishore Papineni,NAACL 2,2002 for theoretical justification,Summary:tf x idf(or tf.idf),Assign a tf.idf weight to each term,i,in each document,d,Increases with the number of occurrences,within,a doc,Incre
15、ases with the rarity of the term,across,the whole corpus,再论TF,Real-valued term-document matrices,Function(scaling)of count of a word in a document:,Bag of words,model,Each is a vector in,v,Here log-scaled,tf.idf,Note can be 1!,Documents as vectors,Each doc,j,can now be viewed as a vector of,wf,idf,v
16、alues,one component for each term,So we have a vector space,terms are axes,docs live in this space,even with stemming,may have 20,000+dimensions,(The corpus of documents gives us a matrix,which we could also view as a vector space in which words live transposable data),Why turn docs into vectors?,First application:Query-by-example,Given a doc,d,find others“like”it.,Now that,d,is a vector,find vectors(docs)“near”it.,Intuition,Postulate:Documents that are“close together”,in the vector space talk a
- 温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。