数据仓库与数据挖掘二



《数据仓库与数据挖掘二》由会员分享,可在线阅读,更多相关《数据仓库与数据挖掘二(39页珍藏版)》请在装配图网上搜索。
1、Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,©Silberschatz, Korth and Sudarshan,20.,39,Click to edit Master title style,Database System Concepts - 6,th,Edition,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Click to edit Maste
2、r title style,Chapter 20: Data Analysis,Chapter 20: Data Analysis,Decision Support Systems,Data Warehousing,Data Mining,Classification,Association Rules,Clustering,,,Decision Support Systems,Decision-support systems,are used to make business decisions, often based on data collected by on-line transa
3、ction-processing systems.,Examples of business decisions:,What items to stock?,What insurance premium to change?,To whom to send advertisements?,Examples of data used for making decisions,Retail sales transaction details,Customer profiles (income, age, gender, etc.),Decision-Support Systems: Overvie
4、w,Data analysis,tasks are simplified by specialized tools and SQL extensions,Example tasks,For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year,As above, for each product category and each customer category,S
5、tatistical analysis,packages (e.g., : S++) can be interfaced with databases,Statistical analysis is a large field, but not covered here,Data mining,seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases.,A,data warehouse,archives information gath
6、ered from multiple sources, and stores it under a unified schema, at a single site.,Important for large businesses that generate data from multiple divisions, possibly at multiple sites,Data may also be purchased externally,Data Warehousing,Data sources often store only current data, not historical
7、data,Corporate decision making requires a unified view of all organizational data, including historical data,A,data warehouse,is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site,Greatly simplifies querying, permits study of histori
8、cal trends,Shifts decision support query load away from transaction processing systems,,Data Warehousing,Design Issues,When and how to gather data,Source driven architecture,: data sources transmit new information to warehouse, either continuously or periodically (e.g., at night),Destination drive
9、n architecture,: warehouse periodically requests new information from data sources,Keeping warehouse exactly synchronized with data sources (e.g., using two-phase commit) is too expensive,Usually OK to have slightly out-of-date data at warehouse,Data/updates are periodically downloaded form online
10、 transaction processing (OLTP) systems.,What schema to use,Schema integration,More Warehouse Design Issues,Data cleansing,E.g., correct mistakes in addresses (misspellings, zip code errors),Merge,address lists from different sources and,purge,duplicates,How to propagate updates,Warehouse schema may
11、be a (materialized) view of schema from data sources,What data to summarize,Raw data may be too large to store on-line,Aggregate values (totals/subtotals) often suffice,Queries on raw data can often be transformed by query optimizer to use aggregate values,,Warehouse Schemas,Dimension values are usu
12、ally encoded using small integers and mapped to full values via dimension tables,Resultant schema is called a,star schema,More complicated schema structures,Snowflake schema,: multiple levels of dimension tables,Constellation,: multiple fact tables,Data Warehouse Schema,Data Mining,Data miningisthep
13、rocessofsemi-automaticallyanalyzing large databasestofind usefulpatterns,Prediction,basedonpast history,Predict if acredit cardapplicant poses agoodcreditrisk,basedonsomeattributes (income, jobtype,age, ..)andpasthistory,Predict if apatternofphonecalling cardusageislikely to be fraudulent,Some examp
14、les of predictionmechanisms:,Classification,Givena newitem whose class is unknown, predicttowhichclassitbelongs,Regression,formulae,Givena setofmappingsforanunknownfunction,predictthefunctionresult fora newparametervalue,Data Mining(Cont.),DescriptivePatterns,Associations,Find books thatare often bo
15、ughtby,“,“similar”customers.Ifanewsuchcustomerbuys onesuch book, suggestthe otherstoo.,Associationsmay be usedasafirststep in detecting,causation,E.g.,associationbetween exposure to chemical Xand cancer,,Clusters,E.g.,typhoid cases wereclustered in an areasurroundingacontaminatedwell,Detectionofclus
16、tersremainsimportantindetecting epidemics,ClassificationRules,Classificationruleshelp assignnewobjectstoclasses.,E.g.,givena newautomobile insuranceapplicant, shouldheorshebeclassifiedaslowrisk,medium riskorhighrisk?,Classificationrulesforaboveexamplecoulduseavariety of data, suchaseducationallevel,
17、 salary,age,etc.,,personP,P.degree =masters,and,P.income> 75,000,,P.credit= excellent,,personP,P.degree =bachelors,and,(P.income,,25,000and P.income,,75,000),,P.credit= good,Rulesarenot necessarily exact:theremaybesomemisclassifications,Classificationrulescanbeshowncompactly as adecisiontree.,
18、DecisionTree,ConstructionofDecisionTrees,Trainingset,: adatasampleinwhichthe classification is alreadyknown.,Greedy,topdowngeneration of decision trees.,Each internal nodeofthe treepartitionsthedatainto groupsbasedona,partitioningattribute,, anda,partitioningcondition,forthe node,Leaf,node:,all(or m
19、ost) of theitemsatthenodebelongtothe sameclass, or,allattributeshave beenconsidered,and no furtherpartitioning is possible.,Best Splits,Pick bestattributesandconditionsonwhichtopartition,Thepurity of aset Softraininginstances canbemeasuredquantitativelyinseveral ways.,Notation:number of classes=,k,,
20、numberofinstances =|S|,fractionofinstances in class,i,=,p,i,.,The,Gini,measureof purity isdefinedas,[,Gini(S)= 1-,,,Whenallinstances are in asingle class, the Gini valueis0,It reaches its maximum (of 1,–,–1/,k,) ifeach classthesamenumberof instances.,,k,i,- 1,p,2,i,BestSplits(Cont.),Anothermeasureo
21、f purity isthe,entropy,measure,which is defined as,,entropy(S)= –,,,Whena set Sissplit into multiplesetsSi,I=1, 2,,…,…,r, we canmeasure the purity oftheresultant set of sets as:,,purity(,S,1,, S,2,, ….., S,r,) =,,,Theinformationgainduetoparticular splitofS into S,i,, i= 1,2,,…,….,r,Information-gai
22、n,(,S,, {,S,1,,,S,2,, ….,,S,r,) =purity(,S,) –purity (,S,1,,,S,2,, …,S,r,),,,,r,i,= 1,|,S,i,|,|,S,|,purity,(,S,i,),k,i-,1,p,i,log,2,p,i,BestSplits(Cont.),Measureof “cost,”,” ofa split:Information-content(,S,, {,S,1,,,S,2,, …..,,S,r,}))= –,,,Information-gainratio,= Information-gain(,S,,{,S,1,,,S,2,
23、, ……,,S,r,}),Information-content(,S,, {,S,1,,,S,2,, …..,,S,r,}),Thebestsplit is the one that givesthemaximuminformationgain ratio,,log,2,r,i,- 1,|,S,i,|,|,S,|,|,S,i,|,|,S,|,,FindingBestSplits,Categoricalattributes (withnomeaningful order):,Multi-way split,onechild for eachvalue,Binary split: try all
24、 possible breakup of valuesinto two sets,andpickthebest,Continuous-valued attributes(can besortedin ameaningfulorder),Binary split:,Sortvalues,tryeach asa splitpoint,E.g., ifvaluesare1, 10, 15, 25, splitat,,1, 10,,, 15,Pickthevalue thatgives best split,Multi-way split:,A seriesofbinary splits o
25、nthesame attributehasroughlyequivalent effect,,,,Decision-Tree Construction Algorithm,Procedure,GrowTree,(,S,)Partition (,S,);,Procedure,Partition (,S,),if,(,purity,(,S,) >,,p,or |,S,| <,,,s,),thenreturn,;,foreach,attribute,A,evaluatesplitson attribute,A,; Usebestsplit found(acrossallattributes)
26、topartition,S,into,S,1,, S,2,, …., S,r,,,for,i,= 1,2,,…,…..,,r,Partition (,S,i,);,Other Typesof Classifiers,Neural net classifiers are studied in artificialintelligence and are not covered here,Bayesianclassifiersuse,Bayes theorem,, whichsays,p,(,c,j,|,d,) =,p,(,d,| c,j,),p,(,c,j,),p,(,d,)where,p,(,
27、c,j,|,d,) =probabilityof instance,d,being inclass,c,j,,,p,(,d,| c,j,) =probabilityof generating instance,d,given class,c,j,,,p,(,c,j,)= probability ofoccurrenceof class,c,j,, and,p,(,d,) =probabilityof instance,d,occuring,,Naïve Bayesian Classifiers,Bayesianclassifiersrequire,computationof,p,(,d,| c
28、,j,),precomputation of,p,(,c,j,),p,(,d,) can beignoredsince it isthesame for all classes,To simplifythetask,,naïve Bayesian classifiers,assume attributes have independent distributions, and thereby estimate,p,(,d,|,c,j,) =,p,(,d,1,|,c,j,) *,p,(,d,2,|,c,j,) *,…,….*(,p,(,d,n,|,c,j,),Eachofthe,p,(,d,i,
29、|,c,j,) can beestimated froma histogramon,d,i,values for eachclass,c,j,thehistogram iscomputed from the traininginstances,Histograms on multiple attributes are more expensivetocomputeandstore,,Regression,Regression dealswith the predictionofa value,ratherthana class.,Given valuesfora set of variable
30、s,X,1,, X,2,, …,X,n,, wewish topredictthevalue of avariableY.,Onewayis to infercoefficientsa,0,, a,1,, a,1,, …,a,n,suchthat,Y,=,a,0,+,a,1,*,X,1,+,a,2,*,X,2,+ …+,a,n,*,X,n,Findingsucha linear polynomialiscalled,linear regression,.,In general,theprocessof finding acurve thatfitsthedata isalso called,c
31、urve fitting,.,Thefitmayonlybeapproximate,becauseof noiseinthedata, or,becausetherelationshipisnotexactlya polynomial,Regression aimsto findcoefficientsthat give the bestpossiblefit.,AssociationRules,Retail shopsareoften interested inassociations between differentitems that people buy.,Someonewhobuy
32、sbread is quitelikely alsoto buy milk,A personwhoboughtthebook,DatabaseSystemConcepts,is quitelikelyalsotobuythebook,Operating SystemConcepts,.,Associationsinformationcanbeusedinseveralways.,E.g., when acustomer buys aparticularbook, anonlineshopmaysuggestassociatedbooks.,Associationrules:,bread,,m
33、ilkDB-Concepts,OS-Concepts,Networks,Lefthandside:,antecedent,,righthandside:,consequent,Anassociationrulemusthaveanassociated,population,;thepopulationconsistsofasetof,instances,E.g.,eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulation,AssociationRules(Cont.),Ruleshave
34、anassociatedsupport,aswellasanassociatedconfidence.,Support,isameasureofwhatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.,milk,,screwdrivers,islow.,Confidence,isameasureofhowoftentheconsequentistruewhentheantecedentistrue.,E.g.,therule,bread,,milk,hasaconfidenceof80pe
35、rcentif80percentofthepurchasesthatincludebreadalsoincludemilk.,,,FindingAssociationRules,Wearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e.g.,supportof2%orgreater),Na,ï,ïvealgorithm,Considerallpossiblesetsofrelevantitems.,Foreachsetfinditssupport(i.e.,counthowmanytransactions
36、purchaseallitemsintheset).,Largeitemsets,:setswithsufficientlyhighsupport,Uselargeitemsetstogenerateassociationrules.,Fromitemset,A,generatetherule,A,-{,b,} ,b,foreach,b,,A.,Supportof rule= support (,A),.,Confidence of rule =support(,A,) /support(,A,- {,b,}),FindingSupport,Determine support ofitem
37、sets via asingle passon set of transactions,Large itemsets:setswith ahighcount at the end ofthepass,If memory not enoughtoholdallcountsforallitemsetsusemultiple passes, considering only someitemsetsineachpass.,Optimization: Once an itemset iseliminatedbecauseitscount (support)is too smallnone ofitss
38、upersets needstobe considered.,The,a priori,technique tofind largeitemsets:,Pass1:count support ofallsets with just1 item.Eliminate thoseitems withlowsupport,Pass,i,:,candidates,: everysetof,i,items such thatallits,i-1,itemsubsetsarelarge,Count support ofallcandidates,Stopifthere are nocandidates,Ot
39、her Typesof Associations,Basic association ruleshaveseverallimitations,Deviations fromtheexpectedprobabilityaremore interesting,E.g., ifmany peoplepurchase bread,andmany peoplepurchase cereal, quitea few wouldbe expectedto purchaseboth,We are interested in,positive,as wellas,negativecorrelations,bet
40、weensetsofitems,Positivecorrelation: co-occurrenceis higher than predicted,Negativecorrelation: co-occurrenceis lowerthan predicted,Sequenceassociations /correlations,E.g., whenever bondsgoup,stock pricesgodownin2 days,Deviations fromtemporalpatterns,E.g., deviationfroma steady growth,E.g., salesof
41、winter wear go down in summer,Notsurprising,partofa knownpattern.,Lookfordeviation fromvalue predictedusing past patterns,Clustering,Clustering:Intuitively,findingclusters ofpointsin the givendata such thatsimilarpoints lie in the same cluster,Canbe formalized usingdistancemetricsinseveralways,Group
42、 pointsinto,k,sets(for agiven,k,) such thattheaveragedistanceofpoints fromthecentroidoftheir assigned groupisminimized,Centroid: pointdefinedby taking average ofcoordinatesineachdimension.,Anothermetric:minimizeaveragedistance between everypairofpoints in acluster,Hasbeenstudiedextensivelyinstatisti
43、cs,buton smalldata sets,Dataminingsystemsaimat clustering techniquesthat can handlevery largedatasets,E.g., the Birchclustering algorithm(more shortly),HierarchicalClustering,Examplefrombiologicalclassification,(theword classificationhere does not meana predictionmechanism),chordatamammaliareptilial
44、eopardshumanssnakescrocodiles,Other examples:Internetdirectory systems (e.g., Yahoo,more onthis later),Agglomerative clusteringalgorithms,Build smallclusters, then cluster smallclusters into bigger clusters,andso on,Divisiveclusteringalgorithms,Start with all itemsina singlecluster, repeatedlyrefine
45、(break)clustersinto smaller ones,Clustering Algorithms,Clustering algorithms have beendesignedtohandle verylarge datasets,E.g., the,Birch algorithm,Mainidea: use an in-memoryR-tree to storepoints thatarebeing clustered,Insert points one ata timeintotheR-tree,merginga new pointwith anexisting cluster
46、 ifislessthan some,,distanceaway,If therearemore leaf nodesthan fit inmemory,merge existingclustersthat are closeto eachother,At the end of firstpasswegeta largenumber of clusters at the leavesoftheR-tree,Merge clusters to reducethenumberof clusters,Collaborative Filtering,Goal:predict what movies/
47、books/… aperson may beinterestedin,on the basis of,Pastpreferences ofthe person,Otherpeople with similarpastpreferences,The preferencesof such peoplefora newmovie/book/…,One approach based on repeatedclustering,Cluster peopleon the basis ofpreferences for movies,Thencluster movieson the basis ofbein
48、g liked bythesameclusters of people,Againcluster peoplebased ontheirpreferences for (the newly createdclustersof) movies,Repeat above till equilibrium,Aboveproblem is aninstance of,collaborative filtering,, where users collaboratein the task offilteringinformation tofindinformation ofinterest,OtherT
49、ypes ofMining,Textmining,: application of data mining to textualdocuments,cluster Web pages tofindrelated pages,cluster pages auserhasvisited toorganizetheirvisit history,classify Web pages automatically into aWeb directory,Datavisualization,systems help users examine large volumesof data and detectpatterns visually,Can visually encodelargeamounts of information on a singlescreen,Humans areverygooda detecting visualpatterns,End of Chapter,Figure 20.01,Figure 20.02,Figure 20.03,Figure 20.05,演讲完毕,,谢,谢谢观看!,
- 温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 踏春寻趣 乐享时光——春季旅游踏春出游活动
- 清明假期至安全不缺席风起正清明安全需守护
- 全国党员教育培训工作规划
- XX中小学公共卫生培训树立文明卫生意识养成良好卫生习惯
- 小学生常见传染病预防知识培训传染病的预防措施
- 3月18日全国爱肝日中西医结合逆转肝硬化
- 肝病健康宣教守护您的肝脏健康如何预防肝炎
- 垃圾分类小课堂教育绿色小卫士分类大行动
- 中小学班主任经验交流从胜任到优秀身为世范为人师表 立责于心履责于行
- 教师数字化转型理解与感悟教师数字化转型的策略与建议
- 团建小游戏团建破冰小游戏团队协作破冰游戏多人互动
- 教师使用deepseek使用攻略让备课效能提升
- 办公室会议纪要培训会议内容会议整理公文攥写
- 党员要注重培塑忠诚奋斗奉献的人格力量
- 橙色卡通风儿童春季趣味运动会