数据仓库与数据挖掘二

上传人：jk****g 文档编号：253025573 上传时间：2024-11-27 格式：PPTX 页数：39 大小：1.61MB

收藏版权申诉举报下载

第1页 / 共39页

第2页 / 共39页

第3页 / 共39页

下载文档到电脑，查找使用更方便

15 积分

下载资源

资源描述：

《数据仓库与数据挖掘二》由会员分享，可在线阅读，更多相关《数据仓库与数据挖掘二（39页珍藏版）》请在装配图网上搜索。

1、Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,©Silberschatz, Korth and Sudarshan,20.,39,Click to edit Master title style,Database System Concepts - 6,th,Edition,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Click to edit Maste

2、r title style,Chapter 20: Data Analysis,Chapter 20: Data Analysis,Decision Support Systems,Data Warehousing,Data Mining,Classification,Association Rules,Clustering,,,Decision Support Systems,Decision-support systems,are used to make business decisions, often based on data collected by on-line transa

3、ction-processing systems.,Examples of business decisions:,What items to stock?,What insurance premium to change?,To whom to send advertisements?,Examples of data used for making decisions,Retail sales transaction details,Customer profiles (income, age, gender, etc.),Decision-Support Systems: Overvie

4、w,Data analysis,tasks are simplified by specialized tools and SQL extensions,Example tasks,For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year,As above, for each product category and each customer category,S

5、tatistical analysis,packages (e.g., : S++) can be interfaced with databases,Statistical analysis is a large field, but not covered here,Data mining,seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases.,A,data warehouse,archives information gath

6、ered from multiple sources, and stores it under a unified schema, at a single site.,Important for large businesses that generate data from multiple divisions, possibly at multiple sites,Data may also be purchased externally,Data Warehousing,Data sources often store only current data, not historical

7、data,Corporate decision making requires a unified view of all organizational data, including historical data,A,data warehouse,is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site,Greatly simplifies querying, permits study of histori

8、cal trends,Shifts decision support query load away from transaction processing systems,,Data Warehousing,Design Issues,When and how to gather data,Source driven architecture,: data sources transmit new information to warehouse, either continuously or periodically (e.g., at night),Destination drive

9、n architecture,: warehouse periodically requests new information from data sources,Keeping warehouse exactly synchronized with data sources (e.g., using two-phase commit) is too expensive,Usually OK to have slightly out-of-date data at warehouse,Data/updates are periodically downloaded form online

10、 transaction processing (OLTP) systems.,What schema to use,Schema integration,More Warehouse Design Issues,Data cleansing,E.g., correct mistakes in addresses (misspellings, zip code errors),Merge,address lists from different sources and,purge,duplicates,How to propagate updates,Warehouse schema may

11、be a (materialized) view of schema from data sources,What data to summarize,Raw data may be too large to store on-line,Aggregate values (totals/subtotals) often suffice,Queries on raw data can often be transformed by query optimizer to use aggregate values,,Warehouse Schemas,Dimension values are usu

12、ally encoded using small integers and mapped to full values via dimension tables,Resultant schema is called a,star schema,More complicated schema structures,Snowflake schema,: multiple levels of dimension tables,Constellation,: multiple fact tables,Data Warehouse Schema,Data Mining,Data miningisthep

13、rocessofsemi-automaticallyanalyzing large databasestofind usefulpatterns,Prediction,basedonpast history,Predict if acredit cardapplicant poses agoodcreditrisk,basedonsomeattributes (income, jobtype,age, ..)andpasthistory,Predict if apatternofphonecalling cardusageislikely to be fraudulent,Some examp

14、les of predictionmechanisms:,Classification,Givena newitem whose class is unknown, predicttowhichclassitbelongs,Regression,formulae,Givena setofmappingsforanunknownfunction,predictthefunctionresult fora newparametervalue,Data Mining(Cont.),DescriptivePatterns,Associations,Find books thatare often bo

15、ughtby,“,“similar”customers.Ifanewsuchcustomerbuys onesuch book, suggestthe otherstoo.,Associationsmay be usedasafirststep in detecting,causation,E.g.,associationbetween exposure to chemical Xand cancer,,Clusters,E.g.,typhoid cases wereclustered in an areasurroundingacontaminatedwell,Detectionofclus

16、tersremainsimportantindetecting epidemics,ClassificationRules,Classificationruleshelp assignnewobjectstoclasses.,E.g.,givena newautomobile insuranceapplicant, shouldheorshebeclassifiedaslowrisk,medium riskorhighrisk?,Classificationrulesforaboveexamplecoulduseavariety of data, suchaseducationallevel,

17、 salary,age,etc.,,personP,P.degree =masters,and,P.income> 75,000,,P.credit= excellent,,personP,P.degree =bachelors,and,(P.income,,25,000and P.income,,75,000),,P.credit= good,Rulesarenot necessarily exact:theremaybesomemisclassifications,Classificationrulescanbeshowncompactly as adecisiontree.,

18、DecisionTree,ConstructionofDecisionTrees,Trainingset,: adatasampleinwhichthe classification is alreadyknown.,Greedy,topdowngeneration of decision trees.,Each internal nodeofthe treepartitionsthedatainto groupsbasedona,partitioningattribute,, anda,partitioningcondition,forthe node,Leaf,node:,all(or m

19、ost) of theitemsatthenodebelongtothe sameclass, or,allattributeshave beenconsidered,and no furtherpartitioning is possible.,Best Splits,Pick bestattributesandconditionsonwhichtopartition,Thepurity of aset Softraininginstances canbemeasuredquantitativelyinseveral ways.,Notation:number of classes=,k,,

20、numberofinstances =|S|,fractionofinstances in class,i,=,p,i,.,The,Gini,measureof purity isdefinedas,[,Gini(S)= 1-,,,Whenallinstances are in asingle class, the Gini valueis0,It reaches its maximum (of 1,–,–1/,k,) ifeach classthesamenumberof instances.,,k,i,- 1,p,2,i,BestSplits(Cont.),Anothermeasureo

21、f purity isthe,entropy,measure,which is defined as,,entropy(S)= –,,,Whena set Sissplit into multiplesetsSi,I=1, 2,,…,…,r, we canmeasure the purity oftheresultant set of sets as:,,purity(,S,1,, S,2,, ….., S,r,) =,,,Theinformationgainduetoparticular splitofS into S,i,, i= 1,2,,…,….,r,Information-gai

22、n,(,S,, {,S,1,,,S,2,, ….,,S,r,) =purity(,S,) –purity (,S,1,,,S,2,, …,S,r,),,,,r,i,= 1,|,S,i,|,|,S,|,purity,(,S,i,),k,i-,1,p,i,log,2,p,i,BestSplits(Cont.),Measureof “cost,”,” ofa split:Information-content(,S,, {,S,1,,,S,2,, …..,,S,r,}))= –,,,Information-gainratio,= Information-gain(,S,,{,S,1,,,S,2,

23、, ……,,S,r,}),Information-content(,S,, {,S,1,,,S,2,, …..,,S,r,}),Thebestsplit is the one that givesthemaximuminformationgain ratio,,log,2,r,i,- 1,|,S,i,|,|,S,|,|,S,i,|,|,S,|,,FindingBestSplits,Categoricalattributes (withnomeaningful order):,Multi-way split,onechild for eachvalue,Binary split: try all

24、 possible breakup of valuesinto two sets,andpickthebest,Continuous-valued attributes(can besortedin ameaningfulorder),Binary split:,Sortvalues,tryeach asa splitpoint,E.g., ifvaluesare1, 10, 15, 25, splitat,,1, 10,,, 15,Pickthevalue thatgives best split,Multi-way split:,A seriesofbinary splits o

25、nthesame attributehasroughlyequivalent effect,,,,Decision-Tree Construction Algorithm,Procedure,GrowTree,(,S,)Partition (,S,);,Procedure,Partition (,S,),if,(,purity,(,S,) >,,p,or |,S,| <,,,s,),thenreturn,;,foreach,attribute,A,evaluatesplitson attribute,A,; Usebestsplit found(acrossallattributes)

26、topartition,S,into,S,1,, S,2,, …., S,r,,,for,i,= 1,2,,…,…..,,r,Partition (,S,i,);,Other Typesof Classifiers,Neural net classifiers are studied in artificialintelligence and are not covered here,Bayesianclassifiersuse,Bayes theorem,, whichsays,p,(,c,j,|,d,) =,p,(,d,| c,j,),p,(,c,j,),p,(,d,)where,p,(,

27、c,j,|,d,) =probabilityof instance,d,being inclass,c,j,,,p,(,d,| c,j,) =probabilityof generating instance,d,given class,c,j,,,p,(,c,j,)= probability ofoccurrenceof class,c,j,, and,p,(,d,) =probabilityof instance,d,occuring,,Naïve Bayesian Classifiers,Bayesianclassifiersrequire,computationof,p,(,d,| c

28、,j,),precomputation of,p,(,c,j,),p,(,d,) can beignoredsince it isthesame for all classes,To simplifythetask,,naïve Bayesian classifiers,assume attributes have independent distributions, and thereby estimate,p,(,d,|,c,j,) =,p,(,d,1,|,c,j,) *,p,(,d,2,|,c,j,) *,…,….*(,p,(,d,n,|,c,j,),Eachofthe,p,(,d,i,

29、|,c,j,) can beestimated froma histogramon,d,i,values for eachclass,c,j,thehistogram iscomputed from the traininginstances,Histograms on multiple attributes are more expensivetocomputeandstore,,Regression,Regression dealswith the predictionofa value,ratherthana class.,Given valuesfora set of variable

30、s,X,1,, X,2,, …,X,n,, wewish topredictthevalue of avariableY.,Onewayis to infercoefficientsa,0,, a,1,, a,1,, …,a,n,suchthat,Y,=,a,0,+,a,1,*,X,1,+,a,2,*,X,2,+ …+,a,n,*,X,n,Findingsucha linear polynomialiscalled,linear regression,.,In general,theprocessof finding acurve thatfitsthedata isalso called,c

31、urve fitting,.,Thefitmayonlybeapproximate,becauseof noiseinthedata, or,becausetherelationshipisnotexactlya polynomial,Regression aimsto findcoefficientsthat give the bestpossiblefit.,AssociationRules,Retail shopsareoften interested inassociations between differentitems that people buy.,Someonewhobuy

32、sbread is quitelikely alsoto buy milk,A personwhoboughtthebook,DatabaseSystemConcepts,is quitelikelyalsotobuythebook,Operating SystemConcepts,.,Associationsinformationcanbeusedinseveralways.,E.g., when acustomer buys aparticularbook, anonlineshopmaysuggestassociatedbooks.,Associationrules:,bread,,m

33、ilkDB-Concepts,OS-Concepts,Networks,Lefthandside:,antecedent,,righthandside:,consequent,Anassociationrulemusthaveanassociated,population,;thepopulationconsistsofasetof,instances,E.g.,eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulation,AssociationRules(Cont.),Ruleshave

34、anassociatedsupport,aswellasanassociatedconfidence.,Support,isameasureofwhatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.,milk,,screwdrivers,islow.,Confidence,isameasureofhowoftentheconsequentistruewhentheantecedentistrue.,E.g.,therule,bread,,milk,hasaconfidenceof80pe

35、rcentif80percentofthepurchasesthatincludebreadalsoincludemilk.,,,FindingAssociationRules,Wearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e.g.,supportof2%orgreater),Na,ï,ïvealgorithm,Considerallpossiblesetsofrelevantitems.,Foreachsetfinditssupport(i.e.,counthowmanytransactions

36、purchaseallitemsintheset).,Largeitemsets,:setswithsufficientlyhighsupport,Uselargeitemsetstogenerateassociationrules.,Fromitemset,A,generatetherule,A,-{,b,} ,b,foreach,b,,A.,Supportof rule= support (,A),.,Confidence of rule =support(,A,) /support(,A,- {,b,}),FindingSupport,Determine support ofitem

37、sets via asingle passon set of transactions,Large itemsets:setswith ahighcount at the end ofthepass,If memory not enoughtoholdallcountsforallitemsetsusemultiple passes, considering only someitemsetsineachpass.,Optimization: Once an itemset iseliminatedbecauseitscount (support)is too smallnone ofitss

38、upersets needstobe considered.,The,a priori,technique tofind largeitemsets:,Pass1:count support ofallsets with just1 item.Eliminate thoseitems withlowsupport,Pass,i,:,candidates,: everysetof,i,items such thatallits,i-1,itemsubsetsarelarge,Count support ofallcandidates,Stopifthere are nocandidates,Ot

39、her Typesof Associations,Basic association ruleshaveseverallimitations,Deviations fromtheexpectedprobabilityaremore interesting,E.g., ifmany peoplepurchase bread,andmany peoplepurchase cereal, quitea few wouldbe expectedto purchaseboth,We are interested in,positive,as wellas,negativecorrelations,bet

40、weensetsofitems,Positivecorrelation: co-occurrenceis higher than predicted,Negativecorrelation: co-occurrenceis lowerthan predicted,Sequenceassociations /correlations,E.g., whenever bondsgoup,stock pricesgodownin2 days,Deviations fromtemporalpatterns,E.g., deviationfroma steady growth,E.g., salesof

41、winter wear go down in summer,Notsurprising,partofa knownpattern.,Lookfordeviation fromvalue predictedusing past patterns,Clustering,Clustering:Intuitively,findingclusters ofpointsin the givendata such thatsimilarpoints lie in the same cluster,Canbe formalized usingdistancemetricsinseveralways,Group

42、 pointsinto,k,sets(for agiven,k,) such thattheaveragedistanceofpoints fromthecentroidoftheir assigned groupisminimized,Centroid: pointdefinedby taking average ofcoordinatesineachdimension.,Anothermetric:minimizeaveragedistance between everypairofpoints in acluster,Hasbeenstudiedextensivelyinstatisti

43、cs,buton smalldata sets,Dataminingsystemsaimat clustering techniquesthat can handlevery largedatasets,E.g., the Birchclustering algorithm(more shortly),HierarchicalClustering,Examplefrombiologicalclassification,(theword classificationhere does not meana predictionmechanism),chordatamammaliareptilial

44、eopardshumanssnakescrocodiles,Other examples:Internetdirectory systems (e.g., Yahoo,more onthis later),Agglomerative clusteringalgorithms,Build smallclusters, then cluster smallclusters into bigger clusters,andso on,Divisiveclusteringalgorithms,Start with all itemsina singlecluster, repeatedlyrefine

45、(break)clustersinto smaller ones,Clustering Algorithms,Clustering algorithms have beendesignedtohandle verylarge datasets,E.g., the,Birch algorithm,Mainidea: use an in-memoryR-tree to storepoints thatarebeing clustered,Insert points one ata timeintotheR-tree,merginga new pointwith anexisting cluster

46、 ifislessthan some,,distanceaway,If therearemore leaf nodesthan fit inmemory,merge existingclustersthat are closeto eachother,At the end of firstpasswegeta largenumber of clusters at the leavesoftheR-tree,Merge clusters to reducethenumberof clusters,Collaborative Filtering,Goal:predict what movies/

47、books/… aperson may beinterestedin,on the basis of,Pastpreferences ofthe person,Otherpeople with similarpastpreferences,The preferencesof such peoplefora newmovie/book/…,One approach based on repeatedclustering,Cluster peopleon the basis ofpreferences for movies,Thencluster movieson the basis ofbein

48、g liked bythesameclusters of people,Againcluster peoplebased ontheirpreferences for (the newly createdclustersof) movies,Repeat above till equilibrium,Aboveproblem is aninstance of,collaborative filtering,, where users collaboratein the task offilteringinformation tofindinformation ofinterest,OtherT

49、ypes ofMining,Textmining,: application of data mining to textualdocuments,cluster Web pages tofindrelated pages,cluster pages auserhasvisited toorganizetheirvisit history,classify Web pages automatically into aWeb directory,Datavisualization,systems help users examine large volumesof data and detectpatterns visually,Can visually encodelargeamounts of information on a singlescreen,Humans areverygooda detecting visualpatterns,End of Chapter,Figure 20.01,Figure 20.02,Figure 20.03,Figure 20.05,演讲完毕，,谢,谢谢观看！,

展开阅读全文

温馨提示:
1: 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

点击下载此资源

数据仓库与数据挖掘二

最新文档

相关资源

相关搜索