# Section 9.1. Graph Mining.ppt

June 23, 2019,Mining and Searching Graphs in Graph Databases,1,Data Mining: Concepts and Techniques — Chapter 9 — 9.1. Graph mining,,June 23, 2019,Mining and Searching Graphs in Graph Databases,2,June 23, 2019,Mining and Searching Graphs in Graph Databases,3,Graph Mining,Methods for Mining Frequent Subgraphs Mining Variant and Constrained Substructure Patterns Applications: Graph Indexing Similarity Search Classification and Clustering Summary,,June 23, 2019,Mining and Searching Graphs in Graph Databases,4,Why Graph Mining?,Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity,June 23, 2019,Mining and Searching Graphs in Graph Databases,5,Graph, Graph, Everywhere,Aspirin,Yeast protein interaction network,from H. Jeong et al Nature 411, 41 (2001),Internet,Co-author network,June 23, 2019,Mining and Searching Graphs in Graph Databases,6,Graph Pattern Mining,Frequent subgraphs A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, compression, comparison, and correlation analysis,June 23, 2019,Mining and Searching Graphs in Graph Databases,7,Example: Frequent Subgraphs,GRAPH DATASET,FREQUENT PATTERNS (MIN SUPPORT IS 2),(A),(B),(C),(1),(2),June 23, 2019,Mining and Searching Graphs in Graph Databases,8,EXAMPLE (II),GRAPH DATASET,FREQUENT PATTERNS (MIN SUPPORT IS 2),June 23, 2019,Mining and Searching Graphs in Graph Databases,9,Graph Mining Algorithms,Incomplete beam search – Greedy (Subdue) Inductive logic programming (WARMR) Graph theory-based approaches Apriori-based approach Pattern-growth approach,June 23, 2019,Mining and Searching Graphs in Graph Databases,10,SUBDUE (Holder et al. KDD’94),Start with single vertices Expand best substructures with a new edge Limit the number of best substructures Substructures are evaluated based on their ability to compress input graphs Using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) + DL(G\S) Terminate until no new substructure is discovered,June 23, 2019,Mining and Searching Graphs in Graph Databases,11,WARMR (Dehaspe et al. KDD’98),Graphs are represented by Datalog facts atomel(C, A1, c), bond (C, A1, A2, BT), atomel(C, A2, c) : a carbon atom bound to a carbon atom with bond type BT WARMR: the first general purpose ILP system Level-wise search Simulate Apriori for frequent pattern discovery,June 23, 2019,Mining and Searching Graphs in Graph Databases,12,Frequent Subgraph Mining Approaches,Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) Pattern growth approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04),June 23, 2019,Mining and Searching Graphs in Graph Databases,13,Properties of Graph Mining Algorithms,Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path tree graph,June 23, 2019,Mining and Searching Graphs in Graph Databases,14,Apriori-Based Approach,…,G,G1,G2,Gn,,,,k-edge,(k+1)-edge,G’,,G’’,,,JOIN,June 23, 2019,Mining and Searching Graphs in Graph Databases,15,Apriori-Based, Breadth-First Search,AGM (Inokuchi, et al. PKDD’00) generates new graphs with one more node,Methodology: breadth-search, joining two graphs,FSG (Kuramochi and Karypis ICDM’01) generates new graphs with one more edge,June 23, 2019,Mining and Searching Graphs in Graph Databases,16,PATH (Vanetik and Gudes ICDM’02, ’04),Apriori-based approach Building blocks: edge-disjoint path,,,,,,,,,,,A graph with 3 edge-disjoint paths,construct frequent paths construct frequent graphs with 2 edge-disjoint paths construct graphs with k+1 edge-disjoint paths from graphs with k edge-disjoint paths repeat,June 23, 2019,Mining and Searching Graphs in Graph Databases,17,FFSM (Huan, et al. ICDM’03),Represent graphs using canonical adjacency matrix (CAM) Join two CAMs or extend a CAM to generate a new graph Store the embeddings of CAMs All of the embeddings of a pattern in the database Can derive the embeddings of newly generated CAMs,June 23, 2019,Mining and Searching Graphs in Graph Databases,18,Pattern Growth Method,…,G,G1,G2,Gn,,,,k-edge,(k+1)-edge,(k+2)-edge,,,,,,,,,,duplicate graph,June 23, 2019,Mining and Searching Graphs in Graph Databases,19,MoFa (Borgelt and Berthold ICDM’02),Extend graphs by adding a new edge Store embeddings of discovered frequent graphs Fast support calculation Also used in other later developed algorithms such as FFSM and GASTON Expensive Memory usage Local structural pruning,June 23, 2019,Mining and Searching Graphs in Graph Databases,20,GSPAN (Yan and Han ICDM’02),Right-Most Extension,Theorem: Completeness,The Enumeration of Graphs using Right-most Extension is COMPLETE,June 23, 2019,Mining and Searching Graphs in Graph Databases,21,DFS Code,Flatten a graph into a sequence using depth first search,0,1,2,3,4,,,,,,,June 23, 2019,Mining and Searching Graphs in Graph Databases,22,DFS Lexicographic Order,Let Z be the set of DFS codes of all graphs. Two DFS codes a and b have the relation a=b (DFS Lexicographic Order in Z) if and only if one of the following conditions is true. Leta = (x0, x1, …, xn) and b = (y0, y1, …, yn),,June 23, 2019,Mining and Searching Graphs in Graph Databases,23,DFS Code Extension,Let a be the minimum DFS code of a graph G and b be a non-minimum DFS code of G. For any DFS code d generated from b by one right-most extension,,THEOREM [ RIGHT-EXTENSION ] The DFS code of a graph extended from a Non-minimum DFS code is NOT MINIMUM,June 23, 2019,Mining and Searching Graphs in Graph Databases,24,GASTON (Nijssen and Kok KDD’04),Extend graphs directly Store embeddings Separate the discovery of different types of graphs path tree graph Simple structures are easier to mine and duplication detection is much simpler,June 23, 2019,Mining and Searching Graphs in Graph Databases,25,Graph Pattern Explosion Problem,If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property An n-edge frequent graph may have 2n subgraphs Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%,June 23, 2019,Mining and Searching Graphs in Graph Databases,26,Closed Frequent Graphs,Motivation: Handling graph pattern explosion problem Closed frequent graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete,June 23, 2019,Mining and Searching Graphs in Graph Databases,27,CLOSEGRAPH (Yan & Han, KDD’03),…,A Pattern-Growth Approach,G,G1,G2,Gn,,,,k-edge,(k+1)-edge,At what condition, can we stop searching their children i.e., early termination?,If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’.,June 23, 2019,Mining and Searching Graphs in Graph Databases,28,Handling Tricky Exception Cases,,(graph 1),,,a,c,b,d,,,(pattern 2),,(pattern 1),,,,(graph 2),,,a,c,b,d,,,,,,a,b,,,,,a,c,d,,,June 23, 2019,Mining and Searching Graphs in Graph Databases,29,Experimental Result,The AIDS antiviral screen compound dataset from NCI/NIH The dataset contains 43,905 chemical compounds Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI,June 23, 2019,Mining and Searching Graphs in Graph Databases,30,Discovered Patterns,20%,10%,5%,June 23, 2019,Mining and Searching Graphs in Graph Databases,31,Performance (1): Run Time,Minimum support (in %),Run time per pattern (msec),June 23, 2019,Mining and Searching Graphs in Graph Databases,32,Performance (2): Memory Usage,Minimum support (in %),Memory usage (GB),June 23, 2019,Mining and Searching Graphs in Graph Databases,33,Number of Patterns: Frequent vs. Closed,CA,Minimum support,Number of patterns,June 23, 2019,Mining and Searching Graphs in Graph Databases,34,Runtime: Frequent vs. Closed,CA,Minimum support,Run time (sec),June 23, 2019,Mining and Searching Graphs in Graph Databases,35,Do the Odds Beat the Curse of Complexity?,Potentially exponential number of frequent patterns The worst case complexty vs. the expected probability Ex.: Suppose Walmart has 104 kinds of products The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: 10-40 What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? Have we solved the NP-hard problem of subgraph isomorphism testing? No. But the real graphs in bio/chemistry is not so bad A carbon has only 4 bounds and most proteins in a network have distinct labels,June 23, 2019,Mining and Searching Graphs in Graph Databases,36,Graph Mining,Methods for Mining Frequent Subgraphs Mining Variant and Constrained Substructure Patterns Applications: Graph Indexing Similarity Search Classification and Clustering Summary,,June 23, 2019,Mining and Searching Graphs in Graph Databases,37,Constrained Patterns,Density Diameter Connectivity Degree Min, Max, Avg,June 23, 2019,Mining and Searching Graphs in Graph Databases,38,Constraint-Based Graph Pattern Mining,Highly connected subgraphs in a large graph usually are not artifacts (group, functionality),Recurrent patterns discovered in multiple graphs are more robust than the patterns mined from a single graph,,,,June 23, 2019,Mining and Searching Graphs in Graph Databases,39,No Downward Closure Property,Given two graphs G and G’, if G is a subgraph of G’, it does not imply that the connectivity of G is less than that of G’, and vice versa.,,,,,,,,,,G,G’,June 23, 2019,Mining and Searching Graphs in Graph Databases,40,Minimum Degree Constraint,Let G be a frequent graph and X be the set of edges which can be added to G such that G U e (e ε X) is connected and frequent. Graph G U X is the maximal graph that can be Extended (one step) from the vertices belong to G,,,,,G,,,,,G U X,,,,,,,,,,June 23, 2019,Mining and Searching Graphs in Graph Databases,41,Pattern-Growth Approach,Find a small frequent candidate graph Remove vertices (shadow graph) whose degree is less than the connectivity Decompose it to extract the subgraphs satisfying the connectivity constraint Stop decomposing when the subgraph has been checked before Extend this candidate graph by adding new vertices and edges Repeat,June 23, 2019,Mining and Searching Graphs in Graph Databases,42,Pattern-Reduction Approach,Decompose the relational graphs according to the connectivity constraint,,June 23, 2019,Mining and Searching Graphs in Graph Databases,43,Pattern-Reduction Approach (cont.),Intersect them and decompose the resulting subgraphs,,intersect,intersect,,,final result,June 23, 2019,Mining and Searching Graphs in Graph Databases,44,Graph Mining,Methods for Mining Frequent Subgraphs Mining Variant and Constrained Substructure Patterns Applications: Classification and Clustering Graph Indexing Similarity Search Summary,,June 23, 2019,Mining and Searching Graphs in Graph Databases,45,Graph Clustering,Graph similarity measure Feature-based similarity measure Each graph is represented as a feature vector The similarity is defined by the distance of their corresponding vectors Frequent subgraphs can be used as features Structure-based similarity measure Maximal common subgraph Graph edit distance: insertion, deletion, and relabel Graph alignment distance,June 23, 2019,Mining and Searching Graphs in Graph Databases,46,Graph Classification,Local structure based approach Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length Graph pattern-based approach Subgraph patterns from domain knowledge Subgraph patterns from data mining Kernel-based approach Random walk (Gärtner ’02, Kashima et al. ’02, ICML’03, Mahé et al. ICML’04) Optimal local assignment (Fröhlich et al. ICML’05) Boosting (Kudo et al. NIPS’04),June 23, 2019,Mining and Searching Graphs in Graph Databases,47,Graph Pattern-Based Classification,Subgraph patterns from domain knowledge Molecular descriptors Subgraph patterns from data mining General idea Each graph is represented as a feature vector x = {x1, x2, …, xn}, where xi is the frequency of the i-th pattern in that graph Each vector is associated with a class label Classify these vectors in a vector space,June 23, 2019,Mining and Searching Graphs in Graph Databases,48,Subgraph Patterns from Data Mining,Sequence patterns (De Raedt and Kramer IJCAI’01) Frequent subgraphs (Deshpande et al, ICDM’03) Coherent frequent subgraphs (Huan et al. RECOMB’04) A graph G is coherent if the mutual information between G and each of its own subgraphs is above some thresholdClosed frequent subgraphs (Liu et al. SDM’05),June 23, 2019,Mining and Searching Graphs in Graph Databases,49,Kernel-based Classification,Random walk Marginalized Kernels (Gärtner ’02, Kashima et al. ’02, ICML’03, Mahé et al. ICML’04) and are paths in graphs and and are probability distributions on paths is a kernel between paths, e.g.,,June 23, 2019,Mining and Searching Graphs in Graph Databases,50,Kernel-based Classification,Optimal local assignment (Fröhlich et al. ICML’05),,Can be extended to include neighborhood information e.g.,where could be an RBF-kernel to measure the similarity of neighborhoods of vertices and , is a damping parameter,,June 23, 2019,Mining and Searching Graphs in Graph Databases,51,Boosting in Graph Classification,Decision stumps Simple classifiers in which the final decision is made by single features. A rule is a tuple . If a molecule contains substructure , it is classified as .GainApplying boosting,,June 23, 2019,Mining and Searching Graphs in Graph Databases,52,Graph Compression,Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,June 23, 2019,Mining and Searching Graphs in Graph Databases,53,Graph Mining,Methods for Mining Frequent Subgraphs Mining Variant and Constrained Substructure Patterns Applications: Classification and Clustering Graph Indexing Similarity Search Summary,