Social Network Analysis | Murat Can Ganiz

Public Course Note

This page summarizes recurring content from recent offerings of the Social Network Analysis course. Semester-specific classroom links, exact dates, private student information, grades, and internal comments are intentionally omitted. Assessment weights, datasets, and project details may evolve each semester.

Course Overview

Social Network Analysis studies social and information systems as graphs. Students learn how to represent networks, compute graph-level and node-level measures, visualize large networks, detect communities, reason about tie strength and trust, predict missing or future links, and model propagation processes such as epidemics, influence, and threshold-based adoption.

The course combines mathematical graph concepts, algorithmic methods, and applied analysis. A major project asks students to analyze a significant real-world social graph, interpret its structure, and communicate results through reproducible code, visualizations, short reports, and presentations.

Graph theory Network measures Centrality Visualization Tie strength Trust Link prediction Propagation Community detection Small-world analysis SNAP datasets

Learning Outcomes

Network Representation

Represent social, information, communication, collaboration, and interaction data as directed, undirected, weighted, unweighted, bipartite, or multiplex graphs.
Use adjacency matrices, adjacency lists, edge lists, and attribute tables correctly.
Explain how data collection choices affect observed network structure.

Structural Measurement

Compute and interpret degree, strength, density, components, diameter, path length, clustering coefficients, assortativity, k-core structure, and centrality measures.
Compare local, ego-network, group-level, and whole-network measurements.
Identify small-world behavior by comparing real networks with appropriate random baselines.

Algorithms and Models

Apply algorithms for centrality, link prediction, entity resolution, community detection, graph partitioning, and diffusion simulation.
Evaluate community detection with modularity, conductance, and, when ground truth is available, NMI or ARI.
Model propagation using epidemic, threshold, independent cascade, linear threshold, and stochastic-process perspectives.

Applied Interpretation

Produce interpretable network visualizations with meaningful layouts, colors, node sizes, edge widths, and filtered views.
Explain what patterns mean in social, organizational, online-platform, or information-flow contexts.
Prepare reproducible analyses and defend design choices during project presentations.

Textbooks and Reading Structure

Recent offerings combine social-network-oriented readings with more general network science and graph analysis material.

Main Social Web Text

Jennifer Golbeck, Analyzing the Social Web. Topics used in the course include foundations, network structure and measures, visualization, tie strength, trust, user attributes and behavior, building networks, entity resolution, link prediction, and propagation.

Social Network Analysis Reference

Borgatti, Everett, Johnson, and Agneessens, Analyzing Social Networks. Recent offerings use this text for introductory concepts and mathematical foundations.

Network Science Reference

Mark Newman, Networks. Selected readings cover matrix algorithms, graph partitioning, and community structure.

Core Topic Map

The exact week-by-week order may change, but the course usually follows this conceptual progression.

1. Introduction and Graph Theory Foundations

Networks as abstractions of social relations, nodes and edges, directed and undirected graphs, weighted graphs, bipartite graphs, multigraphs, ego networks, paths, walks, cycles, connectedness, graph data structures, and fundamental graph algorithms.

2. Network Structure and Measures

Degree distributions, in-degree and out-degree, node strength, density, components and the giant connected component, shortest paths, diameter, effective diameter, average path length, local and global clustering, transitivity, assortativity, mixing matrices, and k-core decomposition.

3. Centrality and Roles

Degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, PageRank, hubs and authorities, centrality comparison, role interpretation, and centrality correlation analysis.

4. Network Visualization

Random, circular, grid, force-directed, Yifan Hu, ForceAtlas2, and related layouts; visual encoding with size, color, edge width, filtering, k-core views, ego-network views, readability, and layout comparison.

5. Tie Strength, Trust, and User Attributes

Strong and weak ties, homophily, social capital, trust relationships, user attributes, behavioral signals, similarity, influence, and the relationship between network structure and observed behavior.

6. Building Networks from Data

Data collection, APIs, logs, communication records, online interactions, edge definitions, temporal windows, sampling, missing edges, duplicated entities, bipartite projections, privacy, ethics, and reproducibility.

7. Entity Resolution and Link Prediction

Entity matching, duplicate detection, record linkage, common-neighbor methods, Jaccard coefficient, Adamic–Adar, preferential attachment, resource allocation, supervised link prediction, train/test splits over edges, and leakage risks.

8. Propagation in Networks

Epidemic models, SIR/SIS-style reasoning, independent cascade models, linear threshold models, stochastic propagation, seed selection, influence maximization intuition, containment, and the firefighter problem.

9. Community Detection and Graph Partitioning

Communities, modularity, conductance, Louvain and Leiden-style modularity optimization, Infomap, spectral methods, stochastic block models, degree-corrected block models, graph partitioning, and interpretation of discovered groups.

Representative Assessment Pattern

Exact weights may change by semester. A recent offering used the following structure:

Quiz 1: 5%
Quiz 2: 5%
Quiz 3: 5%
Midterm exam: 25%
Project: 20%
Final exam: 40%

Students should bring a scientific calculator to exams and be comfortable with logarithms, matrix-style calculations, probability-style reasoning, graph metrics, centrality computations, propagation steps, and small numerical examples.

Term Project: Structural and Community Analysis of a Large Social Graph

The project is designed to give students experience with a complete network analysis pipeline. Teams typically work with a SNAP dataset or a self-collected graph of significant size. A representative size requirement is approximately at least 2,000 nodes and 10,000 edges, or a justified sampled subgraph.

Dataset and Basic Description

Describe node meaning, edge meaning, directionality, edge weights, edge types, and time windows if available.
Report node degree or strength distributions and edge weight/type distributions.
Analyze reciprocity for directed graphs and summarize attributes if available.
Inspect ego networks of obviously significant nodes.

Whole-Network Metrics

Components, component sizes, and giant connected component.
Density, diameter or approximate effective diameter, and average path length.
Degree distribution, ideally on log–log axes when appropriate.
Global clustering, average local clustering, assortativity, mixing, and k-core summary.

Centrality and Role Analysis

Compute degree, betweenness, closeness, and eigenvector centrality or PageRank.
Compare top-k node lists across centrality measures.
Interpret the social or structural role of high-ranking nodes.
Include a centrality correlation heatmap or a simple comparison table.

Small-World Evidence

Compare average path length and clustering coefficient to a comparable random baseline.
Use a baseline with the same number of nodes and edges, or preferably a degree-sequence-aware baseline when appropriate.
Report a small-world measure such as sigma or omega and explain the result clearly.

Community Detection

Apply at least two distinct community detection algorithms.
Suggested methods include Louvain, Leiden, Infomap, spectral methods, and stochastic block models.
Compare communities using modularity and conductance; use NMI or ARI if ground truth exists.
Explain what the communities mean in the context of the dataset.

Visualization

Show at least four layouts such as random, circular, grid, and force-directed.
Use visual encodings: node size for degree or centrality, color for community or attribute, and edge width for weight.
Include one filtered view, such as a k-core subgraph or edge-weight thresholded graph.
Use captions to explain the visual insight, not only the layout algorithm.

Optional Extra Credit: Diffusion and Propagation

Students may simulate diffusion models such as SIR, independent cascade, or linear threshold and compare seed strategies such as random, high-degree, and high-PageRank seeds. A concise figure and several focused takeaways are usually sufficient.

Expected Project Deliverables

Executive Summary

A short PDF, typically around two pages, summarizing the dataset, key network numbers, small-world verdict, community structure, and top insights.

Figure Deck

A single PDF containing required analysis panels such as degree distribution, component overview, centrality leaderboard, centrality comparison, force-directed layout, layout comparison, mixing/assortativity, small-world evidence, and community comparison.

Reproducible Code

A repository with README, scripts or notebooks, clear environment instructions, and a one-command or one-notebook workflow to regenerate figures whenever possible.

Presentation

A concise presentation, typically around 8–10 minutes, focusing on the most important figures, interpretations, and methodological choices.

Suggested Project Pipeline

Week 1: Proposal

Choose dataset, define node and edge meaning, report expected size, directionality, weighting, risks, and include one teaser plot.

Week 2: Graph Construction

Build the graph, clean identifiers, remove duplicates, create baseline statistics, and draft degree distribution and component panels.

Week 3: Visualization Drafts

Generate at least two layouts, choose visual encodings, and decide which filters or subgraphs will make the graph readable.

Week 4: Centrality and First Communities

Compute centralities, prepare top-k tables, interpret roles, and run an initial community detection algorithm.

Week 5: Layouts and Filters

Complete layout comparison, filtered views, and visual interpretation captions.

Week 6: Community Comparison and Small-World Analysis

Compare at least two community methods and complete random-baseline comparisons for small-world evidence.

Week 7: Final Packaging

Finish executive summary, figure deck, repository README, final presentation, and optional diffusion analysis.

Useful Public Datasets and Tools

Dataset Sources

Stanford Large Network Dataset Collection
Social networks, communication networks, collaboration networks, citation networks, web graphs, and networks with ground-truth communities.
Self-collected datasets may be used when the graph is large enough, ethically collected, and clearly documented.

Analysis Tools

Python: NetworkX, pandas, NumPy, scikit-learn, matplotlib.
Large-scale graph analysis: igraph, graph-tool, SNAP.py, or specialized graph libraries where appropriate.
Visualization: Gephi, Cytoscape, ForceAtlas2, Yifan Hu, and Python-based plotting workflows.

Study Questions

The following questions are useful for exam preparation, project presentations, and self-study. They combine conceptual, computational, and applied reasoning.

Foundations and Graph Representation

Define a social network in terms of nodes, edges, and relations. Give three examples from online or real-world systems.
Explain the difference between directed and undirected graphs. Give an example where direction matters.
Explain the difference between weighted and unweighted graphs. What can edge weight represent in a social network?
What is a bipartite network? Give an example involving users and items, authors and papers, or students and courses.
What is a multiplex network? Why might representing different relation types separately be useful?
Compare adjacency matrices, adjacency lists, and edge lists. Which representation is more appropriate for sparse large graphs?
What is an ego network? What types of questions can ego-network analysis answer?
Explain the meaning of a walk, path, cycle, shortest path, geodesic distance, and connected component.
What is the giant connected component? Why is it important in large social graphs?
How can sampling or missing data distort observed network properties?

Network Measures

Define degree, in-degree, out-degree, and strength. How do these measures differ in directed and weighted graphs?
How is graph density calculated for an undirected graph? How does the formula change for directed graphs without self-loops?
What is the diameter of a network? Why is exact diameter often hard to compute for very large graphs?
Define average path length and effective diameter. Why might effective diameter be preferable for noisy real-world networks?
Explain local clustering coefficient, average local clustering coefficient, global clustering coefficient, and transitivity.
What does a high clustering coefficient suggest about triadic closure in a social network?
What is assortativity? Explain degree assortativity and attribute assortativity with examples.
What is a mixing matrix? How can it reveal homophily or heterophily?
What is k-core decomposition? How can it identify structurally cohesive parts of a network?
Why should degree distributions often be plotted on log–log axes for large networks?
What is the difference between a heavy-tailed degree distribution and a normal-like degree distribution?
How would you interpret a graph with many disconnected components and a small giant connected component?

Centrality and Node Roles

Compare degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, and PageRank.
What type of social role is captured by high betweenness centrality?
Why can a node with low degree still have high betweenness centrality?
What does eigenvector centrality reward that degree centrality does not?
How does PageRank differ from simple eigenvector centrality?
Why can closeness centrality be problematic in disconnected graphs?
What are hubs and authorities? In what kinds of networks does this distinction matter?
Why should top-k centrality lists be interpreted with domain knowledge rather than automatically equated with importance?
What does a centrality-correlation table reveal about the structure of a network?
How can centrality measures be misleading when the graph is sampled or incomplete?

Visualization

Compare random, circular, grid, and force-directed layouts. What does each layout reveal or hide?
Why can force-directed layouts be visually attractive but analytically dangerous if over-interpreted?
What are appropriate visual encodings for node size, node color, edge width, and edge color?
Why is filtering often necessary before visualizing large networks?
Give two examples of useful filtered views, such as a k-core view or weight-thresholded view.
How can community labels be visualized effectively?
What common visualization mistakes can make a network figure misleading?
Why should every network visualization have a caption explaining the analytical message?

Tie Strength, Trust, and Attributes

What is tie strength? What behavioral signals might indicate a strong tie?
Explain the difference between strong ties and weak ties in social network analysis.
What is homophily? How can it be measured using node attributes?
How can trust be represented as an edge attribute?
Why is trust not always symmetric?
What is trust propagation? What assumptions does it make?
How can user behavior help explain observed network structure?
Why should sensitive user attributes be handled carefully in social network analysis?

Building Networks, Entity Resolution, and Link Prediction

What design decisions are required when building a network from raw interaction data?
How does the definition of an edge affect all later analysis results?
What is entity resolution? Why is it important before constructing a social graph?
Give examples of duplicate or ambiguous entities in real datasets.
What is link prediction? Give two practical applications.
Compute and interpret common-neighbor score for a pair of nodes.
Compute and interpret Jaccard coefficient for link prediction.
What is the Adamic–Adar score and why does it downweight high-degree common neighbors?
What is preferential attachment and what assumption does it encode?
How would you create train and test sets for link prediction without leaking future information?
Why can evaluating link prediction on randomly removed edges be unrealistic for temporal networks?

Propagation in Networks

What is the difference between biological contagion, information diffusion, and behavioral adoption?
Explain the SIR model in terms of susceptible, infected, and recovered states.
How does an SIS model differ from an SIR model?
What is the independent cascade model?
What is the linear threshold model?
How does seed selection affect the outcome of a diffusion process?
Compare random, high-degree, high-betweenness, and high-PageRank seed strategies.
What does it mean for a propagation model to be stochastic?
What is the firefighter problem in networks?
How can network clustering slow down or accelerate diffusion depending on the process?
Why are repeated simulations often needed when studying stochastic propagation?
What ethical issues arise when modeling influence or information spread?

Community Detection and Graph Partitioning

What is a community in a network? Why is there no single universally correct definition?
Explain modularity. What does a high modularity value suggest?
What is the resolution limit of modularity-based community detection?
How do Louvain and Leiden-style methods search for communities?
What is Infomap’s intuition for community detection?
How do spectral methods use matrices and eigenvectors for graph partitioning?
What is conductance, and why is it useful for evaluating communities?
When ground-truth communities are available, how can NMI or ARI be used?
Why can two different algorithms produce different but still meaningful community partitions?
How should communities be interpreted in context rather than treated as automatically meaningful social groups?
What is a stochastic block model? How does it differ from purely modularity-based methods?
What does degree correction add to a stochastic block model?

Small-World Analysis

What does it mean for a network to have small-world properties?
Why should a real graph be compared to a random baseline before claiming a small-world effect?
What are C and L in small-world analysis?
What are C_rand and L_rand?
How is the small-world coefficient sigma interpreted?
What is the intuition behind omega-style small-world measures?
Why should the random baseline preserve the number of nodes and edges, and sometimes the degree sequence?
Can a disconnected graph be analyzed for small-world properties directly? What preprocessing choices may be needed?

Project and Reproducibility Questions

What should be included in a dataset description for a network analysis project?
Why should raw data, cleaned data, and graph construction code be clearly separated?
What belongs in a reproducible README for a graph analysis project?
How can a figure deck communicate results more effectively than a long unstructured report?
What are the risks of choosing a graph that is too small for meaningful network analysis?
How can team members divide work while still understanding the whole project?
Why should project conclusions discuss limitations and data-collection bias?
How can graph analysis results be useful for community management, recommendation, information security, marketing, public health, or organizational analysis?

Representative Calculation Practice

Students should be able to solve small examples by hand or with a scientific calculator.

Given a small undirected graph, compute degree centrality for every node.
Given an adjacency matrix, identify connected components and the giant connected component.
Given shortest-path distances, compute closeness centrality for selected nodes.
Given all shortest paths in a small graph, compute betweenness centrality for one node.
Given a node and its neighbors, compute the local clustering coefficient.
Given two candidate missing edges, compute common-neighbor, Jaccard, Adamic–Adar, and preferential-attachment scores.
Given a simple threshold model, determine which nodes become active after each time step.
Given SIR transition probabilities or deterministic rules, trace the state of each node for several time steps.
Given community assignments, compute or interpret modularity qualitatively and identify high-conductance boundary cases.
Given C, L, C_rand, and L_rand, compute a small-world coefficient and interpret it.