Network Analysis in the Age of Large Network Dataset Collections – Challenges, Solutions and Applications

Tutorial at CIKM 2017 by Jérôme Kunegis and Renaud Lambiotte

This tutorial is colocated with the International Conference on Information and Knowledge Management (CIKM 2017) held November 6–10 2017 in Singapore

Location: Pan Pacific Hotel Singapore
Room: Ocean 9
Date: November 10, 2017
Time: Half-day (8:30–12:00, with a break from 10:00–10:30)

Tutorial Summary

We present an overview of methods and techniques for working with collections of network datasets. While the fact that many datasets used in data mining are networks has been observed for a long time, the easy availability of these via repositories such as SNAP, KONECT and ICON has only recently become possible. These collections are now allowing new types of analyses, studies and evaluations that were not possible before. In this tutorial, we present the state of the art in the area of such methods as well as practical skills needed to work with them, including choosing datasets, acquiring them, implementing the large number of experiments, as well as the statistical machine-learning issues related to their use. The teachers of this tutorials are affiliated with the KONECT project, one such repository, but the skills learned in the tutorial are not exclusive to it.

Tutorial Slides

Overview

Tutorial type: half-day (3 hours plus breaks)
Target audience: Both novice and experienced practitioners in data mining, machine learning and related areas whose work involves the use of real-world network datasets. In particular, the tutorial will be of use for researchers who wish to increase the statistical significance of their research. To a small extent, the tutorial will also provide valuable insight for those practitioners wishing to maintain their own (small or large) repository of network datasets.
Prerequisites: Attendants should have knowledge of graph theory, and should be familiar with the general use of datasets, and with basic statistical methods, as well as with essential concepts in machine learning. Knowledge in any particular programming language is not required, although a few examples will be given in GNU Octave.
Benefits: We strive to give the attendant to this tutorial the consciousness that the use of multiple network datasets in research can lead to both higher statistical significance of the results, as well as to better performance of machine learning and prediction methods. The attendant should leave the tutorial with a good overview of recent developments in the field, of which work by the presenters represents only a small part. Additionally, we will present first-hand insight into a large corpus of network datasets, as well as give the attendant practical tools to exploit them herself.

Introduction

Network datasets play a central role in data mining, machine learning and related disciplines. In fact, a large number of papers published at CIKM and other conferences use network datasets in their experiments. Such network datasets are sometimes crawled by researchers specifically for one specific study, but more often than not practitioners use the network datasets provided by other researchers. In fact, several projects have emerged in the last decade with the goal of aggregating such network datasets: Websites in this category are the Pajek Datasets page, the UCINET collection, the Arizona State University page, Network Repository, and the Stanford Network Analysis Project (SNAP). More recently, the KONECT project (Koblenz Network Collection, by the presenters of this tutorial) as well as the most recent Index of Complex Networks (ICON) have tried to create a systematic approach to such datasets, by tagging the networks with their properties (directed, undirected, bipartite, signed, etc.), as well as by providing precomputed network metrics (such as the clustering coefficient, diameter, etc.) and plots (such as the degree distribution, the hop plot, etc.). The availability of these repositories opens up new possibilities in the field: It is now possible to evaluate algorithms or methods on dozens, hundreds, or even thousands of datasets. While this availability has the potential to give radical new insights into the datasets as well as into the evaluated methods, it also raises new challenges on multiple levels. Which datasets should be chosen? How can statistical significance of algorithms be determined? On a practical level, it leads to the question which data structures to use for this, as well as how to manage the task of running the resulting high number of experiments.

This tutorial will start with a brief overview of the field, cover the structural diversity of datasets, and cover the file formats and data structures available for handling them. The core of the tutorial (parts Applications and Case Studies) is dedicated to a selection of data mining methods specific to the field, with the goal not to teach these methods specifically, but to give the attendant the knowledge to apply corresponding techniques to other data mining and machine learning methods. Then, we present specific data mining problems arising from the management of such repositories, in particular, the detection of duplicate datasets. We end with a very brief overview of the Stu build automation system, which is successfully used by the presenters for the KONECT project, again, with a focus on the generality of the methods rather on the specific technology.

Outline

The tutorial will be a half-day tutorial of about 180 minutes (excluding breaks), and will be divided into the following five parts. The indicated durations are approximate. The tutorial will cover both work by the presenters (in relation to the KONECT project), as well as work by other researchers, in particular by the authors of the repositories SNAP and ICON.

  1. Motivation: (20 min)
    • Ubiquity of networked data: Discuss whether “everything is a network.” Show to what extent many available datasets are in fact networks, and stress the diversity of what can be a network. Briefly discuss what is not a network. Review the different types of applications (both academic and otherwise) that can be realized with networks: prediction problems, unsupervised problems, model fitting, testing of algorithms, etc., in particular with respect with the diverse fields and subfields of research that are involved.
    • Availability of datasets: Many corpora of network datasets are available now, including SNAP, KONECT, ICON, and many others. Many are collections by individual researchers that have grown to a large size over time.
  2. Networks as a data structure: (25 min)
    • Structural diversity of networks: Review the types of networks in existence: directed, bipartite, signed, temporal, “pathway”-like networks, hypergraphs, etc. Stress that each application as presented earlier has different requirements as to datasets.
    • Data structures for networks: Present the different ways network datasets are represented on computers, with a focus on their practical usage with SNAP, KONECT, and ICON. Stress the fact that the textual file format used by these three projects is simple, but may lead to confusion as to the precise type of network at hand. (A standard example being: Is this network directed or not?)
    • Licenses and legal issues: Cover the different types of licenses (or the lack of a license) used by the main projects in the area. Relate experiences both from KONECT, as well as from other projects, where applicable.
  3. Applications: (50 min)
    • Comparison of datasets: The simplest application possible when multiple datasets are available is a comparison. We cover comparisons via network statistics (i.e., numerical metrics), via plots, and via various distributions (e.g., centralities, distances). We show, using examples from KONECT, the various ways networks can be compared, and in particular, how different measures compare at that task.
    • Detecting outliers: We show how outliers can be found, i.e., datasets for which one property is significantly different than for other datasets.
    • Statistical significance: We sketch the relationship between the number of datasets used in a study and the statistical significance of the results. As an example application, we use the evaluation of recommender systems.
    • Small datasets vs large datasets: We show how a practitioner may choose between using large or small datasets, and what tradeoffs are involved in the choice.
  4. Case Studies: (60 min)
    • Degree distributions: We show using the example of degree distributions, one of the most basic and common distribution studied in networks, that the question “Do networks generally have power-law degree distributions?” can be answered in wildly different ways, depending on the approach. We review early work by Clauset and colleagues and end by presenting recent work by Broido and Clauset using ICON data.
    • Network repositories as a training corpus: We use recent unpublished work by the authors to show that a learning problem can be improved by being trained on a corpus of network datasets as a whole.
    • Transfer learning: Using the example of social network role analysis, we show how the availability of many network datasets allows us to perform and evaluate transfer learning, i.e., learning a function on one dataset and applying it to another.
    • Classification of dataset: Based on multiple works, we show how corpora of network datasets can be used to classify networks, i.e., detect whether a given network is (for instance) a social network or a knowledge network. The problem has been approached from different angles, giving an insight into the different ways to use network dataset collections.
    • Motifs in temporal networks: Based on a recent paper by Paranjape and colleagues, we show how a collection of datasets was used to detect structural motifs in datasets, and how the results were compared.
  5. Management of network dataset repositories: (25 min)
    • Detection of erroneous data: Based on work by Reif, we show how errors in a network dataset collection can be detected. This includes duplicates and near-duplicates, for instance directed networks in which the edge directions have been saved wrongly.
    • Execution of data mining code on many network datasets: We briefly present the Stu build automation software, which was specifically designed to allow computation of many tasks that are interrelated via a complex dependency graph, in particular with respect to parallelization, the use of multiple programming languages, and reproducibility of results.

Material

Presenters

The tutorial is presented by Jérôme Kunegis and Renaud Lambiotte, both at the University of Namur's Center for Complex Systems (naXys).

Jérôme KUNEGIS Dr. Jérôme KUNEGIS is postdoctoral researcher at the Namur Center for Complex Systems (naXys) at the University of Namur, Belgium. Dr. Kunegis leads the KONECT project, which represents one the network dataset repositories covered by the tutorial. Dr. Kunegis graduated at the Technical University of Berlin in the field of computer science in 2006, and received his PhD in 2011 at the University of Koblenz–Landau, Germany, on the spectral analysis of evolving networks. Dr. Kunegis has been teaching in the fields of network science, database systems, and graph theory since 2012.

Renaud LAMBIOTTE Prof. Renaud LAMBIOTTE has a PhD in physics from the Université libre de Bruxelles. After postdocs at ENS Lyon, University of Liège, Université Catholique de Louvain, and Imperial College London, he is currently associate professor at the Mathematical Institute of the University of Oxford and professor at the University of Namur. His main research interests are the modelling and analysis of processes taking place on large networks, with a particular focus on social and brain networks.

CIKM
  2017
KONECT project