Network Analysis in the Age of Large Network
Dataset Collections – Challenges, Solutions and
Applications
Tutorial at CIKM 2017 by Jérôme Kunegis and Renaud Lambiotte
This tutorial is colocated with the
International Conference on Information and Knowledge Management (CIKM 2017) held November 6–10 2017 in Singapore
Location: Pan Pacific Hotel Singapore
Room: Ocean 9
Date: November 10, 2017
Time: Half-day (8:30–12:00, with a break from 10:00–10:30)
Tutorial Summary
We present an overview of methods and techniques for working with collections of network datasets.
While the fact that many datasets used in data mining
are networks has been observed for a long
time, the easy availability of these via repositories such as SNAP,
KONECT and ICON has only
recently become possible. These collections are now allowing new types
of analyses, studies and evaluations that were not possible before. In
this tutorial, we present the state of the art in the area of such
methods as well as practical skills needed to work with them, including
choosing datasets, acquiring them, implementing the large number of
experiments, as well as the statistical machine-learning issues related
to their use. The teachers of this tutorials are affiliated with the
KONECT project, one such repository, but the skills learned in the
tutorial are not exclusive to it.
Tutorial Slides
Overview
Tutorial type: half-day (3 hours plus breaks)
Target audience: Both novice and experienced practitioners in data
mining, machine learning and related areas whose work involves the use
of real-world network datasets. In particular, the tutorial will be of
use for researchers who wish to increase the statistical significance of
their research. To a small extent, the tutorial will also provide valuable
insight for those practitioners wishing to maintain their own (small or
large) repository of network datasets.
Prerequisites: Attendants should have knowledge of graph theory, and
should be familiar with the general use of datasets, and with basic
statistical methods, as well as with essential concepts in machine
learning. Knowledge in any particular programming language is not
required, although a few examples will be given in GNU Octave.
Benefits: We strive to give the attendant to this tutorial the
consciousness that the use of multiple network datasets in research can
lead to both higher statistical significance of the results, as well as
to better performance of machine learning and prediction methods. The
attendant should leave the tutorial with a good overview of recent
developments in the field, of which work by the presenters represents
only a small part.
Additionally, we will present first-hand insight into a large corpus of
network datasets, as well as give the attendant practical tools to
exploit them herself.
Introduction
Network datasets play a central role in data mining, machine learning
and related disciplines. In fact, a large number of papers published at
CIKM and other conferences use network datasets in their experiments.
Such network datasets are sometimes crawled by researchers specifically
for one specific study, but more often than not
practitioners use the network datasets provided by other researchers.
In fact, several projects have emerged in the last decade with the goal
of aggregating such network datasets: Websites in this category
are the Pajek Datasets page, the
UCINET
collection, the Arizona State University
page, Network
Repository, and the Stanford Network
Analysis Project (SNAP). More recently, the KONECT
project (Koblenz Network Collection, by the presenters of this tutorial) as well as the
most recent Index of Complex Networks (ICON) have tried to
create a systematic approach to such datasets, by tagging the networks
with their properties (directed, undirected, bipartite, signed, etc.),
as well as by providing precomputed network metrics (such as the
clustering coefficient, diameter, etc.) and plots (such as the degree
distribution, the hop plot, etc.). The availability of these
repositories opens up new possibilities in the field: It is now possible to evaluate
algorithms or methods on dozens, hundreds, or even thousands of datasets. While
this availability has the potential to give radical new insights into
the datasets as well as into the evaluated methods, it also raises new
challenges on multiple levels. Which datasets should be chosen? How can
statistical significance of algorithms be determined? On a
practical level, it leads to the question which data structures to use for this, as well as how
to manage the task of running the resulting high number of experiments.
This tutorial will start with a brief overview of the field, cover the
structural diversity of datasets, and cover the file formats and data
structures available for handling them. The core of the tutorial (parts
Applications and Case Studies)
is dedicated to a selection of data mining methods specific to the
field, with the goal not to teach these methods specifically, but to
give the attendant the knowledge to apply corresponding techniques to
other data mining and machine learning methods. Then, we present
specific data mining problems arising from the management of such
repositories, in particular, the detection of duplicate datasets. We
end with a very brief overview of the Stu build automation system, which is
successfully used by the presenters for the KONECT project, again, with
a focus on the generality of the methods rather on the specific
technology.
Outline
The tutorial will be a half-day tutorial of about 180 minutes (excluding
breaks), and will be divided into the following five parts. The
indicated durations are approximate. The tutorial will cover both work
by the presenters (in relation to the KONECT project), as well as work
by other researchers, in particular by the authors of the repositories
SNAP and ICON.
- Motivation: (20 min)
- Ubiquity of networked data: Discuss whether “everything is a
network.”
Show to what extent many available datasets are in fact networks,
and stress the diversity of what can be a network. Briefly
discuss what is not a network. Review the different
types of applications (both academic and otherwise) that can be
realized with networks: prediction problems, unsupervised
problems, model fitting, testing of algorithms, etc., in
particular with respect with the diverse fields and subfields of research that
are involved.
- Availability of datasets: Many corpora of network datasets are
available now, including SNAP, KONECT, ICON, and many others. Many are
collections by individual researchers that have grown to a large size
over time.
- Networks as a data structure: (25 min)
- Structural diversity of networks: Review the types of
networks in existence: directed, bipartite, signed, temporal,
“pathway”-like networks, hypergraphs, etc. Stress that each
application as presented earlier has different requirements as to
datasets.
- Data structures for networks: Present the different ways
network datasets are represented on computers, with a focus on
their practical usage with SNAP, KONECT, and ICON. Stress the
fact that the textual file format used by these three projects is
simple, but may lead to confusion as to the precise type of
network at hand. (A standard example being: Is this network
directed or not?)
- Licenses and legal issues: Cover the different types of
licenses (or the lack of a license) used by the main projects in
the area. Relate experiences both from KONECT, as well as from
other projects, where applicable.
- Applications: (50 min)
- Comparison of datasets: The simplest application possible
when multiple datasets are available is a comparison. We cover
comparisons via network statistics (i.e., numerical metrics), via
plots, and via various distributions (e.g., centralities,
distances). We show, using examples from KONECT, the various ways
networks can be compared, and in particular, how different
measures compare at that task.
- Detecting outliers: We show how outliers can be found, i.e.,
datasets for which one property is significantly different than for
other datasets.
- Statistical significance: We sketch the relationship between
the number of datasets used in a study and the statistical
significance of the results. As an example application, we use
the evaluation of recommender systems.
- Small datasets vs large datasets: We show how a practitioner
may choose between using large or small datasets, and what
tradeoffs are involved in the choice.
- Case Studies: (60 min)
- Degree distributions: We show using the example
of degree distributions, one of the most basic and common
distribution studied in networks, that the question “Do networks
generally have power-law degree distributions?” can be answered
in wildly different ways, depending on the approach. We review
early work by Clauset and colleagues and
end by presenting recent work by Broido and Clauset using ICON
data.
- Network repositories as a training corpus: We use recent
unpublished work by the authors to show that a learning problem
can be improved by being trained on a corpus of network datasets
as a whole.
- Transfer learning: Using the example of social network role
analysis, we show how the availability of many network datasets
allows us to perform and evaluate transfer learning, i.e.,
learning a function on one dataset and applying it to another.
- Classification of dataset: Based on multiple works,
we show how corpora of network
datasets can be used to classify networks, i.e., detect whether a
given network is (for instance) a social network or a knowledge
network. The problem has been approached from different angles,
giving an insight into the different ways to use network dataset
collections.
- Motifs in temporal networks: Based on a recent
paper by Paranjape and colleagues, we show how a
collection of datasets was used to detect structural motifs in
datasets, and how the results were compared.
- Management of network dataset repositories: (25 min)
- Detection of erroneous data: Based on work by Reif,
we show how errors in a network dataset collection can be detected.
This includes duplicates and near-duplicates, for instance directed networks in
which the edge directions have been saved wrongly.
- Execution of data mining code on many network datasets: We
briefly present the Stu build automation software, which was
specifically designed to allow computation of many tasks that are
interrelated via a complex dependency graph, in particular with
respect to parallelization, the use of multiple programming
languages, and reproducibility of results.
Material
Presenters
The tutorial is presented by Jérôme Kunegis and Renaud Lambiotte,
both at the University of Namur's Center for Complex Systems (naXys).
Dr. Jérôme KUNEGIS is postdoctoral researcher at the
Namur Center for Complex Systems
(naXys) at the University of Namur, Belgium. Dr. Kunegis leads the
KONECT project, which
represents one the network dataset repositories covered by the
tutorial.
Dr. Kunegis graduated at the Technical
University of Berlin in the field of computer science in 2006, and
received his PhD in 2011 at the University of Koblenz–Landau, Germany,
on the spectral analysis of evolving networks.
Dr. Kunegis has been teaching in the fields of network science, database
systems, and graph theory since 2012.
Prof. Renaud LAMBIOTTE
has a PhD in physics from the Université libre de
Bruxelles. After postdocs at ENS Lyon, University of Liège, Université
Catholique de Louvain,
and Imperial College London, he is currently associate professor at the
Mathematical Institute of the University of Oxford and professor at the
University of Namur. His main research interests are the modelling and
analysis of processes taking place on large networks, with a particular
focus on social and brain networks.