Data

The information on this website are distributed only for academic and research purposes.

Macro- and mesoscale pattern interdependencies in Complex Networks

Identifying and explaining the structure of complex networks at different scales has become an important problem across disciplines. At the mesoscale, modular architecture has attracted most of the attention. At the macroscale, other arrangements --e.g. nestedness or core-periphery-- have been studied in parallel, but to a much lesser extent. However, empirical evidence increasingly suggests that characterising a network with a unique pattern typology may be too simplistic, since a system can integrate properties from distinct organisations at different scales. Here, we explore the relationship between some of these different organisational patterns: two at the mesoscale (modularity and in-block nestedness); and one at the macroscale (nestedness). We show experimentally and analytically that nestedness imposes bounds to modularity, with exact analytical results in idealised scenarios. Specifically, we show that nestedness and modularity are interdependent. Furthermore, we analytically evidence that in-block nestedness provides a natural combination between nested and modular networks, taking structural properties of both. Far from a mere theoretical exercise, understanding the boundaries that discriminate each architecture is fundamental, to the extent that modularity and nestedness are known to place heavy dynamical effects on processes, such as species abundances and stability in Ecology.


Source (citation)
Palazzi, M., Borge-Holthoefer, J., Tessone, C., and Solé-Ribalta, A. (2018). Antagonistic structural patterns in complex networks. arXiv preprint arXiv:1810.12785.

Online division of labour: emergent structures in Open Source Software

The development Open Source Software fundamentally depends on the participation and commitment of volunteer developers to progress. Several works have presented strategies to increase the on-boarding and engagement of new contributors, but little is known on how these diverse groups of developers self-organise to work together. To understand this, one must consider that, on one hand, platforms like GitHub provide a virtually unlimited development framework: any number of actors can potentially join to contribute in a decentralised, distributed, remote, and asynchronous manner. On the other, however, it seems reasonable that some sort of hierarchy and division of labour must be in place to meet human biological and cognitive limits, and also to achieve some level of efficiency. These latter features (hierarchy and division of labour) should translate into recognisable structural arrangements when projects are represented as developer-file bipartite networks. In this paper we analyse a set of popular open source projects from GitHub, placing the accent on three key properties: nestedness, modularity and in-block nestedness -which typify the emergence of heterogeneities among contributors, the emergence of subgroups of developers working on specific subgroups of files, and a mixture of the two previous, respectively. These analyses show that indeed projects evolve into internally organised blocks. Furthermore, the distribution of sizes of such blocks is bounded, connecting our results to the celebrated Dunbar number both in off- and on-line environments. Our analyses create a link between bio-cognitive constraints, group formation and online working environments, opening up a rich scenario for future research on (online) work team assembly.


Source (citation)
Palazzi, María J., et al. "Online division of labour: emergent structures in Open Source Software." arXiv preprint arXiv:1903.03375 (2019).

Twitter Timeseries

Every case study in the paper (except Higgs and Unfiltered Stream) comes in two "flavors", corresponding to two alternative geographical partitions.
In general, the time series in *.dat files are represented in columns. That is, each column represents a geographic unit (city, state, country, etc); and each row corresponds to a time unit. THE EXCEPTION TO THIS IS THE HIGGS DATASET, IN WHICH TIME SERIES ARE TRANSPOSED.
In general, a "time unit" is 1 minute: every value in the time series aggregates the number of messages for a given minute. THE EXCEPTION TO THIS IS THE UNFILTERED STREAM DATASET, IN WHICH EACH ROW CORRESPONDS TO 1 SECOND.

1.    If the dataset is focused in US [Batman, Google]
1.1. The finer partition is at the level of states (50 states + DC + Puerto Rico); which implies that the file has 52 columns. They are sorted according to "usstates_order.txt".
1.2. The alternative partition is at the level of "divisions" (see US Census Bureau-designated regions and divisions: https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States#Census_Bureau-designated_regions_and_divisions); which implies that the files have 9 columns. They are sorted according to the labelling in Fig S10 of the paper.

Relevant files:

  • batmandivisions-timeseries.dat
  • batmanstates-timeseries.dat
  • googledivisions-timeseries.dat
  • googlestates-timeseries.dat
  • usstates_order.txt

2.   Brazil dataset
2.1. The finer partition is at the level of basins, which is roughly equivalent to metropolitant areas (supra-city level); which implies that the file has 97 columns. They are sorted according to file "brazilbasins_order.txt"
2.2. The alternative partition is at the level of "states"; which implies that the files have 27 columns. They are sorted according to the labelling in Fig S10 of the paper.

Relevant files:
  • brazilbasins-timeseries.dat
  • brazilbasins_order.txt
  • brazilstates-timeseries.dat

3.   Spain dataset
3.1. The finer partition is at the level of metropolitan areas; which implies that the file has 57 columns. 56 correspond to metropolitan areas, the extra one ("Resto") belongs to any place in Spain not included in the previous. They are sorted according to "spainmetro_order.txt"
3.2. The alternative partition is at the level of "autonomous communities" (see Spain autonomous communities list: https://en.wikipedia.org/wiki/Ranked_lists_of_Spanish_autonomous_communities#By_population); which implies that the files have 16 columns (Extremadura, Ceuta and Melilla were left out because of lack of data from those areas). They are sorted according to the labelling in Fig S10 of the paper.

Relevant files:
  • spainmetro-timeseries.dat
  • spainmetro_order.txt
  • spaincommunities-timeseries.dat

4.  Higgs dataset:
*****************************************************
Note that, exceptionally, for this dataset rows correspond to countries, columns correspond to time)
*****************************************************
in the paper we report results for a single partition at the level of countries; which implies that the file has 61 rows (only the 61 more active countries in the event were considered). They are sorted according to "higgscountries_order.txt"

Relevant files:
  • higgscountry-timeseries.dat
  • higgscountries_order.txt

5.   Unfiltered Twitter dataset: in the paper we report results for a single partition at the level of countries; which implies that the file has 61 columns (only the 61 more active countries in the event were considered). They are sorted according to "unfilteredcountries_order.txt"

Relevant files:
  • unfilteredcountry-timeseries.dat
  • unfilteredcountries_order.txt


Source (citation)
J. Borge-Holthoefer, N. Perra, B. Gonçalves, S. González-Bailón, A. Arenas, Y. Moreno and A. Vespignani - Science Advances, Vol. 2, no. 4, e1501158, DOI: 10.1126/sciadv.1501158 (2016)