Notebook for 'Computing normalized knowledge diversity'

Adrien Bonnardel, 12/2023

This notebook aims at presenting the work on computing knowledge diversity.

It extends the work on 'Measuring and controlling diversity' [Bourahla 2022] which can be found here. Some of the cells of this notebook are reproduced to display the data.

If you see this notebook as a simple HTML page, then it has been generated by the notebook found in this archive.

This is not a maintained software package (no git repo) but all the code is available as a python file (kdiv.py).

It is provided under the MIT License.

Diversity measures

The diversity measure implements entropy-based diversity (Parameters: distribution, similarity, q, result: float) following Tom Leinster's work [Leinster 2021].

This $q$-parametrised measure computes a diversity index from a distribution ($p$) of a population in categories for which there exists a similarity measure ($Z$).

We recall below the distributions and similarities used in [Notebook].

Ontology distributions

Here are the 7 distributions of the paper (a,b,c,d,e,f,g) of 5 ontologies (A, B, C, D, E) among 10 agents. They are encoded as arrays.

We provide three extra distributions (h, i, j), for the sake of trying.

Distances

The distances between the 5 ontologies are coded into arrays.

So there are no program connection between knowledge distance and diversity.

Such measures may be found in:

Hereafter these distances $d$ are turned into a similarity $Z=e^{-d}$. kdiv.distsim( distance ) computes this transformation.

unstructdist
  A B C D E
A 0 1 1 1 1
B 1 0 1 1 1
C 1 1 0 1 1
D 1 1 1 0 1
E 1 1 1 1 0
   
linearstructdist
  A B C D E
A 0 1 2 3 4
B 1 0 1 2 3
C 2 1 0 1 2
D 3 2 1 0 1
E 4 3 2 1 0
graphsemdist
  A B C D E
A 0.000000 0.333333 0.666667 1.000000 1.000000
B 0.333333 0.000000 0.333333 1.000000 1.000000
C 0.666667 0.333333 0.000000 0.333333 0.666667
D 1.000000 1.000000 0.333333 0.000000 0.333333
E 1.000000 1.000000 0.666667 0.333333 0.000000
   
namesemdist
  A B C D E
A 0.000000 0.428571 0.714286 0.571429 0.285714
B 0.428571 0.000000 0.500000 0.500000 0.666667
C 0.714286 0.500000 0.000000 0.500000 0.666667
D 0.571429 0.500000 0.500000 0.000000 0.333333
E 0.285714 0.666667 0.666667 0.333333 0.000000

Finding $Dmax$ and normalising measures

In order to normalise diversity it is necessary to find what the maximal diversity $Dmax$ is.

It has been shown that $Dmax$ and the distributions associated to it were independent from $q$ [Leinster 2021]. It only depends on $Z$.

The process for finding $Dmax$ involves in principle:

  • For each submatrix $Z_B$ of $Z$ generated by a subset $B$ of the categories,
  • finding its weighting, i.e. a vector $w_B$ of weights such that $Z_B\cdot w_B=1$ ($1$ being the vector with only $1$ values)
  • Selecting $B$ such that the magnitude $Dmax=|Z_B|=\sum_{w\in w_B} w$ of $Z_B$ is maximum.
  • The optimal distribution can be reconstructed as $p_B=\frac{w_B}{|Z_B|}$.

This is implemented by the function kdiv.find_max_dist taking the similarity $Z$ as argument and returning a tuple with: $Dmax$, $B$, $w_B$, $Z_B$, $p_B$ for $B$ the basis of the maximal distribution.

We display below, side by side, for each distance, the plot of the non-normalised measures and normalised measures varying with $q$ and the optimal distribution $p_B$.

With unstructured distance

With linearly structured distance

With graph-based semantic distance

With named-class-based semantic distance

As can be seen:

  • the maximal distribution is not necessarily the equidistribution within categories, and
  • is can be a distribution over a subset of the set of categories (as for the graph-based semantic distance).

Results

The results are also displayed as a table featuring for each distribution, its diversity and normalised diversity along similarities.

They extend those of Table 2 in the initial the paper.

a b c d e f g h i j
categ A 0.00 0.00 0.00 1.00 5.00 1.00 2.00 3.00 4.00 6.00
B 0.00 5.00 2.00 1.00 0.00 2.00 2.00 0.00 1.00 3.00
C 10.00 0.00 6.00 6.00 0.00 4.00 2.00 4.00 0.00 1.00
D 0.00 5.00 2.00 1.00 0.00 2.00 2.00 0.00 1.00 0.00
E 0.00 0.00 0.00 1.00 5.00 1.00 2.00 3.00 4.00 0.00
nostruct entr 1.00 1.46 1.55 1.61 1.46 1.88 2.02 1.72 1.72 1.52
nentr 0.49 0.72 0.77 0.80 0.72 0.93 1.00 0.85 0.85 0.75
linear entr 1.00 1.76 1.59 1.85 1.96 2.25 2.78 2.45 2.41 1.59
nentr 0.35 0.62 0.56 0.65 0.69 0.79 0.98 0.86 0.85 0.56
graphsem entr 1.00 1.46 1.23 1.33 1.46 1.44 1.59 1.53 1.57 1.22
nentr 0.61 0.90 0.75 0.81 0.90 0.88 0.97 0.94 0.96 0.75
namesem entr 1.00 1.24 1.28 1.35 1.14 1.44 1.47 1.40 1.27 1.27
nentr 0.68 0.84 0.87 0.91 0.77 0.97 0.99 0.95 0.86 0.86

Optimisation attempts

The previous algorithm has a time complexity of $O(n^3)$ indeed we have to compute all the submatrices.

By [Leinster 2021, Proposition 10.3], if the matrix is semi-definite positive, then $Dmax = |Z|$. To find if a matrix is semi-definite positive we have either to check if all the eigen values are $> 0$ either cholesky decomposition that will raise an error if the matrix is not positive semi definite definite. But in np.linalg (eigval and cholesky) both have the same time complexity: $O(n^3)$ which is the same as our actual algorithm.

If the similarity matrix has no 0 on the diagonal, that which is expected from similarity matrices, it is possible to use the Gauss-Seidel method to find eigen values in a lower time complexity which seems to be ~ $O(n^2.376)$.

We have used the implementation of the method provided here

For 5 categories (here the ontologies), the example show us that almost the same weight value can be found from around 45-50 tests while the other algorithm provided by remark 7.2 takes 125.

Comparaison

The code right below computes the execution time of the three methods to find $Dmax$ and the maximizing distribution

We can see that the Gauss-Seidel method is more effective than the previous one used on 5*5 matrix

Time (standard) Distribution (standard) Time (Gauss-Seidel) Distribution (Gauss-Seidel) Time (Whole matrix) Distribution (Whole matrix)
unstructdist 3.17e-03 [0.2, 0.2, 0.2, 0.2, 0.2] 1.31e-03 [0.2, 0.2, 0.2, 0.2, 0.2] 9.39e-05 [0.2, 0.2, 0.2, 0.2, 0.2]
linearstructdist 3.15e-03 [0.26, 0.16, 0.16, 0.16, 0.26] 1.57e-03 [0.26, 0.16, 0.16, 0.16, 0.26] 1.03e-04 [0.26, 0.16, 0.16, 0.16, 0.26]
graphsemdistinit 2.83e-03 [0.25, 0.25, 0.0, 0.25, 0.25] 1.39e-03 [0.25, 0.25, 0.0, 0.25, 0.25] 1.39e-04 [0.25, 0.25, -0.01, 0.25, 0.25]
namesemdist 2.81e-03 [0.19, 0.2, 0.28, 0.15, 0.18] 1.40e-03 [0.19, 0.2, 0.28, 0.15, 0.18] 7.77e-05 [0.19, 0.2, 0.28, 0.15, 0.18]

Amazingly, the computation on the whole $Z$ (last two columns) provide the same results as the others even for the graph-based semantic distance (the '-0.01' is also provided by the Gauss-Seidel method but is hidden by the computation).


[Bourahla 2022] Yasser Bourahla, Jérôme David, Jérôme Euzenat, Meryem Naciri, Measuring and controlling knowledge diversity, Proc. 1st JOWO workshop on formal models of knowledge diversity (FMKD), Jönköping (SE), 2022 https://moex.inria.fr/files/papers/bourahla2022c.pdf bourahla2022c

[Leinster 2021] Tom Leinster, Entropy and diversity: the axiomatic approach, Cambridge university press, Cambridge (UK), 2021 https://arxiv.org/pdf/2012.02113.pdf

[Notebook] Knowledge diversity notebook: https://moex.inria.fr/software/kdiv/index.html