.. _projection_clusters:

Projection Clusters
===================

The hierarchy of the h-NNE is inspired by the FINCH clustering (https://github.com/ssarfraz/FINCH-Clustering).
This means that during the generation of the hierarchy a collection of partitions of the dataset is generated.
This set of partitions can be useful when one is dealing with unlabeled data. Using different levels of the hierarchy,
one can identify clusters which align with the h-NNE projection structure. We already presented the functionality in
the demo 1 notebook. Here we look at a more realistic scenario with a larger dataset containing no labels.
The dataset we use is a list of 3 million word embeddings of dimension 300 based on the Google news dataset.

WARNING: To run this code be sure to use a server with at least 64GB of RAM.

NOTE: Here we are using h-NNE v1, as v2 visually separates the FINCH clusters of some of the top levels.

To install the required libraries with pip run:

.. code-block:: bash

    pip install gensim
    pip install matplotlib
    pip install hnne

Next, we will need the dataset. It can be downloaded from this website: https://code.google.com/archive/p/word2vec.
The file needed is 'GoogleNews-vectors-negative300.bin.gz', there is a link to it under the section
'Pre-trained word and phrase vectors'. Once the data is downloaded, one only needs to provide the correct path to
'GoogleNews-vectors-negative300.bin.gz' to the 'load_google_news' function.

Once you have the data, we can get started. First import the libraries:

.. code-block:: python

    from gensim import models
    import numpy as np
    import matplotlib.pyplot as plt

    from hnne import HNNE

Right after, load the data:

.. code-block:: python

    def load_google_news(data_path):
        return models.KeyedVectors.load_word2vec_format(
            data_path,
            binary=True
        ).vectors

    data_path = './GoogleNews-vectors-negative300.bin.gz'
    data = load_google_news(data_path)

Project the data with h-NNE:

.. code-block:: python

    hnne = HNNE(hnne_version="v1")
    projection = hnne.fit_transform(data, verbose=True)

Along with the projected point, the partitions of all levels of h-NNE are available via the
`.hierarchy_parameters.partitions` attribute. Below we display the projected data labeled based on some of the top
level partitions:

.. code-block:: python

    partitions = hnne.hierarchy_parameters.partitions
    partition_sizes = hnne.hierarchy_parameters.partition_sizes
    number_of_levels = partitions.shape[1]

    _, ax = plt.subplots(1, 4, figsize=(10*(4 + 1), 10))

    ax[0].set_title('Unlabelled data')
    ax[0].scatter(*projection.T, s=1)

    for i in range(1, 4):
        partition_idx = number_of_levels - i
        ax[i].set_title(f'Partition of level {i}: {partition_sizes[partition_idx]} clusters')
        ax[i].scatter(*projection.T, s=1, c=partitions[:, partition_idx], cmap='Spectral')
    plt.show()

This clustering can be used to improve visualization or even preprocess the data. No matter which dimension you
project to, the same partitions will be used by h-NNE.

.. image:: projection_clusters.png

An extended version of this example can be found at `this notebook`__.

.. __: https://github.com/koulakis/h-nne/blob/main/notebooks/demo3_clustering_for_free.ipynb