Sampling Maximally Diverse Subsets

2021-08-26

Table of Contents

Applied machine-learning often requires reasoning about points in high-dimensional spaces, where distances encode some form of information. A few times now I’ve needed to summarize such semantically rich spaces using a very small subset of samples. In such cases, ideally you want samples that come from diverse, non-overlapping regions in the space, maximizing the information content per sample.

Symbolically, we can express this as an optimization problem of the following form:

$$ \begin{align} \underset{\mathcal{X}}{\operatorname{arg max}}\quad&\min\quad\{d(x, x^\prime)|x, x^{\prime}\in\mathcal{X}, x\not= x^\prime\}\\ &\text{s.t.}\quad|\mathcal{X}|=k \end{align} $$

Put otherwise, we want to find a subset $\mathcal{X}$ of size $k$, where the nearest neighbours are as distant from each other as possible. From the left image to the right.

A visual depiction of maximally diverse subsampling of a uniformly distributed square.

While the chosen samples in the right image come from the left image, the points are nicely spread out. The sample does not exhibit clumpiness, and each region of the space is fairly represented.

As it happens, such problems crop up all over the place. Unfortunately, similar problems (e.g., the maximally diverse grouping problem, bin packing) are understood to be NP-hard. Given that our dataset is going to comprise thousands or millions of datapoints, brute search is probably out of the question.

Then again, exact solutions are overrated. Can we find an approximation that is good enough, as quickly as possible?

Space-Filling Designs

One area where we want maximally diverse subsets, is that of experimental design in computer science. For example, when running a hyperparameter optimisation search, you’d like to avoid running costly experiments with too similar hyperparameters. Instead, initially you want to explore the space of possible values as much as possible before honing in on an optimum.

Experimental designs that maximize the diversity in responses like this, are often called ‘Space-filling Designs’. The Wootton, Sergent, Phan-Ta-Luu (WSP)[1]¹ algorithm prescribes a relatively simple method for deriving such a space-filling design from a candidate set of points:

11. Generate a set of N points
22. Compute the pairwise distance matrix between all N points
33. Choose a seed point and a minimal distance `min_dist`
44. Remove all points whose distance to the seed point is smaller than `min_dist`
55. Choose the point closest to the seed point whose distance is greater than `min_dist`
66. Make the chosen point the new seed point, and remove the original
77. Iterate steps 4-6 until no more points can be chosen

We can easily implement something like this is Python. Assuming we have access to the pairwise distance matrix, and have an initial guess for min_dist, the code should look something like this²,

 1def wsp_space_filling_design(
 2    min_dist: float,
 3    seed_point: int,
 4    dist_matrix: jtyping.Float[np.ndarray, " num_points num_points"],
 5) -> jtyping.Int[np.ndarray, " num_chosen_points"]:
 6    # A point should never be able to choose itself, so set diagonals to nan
 7    dist_matrix_ = np.copy(dist_matrix)
 8    np.fill_diagonal(dist_matrix_, np.nan)
 9
10    # Add the seed point to the list of chosen points
11    chosen_points = [seed_point.squeeze()]
12
13    cur_point = np.copy(seed_point)
14
15    # Start the iterations
16    while True:
17        # Find all points points within a circle of radius min_dist around the current point
18        points_within_circle = (dist_matrix_[cur_point, :] < min_dist).squeeze()
19
20        # Eliminate those points from ever being chosen
21        dist_matrix_[points_within_circle, :] = np.nan
22        dist_matrix_[:, points_within_circle] = np.nan
23
24        # If no points are able to be chosen, stop
25        if np.all(np.isnan(dist_matrix_[cur_point, :])):
26            break
27
28        # Find the nearest neighbour that is not within that circle
29        # Choose it as the next point
30        nearest_outside_point = np.nanargmin(dist_matrix_[cur_point, :])
31
32        chosen_points.append(nearest_outside_point)
33
34        # Make sure the current point can no longer be chosen
35        dist_matrix_[cur_point, :] = np.nan
36        dist_matrix_[:, cur_point] = np.nan
37
38        cur_point = nearest_outside_point
39
40    chosen_points = np.stack(chosen_points)
41
42    return chosen_points

We can nicely visualize this as below. You can identify each seed point as the only point without an incoming arrow. The arrows indicate the point chosen at each iteration, being the closest point at least min_dist away. The plotted circles have radius 0.5 * min_dist, giving nicely ‘packed’ solutions. Two touching circles have points that are exactly min_dist away.

Some WSP space-filling designs

At the moment, the starting position has a significant effect on the overal structure of the different solutions, but this will decrease as the distance becomes smaller. To simplify things, from now on, I only use the point nearest to the origin, i.e. the as the original seed point (right most figure).

Binary Searching a Solution

While a WSP space-filling design gives us a subset that ‘feels’ right, it assumes access to the very quantity we’re trying to solve for; the maxi-min distance. Ideally, we’d avoid a brute-force search over the entire range of possible distance values.

Since we have access to a set of candidate points though, we can do this efficiently by running a binary search over the list of unique pairwise distances, and then selecting the solution that best approximates our desired batch size $k$.

Specifically, for a sample of $N$ points, there are $\binom{N}{2}=\frac{N!}{2!(N-2)!}$ possible pairwise combinations, meaning with binary search, the solution can be found in

$$ \begin{align} \mathcal{O}\left(\log_{2}\frac{N!}{2!(N-2)!}\right)&=\mathcal{O}\left(\log_{2}\frac{N!}{(N-2)!}-1\right) \end{align} $$

time. Asymptotically, using Stirling’s approximation, this is roughly as expensive as sorting the distances list ($\mathcal{O}(N\log_2N)$). The extended algorithm looks something like this:

11. Gather a set of N candidate points
22. Compute the pairwise distance matrix between all candidates
33. Find all unique distances, and store as a sorted list
44. Binary search the list for a distance
55. For each distance, generate a WSP space-filling design
66. Stop the search once you find a WSP space-filling design that has $k$ points

Pythonically, this looks something like:

 1def binary_search_min_dist(points, desired_batch_size, verbose: bool = True):
 2    # Pairwise distances
 3    dist_matrix = sklearn.metrics.pairwise_distances(points)
 4
 5    # Find all the unique distances
 6    unique_dists = np.unique(np.ravel(dist_matrix))[1:]
 7
 8    # Define the centroid point
 9    # This is where we start our search
10    centroid = np.argmin(
11        np.sum(np.abs(points - np.mean(points, axis=0)), axis=1), axis=0
12    )
13
14    # Initiate binary search parameters
15    left = 0
16    right = unique_dists.shape[0] - 1
17
18    # Binary search through all possible distances
19    i = 0
20    while left <= right:
21        # Binary search select current distance
22        middle = left + (right - left) // 2
23
24        # Get the minimum distance
25        cur_min_d = unique_dists[middle]
26
27        chosen_points = wsp_space_filling_design(
28            min_dist=cur_min_d,
29            seed_point=centroid,
30            dist_matrix=dist_matrix,
31        )
32
33        if verbose:
34            print(f"{i:>2d} | size={len(chosen_points):d}, d={cur_min_d:.6f}")
35
36        # If the size of the selected is too small, decrease the distance
37        # Also includes infeasible (length 0) solution case
38        if desired_batch_size > len(chosen_points):
39            right = middle - 1
40
41        # If too large, increase the distance
42        elif desired_batch_size <= len(chosen_points):
43            left = middle + 1
44
45        i += 1
46
47    return chosen_points, cur_min_d

This leaves some room for customization. I don’t actually use an early stopping criterion, as each iteration is fast enough that I can let the binary search complete. As established, the seed point impacts the final WSP space-filling design quite a bit. Here I only explore 1 solution per distance, but should your search budget allow for it, using a series of random points might lead to more accurate batch size estimates for each distance.

The following figure displays solutions for batch sizes of size 8, 16 and 32, all starting from the mean centroid. While the solutions are all pretty different, they exhibit a similar pattern. Initially the path snakes around the seed point, while at the end, larger and larger leaps are made to fit in some final points at the edges of the covered space.

Some found solutions

Next Steps & Limitations

While we’ve arrived at a relatively cheap method for generating subsets with reasonably spread out points, this approach does have some caveats. Currently, I’ve stuck with Euclidean distance, which gives rise to a nice geometric intuition (hence the figures), different distance metrics probably lose this. I’d image that the cosine or angular distances might work, which would effectively be choosing points based on their projection onto the unit circle. In a later paper [2], the same authors apply the WSP space-filling design algorithm to variable reduction, using the correlation matrix as a pairwise distance matrix.

So far we’ve been looking at points in a low-dimensional space (2D), whereas most applications have hundreds or thousands of dimensions. The cited paper [1] tests WSP space-filling designs against some alternatives in higher dimensions (up to 50D), and find it works favourably, but they never compare against the global optimum. Thus, truly high-dimensional spaces remain untested.

Finally, so far I’ve only looked at uniformly distributed points. This is probably the least realistic assumption, with the majority of embedding space necessarily introducing some form of clustering or manifold learning. The current approach is constrained to the set of observed points, so no subset will include points in ’no man’s space’. In a sense, this ensures the subset respects the manifold.

However, in high-density regions it might be beneficial to use smaller distances locally than used globally, i.e. to focus on details. I’d assume this could be implemented by using some distances suited to non-Euclidean spaces, like the RBF kernel.

Another approach might be to explicitly compute a density estimate, sampling points inversely proportional to their density [3].

Implementation

I’ve put a copy of the notebook used to generate these figures in a public GitHub gist:

References

J. Santiago, M. Claeys-Bruno, and M. Sergent, Construction of space-filling designs using WSP algorithm for high dimensional spaces, Chemometrics and Intelligent Laboratory Systems, vol. 113, pp. 26–31, 2012. doi:10.1016/j.chemolab.2011.06.003
D. Ballabio, V. Consonni, A. Mauri, M. Claeys-Bruno, M. Sergent, and R. Todeschini, A novel variable reduction method adapted from space-filling designs, Chemometrics and Intelligent Laboratory Systems, vol. 136, pp. 147–154, 2014. doi:10.1016/j.chemolab.2014.05.010
B. Shang, D. Apley, and S. Mehrotra, Diversity Subsampling: Custom Subsamples from Large Data Sets, INFORMS Journal on Data Science, vol. 2, no. 2, pp. 161–182, 2023. doi:10.1287/ijds.2022.00017

this was the closest citation I could find. The origins of the algorithm seems spread out over a series of different, independent papers, which is not too strange given its simplicity ↩︎
jtyping, the import I use to annotate arrays is shorthand for the amazing jaxtyping library. The general syntax is dtype[type, "dimensions"]. I find it massively increases the readability of numpy/pytorch/jax code. Consider checking it out ↩︎

#machine-learning #statistics