How to add a column with the index of the nearest row?

nimrah · March 2, 2023, 7:37pm

I am trying to figure out how to add a new column to a Pandas DataFrame that contains the index of the nearest row for each entry. Specifically, I am looking for a way to find the index location of the closest value in a specified column, and then use that information to create a new column that contains the index of the row with the closest value. This technique could be useful in adding contextual information to our data and uncovering relationships between rows.

I have tried to do this, but I have not been successful in achieving the desired outcome. Therefore, I am seeking help to guide me in the right direction. If you have any suggestions or recommendations for how I could add a column with the index of the nearest row, please share your insights with me. I would greatly appreciate your help in solving this problem.

safa · April 19, 2023, 5:44pm

@nimrah, in order to create a new column in a Pandas Data Frame that contains the row number of the nearest row by Euclidean distance, you can use Scipy's KDTree method. It calculates the nearest neighbors using a KDTree data structure, which is efficient for large datasets. For example:

The k parameter in the query method determines how many nearest neighbors to find (in this case, we only want the nearest neighbor).

mubashir_rizvi · May 3, 2023, 5:32pm

You can use Scikit-learn’s NearestNeighbors class to calculate the nearest neighbors, and add a new column to your dataframe. The n_neighbors parameter in the constructor of this class determines how many nearest neighbors to find, here is an example code that performs your task based on two nearest neighbors:

This kneighbors method of this class returns two arrays: distances and indices. The distances contains the distances to each of the n_neighbors nearest neighbors for each row, while indices contains the indices of the corresponding nearest neighbors in the original dataframe.
The indices are then used to create a new column in the dataframe named nearest_indices. I hope this method helps you!