How to split a dataset into equal-sized subsets based on an indexing variable and calculate the mean using NumPy?

Hey, I had work assigned to me in school which to basically write a code that was splitting a dataset into equal-sized subsets based on some indexing variable and then computing the mean of each subset. I had a hard time interpreting the logic for this, can anyone please provide me with a solution for this?

1 Like

Hello @safiaa.02, I have provided a solution below which splits a random dataset of 100 entries into groups based on unique random indexes and then calculates the mean for each group.

  • The code generates an array of 100 random numbers between 0 and 1, which serves as our sample dataset. It also generates an array of 100 random integers between 0 and 9 to be used as an indexing variable.
  • Next, the unique values in S and their corresponding indices are found using np.unique(). An array of zeros with the same length as the number of unique values in S is created to store the mean of D values for each unique value.
  • The code then loops over each unique value in S and calculates the mean of the D values that correspond to the current unique value in S. This is done using boolean indexing to select the D values with the same index as the current unique value.