Importing Every nth Row from a CSV File

There may be times when you might want to work on specific rows only and for that, you need to import only those rows when loading or reading a CSV file. In this thread, different techniques would be discussed for importing or reading every nth row from a CSV file and creating a DataFrame out of it. Before discussing the methods, there are two ways through which you can use CSV files in the discussed methods:

  1. Download the file and provide the path.
  2. Provide the URL at which the file is located.

The methods in this thread will be using the 2nd way of using a CSV file.

1. Using "read_csv()" and "iloc" indexer:

  • Pandas read_csv() function is used here to read the CSV file using a URL.
  • The iloc indexer is then used to filter out every nth row, the third argument in the iloc indexer is of the step interval which can be used to achieve this.

As you can see in the output, every 10th row is loaded in the DataFrame and if you want to reset the index and make it sequential, you can simply use df.reset_index(drop = True).

2. Using "chunksize" argument of "read_csv()":

  • Pandas read_csv() function has an argument of chunksize for loading and reading data in small chunks. In this example, chunks of 10 rows are created.
  • Since each chunk has 10 rows, the first row from each chunk is read using the iloc indexer and saved in a list.
  • The list is the passed to pd.concat() function to combine all the rows in a DataFrame.
  • The result obtained has rows and columns in opposite positions, df.transpose() function is used to interchange their positions.

Note: If you want to reset the index and make it sequential, you can simply use df.reset_index(drop = True).

3. Using "skiprows" argument of "read_csv()":

  • The skiprows argument of the read_csv() function is used to skip and ignore rows while reading a CSV file.
  • A simple lambda function is used in this example to skip those rows which are not a multiple of our desired number n.
  • A drawback of this approach is that it adds the 0th row as the column header, to correct this issue:
    → Save that row from the column header using df.columns.
    → Replace that row in the header with the column names using df.columns.
    → Add the first row in the DataFrame using loc[-1]. (-1 is used so that it doesn’t replace any other row since no row has index -1)
    → Add 1 to the index using df.index to convert -1 to 0 again and make it the first row.
    → Sort the index using df.sort_index() so that it appears sequential.

Note: In the lambda function, n-1 is used instead of n because the index of rows starts from 0 and not 1.

4. Using NumPy and slicing:

  • The np.genfromtxt() is used to load data from a text file, including CSV files, into a NumPy array.
  • [1: : n] is used to skip the 0th item from the array (which is the column names) and the 3rd argument in the slicer is used as a step interval to get every nth item which in this case is 10.
  • Since we skipped the column names, we specify them when we create a DataFrame using pd.DataFrame() constructor.