There may be times when you might want to work on specific rows only and for that, you need to import only those rows when loading or reading a CSV file. In this thread, different techniques would be discussed for importing or reading every nth row from a CSV file and creating a DataFrame out of it. Before discussing the methods, there are two ways through which you can use CSV files in the discussed methods:
- Download the file and provide the path.
- Provide the URL at which the file is located.
The methods in this thread will be using the 2nd way of using a CSV file.
1. Using "read_csv()" and "iloc" indexer:
- Pandas
read_csv()
function is used here to read the CSV file using a URL. - The
iloc
indexer is then used to filter out everynth
row, the third argument in theiloc
indexer is of the step interval which can be used to achieve this.
As you can see in the output, every 10th
row is loaded in the DataFrame and if you want to reset the index and make it sequential, you can simply use df.reset_index(drop = True)
.
2. Using "chunksize" argument of "read_csv()":
- Pandas
read_csv()
function has an argument ofchunksize
for loading and reading data in small chunks. In this example, chunks of 10 rows are created. - Since each chunk has 10 rows, the first row from each chunk is read using the
iloc
indexer and saved in a list. - The list is the passed to
pd.concat()
function to combine all the rows in a DataFrame. - The result obtained has rows and columns in opposite positions,
df.transpose()
function is used to interchange their positions.
Note: If you want to reset the index and make it sequential, you can simply use df.reset_index(drop = True)
.
3. Using "skiprows" argument of "read_csv()":
- The
skiprows
argument of theread_csv()
function is used to skip and ignore rows while reading a CSV file. - A simple lambda function is used in this example to skip those rows which are not a multiple of our desired number
n
. - A drawback of this approach is that it adds the
0th
row as the column header, to correct this issue:
→ Save that row from the column header usingdf.columns
.
→ Replace that row in the header with the column names usingdf.columns
.
→ Add the first row in the DataFrame usingloc[-1]
. (-1 is used so that it doesn’t replace any other row since no row has index -1)
→ Add1
to the index usingdf.index
to convert-1
to0
again and make it the first row.
→ Sort the index usingdf.sort_index()
so that it appears sequential.
Note: In the lambda function, n-1
is used instead of n
because the index of rows starts from 0 and not 1.
4. Using NumPy and slicing:
- The
np.genfromtxt()
is used to load data from a text file, including CSV files, into a NumPy array. -
[1: : n]
is used to skip the0th
item from the array (which is the column names) and the 3rd argument in the slicer is used as a step interval to get everynth
item which in this case is 10. - Since we skipped the column names, we specify them when we create a DataFrame using
pd.DataFrame()
constructor.