Header And Skiprows Difference In Pandas Unclear
Solution 1:
You can follow this article, which explains the difference between the parameters header
and skiprows
with examples from the olympic dataset, which can be downloaded here.
To summarize: the default behavior for pd.read()
is to read in all of the rows, which in the case of this dataset, includes an unnecessary first row of row numbers.
import pandas as pd
df = pd.read_csv('olympics.csv')
df.head()
0 1 2 3 4 ... 11 12 13 14 15
0 NaN № Summer 01 ! 02 ! 03 ! ... № Games 01 ! 02 ! 03 ! Combined total
1 Afghanistan (AFG) 13 0 0 2 ... 13 0 0 2 2
2 Algeria (ALG) 12 5 2 8 ... 15 5 2 8 15
3 Argentina (ARG) 23 18 24 28 ... 41 18 24 28 70
4 Armenia (ARM) 5 1 2 9 ... 11 1 2 9 12
However the parameter skiprows
allows you to delete one or more rows when you read in the .csv file:
df1 = pd.read_csv('olympics.csv', skiprows = 1)
df1.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Afghanistan (AFG) 13 0 0 ... 0 0 2 2
1 Algeria (ALG) 12 5 2 ... 5 2 8 15
2 Argentina (ARG) 23 18 24 ... 18 24 28 70
3 Armenia (ARM) 5 1 2 ... 1 2 9 12
4 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
And if you want to skip a bunch of different rows, you can do the following (notice the missing countries):
df2 = pd.read_csv('olympics.csv', skiprows = [0, 2, 3])
df2.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Argentina (ARG) 23 18 24 ... 18 24 28 70
1 Armenia (ARM) 5 1 2 ... 1 2 9 12
2 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
3 Australia (AUS) [AUS] [Z] 25 139 152 ... 144 155 181 480
4 Austria (AUT) 26 18 33 ... 77 111 116 304
The header
parameter tells you where to start reading in the .csv, which in the following case, does the same thing as skiprows = 1
:
# this gives the same result as df1 = pd.read_csv(‘olympics.csv’, skiprows = 1)
df4 = pd.read_csv('olympics.csv', header = 1)
df4.head()
Unnamed: 0 № Summer 01 ! 02 ! ... 01 !.2 02 !.2 03 !.2 Combined total
0 Afghanistan (AFG) 13 0 0 ... 0 0 2 2
1 Algeria (ALG) 12 5 2 ... 5 2 8 15
2 Argentina (ARG) 23 18 24 ... 18 24 28 70
3 Armenia (ARM) 5 1 2 ... 1 2 9 12
4 Australasia (ANZ) [ANZ] 2 3 4 ... 3 4 5 12
However you cannot use the header parameter to skip a bunch of different rows. You would not be able to replicate df2 using the header parameter. Hopefully this clears things up.
Post a Comment for "Header And Skiprows Difference In Pandas Unclear"