Skip to content Skip to sidebar Skip to footer

Header And Skiprows Difference In Pandas Unclear

Can any one please elaborate with good example the difference between header and skiprows in syntax of pd.read_excel('name',header=number,skiprows=number)

Solution 1:

You can follow this article, which explains the difference between the parameters header and skiprows with examples from the olympic dataset, which can be downloaded here.

To summarize: the default behavior for pd.read() is to read in all of the rows, which in the case of this dataset, includes an unnecessary first row of row numbers.

import pandas as pd
df = pd.read_csv('olympics.csv')
df.head()

                   0         1     2     3     4  ...       11    12    13    14              15
0                NaN  № Summer  01 !  02 !  03 !  ...  № Games  01 !  02 !  03 !  Combined total
1  Afghanistan (AFG)        13     0     0     2  ...       13     0     0     2               2
2      Algeria (ALG)        12     5     2     8  ...       15     5     2     8              15
3    Argentina (ARG)        23    18    24    28  ...       41    18    24    28              70
4      Armenia (ARM)         5     1     2     9  ...       11     1     2     9              12

However the parameter skiprows allows you to delete one or more rows when you read in the .csv file:

df1 = pd.read_csv('olympics.csv', skiprows = 1)
df1.head()

                Unnamed: 0   Summer  01 !  02 !  ...  01 !.2  02 !.2  03 !.2  Combined total
0        Afghanistan (AFG)        13     0     0  ...       0       0       2               2
1            Algeria (ALG)        12     5     2  ...       5       2       8              15
2          Argentina (ARG)        23    18    24  ...      18      24      28              70
3            Armenia (ARM)         5     1     2  ...       1       2       9              12
4  Australasia (ANZ) [ANZ]         2     3     4  ...       3       4       5              12

And if you want to skip a bunch of different rows, you can do the following (notice the missing countries):

df2 = pd.read_csv('olympics.csv', skiprows = [0, 2, 3])
df2.head()

                  Unnamed: 0   Summer  01 !  02 !  ...  01 !.2  02 !.2  03 !.2  Combined total
0            Argentina (ARG)        23    18    24  ...      18      24      28              70
1              Armenia (ARM)         5     1     2  ...       1       2       9              12
2    Australasia (ANZ) [ANZ]         2     3     4  ...       3       4       5              12
3  Australia (AUS) [AUS] [Z]        25   139   152  ...     144     155     181             480
4              Austria (AUT)        26    18    33  ...      77     111     116             304

The header parameter tells you where to start reading in the .csv, which in the following case, does the same thing as skiprows = 1:

# this gives the same result as df1 = pd.read_csv(‘olympics.csv’, skiprows = 1)
df4 = pd.read_csv('olympics.csv', header = 1)
df4.head()

                Unnamed: 0   Summer  01 !  02 !  ...  01 !.2  02 !.2  03 !.2  Combined total
0        Afghanistan (AFG)        13     0     0  ...       0       0       2               2
1            Algeria (ALG)        12     5     2  ...       5       2       8              15
2          Argentina (ARG)        23    18    24  ...      18      24      28              70
3            Armenia (ARM)         5     1     2  ...       1       2       9              12
4  Australasia (ANZ) [ANZ]         2     3     4  ...       3       4       5              12

However you cannot use the header parameter to skip a bunch of different rows. You would not be able to replicate df2 using the header parameter. Hopefully this clears things up.


Post a Comment for "Header And Skiprows Difference In Pandas Unclear"