Skip to content Skip to sidebar Skip to footer

Weird Exponential Increase In Running Time When Using Dataframe.mean() (pandas Performance Non-numeric Column)

I am playing around with a dataset of weather data (To reproduce; data can be found here unzip it and run the code below), and I wanted to normalize the data. To do this, I tried t

Solution 1:

I did some tests, and it seems that the culprit, in this case, is "Date Time" - the non-numeric column.

First, when calculating the mean for different columns on their own, there's clearly no exponential behavior (see chart below - the X axis is the number of rows, the y-axis is time). enter image description here

Second, I then tried to calculate means for the entire data frame in the following three scenarios (each with 80K rows), and timed it with %%timeit:

  • jena_climate_df[0:80000].mean(axis=0) : 10.2 seconds.
  • Setting the date/time column to an index: jena_climate_df.set_index("Date Time")[0:80000].mean(axis=0) - 40 ms (about 0.4% of the previous test).
  • And finally, dropping the date/time column: jena_climate_df.drop("Date Time", axis=1)[0:80000].mean(axis=0) - 19.8 ms (0.2% of the original time).

Hope this helps.

Post a Comment for "Weird Exponential Increase In Running Time When Using Dataframe.mean() (pandas Performance Non-numeric Column)"