Weird Exponential Increase In Running Time When Using Dataframe.mean() (pandas Performance Non-numeric Column)
I am playing around with a dataset of weather data (To reproduce; data can be found here unzip it and run the code below), and I wanted to normalize the data. To do this, I tried t
Solution 1:
I did some tests, and it seems that the culprit, in this case, is "Date Time" - the non-numeric column.
First, when calculating the mean for different columns on their own, there's clearly no exponential behavior (see chart below - the X axis is the number of rows, the y-axis is time). 
Second, I then tried to calculate means for the entire data frame in the following
three scenarios (each with 80K rows), and timed it with %%timeit:
- jena_climate_df[0:80000].mean(axis=0): 10.2 seconds.
- Setting the date/time column to an index: jena_climate_df.set_index("Date Time")[0:80000].mean(axis=0) - 40 ms(about 0.4% of the previous test).
- And finally, dropping the date/time column: jena_climate_df.drop("Date Time", axis=1)[0:80000].mean(axis=0)- 19.8 ms (0.2% of the original time).
Hope this helps.
Post a Comment for "Weird Exponential Increase In Running Time When Using Dataframe.mean() (pandas Performance Non-numeric Column)"