Weird Exponential Increase In Running Time When Using Dataframe.mean() (pandas Performance Non-numeric Column)
I am playing around with a dataset of weather data (To reproduce; data can be found here unzip it and run the code below), and I wanted to normalize the data. To do this, I tried t
Solution 1:
I did some tests, and it seems that the culprit, in this case, is "Date Time" - the non-numeric column.
First, when calculating the mean for different columns on their own, there's clearly no exponential behavior (see chart below - the X axis is the number of rows, the y-axis is time).
Second, I then tried to calculate means for the entire data frame in the following
three scenarios (each with 80K rows), and timed it with %%timeit
:
jena_climate_df[0:80000].mean(axis=0)
: 10.2 seconds.- Setting the date/time column to an index:
jena_climate_df.set_index("Date Time")[0:80000].mean(axis=0) - 40 ms
(about 0.4% of the previous test). - And finally, dropping the date/time column:
jena_climate_df.drop("Date Time", axis=1)[0:80000].mean(axis=0)
- 19.8 ms (0.2% of the original time).
Hope this helps.
Post a Comment for "Weird Exponential Increase In Running Time When Using Dataframe.mean() (pandas Performance Non-numeric Column)"