Pandas Adding Rows To Df In Loop

March 31, 2024 Post a Comment

I'm parsing data in a loop and once it's parsed and structured I would like to then add it to a data frame. The end format of the data frame I would like is something like the foll

Solution 1:

IIUC, you need to convert the single dict to dataframe firstly, then we do append, in case we do not have duplicate 'id' we need groupby get the first value

df=pd.DataFrame()
l=[{'id':234,'2018-01':2},
{'id':534,'2018-01':5},
{'id':534,'2018-03':4}]

for x in l:
    df=df.append(pd.Series(x).to_frame().T.set_index('id')).groupby(level=0).first()
    print(df)

     2018-01
id          
234        2
     2018-01
id          
234        2
534        5
     2018-01  2018-03
id                   
234      2.0      NaN
534      5.0      4.0

Solution 2:

It is not advisable to generate a new data frame at each iteration and append it, this is quite expensive. If your data is not too big and fits into memory, you can make a list of dictionaries first and then pandas allows you to simply do:

df = pd.DataFrame(your_list_of_dicts)
df.set_index('id')

If making a list is to expensive (because you'd like to save memory for the data frame) consider using a generator instead of a list. The basic anatomy of a generator function is this:

defdatagen(your_input):
    for item in your_input:
        # your code to make a dictyielddict

The generator object data = datagen(input) will not store the dicts but yields a dict at each iteration. It can generate items on demand. When you do pd.DataFrame(data), pandas will stream all the data and make a data frame. Generators can be used for data pipelines (like pipes in UNIX) and are very powerful for big data workflows. Be aware, however, that a generator object can be consumed only once, that is if you run pd.DataFrame(data) again, you will get an empty data frame.

Solution 3:

The easiest way I've found in Pandas (although not intuitive) to iteratively append new data rows to a dataframe is using df.loc[ ] to reference the last (nonexistent) row, with len(df) as the index:

df.loc[ len(df) ] = [new, row, of, data]

This will "append" the new data row to the end of the dataframe in-place.

The above example is for an empty Dataframe with exactly 4 columns, such as:

df = pandas.DataFrame(  columns=["col1", "col2", "col3", "col4"]  )

df.loc[ ] indexing can insert data at any Row at all, whether or not it exists yet. It seems it will never give an IndexError, like an numpy.array or List would if you tried to assign to a nonexistent row. For a brand-new, empty DataFrame, len(df) returns 0, and thus references the first, blank row, and then increases by one each time you add a row.

–––––

I do not know the speed/memory efficiency cost of this method, but it works great for my modest datasets (few thousand rows). At least from a memory perspective, I imagine that a large loop appending data to to the target DataFrame directly would use less memory than generating an intermediate List of duplicate data first, then generating a DataFrame from that list. Time "efficiency" could be a different question entirely, one for the other SO gurus to comment on.

–––––

However for the OP's specific case where you also requested to combine the columns if the data is for an existing identically-named column, you'd need som logic during your for loop.

Instead I would make the DataFrame "dumb" and just import the data as-is, repeating dates as they come, eg. your post-loop DataFrame would look like this, with simple column names describing the raw data:

df:iddatedata2342018-01   25342018-01   55352018-03   4

(has two entries for the same date).

Then I would use the DataFrame's databasing functions to organize this data how you like, probably using some combination of df.unique() and df.sort(). Will look into that more later.

Python Channel

Pandas Adding Rows To Df In Loop

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Pandas Adding Rows To Df In Loop"