Skip to content Skip to sidebar Skip to footer

Counting Particular Occurrences In Python In Csv File

I have a csv file with 4 columns {Tag, User, Quality, Cluster_id}. Using python I would like to do the following: For every cluster_id (from 1 to 500), I want to see for each user,

Solution 1:

collections.defaultdict should be a great help here:

# WARNING: Untestedfrom collections import defaultdict

auto_vivificator = lambda: defaultdict(auto_vivificator)

data = auto_vivificator()

# open your csv filefor tag, user, quality, cluster in csv_file:
    user = data[cluster].setdefault(user, defaultdict(int))
    if is_good(quality):
        user["good"] += 1else:
        user["bad"] += 1for cluster, users inenumerate(data):
    print"Cluster:", cluster
    for user, quality_metrics inenumerate(users):
       print"User:", user
       print quality_metrics
       print# A blank line

Solution 2:

Since someone's already posted a defaultdict solution, I'm going to give a pandas one, just for variety. pandas is a very handy library for data processing. Among other nice features, it can handle this counting problem in one line, depending on what kind of output is required. Really:

df = pd.read_csv("cluster.csv")
counted = df.groupby(["Cluster_id", "User", "Quality"]).size()
df.to_csv("counted.csv")

--

Just to give a trailer for what pandas makes easy, we can load the file -- the main data storage object in pandas is called a "DataFrame":

>>> import pandas as pd
>>> df = pd.read_csv("cluster.csv")
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0to499999
Data columns:
Tag           500000  non-nullvaluesUser500000  non-nullvalues
Quality       500000  non-nullvalues
Cluster_id    500000  non-nullvalues
dtypes: int64(1), object(3)

We can check that the first few rows look okay:

>>>df[:5]
   Tag  User Quality  Cluster_id
0  bbb  u001     bad          39
1  bbb  u002     bad          36
2  bag  u003    good          11
3  bag  u004    good           9
4  bag  u005     bad          26

and then we can group by Cluster_id and User, and do work on each group:

>>>for name, group in df.groupby(["Cluster_id", "User"]):...print'group name:', name...print'group rows:'...print group...print'counts of Quality values:'...print group["Quality"].value_counts()...    raw_input()...
group name: (1, 'u003')
group rows:
        Tag  User Quality  Cluster_id
372002  xxx  u003     bad           1
counts of Quality values:
bad    1

group name: (1, 'u004')
group rows:
           Tag  User Quality  Cluster_id
126003  ground  u004     bad           1
348003  ground  u004    good           1
counts of Quality values:
good    1
bad     1

group name: (1, 'u005')
group rows:
           Tag  User Quality  Cluster_id
42004   ground  u005     bad           1
258004  ground  u005     bad           1
390004  ground  u005     bad           1
counts of Quality values:
bad    3
[etc.]

If you're going to be doing a lot of processing of csv files, it's definitely worth having a look at.

Post a Comment for "Counting Particular Occurrences In Python In Csv File"