Counting Particular Occurrences In Python In Csv File
I have a csv file with 4 columns {Tag, User, Quality, Cluster_id}. Using python I would like to do the following: For every cluster_id (from 1 to 500), I want to see for each user,
Solution 1:
collections.defaultdict
should be a great help here:
# WARNING: Untestedfrom collections import defaultdict
auto_vivificator = lambda: defaultdict(auto_vivificator)
data = auto_vivificator()
# open your csv filefor tag, user, quality, cluster in csv_file:
user = data[cluster].setdefault(user, defaultdict(int))
if is_good(quality):
user["good"] += 1else:
user["bad"] += 1for cluster, users inenumerate(data):
print"Cluster:", cluster
for user, quality_metrics inenumerate(users):
print"User:", user
print quality_metrics
print# A blank line
Solution 2:
Since someone's already posted a defaultdict
solution, I'm going to give a pandas one, just for variety. pandas
is a very handy library for data processing. Among other nice features, it can handle this counting problem in one line, depending on what kind of output is required. Really:
df = pd.read_csv("cluster.csv")
counted = df.groupby(["Cluster_id", "User", "Quality"]).size()
df.to_csv("counted.csv")
--
Just to give a trailer for what pandas
makes easy, we can load the file -- the main data storage object in pandas
is called a "DataFrame":
>>> import pandas as pd
>>> df = pd.read_csv("cluster.csv")
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0to499999
Data columns:
Tag 500000 non-nullvaluesUser500000 non-nullvalues
Quality 500000 non-nullvalues
Cluster_id 500000 non-nullvalues
dtypes: int64(1), object(3)
We can check that the first few rows look okay:
>>>df[:5]
Tag User Quality Cluster_id
0 bbb u001 bad 39
1 bbb u002 bad 36
2 bag u003 good 11
3 bag u004 good 9
4 bag u005 bad 26
and then we can group by Cluster_id and User, and do work on each group:
>>>for name, group in df.groupby(["Cluster_id", "User"]):...print'group name:', name...print'group rows:'...print group...print'counts of Quality values:'...print group["Quality"].value_counts()... raw_input()...
group name: (1, 'u003')
group rows:
Tag User Quality Cluster_id
372002 xxx u003 bad 1
counts of Quality values:
bad 1
group name: (1, 'u004')
group rows:
Tag User Quality Cluster_id
126003 ground u004 bad 1
348003 ground u004 good 1
counts of Quality values:
good 1
bad 1
group name: (1, 'u005')
group rows:
Tag User Quality Cluster_id
42004 ground u005 bad 1
258004 ground u005 bad 1
390004 ground u005 bad 1
counts of Quality values:
bad 3
[etc.]
If you're going to be doing a lot of processing of csv
files, it's definitely worth having a look at.
Post a Comment for "Counting Particular Occurrences In Python In Csv File"