Importing Csv Into Python

August 21, 2024 Post a Comment

I have a CSV dataset that looks like this: FirstAge,SecondAge,FirstCountry,SecondCountry,Income,NAME 41,41,USA,UK,113764,John 53,43,USA,USA,145963,Fred 47,37,USA,UK,42857,Dan 47,44

Solution 1:

Go with pandas, it will save you the trouble:

import pandas as pd

df = pd.read_csv('first.csv')
print(df)

Solution 2:

You could use the dtype argument:

import numpy as np

output = np.genfromtxt("main.csv", delimiter=',', skip_header=1, dtype='f, f, |S6, |S6, f, |S6')

print(output)

Output:

[( 41.,  41., b'USA', b'UK',  113764., b'John')
 ( 53.,  43., b'USA', b'USA',  145963., b'Fred')
 ( 47.,  37., b'USA', b'UK',   42857., b'Dan')
 ( 47.,  44., b'UK', b'USA',   95352., b'Mark')]

Solution 3:

Alternative from using pandas is to use csv library

import csv
import numpy as np
ls = list(csv.reader(open('first.csv', 'r')))
val_array = np.array(ls)[1::] # exclude first row (columns name)

Solution 4:

With a few general paramters genfromtxt can read this file (in PY3 here):

In [100]: data = np.genfromtxt('stack43444219.txt', delimiter=',', names=True, dtype=None)
In [101]: data
Out[101]: 
array([(41, 41, b'USA', b'UK', 113764, b'John'),
       (53, 43, b'USA', b'USA', 145963, b'Fred'),
       (47, 37, b'USA', b'UK',  42857, b'Dan'),
       (47, 44, b'UK', b'USA',  95352, b'Mark')], 
      dtype=[('FirstAge', '<i4'), ('SecondAge', '<i4'), ('FirstCountry', 'S3'), ('SecondCountry', 'S3'), ('Income', '<i4'), ('NAME', 'S4')])

This is a structured array. 2 fields are integer, 2 are string (byte string by default), another integer, and string.

The default genfromtxt reads all lines as data. I uses names=True to get to use the first line a field names.

It also tries to read all strings a float (default dtype). The string columns then load as nan.

All of this is in the genfromtxt docs. Admittedly they are long, but they aren't hard to find.

Access fields by name, data['FirstName'] etc.

Using thecsv reader gives a 2d array of strings:

In [102]: ls =list(csv.reader(open('stack43444219.txt','r')))
In [103]: ls
Out[103]: 
[['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income', 'NAME'],
 ['41', '41', 'USA', 'UK', '113764', 'John'],
 ['53', '43', 'USA', 'USA', '145963', 'Fred'],
 ['47', '37', 'USA', 'UK', '42857', 'Dan'],
 ['47', '44', 'UK', 'USA', '95352', 'Mark']]
In [104]: arr=np.array(ls)
In [105]: arr
Out[105]: 
array([['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income',
        'NAME'],
       ['41', '41', 'USA', 'UK', '113764', 'John'],
       ['53', '43', 'USA', 'USA', '145963', 'Fred'],
       ['47', '37', 'USA', 'UK', '42857', 'Dan'],
       ['47', '44', 'UK', 'USA', '95352', 'Mark']], 
      dtype='<U13')

Solution 5:

I think the an issue that you could be running into is the data that you are trying to parse is not all numerics and this could potentially cause unexpected behavior.

One way to detect the types would be to try and identify the types before they are added to your array. For example:

for obj in my_data:
    iftype(obj) == int:
        # process or add your data to numpyelse:
        # cast or discard the data

Python Channel