Skip to content Skip to sidebar Skip to footer

Pandas Read_csv Not Obeying A Regex Sep

Data: from io import StringIO import pandas as pd s = '''ID,Level,QID,Text,ResponseID,responseText,date_key,last 375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 0

Solution 1:

Let's look at this SO Post.

Use this regular expression, r',(?=\S)' explained above.

from io import StringIO
import pandas as pd

s = '''ID,Level,QID,Text,ResponseID,responseText,date_key,last
375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 00:00:00,ynot
375280046,S,D3M,How often? (at home, at work, other),D3M0,Work,2010-03-31 00:00:00,okkk
375280046,M,A78,Do you prefer a, b, or c?,A78C,a,2010-03-31 00:00:00,abc
376918925,M,A78,Which ONE (select only one),A78E,Milk,2004-02-02 00:00:00,launch Wed., '''

df = pd.read_csv(StringIO(s), sep=r',(?=\S)')

Output:

              ID                                 Level   QID      Text  \
375280046 S  D3M               Which is your favorite?  D5M0  option 1   
          S  D3M  How often? (at home, at work, other)  D3M0      Work   
          M  A78             Do you prefer a, b, or c?  A78C         a   
376918925 M  A78           Which ONE (select only one)  A78E      Milk   

                ResponseID  responseText  date_key          last  
375280046 S  2012-08-08 00             0         0          ynot  
          S  2010-03-31 00             0         0          okkk  
          M  2010-03-31 00             0         0           abc  
376918925 M  2004-02-02 00             0         0  launch Wed.,  

Solution 2:

read_csv appears to be stripping the space from the end of the string prior to attempting to identify the separator. This can be worked around by modifying the regex to also check for commas identified as just prior to the end of the file:

pd.read_csv(StringIO(s), sep=r',(?!\s|\Z)', engine='python')
Out[347]: 
          ID Level  QID                                  Text ResponseID  \
0  375280046     S  D3M               Which is your favorite?       D5M0   
1  375280046     S  D3M  How often? (at home, at work, other)       D3M0   
2  375280046     M  A78             Do you prefer a, b, or c?       A78C   
3  376918925     M  A78           Which ONE (select only one)       A78E   

  responseText             date_key          last  
0     option 1  2012-08-08 00:00:00          ynot  
1         Work  2010-03-31 00:00:00          okkk  
2            a  2010-03-31 00:00:00           abc  
3         Milk  2004-02-02 00:00:00  launch Wed.,  

Post a Comment for "Pandas Read_csv Not Obeying A Regex Sep"