Skip to content Skip to sidebar Skip to footer

Skipping Headers And Parsing Xml To Convert To Json

I have a xml file like this: 'HTTP/1.1 100 Continue HTTP/1.1 200 OK Expires: 0 Buffer: false Pragma: No-cache Cache-Control: no-cache Server: Transaction_Serve

Solution 1:

You could use regex to select only the xml part Something like: /<document>(.*)/gs or /"">(.*)/gs

but how are you fetching that website? This looks similar to something I've been doing with curl, but you should be able to get only body out from curl.

Then you use some library to convert xml to json.

For that part you can use something like Converting XML to JSON using Python?

P.S. (I know this would be better as a comment, but i don't have enough reputation so putting it here.)


Solution 2:

You can read each file, remove the unwanted header using a concept as shown below.

import re

file = '''\
"HTTP/1.1 100 Continue
 HTTP/1.1 200 OK
 Expires: 0
 Buffer: false
 Pragma: No-cache
 Cache-Control: no-cache
 Server: Transaction_Server/4.1.0(zOS)
 Connection: close
 Content-Type: text/html
 Content-Length: 33842
 Date: Sat, 02 Aug 2014 09:27:02 GMT

<?xml version=""1.0"" encoding=""UTF-8""?>
<creditBureau xmlns=""http://www.transunion.com/namespace"" xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"">

<document>response</document>
<version>2.9</version>
<transactionControl><userRefNumber>Credit Report Example</userRefNumber>
<subscriber><industryCode>Z</industryCode></subscriber></transactionControl>'''

# list concept.
file_list = file.split('\n')
start = file_list.index('<?xml version=""1.0"" encoding=""UTF-8""?>')
new_list = file_list[start:]
print('joined from list:\n', '\n'.join(new_list), sep='')

# regexp concept.
new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)
print('regexp:\n', new_string, sep='')

The regexp might be quicker though you have plenty of files to test with.

Edit:

Use like this on test.xml:

import re

with open('test.xml') as r:
    file = r.read()

new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)

print(new_string)

Edit:

Another example showing bulk overwriting of xml files. Always test first before using on many files. Small test works fine for me.

import glob, re

for file in glob.iglob('*.xml'):
    with open(file) as r:
        current_string = r.read()

    new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', current_string, flags=re.S)

    with open(file, 'w') as w:
        w.write(new_string)

Specify the codec for reading and writing may be needed.


Post a Comment for "Skipping Headers And Parsing Xml To Convert To Json"