Skipping Headers And Parsing Xml To Convert To Json
Solution 1:
You could use regex to select only the xml part
Something like: /<document>(.*)/gs or /"">(.*)/gs
but how are you fetching that website? This looks similar to something I've been doing with curl, but you should be able to get only body out from curl.
Then you use some library to convert xml to json.
For that part you can use something like Converting XML to JSON using Python?
P.S. (I know this would be better as a comment, but i don't have enough reputation so putting it here.)
Solution 2:
You can read each file, remove the unwanted header using a concept as shown below.
import re
file = '''\
"HTTP/1.1 100 Continue
HTTP/1.1 200 OK
Expires: 0
Buffer: false
Pragma: No-cache
Cache-Control: no-cache
Server: Transaction_Server/4.1.0(zOS)
Connection: close
Content-Type: text/html
Content-Length: 33842
Date: Sat, 02 Aug 2014 09:27:02 GMT
<?xml version=""1.0"" encoding=""UTF-8""?>
<creditBureau xmlns=""http://www.transunion.com/namespace"" xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"">
<document>response</document>
<version>2.9</version>
<transactionControl><userRefNumber>Credit Report Example</userRefNumber>
<subscriber><industryCode>Z</industryCode></subscriber></transactionControl>'''
# list concept.
file_list = file.split('\n')
start = file_list.index('<?xml version=""1.0"" encoding=""UTF-8""?>')
new_list = file_list[start:]
print('joined from list:\n', '\n'.join(new_list), sep='')
# regexp concept.
new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)
print('regexp:\n', new_string, sep='')
The regexp might be quicker though you have plenty of files to test with.
Edit:
Use like this on test.xml:
import re
with open('test.xml') as r:
file = r.read()
new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', file, flags=re.S)
print(new_string)
Edit:
Another example showing bulk overwriting of xml files. Always test first before using on many files. Small test works fine for me.
import glob, re
for file in glob.iglob('*.xml'):
with open(file) as r:
current_string = r.read()
new_string = re.sub(r'\A.*(<\?xml.*)\Z', r'\1', current_string, flags=re.S)
with open(file, 'w') as w:
w.write(new_string)
Specify the codec for reading and writing may be needed.
Post a Comment for "Skipping Headers And Parsing Xml To Convert To Json"