Skip to content Skip to sidebar Skip to footer

Get All Same Attribute Values For Xml In Python

I was trying to get all 'points' attribute values from 'TextRegion--> Coords' tag. I keep getting errors from it. Note: there are tags called 'TextRegion' and 'ImageRegion' whic

Solution 1:

Assuming your posted XML is a copy/paste issue with missing closing of root element opening, your other main issue is the classic XML parsing issue which involves parsing nodes under a default namespace which includes any attribute starting with xmlnswithout a colon separated prefix like xmlns:doc="...".

As a result, you need to define a temporary namespace prefix in Python to parse each named element which you can do with a dictionary passed into findall (not find_all).

from lxml import etree as ET

tree = ET.parse('0004.xml')
nsmp = {'doc': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15'}

root = tree.getroot()
print(root.tag)

# SPECIFY NAMESPACE AND PREFIX ALL NAMED ELEMENTSfor tag in root.findall('doc:Page/doc:TextRegion/doc:Coords', namespaces=nsmp):
    value = tag.get('points')
    print(value)

# 1653,146 1651,148# 2071,326 2069,328 2058,328 2055# 2247,2825 2247,2857 2266,2857 2268,2860 2268# 731,2828 731,2839 728,2841

By the way, lxml is a feature-rich XML library (that required 3rd party installation) that among other powerful tools supports full XPath 1.0. The above code can still work with Python's built-in etree simply by changing import line as from xml.etree import ElementTree as ET.

However, lxml extends this library such as parsing directly to attributes with xpath:

tree = ET.parse('0004.xml')

# SPECIFY NAMESPACE AND PREFIX ALL NAMED ELEMENTSfor pts in tree.xpath('//doc:Coords/@points', namespaces=nsmp):
    print(pts)

# 1653,146 1651,148# 2071,326 2069,328 2058,328 2055# 2247,2825 2247,2857 2266,2857 2268,2860 2268# 731,2828 731,2839 728,2841

Post a Comment for "Get All Same Attribute Values For Xml In Python"