XPATH - Html With A Lot Of Children
Consider the html in the page variable. How do I access the tds ? I want to access them like xpath('/table/tr/td/text())' I don't want to indicate the other trs Unfortunately this
Solution 1:
Use xpath //td/text()
:
things = tree.xpath('//td/text()')
The //td
stands for "find any td
element in any depth.
Works for me.
Printing td
elements grouped per table
:
doc = html.fromstring(page)
for table_elm in doc.xpath("//table"):
print "another table"
things = table_elm.xpath('.//td/text()')
print(things)
Note, that in this case is the .
in xpath significant.
Solution 2:
You don'have to convert BeautifulSoup
to str
:
soup = str(BeautifulSoup(page, 'html.parser'))
You can use something like this:
>>> soup = BeautifulSoup(page, 'html.parser')
>>> for td in soup.find_all('td'):
... print(td)
...
<td>table1 td1</td>
<td>table1 td2</td>
<td>table2 td1</td>
<td>table2 td2</td>
<td>table3 td1</td>
<td>table3 td2</td>
Or, you can also use print(td.text)
if you want the text inside the element.
Solution 3:
tr
inside of tr
is invalid HTML.
And this seems to be "fixed" by the html.fromstring()
parser.
You can test this with this xpath:
things = tree.xpath('//table/tr/*')
And output with:
for thing in things:
print(thing.tag)
Which generates:
td
td
td
td
td
Post a Comment for "XPATH - Html With A Lot Of Children"