Skip to content Skip to sidebar Skip to footer

Python Beautifulsoup Extract Value Without Identifier

I am facing a problem and don't know how to solve it properly. I want to extract the price (so in the first example 130€, in the second 130€). the problem is that the attribute

Solution 1:

How about a findAll solution? First collect all possibles id prefixes and then iterate them and get all elements

>>>from bs4 import BeautifulSoup>>>import re>>>html = """...        <span id="07_lbl" class="lbl">Price:</span>...        <span id="07_content" class="content">130  €</span>...        <span id="08_lbl" class="lbl">Value:</span>...        <span id="08_content" class="content">90000  €</span>.........        <span id="03_lbl" class="lbl">Price:</span>...        <span id="03_content" class="content">130  €</span>...        <span id="04_lbl" class="lbl">Value:</span>...        <span id="04_content" class="content">90000  €</span>...""">>>>>>soup = BeautifulSoup(html)>>>span_id_prefixes = [...    span['id'].replace("_content","")...for span in soup.findAll('span', attrs={'id' : re.compile(r'(_content$)')})...]>>>for prefix in span_id_prefixes:...    lbl     = soup.find('span', attrs={'id' : '%s_lbl' % prefix})...    content = soup.find('span', attrs={'id' : '%s_content' % prefix})...if lbl and content:...print lbl.text, content.text... 
Price: 130  €
Value: 90000  €
Price: 130  €
Value: 90000  €

Solution 2:

Here is how you would easily extract only the price values like you had in mind in your original post.

html = """
        <spanid="07_lbl"class="lbl">Price:</span><spanid="07_content"class="content">130  €</span><spanid="08_lbl"class="lbl">Value:</span><spanid="08_content"class="content">90000  €</span><spanid="03_lbl"class="lbl">Price:</span><spanid="03_content"class="content">130  €</span><spanid="04_lbl"class="lbl">Value:</span><spanid="04_content"class="content">90000  €</span>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

price_texts = soup.find_all('span', text='Price:')
for element in price_texts:
    # .next_sibling() might work, too, with a parent element present
    price_value = element.find_next_sibling('span')
    print price_value.get_text()

# It prints:
# 130  €
# 130  €

This solution has less code and, IMO, is more clear.

Solution 3:

Try Beautiful soup selects function. It uses css selectors:

for span in soup_expose_html.select("span[id$=_content]"):
    print span.text

the result is a list with all spans which have an id ending with _content

Post a Comment for "Python Beautifulsoup Extract Value Without Identifier"