How To Extract URL From HTML Anchor Element Using Python3?
I want to extract URL from web page HTML source. Example: xyz.com source code: Download XYZ I want to e
Solution 1:
You can use built-in xml.etree.ElementTree
instead:
>>> import xml.etree.ElementTree as ET
>>> url = '<a href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'
This works on this particular example, but xml.etree.ElementTree
is not an HTML parser. Consider using BeautifulSoup
:
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'
Or, lxml.html
:
>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'
Personally, I prefer BeautifulSoup
- it makes html-parsing easy, transparent and fun.
To follow the link and download the file, you need to make a full url including the schema and domain (urljoin()
would help) and then use urlretrieve()
. Example:
>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))
UPD (for the different html posted in comments):
>>> from bs4 import BeautifulSoup
>>> data = '<html> <head> <body><example><example2> <a href="/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'
Post a Comment for "How To Extract URL From HTML Anchor Element Using Python3?"