How To Call Correct Class From Url Domain
I have been currently working on creating a web crawler where I want to call the correct class that scrapes the web elements from a given URL. Currently I have created: import sys
Solution 1:
Problem is that k.domain
returns bbc
and you wrote url = 'bbc.co.uk'
so one these solutions
- use
url = 'bbc.co.uk'
along withk.registered_domain
- use
url = 'bbc'
along withk.domain
And add a parameter in the scrape
method to get the response
from abc import abstractmethod
import requests
import tldextract
classScraper:
scrapers = {}
def__init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.url] = scraper_class
@classmethoddeffor_url(cls, url):
k = tldextract.extract(url)
return cls.scrapers[k.registered_domain]()
@abstractmethoddefscrape(self, content: requests.Response):
passclassBBCScraper(Scraper):
url = 'bbc.co.uk'defscrape(self, content: requests.Response):
return"Scraped BBC News"if __name__ == "__main__":
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
r = scraper.scrape(requests.get(url))
print(r) # Scraped BBC News
Improve
I'd suggest to store the url
in a attribute to put the requests.get
in the scrape
, so there is less code in the main
classScraper:
scrapers = {}
def__init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.domain] = scraper_class
@classmethoddeffor_url(cls, url):
k = tldextract.extract(url)
return cls.scrapers[k.registered_domain](url)
@abstractmethoddefscrape(self):
passclassBBCScraper(Scraper):
domain = 'bbc.co.uk'def__init__(self, url):
self.url = url
defscrape(self):
rep = requests.Response = requests.get(self.url)
content = rep.text # ALL HTML CONTENTreturn"Scraped BBC News" + content[:20]
if __name__ == "__main__":
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
r = scraper.scrape()
print(r) # Scraped BBC News<!DOCTYPE html><html
Post a Comment for "How To Call Correct Class From Url Domain"