Scraping Customer Reviews From Dm.de
Solution 1:
As most modern websites it seems dm.de only loads content through javascript after the page initially loaded. This is problematic because pythons requests library and scrapy only deal with http, but do not load any javascript.
The same thing happens on amazon, but there it is detected and you get a javascript-free version.
You can try this for yourself by disabling javascript in your browser and then opening the site you want to scrape.
Solutions include using a scraper that supports javascript, or scrape using an automated browser (using a full browser also supports js of course). Selenium with chromium worked well for me.
Solution 2:
I don't have time to play around with the params, but it's all there in the request url to get back that json.
import requests
importjsonurl="https://api.bazaarvoice.com/data/batch.json?"
num_reviews = 100
query = 'passkey=caYXUVe0XKMhOqt6PdkxGKvbfJUwOPDhKaZoAyUqWu2KE&apiversion=5.5&displaycode=18357-de_de&resource.q0=reviews&filter.q0=isratingsonly%3Aeq%3Afalse&filter.q0=productid%3Aeq%3A596141&filter.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&sort.q0=submissiontime%3Adesc&stats.q0=reviews&filteredstats.q0=reviews&include.q0=authors%2Cproducts%2Ccomments&filter_reviews.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&filter_reviewcomments.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&filter_comments.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&limit.q0=' +str(num_reviews) + '&offset.q0=0&limit_comments.q0=3&callback=bv_1111_19110'
url = "https://api.bazaarvoice.com/data/batch.json?"
request_url = url + queryresponse= requests.get(request_url)
jsonStr = response.textjsonStr= response.text.split('(',1)[-1].rsplit(')',1)[0]
jsonData = json.loads(jsonStr)
reviews = jsonData['BatchedResults']['q0']['Results']
for each in reviews:
print ('Rating: %s\n%s\n' %(each['Rating'], each['ReviewText']))
Output:
Rating:5
Immer wieder zufrieden
Rating:5
ich bin mit dem Produkt sehr zufrieden und kann es nur weiterempfehlen.
Rating:5
Super Creme - zieht schnell ein - angenehmer Geruch - hält lange vor - nicht fettend - ich hatte schon das Gefühl, dass meine Falten weniger geworden sind. Sehr zu empfehlen
Rating:5
Das Produkt erfüllt meine Erwärtungen in jeder Hinsicht-ich kaufe es gerne immer wieder
Rating:5
riecht super, zieht schnell ein und hinterlsst ein tolles Hautgefhl
Rating:3
ganz ok...die Creme fühlt sich nur etwas seltsam an auf der Haut...ich konnte auch nicht wirklich eine Verbesserung des Hautbildes erkennen
Rating:4
Für meinen Geschmack ist das Produkt zu fettig/dick zum auftauen.
Rating:1
Ich bin seit mehreren Jahren treuer Benutzer von L'oreal Produkten und habe bis jetzt immer das blaue Gesichtsgel verwendet. Mit dem war ich mehr als zufrieden. Jetzt habe ich die rote Creme gekauft und bin total enttäuscht. Nach ca. einer Stunde entwickelt sich ein sehr seltsamer Geruch, es riecht nach ranssigem Öl! Das ist im Gesicht nicht zu ertragen.
....
Edit:
Ton of cleaning up to do to make this more compact, but here's the basic query:
import requests
import json
url = "https://api.bazaarvoice.com/data/batch.json"
num_reviews = 100
payload = {
'passkey': 'caYXUVe0XKMhOqt6PdkxGKvbfJUwOPDhKaZoAyUqWu2KE',
'apiversion': '5.5',
'displaycode': '18357-de_de',
'resource.q0': 'reviews',
'filter.q0': 'productid:eq:596141',
'sort.q0': 'submissiontime:desc',
'stats.q0': 'reviews',
'filteredstats.q0': 'reviews',
'include.q0': 'authors,products,comments',
'filter_reviews.q0': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q0': 'contentlocale:eq:de*,de_DE',
'filter_comments.q0': 'contentlocale:eq:de*,de_DE',
'limit.q0': str(num_reviews),
'offset.q0': '0',
'limit_comments.q0': '3',
'resource.q1': 'reviews',
'filter.q1': 'productid:eq:596141',
'sort.q1': 'submissiontime:desc',
'stats.q1': 'reviews',
'filteredstats.q1': 'reviews',
'include.q1': 'authors,products,comments',
'filter_reviews.q1': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q1': 'contentlocale:eq:de*,de_DE',
'filter_comments.q1': 'contentlocale:eq:de*,de_DE',
'limit.q1': str(num_reviews),
'offset.q1': '0',
'limit_comments.q1': '3',
'resource.q2': 'reviews',
'filter.q2': 'productid:eq:596141',
'sort.q2': 'submissiontime:desc',
'stats.q2': 'reviews',
'filteredstats.q2': 'reviews',
'include.q2': 'authors,products,comments',
'filter_reviews.q2': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q2': 'contentlocale:eq:de*,de_DE',
'filter_comments.q2': 'contentlocale:eq:de*,de_DE',
'limit.q2': str(num_reviews),
'offset.q2': '0',
'limit_comments.q2': '3',
'callback': 'bv_1111_19110'}
response = requests.get(url, params = payload)
jsonStr = response.text
jsonStr = response.text.split('(',1)[-1].rsplit(')',1)[0]
jsonData = json.loads(jsonStr)
reviews = jsonData['BatchedResults']['q0']['Results']
for k, v in jsonData['BatchedResults'].items():
for each in v['Results']:
print ('Rating: %s\n%s\n' %(each['Rating'], each['ReviewText']))
Solution 3:
I have tried a lot to properly scrape DM product detail pages with scrapy and bs4 but failed to get a 100% accurate scraper. That's why I have decided to move to selenium. It is slow but gives 100% accurate scraping result.
try:
driver.get(url)
print("Current URL is Valid --> OK")
print("Current URL : ", url)
except Exception as e:
print("URL : ", url, " -->> is Invalid!!!")
print("Error Occured : ", e)
driver.quit()
driver.maximize_window()
driver.set_page_load_timeout(10)
## close overlay and cookies
time.sleep(round(random.uniform(1.0,1.5),2)) # give time to properly load the page initiallytry:
driver.find_element_by_xpath('//*[@id="custom-layer-wrapper"]/section/header/button').click()
driver.find_element_by_xpath('//*[@id="overlays"]/div[2]/div/div/div[2]/button').click()
except Exception as e:
print(e)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.65);") # scroll down to next review page button
time.sleep(round(random.uniform(4.5,5.5),2)) # give time to properly load the page initiallywhileTrue:
try:
# iterate through each comment page
response = driver.execute_script("return document.documentElement.outerHTML") # Export rendered HTML# now extract the reviews
soup = BeautifulSoup(response, 'lxml')
soup = soup.find('ol', {'class': 'bv-content-list-reviews'})
# product_title = product_title + soup.find('div',{'data-dmid' : 'detail-page-headline'}).text
tempR = soup.find_all('div', {'class': 'bv-content-summary-body-text'});reviews = reviews + tempR
tempS = soup.find_all('span', {'class': 'bv-content-rating bv-rating-ratio'});stars = stars + tempS
tempT = soup.find_all('div', {'class': 'bv-content-title-container'});titles = titles + tempT
tempU = soup.find_all('div', {'class', 'bv-content-author-name'}); users = users + tempU;
tempH = soup.find_all('div', {'class', 'bv-content-tag-dimensions'}); hauttyps = hauttyps + tempH;
tempD = soup.find_all('div', {'class', 'bv-content-datetime'}); dates = dates + tempD;
# for item in driver.find_elements_by_css_selector('[itemprop="dateCreated"]'):# dates.append(item.get_attribute('content'))
tempUp = soup.find_all('button', {'class': 'bv-content-btn-feedback-yes'}); helpUp = helpUp + tempUp;
tempDown = soup.find_all('button', {'class': 'bv-content-btn-feedback-no'}); helpDown = helpDown + tempDown;
## Go to next Review page# button_next = driver.find_element_by_xpath('//*[@id="BVRRContainer"]/div/div/div/div/div[3]/div/ul/li[2]/a/span[2]')# button_next = driver.find_element_by_css_selector('#BVRRContainer > div > div > div > div > div.bv-content-pagination > div > ul > li.bv-content-pagination-buttons-item.bv-content-pagination-buttons-item-next > a > span.bv-content-btn-pages-next')
button_next = driver.find_element_by_partial_link_text('►')
button_next.location_once_scrolled_into_view
button_next.click()
time.sleep(round(random.uniform(2.5,3.0),2)) # give time to properly load the page initially
driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.90);") # scroll down to next review page button
time.sleep(round(random.uniform(4.5,5.0),2)) # give time to properly load the page initiallyexcept Exception as e:
print(e)
print("----REACHED THE LAST PAGE-----")
break
time.sleep(3) #
driver.quit()
Post a Comment for "Scraping Customer Reviews From Dm.de"