Skip to content Skip to sidebar Skip to footer

Scraping @Type From HTML Script Into A Csv File Using Pandas

I am trying web scraping for the first time and I am having a lot of trouble especially because the website I am supposed to use tries its best to block scraping libraries. I downl

Solution 1:

import json
import re

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}


def main(url, page):
    params = {
        'page': page,
        'sort': 'dd',
        'filter': 'reviews-dd'
    }
    r = requests.get(url, params=params, headers=headers)
    match = re.search(r'\.parse\((.*)\)', r.text).group(1)
    goal = json.loads(json.loads(match))

    print(goal.keys())


main('https://www.zomato.com/beirut/divvy-ashrafieh/reviews', 1)

Output:

dict_keys(['pages', 'blogData', 'pageUrlMappings', 'careers', 'allJobs', 'department', 'aboutus', 'sneakpeek', 'apiState', 'entities', 'user', 'uiLogic', 'location', 'gAds', 'footer', 'langKeys', 'deviceSpecificInfo', 'pageBlockerInfo', 'fullPageAds', 'networkState', 'fetchConfigs', 'hrefLangInfo', 'pageConfig', 'partnershipLoginModal', 'partnershipLoginOptionModal', 'doesNotDeliverModal', 'backButton'])

Solution 2:

From the soup, you can select the <script> that has the text

window.PRELOADED_STATE = .....

and

  • Extract the string (which is in JSON format) by doing some string manipulations like stripping off unnecessary data
  • Convert that to JSON format using json module
  • Extract the data you need from the JSON.

In my code, x refers to the above mentioned <script> element

import json
x = x.lstrip('window.__PRELOADED_STATE__ = JSON.parse("')
x = x.rstrip('");')

json_string = json.loads(x)             

json_string is in JSON format and you can pull the data.


Post a Comment for "Scraping @Type From HTML Script Into A Csv File Using Pandas"