Scraping @Type From HTML Script Into A Csv File Using Pandas

March 25, 2023 Post a Comment

I am trying web scraping for the first time and I am having a lot of trouble especially because the website I am supposed to use tries its best to block scraping libraries. I downl

Solution 1:

import json
import re

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}


def main(url, page):
    params = {
        'page': page,
        'sort': 'dd',
        'filter': 'reviews-dd'
    }
    r = requests.get(url, params=params, headers=headers)
    match = re.search(r'\.parse\((.*)\)', r.text).group(1)
    goal = json.loads(json.loads(match))

    print(goal.keys())


main('https://www.zomato.com/beirut/divvy-ashrafieh/reviews', 1)

Output:

dict_keys(['pages', 'blogData', 'pageUrlMappings', 'careers', 'allJobs', 'department', 'aboutus', 'sneakpeek', 'apiState', 'entities', 'user', 'uiLogic', 'location', 'gAds', 'footer', 'langKeys', 'deviceSpecificInfo', 'pageBlockerInfo', 'fullPageAds', 'networkState', 'fetchConfigs', 'hrefLangInfo', 'pageConfig', 'partnershipLoginModal', 'partnershipLoginOptionModal', 'doesNotDeliverModal', 'backButton'])

Solution 2:

From the soup, you can select the <script> that has the text

window.PRELOADED_STATE = .....

and

Extract the string (which is in JSON format) by doing some string manipulations like stripping off unnecessary data
Convert that to JSON format using json module
Extract the data you need from the JSON.

In my code, x refers to the above mentioned <script> element

import json
x = x.lstrip('window.__PRELOADED_STATE__ = JSON.parse("')
x = x.rstrip('");')

json_string = json.loads(x)

json_string is in JSON format and you can pull the data.

Python Channel

Scraping @Type From HTML Script Into A Csv File Using Pandas

Solution 1:

Solution 2:

Post a Comment for "Scraping @Type From HTML Script Into A Csv File Using Pandas"