Html Parsing Line By Line

Question

I'm working on a python code intended to parse HTML. The objective here is to find strings in each line, and change them as seen below: Original: 'Criar Alerta'

Solution 1:

I believe that the following is what you are looking for.

Let's use 3 lines, two of which contain words in your dictionary, and one doesn't - just to test the code:

rep = """
      <liclass="current"><astyle="color:#00233C;"href="index.html"><iclass="icon icon-home"></i>  Início</a></li><liclass="current"><astyle="color:#00233C;"href="index.html"><iclass="icon icon-home"></i>  Nunca</a></li><liclass="current"><astyle="color:#00233C;"href="index.html"><iclass="icon icon-home"></i>  Criar Alerta</a></li>
    """

And use your dictionary (hint: it's never a good idea to define a dictionary as dict; it's just asking for trouble somewhere down the road...)

rep_dict = {
"Início": "Start",
"Ajuda": "Help",
"Criar Alerta": "Create Alert",
"Materiais e Estruturas": "Structures and Materials" 
}

Now to the code:

soup = BeautifulSoup(rep, 'lxml')

only_a_tags = soup.find_all('a')

for item inrange(len(only_a_tags)):
    for word in rep_dict:
        if word instr(only_a_tags[item]):
            print(str(only_a_tags[item]).replace(word,rep_dict[word]))

Output:

<ahref="index.html"style="color:#00233C;"><iclass="icon icon-home"></i>  Start</a><ahref="index.html"style="color:#00233C;"><iclass="icon icon-home"></i>  Create    Alert</a>

The item containing "nunca" was not printed because "nunca" is not in rep_dict.

Solution 2:

@Jack Fleeting

In the example below, I want to replace "Início" by "Start":

Original:

<liclass="current"><astyle="color:#00233C;"href="index.html"><iclass="icon icon-home"></i>  Início</a></li>

Expected result:

<liclass="current"><astyle="color:#00233C;"href="index.html"><iclass="icon icon-home"></i>  Start</a></li>

An example from the dictionary:

dict = {
    "Início": "Start",
    "Ajuda": "Help",
    "Criar Alerta": "Create Alert",
    "Materiais e Estruturas": "Structures and Materials"
    ...
}

Below is the code I've written, to practice HTML parsing with BeautifulSoup. (I noticed that all strings to be replaced are inside "a" tags, then I used SoupStrainer("a"))

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

withopen(html_file, 'rb') as src:
    doc = src.read()
    src.close()

only_a_tags = SoupStrainer("a")
parse_1 = 'html.parser'
soup = BeautifulSoup(doc, parse_1, parse_only=only_a_tags)

print(soup.prettify())

The original line is parsed and printed as follows:

<ahref="index.html"style="color:#00233C;"><iclass="icon icon-home"></i>
 Início
</a>

Given the print above, I'm not certain, if I'll be able to obtain the expected result.

My intention is to find the string(s) for each line, then search in the dictionary for its equivalent, and perform the replacement.

By now, I want to know how to perform this replacement of strings, using BeatifulSoup. After it, I will write a 'for' loop, to finally perform the replacement for all of the lines in the HTML file.

My first attempt (berfore knowing about BeautifulSoup) was to work on a .txt version of the HTML file read as binary, wich proved itself very time-consuming and unproductive.

Python Channel

Html Parsing Line By Line

Solution 1:

Solution 2:

Post a Comment for "Html Parsing Line By Line"