Skip to content Skip to sidebar Skip to footer

Reduce Ram Usage In Python Script

I've written a quick little program to scrape book data off of a UNESCO website which contains information about book translations. The code is doing what I want it to, but by the

Solution 1:

I guess I'll just list off some of the problems or possible improvements in no particular order:

  1. Follow PEP 8.

    Right now, you've got lots of variables and functions named using camel-case like setAuthor. That's not the conventional style for Python; Python would typically named that set_author (and published_country rather than PublishedCountry, etc.). You can even change the names of some of the things you're calling: for one, BeautifulSoup supports findAll for compatibility, but find_all is recommended.

    Besides naming, PEP 8 also specifies a few other things; for example, you'd want to rewrite this:

    iflen(resultNumber)== 0 : return0

    as this:

    iflen(result_number)== 0:
        return0

    or even taking into account the fact that empty lists are falsy:

    if not result_number:
        return0
  2. Pass a SoupStrainer to BeautifulSoup.

    The information you're looking for is probably in only part of the document; you don't need to parse the whole thing into a tree. Pass a SoupStrainer as the parse_only argument to BeautifulSoup. This should reduce memory usage by discarding unnecessary parts early.

  3. decompose the soup when you're done with it.

    Python primarily uses reference counting, so removing all circular references (as decompose does) should let its primary mechanism for garbage collection, reference counting, free up a lot of memory. Python also has a semi-traditional garbage collector to deal with circular references, but reference counting is much faster.

  4. Don't make Book.__init__ write things to disk.

    In most cases, I wouldn't expect just creating an instance of a class to write something to disk. Remove the call to export; let the user call export if they want it to be put on the disk.

  5. Stop holding on to so much data in memory.

    You're accumulating all this data into a dictionary just to export it afterwards. The obvious thing to do to reduce memory is to dump it to disk as soon as possible. Your comment indicates that you're putting it in a dictionary to be flexible; but that doesn't mean you have to collect it all in a list: use a generator, yielding items as you scrape them. Then the user can iterate over it just like a list:

    for book inscrape_books():
        book.export()
    

    …but with the advantage that at most one book will be kept in memory at a time.

  6. Use the functions in os.path rather than munging paths yourself.

    Your code right now is rather fragile when it comes to path names. If I accidentally removed the trailing slash from destinationDirectory, something unintended happens. Using os.path.join prevents that from happening and deals with cross-platform differences:

    >>>os.path.join("/Users/robbie/Test/", "USA")
    '/Users/robbie/Test/USA'
    >>>os.path.join("/Users/robbie/Test", "USA")  # still works!
    '/Users/robbie/Test/USA'
    >>># or say we were on Windows:>>>os.path.join(r"C:\Documents and Settings\robbie\Test", "USA")
    'C:\\Documents and Settings\\robbie\\Test\\USA'
    
  7. Abbreviate attrs={"class":...} to class_=....

    BeautifulSoup 4.1.2 introduces searching with class_, which removes the need for the verbose attrs={"class":...}.

I imagine there are even more things you can change, but that's quite a few to start with.

Solution 2:

What do you want the booklist for, in the end? You should export each book at the end of the "for url in range" block (inside it), and do without the allbooks dict. If you really need a list, define exactly what infos you will need, not keeping full Book objects.

Post a Comment for "Reduce Ram Usage In Python Script"