[Answer]-BeautifulSoup parse returning empty set

0👍

Looking at the docs, attrs is a poorly designed argument, and should be treated more like a **kwargs.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class indicates that you actually want to pass the class_ kwarg:

>>> from bs4 import BeautifulSoup
>>> src = """ <div class="s">
...    <div>
...       <div class="f kv" style="white-space:nowrap">
...          <cite class="vurls">www.somewebsite.com/</cite>\U+200E
...       </div>
...    </div>
... </div>
...
... """
>>> soup = BeautifulSoup(src)
>>> soup.find_all('cite')
[<cite class="vurls">www.somewebsite.com/</cite>]
>>> soup.find_all('cite', attr={'class': 'vurls'})
[]
>>> soup.find_all('cite', class_='vurls')
[<cite class="vurls">www.somewebsite.com/</cite>]

1👍

Check the output of headers variable and report back, it seems that you still have wrong encoding:

def url_list(self):
    #setup mechanize
    ###
    ### Mechanize settings are here.
    ###

    for url in urls:
        rawMechSiteInfo = mech.open(url)  #mechanize browse each url
        mech_response = mech.response()
        headers = mech_response.info()
        print "headers ", headers.getheader('Content-Type')
        #results = unicode(mech_response.read()) 
        #BSObjOfUrl = BeautifulSoup(results)
        #HarvestLinks = BSObjOfUrl.find_all(u'cite', class_='vurls')
    #return HarvestLinks
    return

0👍

I have never used mechanize before and I am using urllib2 and beautifulsoup4 all the time.
I run into the encoding and decoding issues several times. Maybe some of my experience will help.

When you read text from the page, elem.text, the default is always unicode. Sometimes people have good luck print unicode directly to the screen and everything is fine. Sometimes, the console will not display the unicode correctly. Which indicates two things:

  1. You have already ready the data in, the only problem is that you want to see it in the IDE (Eclipse, Pycharm, ..etc.) It will not work. You can write the unicode to your database or file without doing anything and sometimes it will be displayed correctly when you see the data outside your IDE.

  2. If you want to see the text first when you write your code(Who doesn’t?) You can print elem.text.encode('utf-8') which I always have good luck with.

Leave a comment