What HTML parser to choose and why BeautifulSoup doesn't work?

I need to parse a HTML-page with *windows-1251* charset (it's in russian). The problem is that it is the web application and I have to use Python 2.4 without any opportunity to install modules on server. The only thing I tried to do was asking an administrator to install **lxml** module but nevertheless it wasn't built in the right way on 2.4 and attempt to import **lxml.html** fails. Now I'm trying to select between **BeautifulSoup** and **html5lib** modules, but I didn't find any simple examples of using html5lib (I need just to extract some text from the certain **div** element with stripping all the other tags inside). In turn, BeautifulSoup returns an error ***'junk characters in start tag: u'\u041f\u0440\u043e\u0434\u0430\u0436\u0430>'*** and any attempts to decode the source page from **CP1251** to **unicode** or any other charset didn't make it. What am i doing wrong? Or what parser I should use?
What release of BeautifulSoup are you using? See crummy.com/software/BeautifulSoup/3.1-problems.html , avoid 3.1.* (unless you're using Python 3) and stick with 3.0.x (for x >= 8).

以上就是What HTML parser to choose and why BeautifulSoup doesn't work?的详细内容,更多请关注web前端其它相关文章!

赞(0) 打赏
未经允许不得转载:web前端首页 » HTML5 答疑

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

前端开发相关广告投放 更专业 更精准