I need to parse a HTML-page with *windows-1251* charset (it's in russian).
The problem is that it is the web application and I have to use Python 2.4 without any opportunity to install modules on server. The only thing I tried to do was asking an administrator to install **lxml** module but nevertheless it wasn't built in the right way on 2.4 and attempt to import **lxml.html** fails.
Now I'm trying to select between **BeautifulSoup** and **html5lib** modules, but I didn't find any simple examples of using html5lib (I need just to extract some text from the certain **div** element with stripping all the other tags inside). In turn, BeautifulSoup returns an error ***'junk characters in start tag: u'\u041f\u0440\u043e\u0434\u0430\u0436\u0430>'*** and any attempts to decode the source page from **CP1251** to **unicode** or any other charset didn't make it.
What am i doing wrong? Or what parser I should use? What release of BeautifulSoup are you using? See crummy.com/software/BeautifulSoup/3.1-problems.html , avoid 3.1.* (unless you're using Python 3) and stick with 3.0.x (for x >= 8).
以上就是What HTML parser to choose and why BeautifulSoup doesn't work?的详细内容，更多请关注web前端其它相关文章！