Options for HTML scraping?

I'm thinking of trying [Beautiful Soup][1], a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well. The story so far: - Python - [Beautiful Soup][2] - [lxml][3] - [HTQL][4] - [Scrapy][5] - [Mechanize][6] - Ruby - [Nokogiri][7] - [Hpricot][8] - [Mechanize][9] - [scrAPI][10] - [scRUBYt!][11] - [wombat][12] - [Watir][13] - .NET - [Html Agility Pack][14] - [WatiN][15] - Perl - [WWW::Mechanize][16] - [Web-Scraper][17] - Java - [Tag Soup][18] - [HtmlUnit][19] - [Web-Harvest][20] - [jARVEST] [21] - [jsoup][22] - [Jericho HTML Parser][23] - JavaScript - [request][24] - [cheerio][25] - [artoo][26] - [node-horseman][27] - [phantomjs][28] - PHP - [Goutte] [29] - [htmlSQL][30] - [PHP Simple HTML DOM Parser][31] - [PHP Scraping with CURL][32] - [ScarletsQuery][33] - Most of them - [Screen-Scraper][34] [1]: http://en.wikipedia.org/wiki/Beautiful_Soup [2]: http://www.crummy.com/software/BeautifulSoup/ [3]: http://codespeak.net/lxml/ [4]: http://htql.net/ [5]: http://scrapy.org/ [6]: http://wwwsearch.sourceforge.net/mechanize/ [7]: http://nokogiri.org/ [8]: https://github.com/hpricot/hpricot/ [9]: https://github.com/tenderlove/mechanize [10]: http://rubyforge.org/projects/scrapi/ [11]: http://scrubyt.org/ [12]: https://github.com/felipecsl/wombat [13]: http://watir.com [14]: http://html-agility-pack.net/?z=codeplex [15]: http://watin.org/ [16]: http://search.cpan.org/dist/WWW-Mechanize/ [17]: http://search.cpan.org/dist/Web-Scraper/ [18]: http://home.ccil.org/~cowan/XML/tagsoup/ [19]: http://htmlunit.sourceforge.net/ [20]: http://web-harvest.sourceforge.net/ [21]: http://sing.ei.uvigo.es/jarvest [22]: http://jsoup.org/ [23]: http://jericho.htmlparser.net/docs/index.html [24]: https://github.com/request/request [25]: https://github.com/cheeriojs/cheerio [26]: http://medialab.github.io/artoo/ [27]: https://github.com/johntitus/node-horseman [28]: http://phantomjs.org/ [29]: https://github.com/FriendsOfPHP/Goutte [30]: https://github.com/hxseven/htmlSQL [31]: http://sourceforge.net/projects/simplehtmldom/ [32]: http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/ [33]: https://github.com/ScarletsFiction/ScarletsQuery [34]: http://www.screen-scraper.com/
(related) Best Methods to parse HTML

以上就是Options for HTML scraping?的详细内容,更多请关注web前端其它相关文章!

赞(0) 打赏
未经允许不得转载:web前端首页 » HTML5 答疑

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

前端开发相关广告投放 更专业 更精准

联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏