Web Scraping - Topic

Member

Posts: 2,303

Joined: Oct 11 2007

Gold: 0.00

Feb 24 2015 03:08pm

Hi guys,

I don't really know if I'm going to get anything from posting here but I have some sort of a problem with one of my side projects.

Basically, I run a web scrapping program I wrote in Java. I keep adding more and more websites to it and I faced my problem trying to scrap a website that loaded dynamically using AJAX.

So here's what I've done so far.

-Added Selenium dependencies
-Installed phantomjs headless browser

After that I load the website using the combination of selenium and phantomjs so that it runs all javascript scripts etc.

My main problem is that every single page takes up to 5 secs to load and I have to visit ~400 pages.

Note that I've disabled image loading etc, and that phantomjs does not render any display.

Does anyone could suggest something to help me? Another library to look on?

Thank you.

j0ltk0la

Member

Posts: 62,215

Joined: Jun 3 2007

Gold: 9,039.20

Feb 25 2015 03:32am

You should only be using Selenium when you have to, it is very slow to render anything and shouldn't be relied upon.

I would look into threading and running your crawler asynchronously when Selenium doesn't need to be used, if you don't have a condition setup to not use Phantom JS then this might not be helpful.

When I run scrapers in Python I like to go barebones and only use requests unless I need something more, I like getting information from the network requests directly instead of dirty parsing.

If you want to experiment with other libraries and you don't mind branching out from Java to Python you could look into Scrapy.

Eleven11

Member

Posts: 2,303

Joined: Oct 11 2007

Gold: 0.00

Feb 25 2015 08:37pm

Quote (j0ltk0la @ 25 Feb 2015 04:32)

Yeah actually it is a big project and I've already transalted it from C# that did not suit my needs.. I have many crawlers, they only use Selenium when their targets have AJAX content.

Thanks for the tip but I've multithreaded it yesterday and went down to 8 mins... Still not fast enough though. Website I scrap don't have any webservice or whatever so dirty parsing is my only option...

AbDuCt

Member

Posts: 13,425

Joined: Sep 29 2007

Gold: 0.00

Warn: 20%

Feb 25 2015 10:18pm

I would try using mechanize, although I believe it is only available for ruby and python.

j0ltk0la

Member

Posts: 62,215

Joined: Jun 3 2007