d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Web Scraping
Add Reply New Topic New Poll
Member
Posts: 2,303
Joined: Oct 11 2007
Gold: 0.00
Feb 24 2015 03:08pm
Hi guys,

I don't really know if I'm going to get anything from posting here but I have some sort of a problem with one of my side projects.

Basically, I run a web scrapping program I wrote in Java. I keep adding more and more websites to it and I faced my problem trying to scrap a website that loaded dynamically using AJAX.

So here's what I've done so far.

-Added Selenium dependencies
-Installed phantomjs headless browser

After that I load the website using the combination of selenium and phantomjs so that it runs all javascript scripts etc.

My main problem is that every single page takes up to 5 secs to load and I have to visit ~400 pages.

Note that I've disabled image loading etc, and that phantomjs does not render any display.

Does anyone could suggest something to help me? Another library to look on?

Thank you.
Member
Posts: 62,215
Joined: Jun 3 2007
Gold: 9,039.20
Feb 25 2015 03:32am
You should only be using Selenium when you have to, it is very slow to render anything and shouldn't be relied upon.

I would look into threading and running your crawler asynchronously when Selenium doesn't need to be used, if you don't have a condition setup to not use Phantom JS then this might not be helpful.

When I run scrapers in Python I like to go barebones and only use requests unless I need something more, I like getting information from the network requests directly instead of dirty parsing.

If you want to experiment with other libraries and you don't mind branching out from Java to Python you could look into Scrapy.
Member
Posts: 2,303
Joined: Oct 11 2007
Gold: 0.00
Feb 25 2015 08:37pm
Quote (j0ltk0la @ 25 Feb 2015 04:32)
You should only be using Selenium when you have to, it is very slow to render anything and shouldn't be relied upon.

I would look into threading and running your crawler asynchronously when Selenium doesn't need to be used, if you don't have a condition setup to not use Phantom JS then this might not be helpful.

When I run scrapers in Python I like to go barebones and only use requests unless I need something more, I like getting information from the network requests directly instead of dirty parsing.

If you want to experiment with other libraries and you don't mind branching out from Java to Python you could look into Scrapy.


Yeah actually it is a big project and I've already transalted it from C# that did not suit my needs.. I have many crawlers, they only use Selenium when their targets have AJAX content.

Thanks for the tip but I've multithreaded it yesterday and went down to 8 mins... Still not fast enough though. Website I scrap don't have any webservice or whatever so dirty parsing is my only option...
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Feb 25 2015 10:18pm
I would try using mechanize, although I believe it is only available for ruby and python.
Member
Posts: 62,215
Joined: Jun 3 2007
Gold: 9,039.20
Feb 26 2015 03:29am
Quote (AbDuCt @ Feb 25 2015 10:18pm)
I would try using mechanize, although I believe it is only available for ruby and python.


There's still no good Mechanize port in Python 3 yet :(

@OP,

AJAX is the bane of web scraping in any langauge, Selenium is a very reliable way around it, but there isn't a good way to make it faster.
Go Back To Programming & Development Topic List
Add Reply New Topic New Poll