Quote (j0ltk0la @ 25 Feb 2015 04:32)
You should only be using Selenium when you have to, it is very slow to render anything and shouldn't be relied upon.
I would look into threading and running your crawler asynchronously when Selenium doesn't need to be used, if you don't have a condition setup to not use Phantom JS then this might not be helpful.
When I run scrapers in Python I like to go barebones and only use requests unless I need something more, I like getting information from the network requests directly instead of dirty parsing.
If you want to experiment with other libraries and you don't mind branching out from Java to Python you could look into Scrapy.
Yeah actually it is a big project and I've already transalted it from C# that did not suit my needs.. I have many crawlers, they only use Selenium when their targets have AJAX content.
Thanks for the tip but I've multithreaded it yesterday and went down to 8 mins... Still not fast enough though. Website I scrap don't have any webservice or whatever so dirty parsing is my only option...