d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Java - Open A Url To Get Page Source
12Next
Closed New Topic New Poll
Retired Moderator
Posts: 21,073
Joined: Apr 7 2008
Gold: 5,135.90
Trader: Trusted
Jan 7 2014 12:45am
Wondering if anyone can help me with this.

First off, why is this in java? Because that is what I am most familiar with. If you have any other easier alternatives, please let me know.

ArtofApocalypse and myself maintain a spreadsheet in the league subforum found here: http://forums.d2jsp.org/topic.php?t=69074957&f=133
We want to have something that will check all of the url's in the second column and see if there is a change in the current rank.

I have a program working that does what we need but the way I open the URL will randomly hang(Will work perfectly on the first half of the list then will get stuck and cannot open one of the pages so it hangs).
The code I use for opening the url is this:
Code
URL url = new URL(inputUrl);
URLConnection spoof = url.openConnection();

//Spoof the connection so we look like a web browser
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );

BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream())); //Hangs here every time.

... the rest of my code that is needed ...


It randomly gets stuck half way through with this solution. Does anybody know another better way to open up the connection. The only thing I need is the page source.
-----------------------------------------
Another question as well. Has anybody done any work with opening an excel file in java? I am using Apache POI but it only works with 97-03 excel format and explodes if you have a bunch of formatting. Any other alternatives for this?

Thanks.

E/UPDATE: I isolated which line it is failing on. It seems to be pretty random. Failed 25th link once, 47th second.. etc. Hangs at those for over 10 minutes before I stopped it.
Perhaps it would go eventaully but this is totally useless if it is going to hang for 10+ mins.

This post was edited by Kagura on Jan 7 2014 01:51am
Member
Posts: 5,269
Joined: Oct 18 2006
Gold: 21,400.00
Jan 7 2014 02:42am
I recently started using Selenium for my scraping programs. It is mostly meant for navigating through webpages, but it can give you the page source so you can parse it.

As for your code, I do not see the mistake. I have a couple programs that I have been using for awhile that essentially contain that exact same line that is hanging for you. My only guess is that the website you are parsing is not fooled by your spoof and is blocking the content.

As for the excel question. I also used Apache POI for a long time to do my write/reading of excel files. I have since moved on to just writing in CSV or tab deliminated files, but that only works if you are the one writing what you read. I have not found a library for the newer excel format.
Member
Posts: 11,610
Joined: Oct 28 2008
Gold: 1,795.00
Jan 7 2014 03:13am
oh how I loathe spreadsheets
Member
Posts: 10,812
Joined: Oct 15 2009
Gold: Locked
Warn: 20%
Jan 7 2014 09:22am
Quote (Kagura @ Jan 6 2014 11:45pm)
E/UPDATE: I isolated which line it is failing on. It seems to be pretty random. Failed 25th link once, 47th second.. etc. Hangs at those for over 10 minutes before I stopped it.
Perhaps it would go eventaully but this is totally useless if it is going to hang for 10+ mins.
As far as the code goes, I can't say, I don't do windows and I don't do Java. The fact that it works for a while, and then stops working for about 10 minutes makes me thing you are getting IP banned. You are probably hitting those links too fast. Try putting in some sleep/wait statements in the loop.

Oh and as someone else mentioned. I would go with comma separated text files for the spreadsheet that way you will never have a problem reading them.

Member
Posts: 5,269
Joined: Oct 18 2006
Gold: 21,400.00
Jan 7 2014 11:14am
Quote (Azrad @ Jan 7 2014 08:22am)
As far as the code goes, I can't say, I don't do windows and I don't do Java. The fact that it works for a while, and then stops working for about 10 minutes makes me thing you are getting IP banned. You are probably hitting those links too fast. Try putting in some sleep/wait statements in the loop.

Oh and as someone else mentioned. I would go with comma separated text files for the spreadsheet that way you will never have a problem reading them.


This is a great point and definitely possible. I've only been IP banned from one website using this method (easy fix since I work with the company and I wasn't hurting them). The problem with this scenario is that you would probably get banned from then on. When I was banned, I couldn't even manually go to their website on my IP address until the admin cleared my IP.

Another possibility is that your program is running too fast for the website. If you are running fast consecutive getSource() calls you may want to try adding a delay.
Member
Posts: 10,812
Joined: Oct 15 2009
Gold: Locked
Warn: 20%
Jan 7 2014 11:35am
Quote (xandumx @ Jan 7 2014 10:14am)
This is a great point and definitely possible.  I've only been IP banned from one website using this method (easy fix since I work with the company and I wasn't hurting them).  The problem with this scenario is that you would probably get banned from then on.  When I was banned, I couldn't even manually go to their website on my IP address until the admin cleared my IP.

Another possibility is that your program is running too fast for the website.  If you are running fast consecutive getSource() calls you may want to try adding a delay.


yeah i've been ip banned from several servers for this kind of thing. Sometimes the ban lasts a long time (long enough for me to give on waiting for it to expire), sometimes it is for less than 15 minutes. I guess it just depends on the configuration.
Retired Moderator
Posts: 21,073
Joined: Apr 7 2008
Gold: 5,135.90
Trader: Trusted
Jan 7 2014 12:19pm
I found it odd that when ever I got "stuck" if I started the program again it would run immediately and work again which made me think they weren't blocking me.
I ran it again with a minute between page requests and it seemed to work fine. Going to play around with this a little more but perhaps that was the only issue. I still don't understand why I can run it again and it works. I assume that if they block you they would block an IP, but maybe not.
Member
Posts: 10,812
Joined: Oct 15 2009
Gold: Locked
Warn: 20%
Jan 7 2014 04:01pm
perhaps that function/library you are using has no timeout by default. So it waits for a reply and if it never gets one it just continues to block execution. Might explain the unusual behavior. Check out the documentation, maybe it has some optional arguments.
Member
Posts: 3,099
Joined: Nov 26 2006
Gold: 0.10
Jan 7 2014 05:30pm
My guess is that you are requesting too many webpages within such a short amount of time and it denies your request. Try spreading out the requests or use riot's api.
Member
Posts: 32,925
Joined: Jul 23 2006
Gold: 3,804.50
Jan 7 2014 05:57pm
Quote (Kagura @ Jan 7 2014 01:19pm)
I found it odd that when ever I got "stuck" if I started the program again it would run immediately and work again which made me think they weren't blocking me.
I ran it again with a minute between page requests and it seemed to work fine. Going to play around with this a little more but perhaps that was the only issue. I still don't understand why I can run it again and it works. I assume that if they block you they would block an IP, but maybe not.


it's likely a flood control. that's what youtube's api does. by time you wait a few seconds to try again it's all fixed up.
Go Back To Programming & Development Topic List
12Next
Closed New Topic New Poll