d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Data Mining
Add Reply New Topic New Poll
Member
Posts: 10,228
Joined: Oct 27 2008
Gold: 9,834.89
Apr 16 2016 12:48pm
This is what I want to do:

Create new tab in Google Chrome
"Loop" through website URL's (ex: www.1.com, www.2.com, www.3.com, www.4.com) in whatever manner I want (I could go up to 10,000 let's say).
Access source code in each website, and grab certain data and store it wherever (notepad) based on condition (ex: If I see the text "sale", I store the website URL and the word after "sale")

What's the way to approach this? Thank you
Member
Posts: 32,925
Joined: Jul 23 2006
Gold: 3,804.50
Apr 16 2016 12:54pm
why do you want this in chrome? it would make more sense to do it outside of a browser.

pick your favourite web scraper. or you can just make your own. i like python's mechanize as a virtual browser if you need cookies/authentication/etc. if you dont need it, pick whatever language you're comfortable with to make an http call and parse html. i used node/jquery, but python, java, c#, etc can all do it.

This post was edited by carteblanche on Apr 16 2016 12:54pm
Member
Posts: 10,228
Joined: Oct 27 2008
Gold: 9,834.89
Apr 16 2016 01:03pm
Quote (carteblanche @ Apr 16 2016 10:54am)
why do you want this in chrome? it would make more sense to do it outside of a browser.

pick your favourite web scraper. or you can just make your own. i like python's mechanize as a virtual browser if you need cookies/authentication/etc. if you dont need it, pick whatever language you're comfortable with to make an http call and parse html. i used node/jquery, but python, java, c#, etc can all do it.


I would like to use Java. Do you know how I can make my own web scraper? (I know none)

edit: Would this work? http://jaunt-api.com/

This post was edited by Michael515 on Apr 16 2016 01:04pm
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Apr 16 2016 01:40pm
Code
require "http/client"
10000.times do |i|
HTTP::Client.get("http://www.#{i}.com") do |response|
File.open("#{i}.source", "w") do |file|
file.puts response.body
end
end
end


Done.
Member
Posts: 3,028
Joined: Mar 23 2016
Gold: 7,568.50
Apr 16 2016 01:42pm
Code
import urllib2
import re

max = 100

for i in range(1, max):
try:
page = urllib2.urlopen("http://www." + str(i) + ".com").read()
if re.search(" sale ", page, re.DOTALL|re.IGNORECASE) != None:
##whatever
except:
##whatever


This post was edited by annexusquam on Apr 16 2016 01:42pm
Member
Posts: 10,228
Joined: Oct 27 2008
Gold: 9,834.89
Apr 16 2016 04:00pm
Quote (AbDuCt @ Apr 16 2016 11:40am)
Code
require "http/client"
10000.times do |i|
HTTP::Client.get("http://www.#{i}.com") do |response|
File.open("#{i}.source", "w") do |file|
file.puts response.body
end
end
end


Done.


Quote (annexusquam @ Apr 16 2016 11:42am)
Code
import urllib2
import re

max = 100

for i in range(1, max):
try:
page = urllib2.urlopen("http://www." + str(i) + ".com").read()
if re.search(" sale ", page, re.DOTALL|re.IGNORECASE) != None:
##whatever
except:
##whatever


Where would you guys run this code in?
Member
Posts: 32,925
Joined: Jul 23 2006
Gold: 3,804.50
Apr 16 2016 04:02pm
Quote (Michael515 @ Apr 16 2016 06:00pm)
Where would you guys run this code in?


looks like ruby and python
Member
Posts: 3,028
Joined: Mar 23 2016
Gold: 7,568.50
Apr 16 2016 06:42pm
Quote (Michael515 @ Apr 16 2016 11:00pm)
Where would you guys run this code in?


python 2.7 but you'll have to replace the ##comments with something, at least
Code
pass
before it'll work.

Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Apr 16 2016 07:03pm
Quote (carteblanche @ Apr 16 2016 06:02pm)
looks like ruby and python


I wish rubies HTTP library was that easy to use. Crystal by far has a much better/simpler syntax for using HTTP related stuff. If you compare the two side by side you will notice a different right away:

http://ruby-doc.org/stdlib-2.3.0/libdoc/net/http/rdoc/Net/HTTP.html
http://crystal-lang.org/api/HTTP/Client.html

A ruby implementation would be more like:

Code
require "net/http"

10000.times do |i|
Net::HTTP.get("http://www.#{i}.com", "/") do |response|
File.open("#{i}.source", "w") do |file|
file.puts response.body
end
end
end


Edit:: I've been using Crystal alot recently just because it performs a lot better than Ruby. I had to write a recursive function for transversing a JSON string:

Code
5 def collectComments(db, imageJson, parentComment = "")
6 imageJson.each do |parent|
7 if !parent["children"].as_a.empty?
8 collectComments(db, parent["children"], parent["comment"])
9 end
10
11 if parentComment != ""
12 childComment = parent["comment"]
13 db.execute("Insert Into Comments Values (?, ?)", parentComment.to_s, childComment.to_s)
14 end
15 end
16 end


If I would of tried to use Ruby for that it would take probably minutes to parse one JSON response rather than a few seconds because the JSON string has about 70 entries in the main array, then each entry has children which has children of their own sometimes up to 10 levels deep.

This post was edited by AbDuCt on Apr 16 2016 07:06pm
Member
Posts: 2,551
Joined: Mar 23 2016
Gold: 1,230.11
Apr 25 2016 10:15am
Quote (annexusquam @ 16 Apr 2016 18:42)
python 2.7 but you'll have to replace the ##comments with something, at least
Code
pass
before it'll work.


2.7 is the way to go. Funny how I don't see many people using the newer versions.
Go Back To Programming & Development Topic List
Add Reply New Topic New Poll