d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Programming Help - Fetching Data From Website > Will Pay Fg For ... Idk
Prev123Next
Add Reply New Topic New Poll
Member
Posts: 62,215
Joined: Jun 3 2007
Gold: 9,039.20
Dec 5 2014 07:08pm
Code
import requests
from bs4 import BeautifulSoup

page = requests.get('http://url')
soup = BeautifulSoup(page.text)

def url_list():
with open('URL-LIST.txt', 'r') as r:
urls = r.read() #may need to experiment with readline() or trying this outside of a function
return urls

def htmls(urls, html=[]):
"""
Find URLs in each line,
Get Page
Yield 1-by-1 or return list of each
"""
page = ""
for line in urls:
page = requests.get(line)
html.insert(0, page)
yield html
# can also do
# yield line
# yield will create generator and return one page at a time for you to parse, instead of a full list, more memory efficient and may make more sense for this

def table_scraps(html, res=[]):
"""
- Assuming the data is a basic HTML table
<div id="heh" >
- adjust soup to search based on attrs

- Another example
<td align="right">
- soup.find_all('td', attrs={"align":"right"})
- soup.find_all('div', attrs={"id" : "gay"})
- etc.
"""
soup = BeautifulSoup(html.text)
content = soup.find_all('div', attrs={"id":"heh"})

# what you want to do with the data
# using patterns, stripping tags, etc.
# insert data into res, a list

res.insert(0, wanted_data)
return res

if __name__ == "__main__":
urls = url_list()
htmls(urls)
table_scraps(soup)


Haven't tested any of it, but should get you started if you're doing the Python route, BeautifulSoup is what I would use for basic HTML scraping.

Actually, I would watch this to get a good overview of scraping in general


This post was edited by killg0re on Dec 5 2014 07:09pm
Member
Posts: 15,717
Joined: Aug 20 2007
Gold: 481.00
Dec 6 2014 11:33am
Python is a bunch of trouble for a simple task

You can fetch everyrhing you want in c# or vb extremely easy, under 20 lines of code
Member
Posts: 1,358
Joined: Dec 30 2012
Gold: 0.10
Dec 6 2014 06:44pm
Quote (t9x @ Dec 6 2014 09:33am)
Python is a bunch of trouble for a simple task

You can fetch everyrhing you want in c# or vb extremely easy, under 20 lines of code


uhh, the code killg0re posted is about 20 lines if you exclude the comments and it downloads + scrapes? not seeing how python is a bunch of trouble tbh
Member
Posts: 376
Joined: Sep 12 2014
Gold: 1,949.00
Dec 6 2014 07:12pm
Hey guys really only been on my mobile this wknd gunna go over this really closely as soon as I have time. Just wanted to toss you guys some fg for helping my thinking on this. Thx again, I'll post back with hopefully some good updates :)

This post was edited by yamamotocannon on Dec 6 2014 07:13pm
Member
Posts: 62,215
Joined: Jun 3 2007
Gold: 9,039.20
Dec 6 2014 11:40pm
Python has a lot of support and almost everyone has written a scraper. Has awesome libraries too, like Scrapy
Member
Posts: 1,358
Joined: Dec 30 2012
Gold: 0.10
Dec 6 2014 11:45pm
Quote (killg0re @ Dec 6 2014 09:40pm)
Python has a lot of support and almost everyone has written a scraper. Has awesome libraries too, like Scrapy


iirc googlebot was originally wrote in python
Member
Posts: 15,717
Joined: Aug 20 2007
Gold: 481.00
Dec 7 2014 09:14am
Quote (SelfTaught @ Dec 6 2014 08:44pm)
uhh, the code killg0re posted is about 20 lines if you exclude the comments and it downloads + scrapes? not seeing how python is a bunch of trouble tbh


Quote (killg0re @ Dec 7 2014 01:40am)
Python has a lot of support and almost everyone has written a scraper. Has awesome libraries too, like Scrapy


okay well if that code works as is sure. but python is not the optimal language.

.net can do this without any added libraries.

and how do you know if that code will even work for his situation

is the data really in a div? what else is in the div? what are you scraping, how does the program know what to scrape?

is the data easier to retrieve by doing ElementById? can it all be retrieved in a simple get request?

you dont know if those 20 lines are going to work

Quote


In terms of my python progress I'm able to open the webpages I want, but parsing elements a lot of my tables / websites are java based which is putting an additional layer of complexity since they don't just appear when you look at HTML on the page. From my understanding I have to actually click on the element to see the relevant HTML (or something along those lines)??



If it is java based, his best best is to use .NET and an http request, most likely the return he is getting is Json which is easily deserialized and can be made into an easy list<object>

Code

[Serializable]
public class Myclasss
[Json Property "json name"]
public string MyData {get;set}


He makes his class structure, and then after he does the get request, which is 5 lines of code at most, he uses Json.DeserializeObject and its done. He has everything he wants in its own property.

No parsing, no looping, nothing.

This post was edited by t9x on Dec 7 2014 09:23am
Member
Posts: 62,215
Joined: Jun 3 2007
Gold: 9,039.20
Dec 7 2014 09:23am
Quote (t9x @ Dec 7 2014 09:14am)
okay well if that code works as is sure. but python is not the optimal language.

.net can do this without any added libraries.

and how do you know if that code will even work for his situation

is the data really in a div? what else is in the div? what are you scraping, how does the program know what to scrape?

is the data easier to retrieve by doing ElementById? can it all be retrieved in a simple get request?

you dont know if those 20 lines are going to work


I know the code I wrote isn't going to work because I don't know all the information about what he's trying to scrape, you can do it by dom, by div tags, by anything you want after you get the request. Python has a large amount of libraries and support for it, a huge user-base, and community around it without even considering how easy it is to read, write, and create programs with minimal effort.

Member
Posts: 13,728
Joined: Jul 11 2007
Gold: 0.00
Dec 7 2014 01:22pm
From abduct:

t9x I know you're trying with good attentions but it doesn't matter which language you use. Picking one language over another saying it is better just shows your inexperience in the subject. This is not mentioning your assumptions you wildly make, but wont let others make while posting example code. What if there is no JSON returned... now you just look like an asshat.

Anyways Ruby master race as always and anything you post it can do in many lines less. As per our exquisite taste in assumptions I assume the data are within a random div tag in which you can modify the regex to match and return the data you need.

Code
require 'open-uri'
open("http://yoursite.com/lol.php") { |page| puts page.read.match(/<div>(.*?)</div>/)[0] }


This post was edited by Switch11 on Dec 7 2014 01:27pm
Member
Posts: 15,717
Joined: Aug 20 2007
Gold: 481.00
Dec 7 2014 03:30pm
Quote (Switch11 @ Dec 7 2014 03:22pm)
From abduct:

t9x I know you're trying with good attentions but it doesn't matter which language you use. Picking one language over another saying it is better just shows your inexperience in the subject. This is not mentioning your assumptions you wildly make, but wont let others make while posting example code. What if there is no JSON returned... now you just look like an asshat.

Anyways Ruby master race as always and anything you post it can do in many lines less. As per our exquisite taste in assumptions I assume the data are within a random div tag in which you can modify the regex to match and return the data you need.

Code
require 'open-uri'
open("http://yoursite.com/lol.php") { |page| puts page.read.match(/<div>(.*?)</div>/)[0] }


Quote
In terms of my python progress I'm able to open the webpages I want, but parsing elements a lot of my tables / websites are java based which is putting an additional layer of complexity since they don't just appear when you look at HTML on the page. From my understanding I have to actually click on the element to see the relevant HTML (or something along those lines)??


sounds like json to me
Go Back To Programming & Development Topic List
Prev123Next
Add Reply New Topic New Poll