Programming Help - Fetching Data From Website - Topic

Member

Posts: 62,215

Joined: Jun 3 2007

Gold: 9,039.20

Dec 5 2014 07:08pm

Code

import requests
from bs4 import BeautifulSoup

page = requests.get('http://url')
soup = BeautifulSoup(page.text)

def url_list():
with open('URL-LIST.txt', 'r') as r:
urls = r.read() #may need to experiment with readline() or trying this outside of a function
return urls

def htmls(urls, html=[]):
"""
Find URLs in each line,
Get Page
Yield 1-by-1 or return list of each
"""
page = ""
for line in urls:
page = requests.get(line)
html.insert(0, page)
yield html
# can also do
# yield line
# yield will create generator and return one page at a time for you to parse, instead of a full list, more memory efficient and may make more sense for this

def table_scraps(html, res=[]):
"""
- Assuming the data is a basic HTML table
<div id="heh" >
- adjust soup to search based on attrs

- Another example
<td align="right">
- soup.find_all('td', attrs={"align":"right"})
- soup.find_all('div', attrs={"id" : "gay"})
- etc.
"""
soup = BeautifulSoup(html.text)
content = soup.find_all('div', attrs={"id":"heh"})

# what you want to do with the data
# using patterns, stripping tags, etc.
# insert data into res, a list

res.insert(0, wanted_data)
return res

if __name__ == "__main__":
urls = url_list()
htmls(urls)
table_scraps(soup)

Haven't tested any of it, but should get you started if you're doing the Python route, BeautifulSoup is what I would use for basic HTML scraping.

Actually, I would watch this to get a good overview of scraping in general

This post was edited by killg0re on Dec 5 2014 07:09pm

t9x

Member

Posts: 15,717

Joined: Aug 20 2007

Gold: 481.00

#12

Dec 6 2014 11:33am

Python is a bunch of trouble for a simple task

You can fetch everyrhing you want in c# or vb extremely easy, under 20 lines of code

SelfTaught

Member

Posts: 1,358

Joined: Dec 30 2012

Gold: 0.10

#13

Dec 6 2014 06:44pm

Quote (t9x @ Dec 6 2014 09:33am)

Python is a bunch of trouble for a simple task

You can fetch everyrhing you want in c# or vb extremely easy, under 20 lines of code

uhh, the code killg0re posted is about 20 lines if you exclude the comments and it downloads + scrapes? not seeing how python is a bunch of trouble tbh

yamamotocannon

Member

Posts: 376

Joined: Sep 12 2014

Gold: 1,949.00

#14

Dec 6 2014 07:12pm

Hey guys really only been on my mobile this wknd gunna go over this really closely as soon as I have time. Just wanted to toss you guys some fg for helping my thinking on this. Thx again, I'll post back with hopefully some good updates

This post was edited by yamamotocannon on Dec 6 2014 07:13pm

j0ltk0la

Member

Posts: 62,215

Joined: Jun 3 2007

Gold: 9,039.20

#15

Dec 6 2014 11:40pm

Python has a lot of support and almost everyone has written a scraper. Has awesome libraries too, like Scrapy

SelfTaught

Member

Posts: 1,358

Joined: Dec 30 2012

Gold: 0.10

#16

Dec 6 2014 11:45pm

Quote (killg0re @ Dec 6 2014 09:40pm)

Python has a lot of support and almost everyone has written a scraper. Has awesome libraries too, like Scrapy

iirc googlebot was originally wrote in python

t9x

Member

Posts: 15,717

Joined: Aug 20 2007

Gold: 481.00

#17

Dec 7 2014 09:14am

Quote (SelfTaught @ Dec 6 2014 08:44pm)

uhh, the code killg0re posted is about 20 lines if you exclude the comments and it downloads + scrapes? not seeing how python is a bunch of trouble tbh

Quote (killg0re @ Dec 7 2014 01:40am)

Python has a lot of support and almost everyone has written a scraper. Has awesome libraries too, like Scrapy

okay well if that code works as is sure. but python is not the optimal language.

.net can do this without any added libraries.

and how do you know if that code will even work for his situation

is the data really in a div? what else is in the div? what are you scraping, how does the program know what to scrape?

is the data easier to retrieve by doing ElementById? can it all be retrieved in a simple get request?

you dont know if those 20 lines are going to work

Quote

In terms of my python progress I'm able to open the webpages I want, but parsing elements a lot of my tables / websites are java based which is putting an additional layer of complexity since they don't just appear when you look at HTML on the page. From my understanding I have to actually click on the element to see the relevant HTML (or something along those lines)??

If it is java based, his best best is to use .NET and an http request, most likely the return he is getting is Json which is easily deserialized and can be made into an easy list<object>

Code

[Serializable]
public class Myclasss
[Json Property "json name"]
public string MyData {get;set}

He makes his class structure, and then after he does the get request, which is 5 lines of code at most, he uses Json.DeserializeObject and its done. He has everything he wants in its own property.

No parsing, no looping, nothing.

This post was edited by t9x on Dec 7 2014 09:23am

j0ltk0la

Member

Posts: 62,215

Joined: Jun 3 2007

Gold: 9,039.20

#18

Dec 7 2014 09:23am

Quote (t9x @ Dec 7 2014 09:14am)

I know the code I wrote isn't going to work because I don't know all the information about what he's trying to scrape, you can do it by dom, by div tags, by anything you want after you get the request. Python has a large amount of libraries and support for it, a huge user-base, and community around it without even considering how easy it is to read, write, and create programs with minimal effort.

Switch11

Member

Posts: 13,728

Joined: Jul 11 2007

Gold: 0.00

#19

Dec 7 2014 01:22pm

From abduct:

t9x I know you're trying with good attentions but it doesn't matter which language you use. Picking one language over another saying it is better just shows your inexperience in the subject. This is not mentioning your assumptions you wildly make, but wont let others make while posting example code. What if there is no JSON returned... now you just look like an asshat.

Anyways Ruby master race as always and anything you post it can do in many lines less. As per our exquisite taste in assumptions I assume the data are within a random div tag in which you can modify the regex to match and return the data you need.

Code

require 'open-uri'
open("http://yoursite.com/lol.php") { |page| puts page.read.match(/<div>(.*?)</div>/)[0] }

This post was edited by Switch11 on Dec 7 2014 01:27pm

t9x

Member

Posts: 15,717

Joined: Aug 20 2007

Gold: 481.00

#20

Dec 7 2014 03:30pm

Quote (Switch11 @ Dec 7 2014 03:22pm)

Code

require 'open-uri'
open("http://yoursite.com/lol.php") { |page| puts page.read.match(/<div>(.*?)</div>/)[0] }

Quote

In terms of my python progress I'm able to open the webpages I want, but parsing elements a lot of my tables / websites are java based which is putting an additional layer of complexity since they don't just appear when you look at HTML on the page. From my understanding I have to actually click on the element to see the relevant HTML (or something along those lines)??

sounds like json to me

Go Back To Programming & Development Topic List