Programming Help - Fetching Data From Website - Topic

Member

Posts: 376

Joined: Sep 12 2014

Gold: 1,949.00

Dec 3 2014 04:14pm

Hi,

I'm working on a project where I want to write a webcrawler that will populate a database. Basically:

1) Go to a website
2) Select the right table out of many options
3) Copy the correct row
4) Take a value from the row
5) Store the value in some database of some sort

Basically, I would want to know the most pain-free way of doing this.

I currently have written exactly this program inside of excel using some rudimentary VBA and Excel's built in macro writer. It works well enough, but the project has expanded beyond excels capabilities and needs to be faster / more dynamic.

Let me know what you think!!

From my research it seems like php or python can do the trick with mysql. I know exactly 0 about these languages. I do have visual studio on my computer though...

I would gladly give all my fg if I could really get my knowledge furthered on this.

Also would read / look at any resources that you trust in earnest.

Thanks for your time.

This post was edited by yamamotocannon on Dec 3 2014 04:22pm

carteblanche

Member

Posts: 32,925

Joined: Jul 23 2006

Gold: 3,804.50

Dec 3 2014 04:47pm

i'm guessing you have some kind of financial incentive to do it, so my suggestion is hire someone to do it for you. they will complete it faster and provide a more robust solution than you'd make on your own.

with that said, it depends on how complex your data is. if it's as simple as going to a URL and finding an html table whose structure is always the same, then any language that can send an HTTP request and receive html will be fine. VB.NET can do it for sure, and I'm guessing VBA can do it. if it's more complex than that, you'd have to be more specific. eg: from that main url, are you going to search for 20 more URLs and each of those go 20 URLs deep, and you have to dynamically try to figure out what tables have data that you care about? is the data inside an html table or is it inside a flash object / image / popup? etc

yamamotocannon

Member

Posts: 376

Joined: Sep 12 2014

Gold: 1,949.00

Dec 3 2014 05:18pm

Quote (carteblanche @ Dec 3 2014 05:47pm)

Haha

My main goal is to learn how to do this myself, just as painless as possible!!

I'll try to write it out in more detail

For ELEMENT A

1) Go to a website 1 from a pre-determined list
2) Select the correct table out of many options
3) Select the correct row (most recent lets say -- the table might only be 5 rows or it could be 15, I just want the most recent (usually last or first row))
4) Take multiple values from the row based on column heading
5) Repeat process for other relevant tables, but still at the same URL
6) Populate database

Repeat process for website 2,3,4,5,etc for ELEMENT A still. I would expect the row in the database to look something like

Element A [key data from website 1, more data from website 1, data from website 2, no data from website 3 - maybe a placeholder or 0, data from website 4]
Element B [same story]
...
...
...
Element AAAABZ [same story]

It doesn't have to go any deeper than the initial URL since the predetermined lists are solid (I hope)

Ultimately this would create a dataset I would work with in R or SAS

I have done a lot of this work in VBA -- it works OK, though I cheated and used the Excel Macro Writer to retrieve the tables by manually selecting them, and then writing a subsequent VBA macro to locate the correct rows.

My motivation for doing this project are 2 fold.

First, I really want to do this project.

Second, I think that at my current job/career learning how to be able to do this would really help me. I'm not trying to be a pure programmer perse... but someone that works with data that knows how to create my own complex datasets. Getting the right data is half the battle.

Further complication is that I want the most up to date data so I would be running this macro like... daily

Definitely would appreciate ANY help/comments/advice

This post was edited by yamamotocannon on Dec 3 2014 05:18pm

SelfTaught

Member

Posts: 1,358

Joined: Dec 30 2012

Gold: 0.10

Dec 3 2014 06:58pm

use libcurl or poco for your network library

once you have downloaded the page, convert it to xml or xhtml using tidy++

once converted, parse your data from it using an xml library that offers xpath. i like pugixml

/e oops i assumed c/c++ was being used for some reason -_-

i guess what i said still works but you might need to use different libraries to accomplish the steps.

you could also php for this kind of thing pretty easily using simple html dom parser: http://simplehtmldom.sourceforge.net/

This post was edited by SelfTaught on Dec 3 2014 07:17pm

SelfTaught

Member

Posts: 1,358

Joined: Dec 30 2012

Gold: 0.10

Dec 3 2014 07:51pm

Here's a more detailed reply:

First off, you need to decide which language you're going to use. I personally use c++ for tasks like this because it's fast. With that being said, you could write a web crawler in just about any language. I don't know that I'd recommend that you try to write a web crawler as you learn a new programming language. Crawlers can get pretty in depth and you'll probably just give up at some point. If you do decide to try and write one anyways, there are a few things you need for the task:

- A networking library to download pages with (probably use libcurl since its built into php and is available in many other languages.)
- An html to xml/xhtml library
- An xml parsing library that offers XPath

The html to xml/xhtml library isnt completely necessary but is recommended since you plan on parsing over the html and extracting data from it. You can use regular expressions but read this stackoverflow page to see why you shouldn't: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags . I'd use tidy to convert the html. As far as I can tell, you can get tidy for c++, c, PHP, Perl, and a couple other languages. Here's a link to the library: http://tidy.sourceforge.net/ . Look over their tutorials / documentation and it'll go over how to use it but the gist of it is, you pass the html from the page you've downloaded into it, and it outputs that in either xhtml or xml (you can specify it in the settings after you've implemented it).

Once the conversion is done, you can parse reliably without worrying too much about unexpected behavior like you might have to if you were to use regular expressions. To parse use a library which supports XPath's. Here is some information that you can read over to get an understanding of what XPath's are and how they're used: http://www.w3schools.com/xpath/ . Basically, XPath's will let you target the specific data that you want extract.

Here's some pseudo code that you can study to get an idea of how a web crawling algorithm works

Code

Ask user to specify the starting URL on web and file type that crawler should crawl.

Add the URL to the empty list of URLs to search.

While not empty ( the list of URLs to search )
{

Take the first URL in from the list of URLs
Mark this URL as already searched URL.

If the URL protocol is not HTTP then
break;
go back to while

If robots.txt file exist on site then
If file includes .Disallow. statement then
break;
go back to while

Open the URL

If the opened URL is not HTML file then
Break;
Go back to while

Iterate the HTML file

While the html text contains another link {

If robots.txt file exist on URL/site then
If file includes .Disallow. statement then
break;
go back to while

If the opened URL is HTML file then
If the URL isn't marked as searched then
Mark this URL as already searched URL.

Else if type of file is user requested
Add to list of files found.

}
}

got that pseudo code from here: http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawler

This post was edited by SelfTaught on Dec 3 2014 07:53pm

yamamotocannon

Member

Posts: 376

Joined: Sep 12 2014

Gold: 1,949.00

Dec 3 2014 08:09pm

First, thanks for the reply, I'm going over the links you added in more detail right now.

I think I am using the term "crawler" as a misnomer sort of, since I already have the list of websites that I want.

I still need to understand everything you are referencing about parsing though with xml.

Thanks for the time, I'll post back after I go through it!!

t9x

Member

Posts: 15,717

Joined: Aug 20 2007

Gold: 481.00

Dec 5 2014 08:17am

You should do this in C# or VB .Net

You can get Visual Studio express for free.

Then you should figure out how your website shows you the data

Here are your options, you can use a programatic webbrowser and grab each piece of data by Element ID

Or you can use Http Requests and get the whole page and parse the info you need.

Using a web browser is much slower than Http Requests. Some websites use queries in the URL to show exactly what you want, this makes Http Requests very easy to work with.

A web browser works on a Document_Completed event that gets called every time a page finishes loading, meaning you have to set up a step-by-step process with Booleans to skip over previous events you have already ran

I can help you do this, if you PM me or have skype or something and you give me a little more detail about the website, I can guide you in the right direction.

I currently work full-time making applications that retrieve data off of 3rd party websites, I also do database management and pure business applications.

This is a sample of an http request in C#, i just started C# 2 weeks ago and have already made 2 applications doing exactly what you want, so its fairly easy to learn

Code

var urlPQ = "URL goes here with OData query";
var webReq = (HttpWebRequest)WebRequest.Create(urlPQ);
webReq.CookieContainer = cookieJar;
var webResp = (HttpWebResponse)webReq.GetResponse();
if ((webResp.StatusCode == HttpStatusCode.OK) && (webResp.ContentLength > 0))
{
var reader = new StreamReader(webResp.GetResponseStream());
string s = reader.ReadToEnd();
var json = JsonConvert.DeserializeObject<Search>(s);
return json.SearchID;
}

I used an Http Request to get a Json return ( this is what the website used for data management ), I got my Search ID out of it by having a Serializable class with a Json Property "Search"

With a pre-built class of what you are looking for, you can do this on each website, and get everything you want. This would be all the code you need for the retrieval

Your return also does not have to be Json to Deserialize it, it can be anything, Json is actually harder to deserialize than most data returns

This post was edited by t9x on Dec 5 2014 08:24am

yamamotocannon

Member

Posts: 376

Joined: Sep 12 2014

Gold: 1,949.00

Dec 5 2014 11:09am

Quote (t9x @ Dec 5 2014 09:17am)

You should do this in C# or VB .Net

Haha ugh just spend the last 2 days learning Python :S

Though what you are describing is more or less what I'm looking for.

In terms of my python progress I'm able to open the webpages I want, but parsing elements a lot of my tables / websites are java based which is putting an additional layer of complexity since they don't just appear when you look at HTML on the page. From my understanding I have to actually click on the element to see the relevant HTML (or something along those lines)??

Seems like I might be able to get around that problem by downloading the webpage to my computer and then reading it that way??

This post was edited by yamamotocannon on Dec 5 2014 11:26am

yamamotocannon

Member

Posts: 376

Joined: Sep 12 2014

Gold: 1,949.00

Dec 5 2014 01:43pm

It seems a lot of the tables I want are hidden in <div id="main data here"> or something like that... not sure how to work around it when the HTML I download leaves those fields blank (downloaded through python or not)

SelfTaught

Member

Posts: 1,358

Joined: Dec 30 2012

Gold: 0.10

#10

Dec 5 2014 05:32pm

Quote (yamamotocannon @ Dec 5 2014 11:43am)

Can you provide a link to one of the pages and point out the data that you're trying to mine from it?

Ill give you an example of how i'd do it in c++

Go Back To Programming & Development Topic List