Grabbing And Saving Files Online - Topic

Member

Posts: 10,228

Joined: Oct 27 2008

Gold: 9,834.89

May 1 2016 10:25pm

I would like to go to a website, and on it, there would be a bunch of hyperlinks that lead to PDF documents. I'd like to parse through the page, grab all of these documents, then save them onto my computer for easy access in the future (ex: if the website crashes).

For example: On this site, https://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free
about halfway down under the title "Problem Sets", there are a bunch of documents coded as hyperlinks. I'd like to save those through a program, without actually individually clicking on each and saving them.

If anyone can walk me through this code-building in Java (I use the Netbeans IDE if that makes a difference), I'd be willing to pay you FG for your help and time. Or, general advice is much appreciated also

AbDuCt

Member

Posts: 13,425

Joined: Sep 29 2007

Gold: 0.00

Warn: 20%

May 1 2016 11:13pm

I can do this in Ruby if you ever change your mind.

Michael515

Member

Posts: 10,228

Joined: Oct 27 2008

Gold: 9,834.89

May 2 2016 12:12am

Quote (AbDuCt @ May 1 2016 09:13pm)

I can do this in Ruby if you ever change your mind.

Will do. Do you know if there's a specific name for what I'm trying to achieve here?

Ideophobe

Member

Posts: 14,631

Joined: Sep 14 2006

Gold: 575.56

May 2 2016 04:15am

ya theres a really convenient library from apache called the commons io
it can be done with base java libraries, but you should get that package anyway it's super convenient, and why reinvent the wheel when you can write a program in 20 lines of code lol

Code

commented "walk through"

Code

// Using Commons IO library
// Available at http://commons.apache.org/io/download_io.cgi
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.FilenameUtils;

import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class JSP_Request_2 {
public static void main(String[] args) throws IOException {
//STARTO
System.out.println("start");
//Make sure this directory exists if it doesn't change the path to wherever you want to put the files
String directory = "C:\\Users\\user\\IdeaProjects\\JSP_Request_2\\";
//Make a URL out of the link to the page you wanted
URL input = new URL("http://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free");
//Make a reader to scan the html of it
BufferedReader in = new BufferedReader(new InputStreamReader(input.openStream()));
//Make a string to hold the values you read from each line
String inputLine;
//Read each line of the html
while ((inputLine = in.readLine()) != null) {
//find a string that is unique to the html that looks like a good starting point
//read through the html til you find it
if (inputLine.contains("Problem Sets</span></h3>")) {
//find a string unique to the html that is in your last read line that will stop the program
while (!inputLine.contains("Other Geometry")) {
//set your string = next line
inputLine = in.readLine();
//split the line up into seperate strings delimited by a "
String words[] = inputLine.split("[\"]");
//ignore the <ul>s and </ul> lines
if(words.length>5) {
//cool utility you get with commons to grab the filename, can be done with base io library, but fuck a bunch of parsing
String file = FilenameUtils.getBaseName(words[5]);
//print the filename of the file you want to download
System.out.println("Downloaded : " + file);
//download the file (to the directory we made earlier + the file name + with the extension .pdf because for some reason not all your links have the extension,
// words[5] is where your file link is in the split
download(directory + FilenameUtils.getBaseName(words[5])+".pdf",words[5]);
}
}
}
}
//close the stream
in.close();
//DONE
System.out.println("done");
}
//commons ftw this method is probably like 15 lines with java io utils
public static void download(String fileName, String fileUrl)throws IOException {
FileUtils.copyURLToFile(new URL(fileUrl), new File(fileName));
}
}

now i gotta go delete 20 pdfs lol

This post was edited by Ideophobe on May 2 2016 04:44am

annexusquam

Member

Posts: 3,028

Joined: Mar 23 2016

Gold: 7,568.50

May 2 2016 04:22am

If you actually need this and aren't just writing it to practice, the easiest thing would be a program that can crawl through a website (recursively) that you then configure to only keep files of a certain type. If you're on *nix it should be fairly easy with wget (something like https://fak3r.com/2008/07/28/howto-recursively-download-only-specific-file-types/ ), on windows I use https://www.httrack.com to mirror whole websites but I don't know how customizable it is.

Ideophobe

Member

Posts: 14,631

Joined: Sep 14 2006

Gold: 575.56

May 2 2016 05:32am

i was thinking about it over breakfast, and the weird file names really bother me

Code

while (!inputLine.contains("Other Geometry")) {
inputLine = in.readLine();
String words[] = inputLine.split("[\"]|[<]|[>]");
if(words.length>5) {
System.out.println("Downloaded : " + words[10]);
download(directory + words[10]+".pdf",words[8]);
}
}

AbDuCt

Member

Posts: 13,425

Joined: Sep 29 2007

Gold: 0.00

Warn: 20%

May 2 2016 06:46am

Quote (Michael515 @ May 2 2016 02:12am)

Will do. Do you know if there's a specific name for what I'm trying to achieve here?

For the name you are looking to be creating what is known as a web scraper/spider. You are looking for a xml/html scraping/parsing library as well has the core library for issuing http requests. The xml/html library should be able to use xpaths for easier programming.

Using string searches is a terrible way to process html.

For instance in ruby I would do something like (this probably wont work and is untested, quickly generated the xpath and didn't check if it was correct):

Code

require "open-uri"
require "uri"
require "nokogiri"

doc = Nokogiri::HTML(open("https://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free"))
links = doc.xpath("//div[@id='mw-content-text']/ul[9]/a/@href")

links.each do |link|
download_stream = open(link)
uri = URI.parse(link)
IO.copy_stream(download_stream, "./#{File.basename(uri.path))}")
end

Quote (annexusquam @ May 2 2016 06:22am)

If you actually need this and aren't just writing it to practice, the easiest thing would be a program that can crawl through a website (recursively) that you then configure to only keep files of a certain type. If you're on *nix it should be fairly easy with wget (something like https://fak3r.com/2008/07/28/howto-recursively-download-only-specific-file-types/ ), on windows I use https://www.httrack.com/ to mirror whole websites but I don't know how customizable it is.

This is another option, wget or curl may be able to solve this easily. curl the webpage, grep for urls matching a http url ending in pdf, use wget/curl to fetch that link.

This post was edited by AbDuCt on May 2 2016 06:51am

Ideophobe

Member

Posts: 14,631

Joined: Sep 14 2006

Gold: 575.56

May 2 2016 11:25am

won't work

Code

http://cdn.artofproblemsolving.com/aops20/attachments/85314_e1e907a6c92d64bea241ed35b6414d3a

This post was edited by Ideophobe on May 2 2016 11:26am

AbDuCt

Member

Posts: 13,425

Joined: Sep 29 2007

Gold: 0.00

Warn: 20%

May 2 2016 11:44am

Quote (Ideophobe @ May 2 2016 01:25pm)

won't work

Code

http://cdn.artofproblemsolving.com/aops20/attachments/85314_e1e907a6c92d64bea241ed35b6414d3a

What doesn't work? For a programmer you're as descriptive as a 2 year old.

This post was edited by AbDuCt on May 2 2016 11:45am

AbDuCt

Member

Posts: 13,425

Joined: Sep 29 2007

Gold: 0.00

Warn: 20%

#10

May 2 2016 06:03pm

Anyways not coding from my phone anymore like my previous post. Here is a working tested version:

Code

require "open-uri"
require "uri"
require "nokogiri"

doc = Nokogiri::HTML(open("http://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free"))
links = doc.xpath("//div[@id='mw-content-text']/ul[9]/li/a/@href")

links.each do |link|
download_stream = open(link)
uri = URI.parse(link)
filename = File.basename(uri.path)
filename += ".pdf" if filename[-4..-1] != ".pdf"
IO.copy_stream(download_stream, "./#{filename}")
end

Fixed the Xpath from the previous version and added an extension concatenation since most of the files lack them. Even though all would still be valid pdf documents when opened, windows is gay.

This post was edited by AbDuCt on May 2 2016 06:04pm

Go Back To Programming & Development Topic List