d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Grabbing And Saving Files Online
12Next
Add Reply New Topic New Poll
Member
Posts: 10,228
Joined: Oct 27 2008
Gold: 9,834.89
May 1 2016 10:25pm
I would like to go to a website, and on it, there would be a bunch of hyperlinks that lead to PDF documents. I'd like to parse through the page, grab all of these documents, then save them onto my computer for easy access in the future (ex: if the website crashes).

For example: On this site, https://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free
about halfway down under the title "Problem Sets", there are a bunch of documents coded as hyperlinks. I'd like to save those through a program, without actually individually clicking on each and saving them.

If anyone can walk me through this code-building in Java (I use the Netbeans IDE if that makes a difference), I'd be willing to pay you FG for your help and time. Or, general advice is much appreciated also :)
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
May 1 2016 11:13pm
I can do this in Ruby if you ever change your mind.
Member
Posts: 10,228
Joined: Oct 27 2008
Gold: 9,834.89
May 2 2016 12:12am
Quote (AbDuCt @ May 1 2016 09:13pm)
I can do this in Ruby if you ever change your mind.


Will do. Do you know if there's a specific name for what I'm trying to achieve here?
Member
Posts: 14,631
Joined: Sep 14 2006
Gold: 575.56
May 2 2016 04:15am
ya theres a really convenient library from apache called the commons io
it can be done with base java libraries, but you should get that package anyway it's super convenient, and why reinvent the wheel when you can write a program in 20 lines of code lol
Code
// Using Commons IO library
// Available at http://commons.apache.org/io/download_io.cgi
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.FilenameUtils;

import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class JSP_Request_2 {
public static void main(String[] args) throws IOException {
System.out.println("start");
//Make sure this directory exists if it doesn't change the path to wherever you want to put the files
String directory = "C:\\Users\\user\\IdeaProjects\\JSP_Request_2\\";

URL input = new URL("http://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free");
BufferedReader in = new BufferedReader(new InputStreamReader(input.openStream()));

String inputLine;
while ((inputLine = in.readLine()) != null) {
if (inputLine.contains("Problem Sets</span></h3>")) {
while (!inputLine.contains("Other Geometry")) {
inputLine = in.readLine();
String words[] = inputLine.split("[\"]");
if(words2.length>5) {
String file = FilenameUtils.getBaseName(words[5]);
System.out.println("Downloaded : " + file);
download(directory + FilenameUtils.getBaseName(words2[5])+".pdf",words[5]);
}
}
}
}
in.close();
System.out.println("done");
}
public static void download(String fileName, String fileUrl)throws IOException {
FileUtils.copyURLToFile(new URL(fileUrl), new File(fileName));
}
}


commented "walk through"
Code
// Using Commons IO library
// Available at http://commons.apache.org/io/download_io.cgi
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.FilenameUtils;

import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class JSP_Request_2 {
public static void main(String[] args) throws IOException {
//STARTO
System.out.println("start");
//Make sure this directory exists if it doesn't change the path to wherever you want to put the files
String directory = "C:\\Users\\user\\IdeaProjects\\JSP_Request_2\\";
//Make a URL out of the link to the page you wanted
URL input = new URL("http://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free");
//Make a reader to scan the html of it
BufferedReader in = new BufferedReader(new InputStreamReader(input.openStream()));
//Make a string to hold the values you read from each line
String inputLine;
//Read each line of the html
while ((inputLine = in.readLine()) != null) {
//find a string that is unique to the html that looks like a good starting point
//read through the html til you find it
if (inputLine.contains("Problem Sets</span></h3>")) {
//find a string unique to the html that is in your last read line that will stop the program
while (!inputLine.contains("Other Geometry")) {
//set your string = next line
inputLine = in.readLine();
//split the line up into seperate strings delimited by a "
String words[] = inputLine.split("[\"]");
//ignore the <ul>s and </ul> lines
if(words.length>5) {
//cool utility you get with commons to grab the filename, can be done with base io library, but fuck a bunch of parsing
String file = FilenameUtils.getBaseName(words[5]);
//print the filename of the file you want to download
System.out.println("Downloaded : " + file);
//download the file (to the directory we made earlier + the file name + with the extension .pdf because for some reason not all your links have the extension,
// words[5] is where your file link is in the split
download(directory + FilenameUtils.getBaseName(words[5])+".pdf",words[5]);
}
}
}
}
//close the stream
in.close();
//DONE
System.out.println("done");
}
//commons ftw this method is probably like 15 lines with java io utils
public static void download(String fileName, String fileUrl)throws IOException {
FileUtils.copyURLToFile(new URL(fileUrl), new File(fileName));
}
}

now i gotta go delete 20 pdfs lol

This post was edited by Ideophobe on May 2 2016 04:44am
Member
Posts: 3,028
Joined: Mar 23 2016
Gold: 7,568.50
May 2 2016 04:22am
If you actually need this and aren't just writing it to practice, the easiest thing would be a program that can crawl through a website (recursively) that you then configure to only keep files of a certain type. If you're on *nix it should be fairly easy with wget (something like https://fak3r.com/2008/07/28/howto-recursively-download-only-specific-file-types/ ), on windows I use https://www.httrack.com to mirror whole websites but I don't know how customizable it is.
Member
Posts: 14,631
Joined: Sep 14 2006
Gold: 575.56
May 2 2016 05:32am
i was thinking about it over breakfast, and the weird file names really bother me
Code
while (!inputLine.contains("Other Geometry")) {
inputLine = in.readLine();
String words[] = inputLine.split("[\"]|[<]|[>]");
if(words.length>5) {
System.out.println("Downloaded : " + words[10]);
download(directory + words[10]+".pdf",words[8]);
}
}
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
May 2 2016 06:46am
Quote (Michael515 @ May 2 2016 02:12am)
Will do. Do you know if there's a specific name for what I'm trying to achieve here?


For the name you are looking to be creating what is known as a web scraper/spider. You are looking for a xml/html scraping/parsing library as well has the core library for issuing http requests. The xml/html library should be able to use xpaths for easier programming.

Using string searches is a terrible way to process html.

For instance in ruby I would do something like (this probably wont work and is untested, quickly generated the xpath and didn't check if it was correct):

Code
require "open-uri"
require "uri"
require "nokogiri"

doc = Nokogiri::HTML(open("https://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free"))
links = doc.xpath("//div[@id='mw-content-text']/ul[9]/a/@href")

links.each do |link|
download_stream = open(link)
uri = URI.parse(link)
IO.copy_stream(download_stream, "./#{File.basename(uri.path))}")
end


Quote (annexusquam @ May 2 2016 06:22am)
If you actually need this and aren't just writing it to practice, the easiest thing would be a program that can crawl through a website (recursively) that you then configure to only keep files of a certain type. If you're on *nix it should be fairly easy with wget (something like https://fak3r.com/2008/07/28/howto-recursively-download-only-specific-file-types/ ), on windows I use https://www.httrack.com/ to mirror whole websites but I don't know how customizable it is.


This is another option, wget or curl may be able to solve this easily. curl the webpage, grep for urls matching a http url ending in pdf, use wget/curl to fetch that link.

This post was edited by AbDuCt on May 2 2016 06:51am
Member
Posts: 14,631
Joined: Sep 14 2006
Gold: 575.56
May 2 2016 11:25am
won't work
Code
http://cdn.artofproblemsolving.com/aops20/attachments/85314_e1e907a6c92d64bea241ed35b6414d3a


This post was edited by Ideophobe on May 2 2016 11:26am
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
May 2 2016 11:44am
Quote (Ideophobe @ May 2 2016 01:25pm)
won't work
Code
http://cdn.artofproblemsolving.com/aops20/attachments/85314_e1e907a6c92d64bea241ed35b6414d3a


What doesn't work? For a programmer you're as descriptive as a 2 year old.

This post was edited by AbDuCt on May 2 2016 11:45am
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
May 2 2016 06:03pm
Anyways not coding from my phone anymore like my previous post. Here is a working tested version:

Code
require "open-uri"
require "uri"
require "nokogiri"

doc = Nokogiri::HTML(open("http://www.artofproblemsolving.com/wiki/index.php?title=Mathematics_competitions_resources#Free"))
links = doc.xpath("//div[@id='mw-content-text']/ul[9]/li/a/@href")

links.each do |link|
download_stream = open(link)
uri = URI.parse(link)
filename = File.basename(uri.path)
filename += ".pdf" if filename[-4..-1] != ".pdf"
IO.copy_stream(download_stream, "./#{filename}")
end


Fixed the Xpath from the previous version and added an extension concatenation since most of the files lack them. Even though all would still be valid pdf documents when opened, windows is gay.

This post was edited by AbDuCt on May 2 2016 06:04pm
Go Back To Programming & Development Topic List
12Next
Add Reply New Topic New Poll