d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Python Help
Add Reply New Topic New Poll
Member
Posts: 29,702
Joined: Jun 10 2010
Gold: 4,707.16
Oct 15 2017 09:05pm
I have been googling and pulling my hair trying to figure this out

I am using beautifulsoup findall to scrap a page

I want to find everything that has this but not that

this = GOODSTUFF

that = feed

example of page

Code
Scrape dem' fucking torrents, bra
url1
url2
logged in
soup = <!DOCTYPE html>
<html>
<head>
<title>Browse Anime Torrents :: WEBSITE</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_torrents.xml" rel="search" title="WEBSITE anime torrents" type="application/opensearchdescription+xml"/>
<link href="/opensearch_torrents2.xml" rel="search" title="WEBSITE music torrents" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_torrents_all/GOODSTUFF" rel="alternate" title="WEBSITE - All Torrents" type="application/rss+xml"/>
<link href="/feed/rss_torrents_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime Torrents" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/torrent/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
<span class="download_link">[<a href="https://WEBSITE.tv/torrent/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
<a href="torrents.php?id=27682&amp;torrentid=223197">Game | NES | JPN | Archived | <img alt="Freeleech!" src="/static/c$
</td>


what im using

Code
download = soup.find_all(href=re.compile("GOODSTUFF"))


This post was edited by CyberGod on Oct 15 2017 09:13pm
Member
Posts: 29,702
Joined: Jun 10 2010
Gold: 4,707.16
Oct 16 2017 12:28pm
closed, figured it out
Member
Posts: 4,087
Joined: Oct 29 2011
Gold: 0.69
Oct 17 2017 10:30am
So you used a regex to sift through all the hrefs? I know you said figured it out, just curious is all!

how did you structure your loop to iterate through it all? I've previously always used a for loop on var.find_all('a') to create a list of iterable <a /a> elements, then run through the list with a var.get('href') for each iteration.

Member
Posts: 29,702
Joined: Jun 10 2010
Gold: 4,707.16
Oct 17 2017 12:43pm
Quote (destroyered @ Oct 17 2017 12:30pm)
So you used a regex to sift through all the hrefs? I know you said figured it out, just curious is all!

how did you structure your loop to iterate through it all? I've previously always used a for loop on var.find_all('a') to create a list of iterable <a /a> elements, then run through the list with a var.get('href') for each iteration.


Yes regex

soup.find_all(href=re.compile("http(?:s)://(.*)/GOODSTUFF"))

this is what someone helped me with I'm terrible with regex.

But it works perfectly aslong as I don't go through a socks5 prozy for some reason it has problems with https but works for http lucky for this case I didn't need a proxy
Member
Posts: 4,087
Joined: Oct 29 2011
Gold: 0.69
Oct 19 2017 02:37pm
Quote (CyberGod @ Oct 17 2017 02:43pm)
Yes regex

soup.find_all(href=re.compile("http(?:s)://(.*)/GOODSTUFF"))

this is what someone helped me with I'm terrible with regex.

But it works perfectly aslong as I don't go through a socks5 prozy for some reason it has problems with https but works for http lucky for this case I didn't need a proxy


Interesting. Where ya pulling this html from anyways? You could try putting in a header of some sort. A last resort, if you wer trying to steal the declaration of independence, you could verify=false, but this exposes you to some very sketchy stuff, where I highly recommend you don't resort to.

However, for a simple label like 'goodstuff' where nothing changes and 'goodstuff' is always what you'l be searching for... then regex is most def overkill.

try something of the following:

Code
for link in page.find_all('a'): #this iterates through all <a /a> elements
if 'GOODSTUFF' in str(link.get('href'): #checks for 'GOODSTUFF' in current link iteration
print(link) #and of course instead of print you can do anythign else you'd like. By the way, link.get('href') may be more handy for this line.


hope this helps.

edit: you already solved, sry for pointless solution XD

This post was edited by destroyered on Oct 19 2017 02:38pm
Go Back To Programming & Development Topic List
Add Reply New Topic New Poll