No idea why i spent so long on this, but i did. Your code was really hard to look through. So i started with that, I broke it up into 2 parts, getting the links from the sitemaps, then the other stuff. I broke out a few bits into separate functions too. This is checking about 2 urls per second on my machine which seems about right. How this is better (you can argue with me about this part).
Don't have to reopen and close the output file after each write
Removed a fair bit of unneeded code
gave your variables better names (this does not improve speed in any way but please do this especially if you are asking for help with it)
Really the main thing... once you break it all up it becomes fairly clear that whats slowing you down is waiting on the requests which is pretty standard for web-scraping, you can look into multi threading to avoid the wait. Once you get into multi threading, the benefit of breaking up your code will likely also become much more evident.
# returns sitemap links
def get_links(s):
old_xml = requests.get(s)
new_xml = old_xml.text
final_xml = BeautifulSoup(new_xml, "lxml")
return final_xml.findAll('loc')
# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
link_id = link[link.find("PLAW"):]
r = requests.get('
https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
print(r.url)
try:
r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
print(r.url)
soup = BeautifulSoup(r.text, "lxml")
references = soup.findAll('identifier', {'type': 'Statute citation'})
for ref in references:
if ref.text == "106 Stat. 4845":
return r.url
else:
return False
except:
print("bah" + r.url)
return False
sitemap_links_el = get_links("
https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]
with open("output.txt", "a") as f:
for link in links:
url = scrapey(link)
if url is False:
print("no find")
else:
print("found on: {}".format(url))
f.write("{}\n".format(url))