simple web scraper very slow

Question

simple web scraper very slow

asked Apr 26, 2022 in Education by JackTerrance

I'm fairly new to python and web-scraping in general. The code below works but it seems to be awfully slow for the amount of information its actually going through. Is there any way to easily cut down on execution time. I'm not sure but it does seem like I have typed out more/made it more difficult then I actually needed to, any help would be appreciated. Currently the code starts at the sitemap then iterates through a list of additional sitemaps. Within the new sitemaps it pulls data information to construct a url for the json data of a webpage. From the json data I pull an xml link that I use to search for a string. If the string is found it appends it to a text file. #global variable start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId=' dash = '-' urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml" old_xml=requests.get(urlSitemap) print (old_xml) new_xml= io.BytesIO(old_xml.content).read() final_xml=BeautifulSoup(new_xml) linkToBeFound = final_xml.findAll('loc') for loc in linkToBeFound: urlPLmap=loc.text old_xmlPLmap=requests.get(urlPLmap) print(old_xmlPLmap) new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read() final_xmlPLmap=BeautifulSoup(new_xmlPLmap) linkToBeFound2 = final_xmlPLmap.findAll('loc') for pls in linkToBeFound2: argh = pls.text.find('PLAW') theWanted = pls.text[argh:] thisShallWork =eval(requests.get(start + theWanted).text) print(requests.get(start + theWanted)) dict1 = (thisShallWork['download']) finaldict = (dict1['modslink'])[2:] print(finaldict) url2='https://' + finaldict try: old_xml4=requests.get(url2) print(old_xml4) new_xml4= io.BytesIO(old_xml4.content).read() final_xml4=BeautifulSoup(new_xml4) references = final_xml4.findAll('identifier',{'type': 'Statute citation'}) for sec in references: if sec.text == "106 Stat. 4845": Print(dash * 20) print(sec.text) Print(dash * 20) sec313 = open('sec313info.txt','a') sec313.write("\n") sec313.write(pls.text + '\n') sec313.close() except: print('error at: ' + url2) JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

Related questions

0 votes

Q: Very Slow result when use WHERE and ORDER BY condition in MYSQL Query

I am facing issue of very slow result. I am sharing table structure as and results also. if you ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Apr 9, 2022 in Education by JackTerrance

0 votes

Q: Why are RDS queries from EC2 to RDS taking around 22ms each, which is very slow

I have an EC2 instance (medium, us-east-1d), and RDS instance (us-east-1a, db.t2.medium). I ... , JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Apr 3, 2022 in Education by JackTerrance

0 votes

Q: Unix is very simple it just needs a genius to understand its simplicity

Unix is very simple it just needs a genius to understand its simplicity Select the correct answer from above options...

asked Dec 17, 2021 in Education by JackTerrance

0 votes

Q: Need help with a Python scraper

I am trying to use urllib with python to make a scraper, I can download the images, but they are ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Feb 25, 2022 in Education by JackTerrance

0 votes

Q: Need help with a Python scraper

I am trying to use urllib with python to make a scraper, I can download the images, but they are ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Feb 24, 2022 in Education by JackTerrance

0 votes

Q: Big Boss 16: Will Priyanka Choudhary will win Big Boss 16 with Slow and Steady Race?

Big Boss 16: Will Priyanka Choudhary will win Big Boss 16 with Slow and Steady Race? Big Boss Live Watch : Watch ... -plans-to-evict-her-from-the-house-latest-tv-news-2295454%2F...

asked Jan 4, 2023 in Technology by Editorial Staff

0 votes

Q: mysql 5.7 log-slow-queries error

I'm trying to enable Slow Query Logging on mysql 5.7 and getting this error: 2016-04-27T14:55:51 ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Jul 11, 2022 in Education by JackTerrance

0 votes

Q: Height in % - Slow rendering

My context IINM, the percentage-height assumes that he height of the parent is available when the height is ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Jun 8, 2022 in Education by JackTerrance

0 votes

Q: Why impersonated Parallel function performs slow on the first run but speeds up subsequently?

So I have some code which copies files to 5 remote PCs on the network. I have a class which ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked May 26, 2022 in Education by JackTerrance

0 votes

Q: h5py extremely slow writing

After preparing data from a dataset, I want to save the prepared data using h5py. The data is a ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Apr 2, 2022 in Education by JackTerrance

0 votes

Q: How to change slow parametrized inserts into fast bulk copy (even from memory)

I had someting like this in my code (.Net 2.0, MS SQL) SqlConnection connection = new SqlConnection ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Mar 15, 2022 in Education by JackTerrance

0 votes

Q: How to change slow parametrized inserts into fast bulk copy (even from memory)

I had someting like this in my code (.Net 2.0, MS SQL) SqlConnection connection = new SqlConnection ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Mar 13, 2022 in Education by JackTerrance

0 votes

Q: Convert a MySQL table into a ColumnFamily in Cassandra : Slow batch mutations with Hector

I have a very large MySQL table (billions of rows, with dozens of columns) I would like to ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Feb 18, 2022 in Education by JackTerrance

0 votes

Q: In some cases, it is found that a large number of colliding molecules have energy more than thereshold value, yet the reaction is slow. Why?

In some cases, it is found that a large number of colliding molecules have energy more than thereshold value, yet the ... is slow. Why? Select the correct answer from above options...

asked Jan 4, 2022 in Education by JackTerrance

0 votes

Q: टी/The3. वंचिटित सी ग्टउठspeed of computer is……कंप्यूटर की गति ……..मजगO उत/Fast/तेजO यीभी/Slow/धीमीO मॅपम/Medium/मध्यमOप्टिपठां हि वेष्टी ठी/NONE OF THE

टी/The 3. वंचिटित सी ग्टउठ speed of computer is कंप्यूटर की गति .. मज ग O उत/Fast/तेज O यीभी/Slow/ ... THE ABOVE/इनमें से कोई नहीं Select the correct answer from above options...

asked Dec 21, 2021 in Education by JackTerrance

JackTerrance · Answer 1 · 2022-04-26T02:53:53+0000

No idea why i spent so long on this, but i did. Your code was really hard to look through. So i started with that, I broke it up into 2 parts, getting the links from the sitemaps, then the other stuff. I broke out a few bits into separate functions too. This is checking about 2 urls per second on my machine which seems about right. How this is better (you can argue with me about this part). Don't have to reopen and close the output file after each write Removed a fair bit of unneeded code gave your variables better names (this does not improve speed in any way but please do this especially if you are asking for help with it) Really the main thing... once you break it all up it becomes fairly clear that whats slowing you down is waiting on the requests which is pretty standard for web-scraping, you can look into multi threading to avoid the wait. Once you get into multi threading, the benefit of breaking up your code will likely also become much more evident. # returns sitemap links def get_links(s): old_xml = requests.get(s) new_xml = old_xml.text final_xml = BeautifulSoup(new_xml, "lxml") return final_xml.findAll('loc') # gets the final url from your middle url and looks through it for the thing you are looking for def scrapey(link): link_id = link[link.find("PLAW"):] r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id)) print(r.url) try: r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:])) print(r.url) soup = BeautifulSoup(r.text, "lxml") references = soup.findAll('identifier', {'type': 'Statute citation'}) for ref in references: if ref.text == "106 Stat. 4845": return r.url else: return False except: print("bah" + r.url) return False sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml") sitemap_links = map(lambda x: x.text, sitemap_links_el) nlinks_el = map(get_links, sitemap_links) links = [num.text for elem in nlinks_el for num in elem] with open("output.txt", "a") as f: for link in links: url = scrapey(link) if url is False: print("no find") else: print("found on: {}".format(url)) f.write("{}\n".format(url))