Web scraping Google search results
Python is a programming language that lets you work quickly
and integrate systems more effectively.
Sometimes you might need to get the urls of a Google search result. There are many ways to accomplish this task and I would like to discuss one of the ways we are using here at Walden Systems, because we believe this is the most straightforward method to get the urls of a Google search result.
Google Search results are hidden in the javascript, that’s why we have used Selenium for extracting html produced by Javascript. We prefered to use PhantomJS as it would execute in the background . If we have used firefox as webdriver in Selenium, a new firefox instance would have been created on each requests. In Addition, BeautiulSoup is for parsing html.
In order to do this
1 from selenium import webdriver 2 from bs4 import BeautifulSoup 3 4 def getGoogleLinksForSearchText( p_searchText, p_num ) : 5 l_browser = webdriver.PhantomJS( executable_path='/FULL_DIRECTORY_PATH_OF_PHANTOMJS/phantomjs' ) 6 l_searchGoogle = 'https://www.google.com/webhp?#num=' + p_num + '&q=' + p_searchText 7 l_browser.get( l_searchGoogle ) 8 l_pageSource = l_browser.page_source 9 bsObj = BeautifulSoup( l_pageSource, 'lxml' ) 10 11 l_elements = bsObj.findAll( "h3", { "class": "r" } ) 12 13 l_links = [ ] 14 l_str_to_find = "/url?q=" 15 l_str_to_end = "&sa=" 16 for l_element in l_elements : 17 l_str = l_element.find("a").attrs['href'] 18 19 l_begin_loc = l_str.find( l_str_to_find, 0 , len( l_str ) ) 20 if ( -1 < l_begin_loc ) : 21 l_begin_loc = l_begin_loc + len( l_str_to_find ) 22 l_end_loc = l_str.find( l_str_to_end, l_begin_loc, len( l_str ) ) 23 if ( -1 < l_end_loc ) : 24 l_url = l_str[ l_begin_loc : l_end_loc ] 25 l_links.append (l_url ) 26 27 return l_links 28 29 l_results = getGoogleLinksForSearchText("Oscars seacrest", "50" ) 30 for l_result in l_results : 31 print( l_result )