Web Scraping a Javascript Heavy Website in Python and Using Pandas for Analysis

I set out to try using the Python library BeautifulSoup to get data on the retailers that would be attending a market, as shown on this webpage: https://www.americasmart.com/browse/#/exhibitor?market=23.

What I found, however, was that the BeautifulSoup function (BeautifulSoup(url, ‘html.parser’)) only returned the header and footer of the page. It turns out that this website relies on Javascript to populate most of the data on the page, so the data I was looking for was not in the html tags.

To find the AJAX request that returned the data I needed, I looked under the XHR and JS tabs in the Network section of the Google Chrome browser (see image below). The credit for this idea came from this blog post: https://blog.hartleybrody.com/web-scraping-cheat-sheet/#useful-libraries. You need to hover over the “Name” fields and right click to copy the link address, which you paste in the code below. There are a bunch of options here, but I just tried them one by one to see if they retrieved the data I was looking for.

The Network tab on the Google Chrome Inspect menu
import requestsurl = ‘https://wem.americasmart.com/api/v1.2/Search/LinesAndPhotosByMarket?status=ACTIVE_AND_UPCOMING&marketID=23'
r = requests.get(url)
info = r.json()

For this website, the data was returned in a list of dictionaries. I had to play around with the indexes to extract the data I needed. Ultimately, I wanted to create a Pandas DataFrame with location information about every retailer, and then merge it with a narrowed down list of retailers, housed in a .csv file. The link to the full code is on my github: https://github.com/mdibble2/Projects/blob/master/Web%2BScraping%2Bwith%2BAJAX%2Brequest%20(1).ipynb

I used the df.merge() function to select only the rows of data from the website that matched the retailers listed in the csv file (i.e. an inner join). The drawback to this approach is that every retailer needed to be spelled correctly with the correct capitalization in order to match the primary key of my first table, the one generated from the website.

To check that there were not any misspelled retailers, I created a series from the retailer names in the merged dataset and another series from the retailer names in the csv dataset. I then merged these together and used df.drop_duplicates to see if any unique values remained besides those in the merged dataset. I found that I had misspelled or incorrectly abbreviated 9 retailers. Once I corrected these records, the merged dataset gave all the information I required.

Some take-aways from this mini project were:

  • BeautifulSoup cannot extract text on a website that is generated from Javascript queries
  • It was actually easier to extract the data from the response from the jQuery than it would have been with BeautifulSoup
  • When merging datasets with Pandas, the primary key columns must either be named the same or be specified in the .merge() function. They must also be exactly the same (obvious, I know but it created more work in my analysis)

An Industrial Engineer turned Data Scientist who is interested in all things technology! Find me on LinkedIn: https://www.linkedin.com/in/megandibble1/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store