I set out to try using the Python library BeautifulSoup to get data on the retailers that would be attending a market, as shown on this webpage: https://www.americasmart.com/browse/#/exhibitor?market=23.
To find the AJAX request that returned the data I needed, I looked under the XHR and JS tabs in the Network section of the Google Chrome browser (see image below). The credit for this idea came from this blog post: https://blog.hartleybrody.com/web-scraping-cheat-sheet/#useful-libraries. You need to hover over the “Name” fields and right click to copy the link address, which you paste in the code below. There are a bunch of options here, but I just tried them one by one to see if they retrieved the data I was looking for.
import requestsurl = ‘https://wem.americasmart.com/api/v1.2/Search/LinesAndPhotosByMarket?status=ACTIVE_AND_UPCOMING&marketID=23'
r = requests.get(url)
info = r.json()
For this website, the data was returned in a list of dictionaries. I had to play around with the indexes to extract the data I needed. Ultimately, I wanted to create a Pandas DataFrame with location information about every retailer, and then merge it with a narrowed down list of retailers, housed in a .csv file. The link to the full code is on my github: https://github.com/mdibble2/Projects/blob/master/Web%2BScraping%2Bwith%2BAJAX%2Brequest%20(1).ipynb
I used the df.merge() function to select only the rows of data from the website that matched the retailers listed in the csv file (i.e. an inner join). The drawback to this approach is that every retailer needed to be spelled correctly with the correct capitalization in order to match the primary key of my first table, the one generated from the website.
To check that there were not any misspelled retailers, I created a series from the retailer names in the merged dataset and another series from the retailer names in the csv dataset. I then merged these together and used df.drop_duplicates to see if any unique values remained besides those in the merged dataset. I found that I had misspelled or incorrectly abbreviated 9 retailers. Once I corrected these records, the merged dataset gave all the information I required.
Some take-aways from this mini project were:
- It was actually easier to extract the data from the response from the jQuery than it would have been with BeautifulSoup
- When merging datasets with Pandas, the primary key columns must either be named the same or be specified in the .merge() function. They must also be exactly the same (obvious, I know but it created more work in my analysis)