Web Scraping Monster.com for Data Science Keywords

May 20, 2019

The other day I was reading up on web scraping because I realized my GitHub portfolio lacked any projects that required extracting information from HTML source. While Googling some examples, I came across this nice blog post by Jesse Steinweg-Woods about scraping Indeed.com for key skills that employers are looking for in data scientists. Jesse showed the graphic below by Swami Chandrasekaran to demonstrate the point that it would take a lifetime (or more) to master all the tools required to qualify for every data science job listing.

Rather than learn everything, we should learn the tools that have the greatest probability of ending up on the “requirements” list of a job posting. We can go to a website like Indeed and collect the keywords commonly used in data science postings. Then we can plot the keyword versus its frequency, as a function of the city in which one would like to work.

Jesse developed some nice code in Python to:

  • Construct a URL to the search results for job postings matching a given city and state (or nationwide, if none are specified)
  • Extract and tally keywords from data science job postings listed in the search results

Below is a plot of the percentage of job listings containing each term from a pre-defined list of keywords for NYC back data scientist positions. Jesse plotted this in 2015, when he wrote his code (credit to him for the graphic).

percentage of data scientist job ads with key skill from Indeed.com, 2015

For my project I decided to apply a similar analysis to Monster.com, another job listings board. I was curious as to how similar the results were to Indeed. Of course, it’s been four years since Jesse’s original analysis. So there are more variables in play than the change in platform. But I had a feeling R, Python, and SQL would remain the top three languages on the list.

The project is available on my GitHub and I describe my study in this Jupyter notebook, but I’ll explain the logic below.

Web Scraping

I made use of the Beautiful Soup library to load a page of search results for a specified job query. Below, we’re looking up “Data Scientist” positions in “New York, NY”. I load the first 10 “pages” of results in a single request to avoid having to load more listings dynamically (by pressing the “Load more jobs” button):

import requests
from bs4 import BeautifulSoup

first_url = 'https://www.monster.com/jobs/search/?q=data-scientist&where=New-York__2C-NY&stpage=1&page=10'

response = requests.get(first_url)
soup = BeautifulSoup(response.text, 'html.parser')

The soup object is further searchable for HTML tags with specified attributes that correspond to the fields we need. For example, each listing is stored in an HTML <section> with the unique ID stored in an attribute called data-jobid. The following code gets us the unique job IDs corresponding to each listing in the results:

all_listings = soup.find_all('section', attrs={'data-jobid': True})
ids = [item['data-jobid'] for item in all_listings]

We can do this to get the company, job title, and location of each result. Each listing also contains a link leading to a static webpage that contains the job description. The URLs are stored as a link within an <h2> tag that belongs to the class <title>. The following code gets all the URLs:

new_urls = [item.find('h2', attrs={'class': 'title'}).find('a', href=True)['href'] for item in all_listings]

In order to load the descriptions, we need to fetch the source from each of the above URLs. We will fetch the first URL and grab the description, which is stored in a <div> with the id JobDescription:

job_url = new_urls[0]
response = requests.get(job_url)
soup = BeautifulSoup(response.text, 'html.parser')
job_body = soup.find('div', attrs={'id': 'JobDescription'})
desc = job_body.get_text(separator=' ')  # add whitespace between HTML tags

We repeat this process for each listing obtained in a search. Now we’d like to explore how to parse the keywords in these descriptions.

Keyword Parsing

Let’s say we have loaded all the descriptions. Now we want to calculate the frequency of each data science keyword across the listings. I went with the same list of keywords that Jesse used in his original study. The code is a bit long, but basically we break the description down into a list of lowercase words, filter out stopwords with the help of the NLTK library, and count each term. If all_words is the list of words extracted from the description of a single listing, then we can construct a Pandas DataFrame with the keywords and their frequencies as follows:

import pandas as pd
from collections import Counter

freqs = Counter()
out_dict = dict([(x, freqs[x.lower()]) for x in DATA_SCI_KEYWORDS])
df = pd.DataFrame.from_dict(out_dict, orient='index', columns=['Frequency']).reset_index()
df = df.rename(columns={'index': 'Keyword'}).sort_values(by='Frequency', ascending=False).reset_index(drop=True)

In the full implementation, I return a DataFrame of the frequencies across all the listings obtained in a search. I did this for NYC-area jobs for a Data Scientist, and created a bar plot of the top 10 keywords:

So Python, R, and SQL reign supreme in 2019! Though Python now appears to be a more common requirement than R. Good thing I’ve made it my language of choice these days!

Check out the project on my GitHub, go through the Jupyter notebook, and let me know what you think!