
Overview of Machine Learning Data Extraction Tool
Machine learning can be active when it comes to sorting out data. Moreover, in machine learning, the digital machine goes through an input of a massive amount of data. Moreover, machine learning is a type of Artificial Intelligence (AI) scenario. The digital device comes under exposure to millions of data points allowing it to learn, improve, evolve, and make decisions on when it comes to processing new and additional data by the data. Indeed, Python is the most common and popular language underuse in machine learning.
Example:
As an example, you can feed data from job sites into a computer system, and it is programmed to find best jobs matching following a specific criterion that has been successful in the past in sorting out the data. Once the provision of fresh information takes place for the machine to process, it can fetch out the desired data points with very high accuracy.
Machine learning certainly depends on quality data in bulk quantities. Also, the retrieval of the desired information is usually a challenge.
In this article, we will show how to collect the desired data from the World Wide Web (www). The data will be collected from job sites for three parameters, job title, Company website, and the company name that is offering the job. This tool crawls job sites as well as extracts desired data associated with these three parameters for simplicity sake.
The software tool (that is a bot, spider or crawler) is coded in Python programming language and takes advantage of a framework known as “Scrapy.” Python’s framework Scrapy offers a settings file in which one can configure specific parameters that will help in running the software tool properly. Each job site will have an altered code for the bot (spider or crawler). For the sake of sanity, we will provide the code system for the tool that will crawl the Glassdoor site and extract useful data.
The code contains parameters to set up scrapy’s setting file.
Code for the Settings File
# -*- coding: utf-8 -*-
# Scrapy settings for glassdoorjobs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = ‘glassdoorjobs’
SPIDER_MODULES = [‘glassdoorjobs.spiders’]
NEWSPIDER_MODULE = ‘glassdoorjobs.spiders’
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36’
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
URLLENGTH_LIMIT = 4000
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 1
Explanation of the Code
The setting values in bold text for the code are the ones goes with alterations and are explained below.
BOT_NAME = ‘glassdoorjobs’
The BOT_ NAME is the name of the crawler and can be named anything. Here we use the name “glassdoorjobs” for better relevancy.
SPIDER_MODULES = [‘glassdoorjobs.spiders’]
SPIDER_MODULES is the list of modules where the Scrapy will look for spiders. In our case, we have only one module in the list “glassdorjobs.spiders.”
NEWSPIDER_MODULE by default is ‘’, it is where you create new spiders using the “genspider” command.
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36’
In the ‘’ the Chrome USER_AGENT is defined that helps in the identification of the browser for the server (i.e., which browser is under use). This information is sent in the User-Agent HTTP header every time the request takes place on a job site (in our case it is Glassdoor.com).
ROBOTSTXT_OBEY = False
Each site on the internet has a text file for robots (also known as spiders, crawlers, and bots in this context) where the structure of the site is defined. The purpose is allowing the robots to look for information that is useful. In many sites, the access to this information is restricted therefore we do not intend to use this information and therefore, we use the value “False” for our robot to skip this file.
URLLENGTH_LIMIT = 4000
The limit to URL Length is explicitly defined as “4000” characters because scrapy ignores URL with greater than 2083 characters.
Creating the Spider for Data Extraction
A spider was written to extract data from the job site.
import scrapy
class JobsSpider(scrapy.Spider):
name = “jobs_spider”
def start_requests(self):
urls = [
‘https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=&sc.keyword=&locT=C&locId=3206630&jobType=’
#’https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=manager&sc.keyword=manager&locT=C&locId=3206630&jobType=’
]
The first URL in the `urls` list represents a search of all jobs in Lahore (location Id=3206630) and the second URL represents a search of jobs containing the keyword `Manager` in Lahore.
The programmer then creates a class named “JobsSpider.” The “Spider” is a class in the “Scrapy” library, and we have called it such.
The programmer creates the function “start_requests.”
Now here you search the job website such as if you’re looking for a job vacancy for a “Manager.” Your search for the manager while filling other appropriate fields. Then you will copy paste the URLs here in the code as shown above in the code snippet.
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_jobs_links)
In this code, the system picks the relevant URL from the URLs array and hence, generates a request. The parsing of job links takes place as the request makes its way to the server. When the request completes, the callback parse_jobs_links comes into execution.
def parse_jobs_links(self, response):
job_links = response.xpath(‘//body/div[@class[starts-with(.,”pageContentWrapper”)]]’).xpath(‘.//div[@id[starts-with(.,”JobResults”)]]’).xpath(‘.//ul[@class[starts-with(.,”jlGrid”)]]’).xpath(‘.//div[@class[starts-with(.,”logoWrap”)]]/a/@href’).extract();
The function “parse_jobs_links” is defined and the object “response.” The object “response” comes under use with the “XPath” language. “XPath” is a language used to navigate through elements and attributes in an XML document and it can be used with HTML also. Xpath has over 200 built-in functions. At the end of the multiple usages of XPath function the extract() function is used to extract the kind of data we need concerning the job postings.
for link in job_links:
yield scrapy.Request(url=’https://www.glassdoor.com’ + link, callback=self.parse_job_data)
This snippet requests every link that is present in the list “job_links.” Upon receiving the response for a link, the callback “parse_job_data” comes under execution.
next_page = response.css(‘li.next a’).xpath(‘./@href’).extract_first()
print “url:”, next_page
The “css” in the above code is similar to “xpath”, allowing us to select elements in the HTML page. You can use “xpath” and “css” functions separately. Here, the programmer is using both of these. The code now looks for the “NEXT” button link at the end of the web page of the job website.
if (next_page):
yield scrapy.Request(url=’https://www.glassdoor.com’ + next_page, callback=self.parse_jobs_links)
Once the spiders parse the webpage, it finds the “next page” tab. To go to the next page, it uses the parse_jobs_links function as a call once more to fetch the required data on the new web page.
def parse_job_data(self, response):
job_title = response.xpath(‘//div[@class[contains(.,”JobTitle”)]]/h2/text()’).extract_first()
The programmer defines a new function parse_job_data so that it fetches the “job titles” (which is the required piece of information) using “xpath” again by scanning the source code of the job website.
company_name = response.xpath(‘//span[@class[starts-with(., “strong ib”)]]/text()’).extract_first()
The same parse_job_data function gets the company name using the same “XPath” command.
company_website = response.css(‘div#CompanyContainer’).xpath(‘.//span[@class[contains(.,”value website”)]]’).xpath(‘.//a[@href]//text()’).extract_first()
Similarly, CSS and XPath functions combine together to extract the data about the company website.
if (company_name):
company_name = company_name.lstrip()
else:
company_name = ”
yield {
‘job_title’: job_title,
‘company_name’: company_name,
‘company_website’: company_website
The “if” condition adds to remove any spaces at the beginning of the company’s name, after retrieving it from the source code. Here is where we get job titles, company names and also company websites.
In brief, it is to scrap and collect data from the world wide web (www). You can use this technique for the purpose of collecting data for your next machine learning project.
To discuss our services, expertise and how we can help you, please contact us.