Web Scraping using Python

Mohit Nakhale
6 min readJun 29, 2021

--

What is Web Scraping

The practice of collecting and processing data from web pages using computer software is known as web scraping. It’s an effective method for generating datasets for study and education.

Web scraping is a way of extracting huge volumes of data from websites in an automated manner. The majority of this data is unstructured HTML data that is transformed to structured data in a spreadsheet or database before being used in various applications.

How Web Scraping Works

Web scrapers can retrieve all of the data on a site or only the info that a user is looking for. It’s preferable if you define the data you’re looking for so that the web scraper only pulls that information quickly. For example, you could want to scrape an Amazon website for information on the many types of juicers available, but you just need the information on the different juicer models, not the user reviews.

When a web scraper has to scrape a site, it is first given the URLs of the sites it needs to scrape. The HTML code for those sites is then loaded, and a more complex scraper may extract all of the CSS and JavaScript components as well.

The scraper then extracts the needed data from the HTML code and outputs it in the user-specified format. The data is usually stored in the form of an Excel spreadsheet, CSV file or text file but it may also be saved in other forms like a JSON file.

Objective

In this blog post, we’d be scraping the Times Jobs website, and we’ll fetch different job posts from the website that are applicable to us as per the technology stack of our choice.

Steps & Execution

  • Select a website to scrape, and check if it allows scraping. In order to check, we can send a GET request to the website that we desire to scrape and check if we get a response code of 200.
  • Let’s fetch the HTML Source of the website, using the text attribute.
  • In order to parse HTML content into python objects, we’d make use of the Beautiful Soup library and the lxml parser.
  • Here, we’re using the lxml parser instead of using the default HTML parser, because the lxml parser can also deal well with broken HTML code, where the default HTML parser fails.
  • Let’s create a Beautiful Soup object and parse the HTML code using the lxml parser. In order to create a Beautiful Soup object, we’ll have to import the BeautifulSoup library.
  • In order to collect selective data from the website, we need to inspect the website using Dev Tools offered by the browsers. Here, we’d be filtering out the HTML divs using the HTML class applied to them.
  • In the above image, we can see that the job posts listed on the website are using <li> tag and have a class called ‘clearfix job-bx wht-shd-bx’ applied to them. In order to fetch multiple such job posts, we’ll use the find_all() method and pass ‘li’ & ‘clearfix job-bx wht-shd-bx’ as arguments so as to fetch all the job posts and we’ll store them in the jobs list.
  • Now that we have a list of all the job posts, let’s loop through each one of them and extract the ‘Company Name’, ‘Required Skills’, ‘Link for the Job Description’, ‘Published Date’.
  • Here, we’re storing various required fields into multiple variables. Let’s have a look at the explanation of the above piece of code.
  • In the above image, we can see that the Company Name is listed in the <h3> tag and has a class named ‘joblist-comp-name’ applied to it.
  • So, we’ll apply the find() method to our job variable which contains a single job post.
  • We’ll pass the ‘h3’ & ‘joblist-comp-name’ class to this method to fetch the company name.
  • We’ll use the text attribute to fetch the string of the company name.
  • Here, we can see that the Key Skills are enlisted in the <span> tag and have a class named ‘joblist-comp-name’ applied to it.
  • So, we’ll again make use of the find() method on our job variable and pass the ‘span’ & ‘srp-skills’ class to this method to fetch the key skills mentioned in the job post.
  • We’ll again make use of the text attribute.
  • In the code snippet above, we can see that we have two variables, namely, skills & prettified_skills.
  • In the terminal, we can see the value of skills variable is a string that contains a lot of whitespaces. In order to remove those whitespaces, we’ve striped the starting and trailing whitespaces, replaced intermediate whitespaces with an empty string and stored them in the prettified_skills list.
  • One advantage of doing this is that we’d be easily able to filter out job posts that are out of our tech stack.
  • Here, we can see that the Job Description Link is an anchor tag <a>.
  • This anchor tag falls under an <h2> tag which itself falls under the <header> tag.
  • Rather than using the find() method, we’d make use of the dot operator. Check the code snippet below.
  • Here, we’ve used the dot operator to fetch the value of the job link. We’ve applied the dot operators in the hierarchical order of the DOM tree, wherein, the header tag comes first, then the h2 tag and then the anchor tag.
  • We’re accessing the ‘href’ attribute of the anchor tag to scrape the link of the job and storing it in the job_link variable.
  • We see that the Published Date is present inside a <span> tag having a class ‘sim-posted’.
  • Here instead of passing the keyword argument of the class name, we’re passing a dictionary that contains the class name. This approach allows us to filter out required information based on multiple HTML attributes like style, name, value, type, etc.

Now that we have retrieved all the information that we need, let’s define our technology stack and filter out job posts which do not contain any of the skill mentioned in our technology stack.

  • In the code snippet above, we’ve declared a global list named my_tech_stack which contains skills of my choice.
  • Within the for loop, we’ve made use of any() method. This method will check if any of the skills mentioned in the my_tech_stack list are present in the prettified_skills list. If there’s a match the function returns True, otherwise False. This gets stored in the skills_check variable.
  • Now, if there’s a skill match, i.e. the value of the skill_check variable turns out to be true, then we print the job post on the terminal.
  • Let’s add some extra functionality of running this python script periodically (after every 15 minutes) for fetching the job posts from the website and listing the job posts that meet our requirements into a text file.
  • The final code snippet looks as follows

You can find the link to this code here. Thank you.

--

--