Web Scraping 101 with BS4
I am going to teach you some simple web scraping. This isn't rocket science and if you know basic python it's going to be very easy to follow along.
For this tutorial, I am going to be scraping the Coronavirus statistics bar on top of popular news website Mihaaru.
We will be using:
1) Google Chrome
2) Python 3
The packages we will be using are:
bs4 - pip install bs4
requests - pip install requests
lxml - pip install lxml
The requests package will be used to get the whole website html. BeautifulSoup in bs4 will be used to parse the html. We will use the parsed html object to find specific items and their text values.
To scrape the top bar, first we need to find div or container that it is in. By using a browser, in our case, Google Chrome, I inspect the text we are going to scrape.
We end up with something like below. A list inside a div which has a class of coronavirus-special-coverage-item
.
Now by looking at the html, we have figured out some important things:
1) This is the second div with the same class name
2) All the details are inside <li>
tags
3) The details are inside <span>
and classes are number
and clabel
That is all the information we need to write the script. So I'll go ahead and write the script for you.
The code looks a bit heavy due to the comments, but if you read carefully you should be able to follow what I am doing there.
Code snippet/Gist can be found here.
Repl.it for the code can be found here. I used html.parser
here instead of lxml
but it shouldn't make a difference.
And if you need any help with scraping or running this, do contact me via Telegram on Baivaru Tech Tips