Day 1 09/05
As my approach to Web Scraping resources, I decided to read blog content and general overviews of these methods and python libraries. I wanted to have a general perspective on what these processes were, the tools that are available, and how to prioritize my learning process in a realistic way.
I don’t have a strong basis on Python, but I heard that I could work ok with some packages. So this is the main concern that I want to solve with this general overview.
Resource 1:
Kokatjuhha, J. (2018, January 9). Web Scraping Tutorial with Python: Tips and Tricks. Retrieved September 5, 2018, from https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071
This resource provides a general perspective on Web scraping: definitions, use cases, tools and practical tips (how to speed up, dos and don’ts, how not to be detected, etc). They also provide tons of links for further work, as well as Python snippets. They use the following definition: “Web scraping is a technique used to extract data from websites through an automated process.”
Takeaways:
· There is no universal solution for web scraping. The data is unstructured, so depending on the context different strategies might apply.
· Understanding the website structure is important. By the same token, the inspection of a Website can be very useful to get its structure. Some knowledge of HTML tags can be very useful to address more complex structures and a useful knowledge in the future.
· Scraping is not always practical or legal. This is important because I wanted to automatically scrape some content that was inside of a MOOC course for research purposes. Although the terms and conditions of the site request no to scrape. I’ll have to consider that in my strategy and ethical concerns for research.
o This inspired me to look for specific work on scraping edX and MOOC content.
· Some sites have rules for scraping that can be seen in robots.txt. A good practice is to read and follow them.
· Before scraping, always check if there is a public API available.
· Two main tools are recommended:
o Scrapy: a stand-alone ready-to-use data extracting framework.
o BeautifulSoup with Request: is a library that allows you to parse the HTML source code in a beautiful way. Along with it you need a Request library that will fetch the content of the url.
o The author chose BeautifulSoup in order to be forced to figure out a lot of stuff that Scrapy handles on its own, and hopefully help to learn faster from her mistakes. This sounds like a good learning advice to prioritize my own learning.
Resource 2:
5 Tasty Python Web Scraping Libraries. (n.d.). Retrieved September 5, 2018, from https://elitedatascience.com/python-web-scraping-libraries
This resource is focused on 5 python libraries, which are described on the level of complexity:
· The Farm: Requests
· The Stew: Beautiful Soup 4
· The Salad: lxml
· The Restaurant: Selenium
· The Chef: Scrapy
Takeaways:
· Lots of tutorials, courses, and resources are provided for each library.
· There is not needed to learn all of them, although Request is the most recommended. This is because it determines how you communicate with websites.
o Request is an HTTP library, which means you can use it to access web pages.
· BeautifulSoup or lxml are presented as equivalents. Good enough for few pages.
o Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.
o One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.
· Learning Scrapy is recommended if you need to build a real spider or web-crawler, instead of just scraping a few pages here and there. It can be used to manage requests, preserve user sessions, follow redirects, and handle output pipelines.
o Scrapinghub - Cloud-based crawling service by the creators of Scrapy. The first cloud unit is free.
Resource 3:
Web Scraping Using Python. (2018, July 26). Retrieved September 5, 2018, from https://www.datacamp.com/tutorials/web-scraping-using-python
In resource is a tutorial, that shows how to extract data from the web, manipulate and clean data using Python’s Pandas library, and data visualize using Python’s Matplotlib library.
Main goals:
· Data extraction from the web using Python’s Beautiful Soup module.
· Data manipulation and cleaning using Python’s Pandas library.
· Data visualization using Python’s Matplotlib library.
Using specific modules, the tutorial gives a walkthrough through python snippets of all of these. The extraction part seems the most important for me right now. The goal of the tutorial is to take a table from a webpage and convert it into a dataframe for easier manipulation using Python, which sounds like a very reasonable first goal learning process.
Resource 4:
Hassan, K. (2016, April 29). Extracting data from websites using Scrapy. Retrieved September 5, 2018, from https://medium.com/@kaismh/extracting-data-from-websites-using-scrapy-e1e1e357651a#.sw7c9ycio
The author provides a detailed tutorial on how to scrap an e-commerce site for business insights. The author also provides a link on a paper that compare a diverse set of open source crawlers, in which one of them is Scrapy.
Is interesting to notice, that reading at the comments, the tutorial seems to have problems with replication. Some commenters attribute it to changes in the original website that was scraped in the first place.
General insights
I gain a general sense of what it means to scrape a website and the different layers of the process. I want to focus on the extraction of a website information, therefore, starting with Python’s Beautiful Soup seems the best choice. It also seems to maximize the learning process, that probably would be automated by Scrappy. I will start then by watching Beautiful Soup and Request tutorials to get more hands on.
I feel that I need to rework my basis on Python in order to take advantage of the library, therefore I will complement my learning by finishing the data camp course on Python, as well as getting a better understanding of pandas’ module installed on Anaconda. After this, I’ll go back to the data camp tutorial mentioned as the third resource, to run each part of the code provided in the Notebook.
Finally, learning that scraping can be a grey area in terms of legal or ethical concerns, is relevant to consider to my future research purposes.