Saturday 13:00–13:45

Python tools for webscraping

Jose Manuel Ortega

Audience level:: Intermediate

Description

If we want to extract the contents of a website automating information extraction, often we find that the website does not offer any API to get the data you need and It is necessary use scraping techniques to recover data from a Web automatically. Some of the most powerful tools for extracting the data in web pages can be found in the python ecosystem.

Abstract

Introduction to webscraping

WebScraping is the process of collecting or extracting data from web pages automatically. Nowdays is a very active field and developing shared goals with the semantic web field, natural language processing,artificial intelligence and human computer interaction.
Python tools for webscraping

Some of the most powerful tools to extract data can be found in the python ecosystem, among which we highlight Beautiful soup, Webscraping, PyQuery and Scrapy.
Comparison between webscraping tools

A comparison of the mentioned tools will be made, showing advantages and disadvantages of each one,highlighting the elements of each one to perform data extraction as regular expressions,css selectors and xpath expressions.
Project example with scrapy

Scrapy is a framework written in python for extraction automated data that can be used for a wide range of applications such as data mining processing. When using Scrapy we have to create a project, and each project consists of:

1.Items: We define the elements to be extracted.
2.Spiders: The heart of the project, here we define the extract data procedure.
3.Pipelines: Are the proceeds to analyze elements: data validation, cleansing html code

Outline

Introduction to webscraping(5 min)
I will mention the main scraping techniques
- 1.1.WebScraping
- 1.2.Screen scraping
- 1.3.Report mining
- 1.4.Spiders

Python tools for webscraping(10 min)
For each library I will make and introduction with a basic example. In some examples I will use requests library for sending HTTP requests
- 2.1. BeautifulSoup
- 2.2. Webscraping
- 2.2. PyQuery

Comparing scraping tools(5 min)
- 3.1.Introduction to techniques for obtain data from web pages like regular expressions,css selectors, xpath expressions
- 3.2.Comparative table comparing main features of each tool

Project example with scrapy(10 min)
- 4.1.Project structure with scrapy
- 4.2.Components(Scheduler,Spider,Pipeline,Middlewares)
- 4.3.Generating reports in json,csv and xml formats