Web Crawling in Python

Last Updated on June 21, 2023

In the earlier days, it was a tedious job to collect data, and it was usually very expensive. Machine learning duties cannot stick with out data. Luckily, we have got numerous data on the web at our disposal lately. We can copy data from the online to create our dataset. We can manually get hold of data and save them to the disk. But we’ll do it additional successfully by automating the data harvesting. There are quite a few devices in Python which will help the automation.

After ending this tutorial, you may be taught:

How to utilize the requests library to be taught on-line data using HTTP
How to be taught tables on web pages using pandas
How to utilize Selenium to emulate browser operations

Kick-start your enterprise with my new e guide Python for Machine Learning, along with step-by-step tutorials and the Python provide code data for all examples.

Let’s get started!

Web Crawling in Python
Photo by Ray Bilcliff. Some rights reserved.

Overview

This tutorial is cut up into three elements; they’re:

Using the requests library
Reading tables on the web using pandas
Reading dynamic content material materials with Selenium

Using the Requests Library

When we focus on writing a Python program to be taught from the online, it is inevitable that we’ll’t avoid the requests library. You wish to put in it (along with BeautifulSoup and lxml that we’re going to cowl later):

pip arrange requests beautifulsoup4 lxml

1	pip arrange requests beautifulsoup4 lxml

It provides you with an interface that allows you to work along with the online merely.

The fairly easy use case will be to be taught an web internet web page from a URL:

import requests</p><p># Lat-Lon of New York<br />URL = “https://climate.com/climate/at the second/l/40.75,-73.98”<br />resp = requests.get(URL)<br />print(resp.status_code)<br />print(resp.textual content material)

import requests

# Lat-Lon of New York

URL = “https://climate.com/climate/at the second/l/40.75,-73.98”

resp = requests.get(URL)

print(resp.status_code)

print(resp.textual content material)

200<br /><!doctype html><html dir=”ltr” lang=”en-US”><head><br />      <meta data-react-helmet=”true” charset=”utf-8″/><meta data-react-helmet=”true”<br />establish=”viewport” content material materials=”width=device-width, initial-scale=1, viewport-fit=cowl”/><br />…

200

<!doctype html><html dir=”ltr” lang=”en-US”><head>

establish=”viewport” content material materials=”width=device-width, initial-scale=1, viewport-fit=cowl”/>

…

If you’re acquainted with HTTP, you probably can probably recall {{that a}} standing code of 200 means the request is effectively fulfilled. Then we’ll be taught the response. In the above, we be taught the textual response and get the HTML of the online internet web page. Should it is a CSV or one other textual data, we’ll get them throughout the textual content material attribute of the response object. For occasion, that’s how we’ll be taught a CSV from the Federal Reserve Economics Data:

import io<br />import pandas as pd<br />import requests</p><p>URL = “https://fred.stlouisfed.org/graph/fredgraph.csv?id=T10YIE&cosd=2023-04-14&coed=2023-04-14”<br />resp = requests.get(URL)<br />if resp.status_code == 200:<br />   csvtext = resp.textual content material<br />   csvbuffer = io.StringIO(csvtext)<br />   df = pd.read_csv(csvbuffer)<br />   print(df)

import io

import pandas as pd

import requests

URL = “https://fred.stlouisfed.org/graph/fredgraph.csv?id=T10YIE&cosd=2023-04-14&coed=2023-04-14”

resp = requests.get(URL)

if resp.status_code == 200:

csvtext = resp.textual content material

csvbuffer = io.StringIO(csvtext)

df = pd.read_csv(csvbuffer)

print(df)

            DATE T10YIE<br />0     2023-04-17   1.88<br />1     2023-04-18   1.85<br />2     2023-04-19   1.85<br />3     2023-04-20   1.85<br />4     2023-04-21   1.84<br />…          …    …<br />1299  2023-04-08   2.87<br />1300  2023-04-11   2.91<br />1301  2023-04-12   2.86<br />1302  2023-04-13    2.8<br />1303  2023-04-14   2.89</p><p>[1304 rows x 2 columns]

DATE T10YIE

0 2023-04-17 1.88

1 2023-04-18 1.85

2 2023-04-19 1.85

3 2023-04-20 1.85

4 2023-04-21 1.84

… … …

1299 2023-04-08 2.87

1300 2023-04-11 2.91

1301 2023-04-12 2.86

1302 2023-04-13 2.8

1303 2023-04-14 2.89

[1304 rows x 2 columns]

If the data is inside the kind of JSON, we’ll be taught it as textual content material and even let requests decode it for you. For occasion, the subsequent is to tug some data from GitHub in JSON format and convert it proper right into a Python dictionary:

import requests</p><p>URL = “https://api.github.com/customers/jbrownlee”<br />resp = requests.get(URL)<br />if resp.status_code == 200:<br />    data = resp.json()<br />    print(data)

import requests

URL = “https://api.github.com/customers/jbrownlee”

resp = requests.get(URL)

if resp.status_code == 200:

data = resp.json()

print(data)

{‘login’: ‘jbrownlee’, ‘id’: 12891, ‘node_id’: ‘MDQ6VXNlcjEyODkx’,<br />‘avatar_url’: ‘https://avatars.githubusercontent.com/u/12891?v=4’,<br />‘gravatar_id’: ”, ‘url’: ‘https://api.github.com/customers/jbrownlee’,<br />‘html_url’: ‘https://github.com/jbrownlee’,<br />…<br />‘agency’: ‘Machine Learning Mastery’, ‘weblog’: ‘http://Machine Learning – Artificial Intelligence’,<br />‘location’: None, ‘e mail’: None, ‘hireable’: None,<br />‘bio’: ‘Making builders superior at machine learning.’, ‘twitter_username’: None,<br />‘public_repos’: 5, ‘public_gists’: 0, ‘followers’: 1752, ‘following’: 0,<br />‘created_at’: ‘2008-06-07T02:20:58Z’, ‘updated_at’: ‘2023-02-22T19:56:27Z’<br />}

{‘login’: ‘jbrownlee’, ‘id’: 12891, ‘node_id’: ‘MDQ6VXNlcjEyODkx’,

‘avatar_url’: ‘https://avatars.githubusercontent.com/u/12891?v=4’,

‘gravatar_id’: ”, ‘url’: ‘https://api.github.com/customers/jbrownlee’,

‘html_url’: ‘https://github.com/jbrownlee’,

…

‘agency’: ‘Machine Learning Mastery’, ‘weblog’: ‘http://Machine Learning – Artificial Intelligence’,

‘location’: None, ‘e mail’: None, ‘hireable’: None,

‘bio’: ‘Making builders superior at machine learning.’, ‘twitter_username’: None,

‘public_repos’: 5, ‘public_gists’: 0, ‘followers’: 1752, ‘following’: 0,

‘created_at’: ‘2008-06-07T02:20:58Z’, ‘updated_at’: ‘2023-02-22T19:56:27Z’

}

But if the URL offers you some binary data, harking back to a ZIP file or a JPEG image, you need to get them throughout the content material materials attribute in its place, as this can be the binary data. For occasion, that’s how we’ll get hold of an image (the logo of Wikipedia):

import requests</p><p>URL = “https://en.wikipedia.org/static/photographs/project-logos/enwiki.png”<br />wikilogo = requests.get(URL)<br />if wikilogo.status_code == 200:<br />    with open(“enwiki.png”, “wb”) as fp:<br />        fp.write(wikilogo.content material materials)

import requests

URL = “https://en.wikipedia.org/static/photographs/project-logos/enwiki.png”

wikilogo = requests.get(URL)

if wikilogo.status_code == 200:

with open(“enwiki.png”, “wb”) as fp:

fp.write(wikilogo.content material materials)

Given we already obtained the online internet web page, how must we extract the data? This is previous what the requests library can current to us, nevertheless we’ll use a singular library to help. There are two strategies we’ll do it, counting on how we have to specify the data.

The first method is to consider the HTML as a type of XML doc and use the XPath language to extract the element. In this case, we’ll make use of the lxml library to first create a doc object model (DOM) after which search by XPath:

…<br />from lxml import etree</p><p># Create DOM from HTML textual content material<br />dom = etree.HTML(resp.textual content material)<br /># Search for the temperature element and get the content material materials<br />elements = dom.xpath(“//span[@data-testid=’TemperatureValue’ and contains(@class,’CurrentConditions’)]”)<br />print(elements[0].textual content material)

...

from lxml import etree

# Create DOM from HTML textual content material

dom = etree.HTML(resp.textual content material)

# Search for the temperature element and get the content material materials

elements = dom.xpath(“//span[@data-testid=’TemperatureValue’ and contains(@class,’CurrentConditions’)]”)

print(elements[0].textual content material)

61°

61°

XPath is a string that specifies learn to uncover a element. The lxml object provides a carry out xpath() to look the DOM for elements that match the XPath string, which could be quite a few matches. The XPath above means to go looking out an HTML element anyplace with the <span> tag and with the attribute data-testid matching “TemperatureValue” and class beginning with “CurrentConditions.” We will be taught this from the developer devices of the browser (e.g., the Chrome screenshot beneath) by inspecting the HTML provide.

This occasion is to go looking out the temperature of New York City, provided by this particular element we get from this web internet web page. We know the first element matched by the XPath is what we would like, and we’ll be taught the textual content material contained within the <span> tag.

The completely different method is to utilize CSS selectors on the HTML doc, which we’ll make use of the BeautifulSoup library:

61°

61°

In the above, we first cross our HTML textual content material to BeautifulSoup. BeautifulSoup helps various HTML parsers, each with completely completely different capabilities. In the above, we use the lxml library as a result of the parser as helpful by BeautifulSoup (and it is also often the quickest). CSS selector is a singular mini-language, with execs and cons compared with XPath. The selector above is a similar to the XPath we used throughout the earlier occasion. Therefore, we’ll get the similar temperature from the first matched element.

The following is a complete code to print the current temperature of New York consistent with the real-time information on the web:

import requests<br />from lxml import etree</p><p># Reading temperature of New York<br />URL = “https://climate.com/climate/at the second/l/40.75,-73.98”<br />resp = requests.get(URL)</p><p>if resp.status_code == 200:<br />    # Using lxml<br />    dom = etree.HTML(resp.textual content material)<br />    elements = dom.xpath(“//span[@data-testid=’TemperatureValue’ and contains(@class,’CurrentConditions’)]”)<br />    print(elements[0].textual content material)</p><p>    # Using BeautifulSoup<br />    soup = BeautifulSoup(resp.textual content material, “lxml”)<br />    elements = soup.select(‘span[data-testid=”TemperatureValue”][class^=”CurrentConditions”]’)<br />    print(elements[0].textual content material)

import requests

from lxml import etree

# Reading temperature of New York

URL = “https://climate.com/climate/at the second/l/40.75,-73.98”

resp = requests.get(URL)

if resp.status_code == 200:

# Using lxml

dom = etree.HTML(resp.textual content material)

elements = dom.xpath(“//span[@data-testid=’TemperatureValue’ and contains(@class,’CurrentConditions’)]”)

print(elements[0].textual content material)

# Using BeautifulSoup

soup = BeautifulSoup(resp.textual content material, “lxml”)

elements = soup.select(‘span[data-testid=”TemperatureValue”][class^=”CurrentConditions”]’)

print(elements[0].textual content material)

As you probably can take into consideration, you probably can collect a time sequence of the temperature by working this script on a each day schedule. Similarly, we’ll collect data mechanically from various web pages. This is how we’ll pay money for data for our machine learning duties.

Reading Tables on the Web Using Pandas

Very often, web pages will use tables to carry data. If the online web page is simple adequate, we would even skip inspecting it to go looking out out the XPath or CSS selector and use pandas to get all tables on the internet web page in a single shot. It is simple adequate to be carried out in a single line:

import pandas as pd</p><p>tables = pd.read_html(“https://www.federalreserve.gov/releases/h15/”)<br />print(tables)

import pandas as pd

tables = pd.read_html(“https://www.federalreserve.gov/releases/h15/”)

print(tables)

[                               Instruments 2023Apr7 2023Apr8 2023Apr11 2023Apr12 2023Apr13<br />0          Federal funds (effective) 1 2 3     0.33     0.33      0.33      0.33      0.33<br />1                 Commercial Paper 3 4 5 6      NaN      NaN       NaN       NaN       NaN<br />2                             Nonfinancial      NaN      NaN       NaN       NaN       NaN<br />3                                  1-month     0.30     0.34      0.36      0.39      0.39<br />4                                  2-month     n.a.     0.48      n.a.      n.a.      n.a.<br />5                                  3-month     n.a.     n.a.      n.a.      0.78      0.78<br />6                                Financial      NaN      NaN       NaN       NaN       NaN<br />7                                  1-month     0.49     0.45      0.46      0.39      0.46<br />8                                  2-month     n.a.     n.a.      0.60      0.71      n.a.<br />9                                  3-month     0.85     0.81      0.75      n.a.      0.86<br />10                   Bank prime loan 2 3 7     3.50     3.50      3.50      3.50      3.50<br />11      Discount window primary credit 2 8     0.50     0.50      0.50      0.50      0.50<br />12              U.S. government securities      NaN      NaN       NaN       NaN       NaN<br />13   Treasury bills (secondary market) 3 4      NaN      NaN       NaN       NaN       NaN<br />14                                  4-week     0.21     0.20      0.21      0.19      0.23<br />15                                 3-month     0.68     0.69      0.78      0.74      0.75<br />16                                 6-month     1.12     1.16      1.22      1.18      1.17<br />17                                  1-year     1.69     1.72      1.75      1.67      1.67<br />18            Treasury constant maturities      NaN      NaN       NaN       NaN       NaN<br />19                               Nominal 9      NaN      NaN       NaN       NaN       NaN<br />20                                 1-month     0.21     0.20      0.22      0.21      0.26<br />21                                 3-month     0.68     0.70      0.77      0.74      0.75<br />22                                 6-month     1.15     1.19      1.23      1.20      1.20<br />23                                  1-year     1.78     1.81      1.85      1.77      1.78<br />24                                  2-year     2.47     2.53      2.50      2.39      2.37<br />25                                  3-year     2.66     2.73      2.73      2.58      2.57<br />26                                  5-year     2.70     2.76      2.79      2.66      2.66<br />27                                  7-year     2.73     2.79      2.84      2.73      2.71<br />28                                 10-year     2.66     2.72      2.79      2.72      2.70<br />29                                 20-year     2.87     2.94      3.02      2.99      2.97<br />30                                 30-year     2.69     2.76      2.84      2.82      2.81<br />31                    Inflation indexed 10      NaN      NaN       NaN       NaN       NaN<br />32                                  5-year    -0.56    -0.57     -0.58     -0.65     -0.59<br />33                                  7-year    -0.34    -0.33     -0.32     -0.36     -0.31<br />34                                 10-year    -0.16    -0.15     -0.12     -0.14     -0.10<br />35                                 20-year     0.09     0.11      0.15      0.15      0.18<br />36                                 30-year     0.21     0.23      0.27      0.28      0.30<br />37  Inflation-indexed long-term average 11     0.23     0.26      0.30      0.30      0.33,       0               1<br />0  n.a.  Not available.]

[ Instruments 2023Apr7 2023Apr8 2023Apr11 2023Apr12 2023Apr13

0 Federal funds (effective) 1 2 3 0.33 0.33 0.33 0.33 0.33

1 Commercial Paper 3 4 5 6 NaN NaN NaN NaN NaN

2 Nonfinancial NaN NaN NaN NaN NaN

3 1-month 0.30 0.34 0.36 0.39 0.39

4 2-month n.a. 0.48 n.a. n.a. n.a.

5 3-month n.a. n.a. n.a. 0.78 0.78

6 Financial NaN NaN NaN NaN NaN

7 1-month 0.49 0.45 0.46 0.39 0.46

8 2-month n.a. n.a. 0.60 0.71 n.a.

9 3-month 0.85 0.81 0.75 n.a. 0.86

10 Bank prime loan 2 3 7 3.50 3.50 3.50 3.50 3.50

11 Discount window primary credit 2 8 0.50 0.50 0.50 0.50 0.50

12 U.S. government securities NaN NaN NaN NaN NaN

13 Treasury bills (secondary market) 3 4 NaN NaN NaN NaN NaN

14 4-week 0.21 0.20 0.21 0.19 0.23

15 3-month 0.68 0.69 0.78 0.74 0.75

16 6-month 1.12 1.16 1.22 1.18 1.17

17 1-year 1.69 1.72 1.75 1.67 1.67

18 Treasury constant maturities NaN NaN NaN NaN NaN

19 Nominal 9 NaN NaN NaN NaN NaN

20 1-month 0.21 0.20 0.22 0.21 0.26

21 3-month 0.68 0.70 0.77 0.74 0.75

22 6-month 1.15 1.19 1.23 1.20 1.20

23 1-year 1.78 1.81 1.85 1.77 1.78

24 2-year 2.47 2.53 2.50 2.39 2.37

25 3-year 2.66 2.73 2.73 2.58 2.57

26 5-year 2.70 2.76 2.79 2.66 2.66

27 7-year 2.73 2.79 2.84 2.73 2.71

28 10-year 2.66 2.72 2.79 2.72 2.70

29 20-year 2.87 2.94 3.02 2.99 2.97

30 30-year 2.69 2.76 2.84 2.82 2.81

31 Inflation indexed 10 NaN NaN NaN NaN NaN

32 5-year -0.56 -0.57 -0.58 -0.65 -0.59

33 7-year -0.34 -0.33 -0.32 -0.36 -0.31

34 10-year -0.16 -0.15 -0.12 -0.14 -0.10

35 20-year 0.09 0.11 0.15 0.15 0.18

36 30-year 0.21 0.23 0.27 0.28 0.30

37 Inflation-indexed long-term average 11 0.23 0.26 0.30 0.30 0.33, 0 1

0 n.a. Not available.]

The read_html() carry out in pandas reads a URL and finds the entire tables on the internet web page. Each desk is remodeled proper right into a pandas DataPhysique after which returns all of them in an inventory. In this occasion, we’re learning the various charges of curiosity from the Federal Reserve, which happens to have only one desk on this internet web page. The desk columns are acknowledged by pandas mechanically.

Chances are that not all tables are what we’re severe about. Sometimes, the online internet web page will use a desk merely as a choice to format the online web page, nevertheless pandas is not going to be good adequate to tell. Hence we’ve to examine and cherry-pick the top consequence returned by the read_html() carry out.

Want to Get Started With Python for Machine Learning?

Take my free 7-day e mail crash course now (with sample code).

Click to sign-up and as well as get a free PDF Ebook mannequin of the course.

Reading Dynamic Content With Selenium

A great portion of modern-day web pages is full of JavaScripts. This offers us a fancier experience nevertheless turns right into a hurdle to utilize as a program to extract data. One occasion is Yahoo’s dwelling internet web page, which, if we merely load the online web page and uncover all data headlines, there are far fewer than what we’ll see on the browser:

import requests</p><p># Read Yahoo dwelling internet web page<br />URL = “https://www.yahoo.com/”<br />resp = requests.get(URL)<br />dom = etree.HTML(resp.textual content material)</p><p># Print data headlines<br />elements = dom.xpath(“//h3/a[u[@class=”StretchedBox”]]”)<br />for elem in elements:<br />    print(etree.tostring(elem, approach=”textual content material”, encoding=”unicode”))

import requests

# Read Yahoo dwelling internet web page

URL = “https://www.yahoo.com/”

resp = requests.get(URL)

dom = etree.HTML(resp.textual content material)

# Print data headlines

elements = dom.xpath(“//h3/a[u[@class=”StretchedBox”]]”)

for elem in elements:

print(etree.tostring(elem, approach=“textual content material”, encoding=“unicode”))

This is because of web pages like this depend upon JavaScript to populate the content material materials. Famous web frameworks harking back to AngularJS or React are behind powering this class. The Python library, harking back to requests, does not understand JavaScript. Therefore, you’ll discover the top consequence in any other case. If the data it’s good to fetch from the online is one amongst them, you probably can study how the JavaScript is invoked and mimic the browser’s habits in your program. But that’s most probably too tedious to make it work.

The completely different method is to ask an precise browser to be taught the online internet web page comparatively than using requests. This is what Selenium can do. Before we’ll use it, we’ve to arrange the library:

pip arrange selenium

1	pip arrange selenium

But Selenium is only a framework to manage browsers. You will need to have the browser put in in your laptop computer along with the driving power to connect Selenium to the browser. If you intend to utilize Chrome, you need to get hold of and arrange ChromeDriver too. You should put the driving power throughout the executable path so that Selenium can invoke it like a normal command. For occasion, in Linux, you merely should get the chromedriver executable from the ZIP file downloaded and put it in /usr/native/bin.

Similarly, for many who’re using Firefox, you need the GeckoDriver. For additional particulars on organising Selenium, it is best to seek the advice of with its documentation.

Afterward, it’s good to use a Python script to manage the browser habits. For occasion:

import time<br />from selenium import webdriver<br />from selenium.webdriver.assist.ui import WebDriverWait<br />from selenium.webdriver.widespread.by import By</p><p># Launch Chrome browser in headless mode<br />selections = webdriver.ChromeDecisions()<br />selections.add_argument(“headless”)<br />browser = webdriver.Chrome(selections=selections)</p><p># Load web internet web page<br />browser.get(“https://www.yahoo.com”)<br /># Network transport takes time. Wait until the online web page is completely loaded<br />def is_ready(browser):<br />    return browser.execute_script(r”””<br />        return doc.readyState === ‘full’<br />    “””)<br />WebDriverWait(browser, 30).until(is_ready)</p><p># Scroll to bottom of the online web page to set off JavaScript movement<br />browser.execute_script(“window.scrollTo(0, doc.physique.scrollHeight);”)<br />time.sleep(1)<br />WebDriverWait(browser, 30).until(is_ready)</p><p># Search for data headlines and print<br />elements = browser.find_elements(By.XPATH, “//h3/a[u[@class=”StretchedBox”]]”)<br />for elem in elements:<br />    print(elem.textual content material)</p><p># Close the browser as quickly as finish<br />browser.shut()

import time

from selenium import webdriver

from selenium.webdriver.assist.ui import WebDriverWait

from selenium.webdriver.widespread.by import By

# Launch Chrome browser in headless mode

selections = webdriver.ChromeDecisions()

selections.add_argument(“headless”)

browser = webdriver.Chrome(selections=selections)

# Load web internet web page

browser.get(“https://www.yahoo.com”)

# Network transport takes time. Wait until the online web page is completely loaded

def is_ready(browser):

return browser.execute_script(r“”“

return doc.readyState === ‘full’

““”)

WebDriverWait(browser, 30).until(is_ready)

# Scroll to bottom of the online web page to set off JavaScript movement

browser.execute_script(“window.scrollTo(0, doc.physique.scrollHeight);”)

time.sleep(1)

WebDriverWait(browser, 30).until(is_ready)

# Search for data headlines and print

elements = browser.find_elements(By.XPATH, “//h3/a[u[@class=”StretchedBox”]]”)

for elem in elements:

print(elem.textual content material)

# Close the browser as quickly as finish

browser.shut()

The above code works as follows. We first launch the browser in headless mode, which implies we ask Chrome to start out out nevertheless not present on the show display. This is important if we have to run our script remotely as there is not going to be any GUI assist. Note that every browser is developed in any other case, and thus the alternatives syntax we used is restricted to Chrome. If we use Firefox, the code will be this in its place:

selections = webdriver.FirefoxDecisions()<br />selections.set_headless()<br />browser = webdriver.Firefox(firefox_options=selections)

selections = webdriver.FirefoxDecisions()

selections.set_headless()

browser = webdriver.Firefox(firefox_options=selections)

After we launch the browser, we give it a URL to load. But as a result of it takes time for the neighborhood to ship the online web page, and the browser will take time to render it, we should at all times wait until the browser is ready sooner than we proceed to the following operation. We detect if the browser has accomplished rendering by using JavaScript. We make Selenium run a JavaScript code for us and inform us the top consequence using the execute_script() carry out. We leverage Selenium’s WebDriverWait instrument to run it until it succeeds or until a 30-second timeout. As the online web page is loaded, we scroll to the underside of the online web page so the JavaScript could be triggered to load additional content material materials. Then we await one second unconditionally to make sure the browser triggered the JavaScript, then wait until the online web page is ready as soon as extra. Afterward, we’ll extract the data headline element using XPath (or alternatively using a CSS selector). Because the browser is an exterior program, we’re answerable for closing it in our script.

Using Selenium is completely completely different from using the requests library in quite a few factors. First, you certainly not have the online content material materials in your Python code immediately. Instead, you seek the advice of with the browser’s content material materials everytime you need it. Hence the online elements returned by the find_elements() carry out seek advice from issues contained within the exterior browser, so we should always not shut the browser sooner than we finish consuming them. Secondly, all operations should be based on browser interaction comparatively than neighborhood requests. Thus you need to administration the browser by emulating keyboard and mouse actions. But in return, you’ve got the full-featured browser with JavaScript assist. For occasion, it’s good to use JavaScript to look at the dimensions and place of a element on the internet web page, which you may know solely after the HTML elements are rendered.

There are far more options provided by the Selenium framework that we’ll cowl proper right here. It is extremely efficient, nevertheless because it’s associated to the browser, using it is additional demanding than the requests library and much slower. Usually, that’s the closing resort for harvesting information from the online.

Summary

In this tutorial, you observed the devices we’ll use to fetch content material materials from the online.

Specifically, you found:

How to utilize the requests library to ship the HTTP request and extract data from its response
How to assemble a doc object model from HTML so we’ll uncover some specific information on an web internet web page
How to be taught tables on an web internet web page quickly and easily using pandas
How to utilize Selenium to manage a browser to take care of dynamic content material materials on an web internet web page

Search This Blog

Solution Desk

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?