robot
Articles » Creating a simple web scraper using NodeJS, axios and cheerio

Creating a simple web scraper using NodeJS, axios and cheerio

29 March, 2021

In this tutorial we'll create a simple web scraper using NodeJS, axios and cheerio. We'll also use the node fs module to save the scraped data into a JSON file.

The URL we'll be scraping is https://www.starwars.com/news which contains a list of news items highlighted in red shown below. We'll scrape the data and save into a JSON file.

star wars news website

 

Create the project folder

Create an empty folder to hold the project files. I've called mine scraping-tutorial.

Create a package.json file

Inside the project folder create a package.json file to save the project dependencies.

npm init -y

 

Install dependencies

Next, we'll install the axios and cheerio libraries.

npm install axios cheerio

 

Writing the scraper

In the root directory of the project folder create an empty javascript file for the scraper, I've called mine scraper.js

At the top of the file import the dependencies.

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

Next, let's check if we can hit the URL by doing a simple GET request with axios by passing in the URL.

axios.get('https://www.starwars.com/news')
    .then((response) => {
        console.log(response.data);
    })
    .catch((error) => {
        console.log(error);
    });

 

Test the script

To test our script, open the terminal at the root directory of the project folder and type the following command then hit enter to execute the script. Notice that I've logged response.data instead of response. This is because axios returns several properties but data is the one that contains our HTML payload.

node scraper.js

If successful, you should get a bunch of HTML logged to the terminal similar to the image below:

html response

 

Loop through each article and get its data

Now that we've got HTML being returned we can go through the page and extract the data we want. Here we can use the cheerio library for this. First let's use the chrome devtools to inspect the page contents and target the desired elements and data we want to extract.

Inspecting the page we can see that all articles are <li>'s that are nested under a <ul> with a class of 'news-articles'.

Let's use cheerio to loop through each <li> and get the article contents.

...

axios.get('https://www.starwars.com/news')
    .then((response) => {
        let $ = cheerio.load(response.data);
        
        $('.news-articles li').each((index, element) => {
            console.log($(element));
        });
    })
    .catch((error) => {
        console.log(error);
    });

In the code above I've parsed the response.data through cheerio with let $ = cheerio.load(response.data) which lets us access the HTML response by using the $ symbol - similar to JQuery. Next I've looped through all the <li>'s with cheerio's each method and logged each element to the console which logs each <li> node to the console.

li node logged to console

Now that we can access each <li> element through the each loop we can start extracting text. Let's start by getting the article titles. Inspecting each <li> we can see that the title text is nested under the <h2> element.

So to access the <h2> we can simply use cheerio's find() method and pass 'h2' for it to find each <h2> element under each <li>. We'll also get the text with text() method and use the trim() method to clean up any whitespace.

...

axios.get('https://www.starwars.com/news')
    .then((response) => {
        let $ = cheerio.load(response.data);
        
        $('.news-articles li').each((index, element) => {
            console.log($(element).find('h2').text().trim());
        });
    })
    .catch((error) => {
        console.log(error);
    });

And we can now see that we're only logging each article's title.

star wars article titles

 

Saving the scraped data into a JSON file

Now that we're getting the article titles back, we can save these to a JSON file using the fs module.

First we'll create an empty array to push our articles into by declaring the articles array, then we wrap each article we're pushing as an object and then finally write the file called articles.json to the same root directory with the fs.writeFile() method.

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

axios.get('https://www.starwars.com/news')
    .then((response) => {
        let $ = cheerio.load(response.data);
        let articles = [];

        $('.news-articles li').each((index, element) => {
            articles.push({
                title: $(element).find('h2').text().trim()
            });
        });

        fs.writeFile('./articles.json', JSON.stringify(articles), (error) => {
            if (error) throw error;
        })
    })
    .catch((error) => {
        console.log(error);
    });

Now if you run the script again, you'll see that the articles.json file has been written into the project folder with stringified JSON similar to the one shown below:

Here's the final code together with other article details such as the URL and author.

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

axios.get('https://www.starwars.com/news')
    .then((response) => {
        let $ = cheerio.load(response.data);
        let articles = [];

        $('.news-articles li').each((index, element) => {
            articles.push({
                url: $(element).find('a').attr('href'),
                title: $(element).find('h2').text().trim(),
                aurthor: $(element).find('.byline-author').text().trim()
            });
        });
        
        fs.writeFile('./articles.json', JSON.stringify(articles), (error) => {
            if (error) throw error;
        })
    })
    .catch((error) => {
        console.log(error);
    });

Happy scraping!

Post your comment

Comments

No one has commented on this page yet.

RSS feed for comments on this page | RSS feed for all comments