essence-header-logo
Feb 23, 2023
Web scrapping with ROR

What is Web Scraping?

Web scraping is used for extracting essential data from a website. You can extract data in several applications. Mainly web scrapping is used for gathering valuable data. Ruby on rails is straightforward to create an application for web scrapping

Understanding Ruby

Ruby is a programming language with a high-level multiparadigm that can be fully interpreted. You can store the program code in plaintext, which can be transmitted to the interpreter to execute it. By using the framework of Rails, you can easily create software which is why nowadays, start-ups prefer to use Ruby to build a minimum viable product. 

The Ruby community allows it to be maintained and developed. The libraries and helpful tools, also called gems, ensure that best coding practices are implemented in every situation. To any new software, Ruby is like a universal magical solution.

Web Scraper Application in Ruby on Rails: How to Create RoR Web Scraping App?

Let’s start with scrapping some data here! The first thing that we have to do is pick a website from where we will scrap the data. Once done, follow mentioned steps below.

Step 1: Setting up the environment

Here let’s learn about all the prerequisites we need to build a web scraper using Ruby.

  • IDE: Here, we will be using Visual Studio code. You don’t have to work in additional configuration, which is lightweight. One can pick an IDE of their choice. 

  • Ruby’s latest version: Download Ruby from the official website and pick the latest version depending on your OS. 

  • Bundler: Bundler is a gem or a ruby tool for dependency management. 

  • Watir: Watir is needed for automatic testing. It is a gem powered by Selenium and can imitate the user’s behavior. 

  • Webdriver: It is a gem suggested by Watir. It will help in downloading the browser’s latest driver. 

  • Nokogiri: This gem helps in making the process of analyzing web pages easy. It can quickly parse XML and HTML, allows access to CSS3 selectors and XPath, and detect HTML broken documents.

Using this, you can set the environment for Ruby. Create a new directory in the computer and open it in the IDE. to install the first gem, open the terminal window and run the command:

            
                > gem install bundler
            
        

In the root directory of your project, create a file name Gemfile. You can add other gems as dependencies.

source ‘https://rubygems.org’

            
                gem 'watir', '~> 6.19', '>= 6.19.1'
gem 'webdrivers', '~> 4.6'
gem 'nokogiri', '~> 1.11', '>= 1.11.7'
            
        

Open the terminal window and run the below command for gem installation.

            
                > bundle install
            
        

To hold the code of the web scrapper, create the scrapper.rb file. Now execute the command:

            
                > ruby scraper.rb
            
        

Step 2: Inspecting the webpage you intend to scrape

Open the webpage that you want to scrape. Now right-click on it and click on Inspect element option. It will now open the console for the developer. Here you can find the HTML of the website.

Step 3: Send the HTTP request to scrape it

To get the HTML on the local machine, you have to send the HTTP request. Watir will help in returning the document. Open the IDE and write the below code.

Code:

Write all the required imports and then:

            
                require 'watir'
require 'webdrivers'
require 'nokogiri
            
        

After initializing the browser instance, you have to open the website you want to scrape. Access the HTML and provide it to the Nokogiri constructor. Now it will allow parsing the result.

Code:

            
                browser = Watir::Browser.new
browser.goto 'https://blog.eatthismuch.com/latest-articles/'
parsed_page = Nokogiri::HTML(browser.html)
File.open("parsed.txt", "w") { |f| f.write "#{parsed_page}" }
browser.close
            
        

You can save the result in the text file name parsed.txt and see the HTML. After receiving the response, it is mandatory to close the connection.

Step 4: Extraction of specific sections

Here we have an HTML document, and we want the data. So we have to parse previous responses into information that humans can read. First, we will extract the website’s title. The best part about Ruby is it has few exceptions, and everything here is an object. Even string can have methods and attributes.

We can access the website’s title through the parsed_page object attribute.

Code:

            
                puts parsed_page.title
            
        

You can save the result in the text file name parsed.txt and see the HTML. After receiving the response, it is mandatory to close the connection.

            
                links = parsed_page.css('a')
links.map {|element| element["href"]}
puts links
            
        

We can get links with the href attribute from HTML using the map method. We need the link’s attribute to get the article’s title and address. Under the <div> tag, you will find a meta description having the class name.

There are multiple ways to do this search. We are going to look for all the <div> tags having class name td_module_10 and then iterate through each of them and extract inner <div> tags and <a> tags having class name td-excerpt.

Code:

            
                property_cards = parsed_page.xpath("//div[contains(@class, 'td_module_10')]")
property_cards.each do |card|
   title = card.xpath("div[@class='td-module-thumb']/a/@title")
   description = card.xpath("div[@class='item-details']/p[@class='td-description']")
    rice= card.xpath("div[@class='item-details']/span[@class='td-price']")
end
            
        

Here the XPath expression does all the work since we are searching for HTML elements by their ascendants and class names.

Step 5: Exporting the data to CSV

Using an article aggregator makes it easy to pass the data to other applications. Firstly, we can parse the data to the external file through exporting. We have to create a CSV file. It is easy to read CSV files by applications and can be opened with excel if additional processing is needed.

We will perform the first import with the following:

            
                require 'csv'
            
        

We will create CSV and wrap the earlier code in the append mode. Now the scraper is going to look like thi

            
                CSV.open("properties.csv", "a+") do |csv|
   csv << ["title", "description", "price"]
   property_cards = parsed_page.xpath("//div[contains(@class, 'td_module_10')]")
   property_cards.each do |card|
       title = card.xpath("div[@class='td-module-thumb']/a/@title")

       description = card.xpath("div[@class='item-details']/p[@class='td-description']")
       price= card.xpath("div[@class='item-details']/span[@class='td-price']")
       csv << [title.first.value, description.first.value, price.first.text.strip]
   end
end
            
        

Now the parsed data is presented to you in an easy forward, non-scary, and clean way.

The web scraping tool is potent and allows you to analyze or access large amounts of data from several sources.

It will help you to access, process, and aggregate the data quickly. Whether this entire task will be daunting or challenging depends on the tool you are using.

Developing a web scraper in Ruby on Rails can be the best option if done by professional ruby on rails developers.

We are a team of expert RoR developers working with various organizations as their Rails technical partners.

Feel free to get in touch with us to hire dedicated Ruby developers.

Sachiin Gevariya

Sachin Gevariya is a Founder and Technical Director at Essence Solusoft. He is dedicated to making the best use of modern technologies to craft end-to-end solutions. He also has a vast knowledge of Cloud management. He loves to do coding so still doing the coding. Also, help employees for quality based solutions to clients. Always eager to learn new technology and implement for best solutions.