Ultimate Guide to Scraping Linkarooie with Ruby

Advanced

Category: Web Development & Automation

Prerequisites

Basic Ruby knowledge
Familiarity with HTML/CSS selectors
A Ruby environment with Bundler installed
Desire to output dynamic data in Markdown or HTML format

Introduction

Linkarooie lets you create shareable link profiles, achievements, and more. In this comprehensive guide, you’ll learn how to:

Fetch Linkarooie pages via HTTParty.
Parse HTML content with Nokogiri.
Output data as Markdown or HTML (using Redcarpet).
Automate with Cron or GitHub Actions.
Enhance reliability via rate limiting, testing (RSpec + VCR), data validation, and config files.
Containerise everything using Docker and Docker Compose.

1. Project Setup

1.1. Directory and Files

Follow these steps to create the initial structure:

# 1. Create a directory for your project
mkdir linkarooie-scraper
cd linkarooie-scraper

# 2. Create a Gemfile for your Ruby dependencies
touch Gemfile

# 3. Create your scraper script
touch linkarooie_scraper.rb

# 4. (Optional) Create a Dockerfile and docker-compose.yml for containerisation
touch Dockerfile
touch docker-compose.yml

Your folder now contains:

Gemfile
linkarooie_scraper.rb
Dockerfile
docker-compose.yml (optional)

1.2. Define Gems in Your `Gemfile`

source "https://rubygems.org"

gem "httparty"
gem "nokogiri"
gem "redcarpet" # Only if you want HTML conversion from Markdown

# For testing (optional best practice improvements):
# gem "rspec"
# gem "vcr"
# gem "webmock"

Then install:

bundle install

2. Docker & Docker Compose Setup

Below are example files for a basic Docker workflow, using Ruby 3.3 and the new Docker Compose. Feel free to customise.

2.1. Dockerfile

# Dockerfile

# Start with the official Ruby 3.3 image
FROM ruby:3.3

# Create a directory for our app
ENV APP_HOME /usr/src/app
RUN mkdir -p $APP_HOME
WORKDIR $APP_HOME

# Copy Gemfile and Gemfile.lock first for efficient caching
COPY Gemfile Gemfile.lock ./

# Install dependencies
RUN bundle install

# Copy the rest of our code
COPY . .

# Set default command
CMD [ "ruby", "linkarooie_scraper.rb" ]

Note: You may not have a Gemfile.lock yet. It’s good practice to generate one locally by running bundle install on your host, then copy it in.

2.2. Docker Compose

# docker-compose.yml
services:
  linkarooie:
    build: .
    volumes:
      - .:/usr/src/app
    command: bundle exec ruby linkarooie_scraper.rb

Important: Docker Compose (the command docker-compose) is now officially replaced by Docker Compose V2, invoked as docker compose. The syntax above uses the newer file format.

Running with Docker

# 1. Build the Docker image
docker compose build

# 2. Run the container
docker compose up

Once it’s finished, you should see logs indicating the scraper ran. Reports (Markdown/HTML) will appear in your local reports/daily directory if configured correctly.

3. Fetching Your Linkarooie Profile with HTTParty

Inside linkarooie_scraper.rb, add the following code for the HTTP request:

require "httparty"

url = "https://linkarooie.com/loftwah"
response = HTTParty.get(url, headers: { "User-Agent" => "Mozilla/5.0" })

if response.code == 200
  puts "Page fetched successfully!"
  html_content = response.body
else
  puts "Failed to fetch page. Status code: #{response.code}"
  exit
end

Tip: Including a custom User-Agent header can help avoid being blocked by websites that disallow default HTTP libraries.

4. Parsing the HTML with Nokogiri

Continuing in linkarooie_scraper.rb:

require "nokogiri"

doc = Nokogiri::HTML(html_content)

# Extract the main heading (profile title)
profile_title = doc.at_css("h1.text-2xl.font-bold")&.text
puts "Profile Title: #{profile_title}"

# Extract link items in the “Links” section
link_elements = doc.css(".links-section ul li")
links_data = link_elements.map do |li|
  link_title = li.at_css("h2 a")&.text&.strip
  link_url   = li.at_css("h2 a")&.[]("href")
  link_desc  = li.at_css("p")&.text&.strip
  { title: link_title, url: link_url, description: link_desc }
end

puts "Found #{links_data.size} link items!"

# Extract images (like banners, avatars, etc.)
images = doc.css("img").map { |img| img["src"] }.compact
puts "Found #{images.size} images!"

5. Output as Markdown

Markdown is excellent for quick reporting. Still within linkarooie_scraper.rb:

require "fileutils"

# Create a directory for reports if it doesn’t exist
FileUtils.mkdir_p("reports/daily")

report_date = Time.now.strftime("%Y-%m-%d")
markdown_path = "reports/daily/#{report_date}.md"

markdown_content = <<~MD
  # Linkarooie Profile Report
  **Date:** #{report_date}

  ## Profile Title
  #{profile_title}

  ## Links
MD

links_data.each do |link|
  markdown_content << "- **#{link[:title]}**\n"
  markdown_content << "  - URL: [#{link[:url]}](#{link[:url]})\n"
  markdown_content << "  - Description: #{link[:description]}\n\n"
end

markdown_content << "## Images\n"
images.each do |img|
  markdown_content << "![Image](#{img})\n"
end

File.write(markdown_path, markdown_content)
puts "Markdown report saved to #{markdown_path}"

6. Converting Markdown to HTML with Redcarpet (Optional)

Add the following snippet if you want HTML output (still in linkarooie_scraper.rb):

require "redcarpet"

markdown_renderer = Redcarpet::Markdown.new(Redcarpet::Render::HTML)
html_report = markdown_renderer.render(markdown_content)

html_path = "reports/daily/#{report_date}.html"
File.write(html_path, html_report)
puts "HTML report saved to #{html_path}"

This gives you two output formats:

Markdown: Shareable as is or convertible to PDF.
HTML: Perfect for browser-based viewing.

7. Rate Limiting (Best Practice)

If you need to scrape multiple pages, add a short delay or backoff:

URLS_TO_SCRAPE = [
  "https://linkarooie.com/profile1",
  "https://linkarooie.com/profile2"
]

URLS_TO_SCRAPE.each_with_index do |url, index|
  response = HTTParty.get(url, headers: { "User-Agent" => "Mozilla/5.0" })

  if response.code == 200
    doc = Nokogiri::HTML(response.body)
    # Parse data...
  else
    warn "Failed to fetch #{url}. Status: #{response.code}"
  end

  # Sleep 2 seconds between requests (avoid hammering the server)
  sleep(2) unless index == URLS_TO_SCRAPE.size - 1
end

8. Testing and Validation (Recommended)

8.1. Testing with RSpec

Create a spec folder and add spec_helper.rb.
Write tests to confirm your selectors still work:

# spec/scraper_spec.rb
require "spec_helper"
require_relative "../linkarooie_scraper"

RSpec.describe "LinkarooieScraper" do
  it "scrapes the loftwah profile correctly" do
    data = scrape_linkarooie("https://linkarooie.com/loftwah")
    expect(data[:profile_title]).to include("Dean Lofts")
    expect(data[:links]).not_to be_empty
  end
end

8.2. VCR for Stable Tests

Add gem "vcr" and gem "webmock". Configure them, then wrap your tests:

# spec/scraper_spec.rb
VCR.use_cassette("loftwah_profile") do
  data = scrape_linkarooie("https://linkarooie.com/loftwah")
  # ...
end

This ensures consistent test results without relying on live site availability.

8.3. Data Validation

Validate data before adding it to your reports:

def sanitize_text(text)
  text_string = text.to_s.strip
  text_string.empty? ? "Untitled" : text_string
end

links_data = link_elements.map do |li|
  {
    title: sanitize_text(li.at_css("h2 a")&.text),
    url: validate_url(li.at_css("h2 a")&.[]("href")),
    description: sanitize_text(li.at_css("p")&.text)
  }
end

def validate_url(url)
  uri = URI.parse(url)
  (uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)) ? url : "Invalid URL"
rescue URI::InvalidURIError
  "Invalid URL"
end

9. Configuration in YAML (Optional)

To avoid hardcoded values, store them in config/scraper.yml:

scraper:
  base_url: "https://linkarooie.com"
  profile_path: "/loftwah"
  selectors:
    profile_title: "h1.text-2xl.font-bold"
    link_items: ".links-section ul li"
    link_title: "h2 a"
    link_url: "h2 a"
    link_desc: "p"
  output:
    directory: "reports/daily"
    date_format: "%Y-%m-%d"
  rate_limit_seconds: 2

Then load it:

require "yaml"

config = YAML.load_file("config/scraper.yml")
base_url = config.dig("scraper", "base_url")
profile_path = config.dig("scraper", "profile_path")
selectors = config.dig("scraper", "selectors")
rate_limit = config.dig("scraper", "rate_limit_seconds")

full_url = "#{base_url}#{profile_path}"
response = HTTParty.get(full_url)
# ...
sleep(rate_limit)

10. Automating with Cron or GitHub Actions

10.1. Cron (Linux/macOS)

Run daily at 9 AM:

crontab -e
# Add:
0 9 * * * /usr/bin/env ruby /path/to/linkarooie_scraper.rb

10.2. GitHub Actions

Create a .github/workflows/scraper.yml file and use checkout@v4 with Ruby 3.3:

name: "Daily Linkarooie Scrape"
on:
  schedule:
    - cron: "0 9 * * *"

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Ruby
        uses: ruby/setup-ruby@v1
        with:
          ruby-version: 3.3

      - name: Install Dependencies
        run: bundle install

      - name: Run Scraper
        run: bundle exec ruby linkarooie_scraper.rb

Note: If you prefer to run this inside Docker during GitHub Actions, you can build and run the container instead of installing Ruby. However, the above example demonstrates using Ruby directly on the GitHub Actions runner.

11. Final `linkarooie_scraper.rb` Example

Here’s a condensed version of the scraper script that you can copy and paste:

#!/usr/bin/env ruby

require "httparty"
require "nokogiri"
require "redcarpet"
require "fileutils"
require "uri"

URL = "https://linkarooie.com/loftwah"
RATE_LIMIT_SECONDS = 2

def scrape_linkarooie(url)
  response = HTTParty.get(url, headers: { "User-Agent" => "Mozilla/5.0" })
  raise "Failed to fetch (Status #{response.code})" unless response.code == 200

  doc = Nokogiri::HTML(response.body)

  profile_title = doc.at_css("h1.text-2xl.font-bold")&.text
  link_els = doc.css(".links-section ul li")
  links_data = link_els.map do |li|
    link_title = li.at_css("h2 a")&.text&.strip
    link_url   = li.at_css("h2 a")&.[]("href")
    link_desc  = li.at_css("p")&.text&.strip
    { title: link_title, url: link_url, description: link_desc }
  end
  images = doc.css("img").map { |img| img["src"] }.compact

  { profile_title: profile_title, links: links_data, images: images }
end

sleep(RATE_LIMIT_SECONDS) # Rate limit if scraping multiple pages

data = scrape_linkarooie(URL)

FileUtils.mkdir_p("reports/daily")
report_date = Time.now.strftime("%Y-%m-%d")

markdown_file = "reports/daily/#{report_date}.md"
markdown_content = <<~MD
  # Linkarooie Profile Report
  **Date:** #{report_date}

  ## Profile Title
  #{data[:profile_title]}

  ## Links
MD

data[:links].each do |l|
  markdown_content << "- **#{l[:title]}**\n"
  markdown_content << "  - URL: [#{l[:url]}](#{l[:url]})\n"
  markdown_content << "  - Description: #{l[:description]}\n\n"
end

markdown_content << "## Images\n"
data[:images].each do |img|
  markdown_content << "![Image](#{img})\n"
end

File.write(markdown_file, markdown_content)
puts "Markdown report saved to #{markdown_file}"

# Convert to HTML
html_file = "reports/daily/#{report_date}.html"
renderer = Redcarpet::Markdown.new(Redcarpet::Render::HTML)
html_report = renderer.render(markdown_content)
File.write(html_file, html_report)
puts "HTML report saved to #{html_file}"

Conclusion

By following this guide, you’ll have a robust, production-ready process for scraping Linkarooie profiles. You’ve learned to:

Build and run your Ruby scraper in Docker with Docker Compose (using Ruby 3.3).
Fetch HTML with HTTParty and parse with Nokogiri.
Convert data into Markdown or HTML with Redcarpet.
Automate via Cron or GitHub Actions—now with checkout@v4 and Ruby 3.3.
Extend your reliability with rate limiting, testing, validation, and YAML-based configuration.