This article is an excerpt from the following book: Wicked Cool Ruby Scripts: Useful Scripts That Solve Difficult Problems.
The purpose of this script is to validate all the links on a web page. Checking link validity is important for several reasons. First, as a viewer, encountering broken links is very frustrating. Second, valid links make a site more professional. Finally, if your website contains a link to someone else's site and they move or remove a page, you have no way of knowing without specifically checking.
Without automating this task, a person would have to literally click each link to prove the paths were valid. Extremely small sites are easy to validate, but sites with many links are tedious and time consuming. This is an example of a task that, when done manually, could take several hours. With the use of some Ruby tricks, you can cut that time down to 10 seconds! Writing the script will take a little time, but it's reusable. The Code...
require 'uri'
require 'open-uri'
require 'rubyful_soup'
begin
print "\n\nEnter website to crawl (ex. http://www.google.com): "
url = gets
puts url
uri = URI.parse(url)
html = open(uri).read
rescue Exception => e
print "Unable to connect to the url:"
puts "ERROR ---- #{0"
end
soup = BeautifulSoup.new(html)
links = soup.find_all('a').map { lal a['href'] }
links.delete_if { Ihrefl href =— /jayascriptImailto/ }
links.each do 111
if 1
begin
link = URI.parse(1) link.scheme 11= 'http' link.host li= uri.host
link.path = uri.path + link.path unless link.path[0] == //
link = URI.parse(link.to_s)
open(link).read
rescue Exception => e
puts "#{1ink} failed because #{e}"
end
end
end
For starters, we need to talk about HTML manipulation and interfacing with websites. Ruby has several ways of accessing the Web, but the simplest to use, by far, is open_uri. If you are familiar with wget, then getting to know open_uri should be easy; with my wicked little gems, I'm halfway to scraping web pages. For Internet scraping activities, I typically use rubyful_soup, an HTML/XML parser for Ruby, in combination with uri and open_uri. The rubyful_soup gem can be installed like any of the other gems used throughout the book. As you follow the examples in the book, you will see just how powerful rubyful_soup can be.
The script begins with some error handling in case the user mistakenly enters a bad URL or a connection cannot be made to the root directory of the web address e. Either way, the user will get more than one chance to correct his errors.
After the URL has been entered, it is parsed using the uri library. The URL you provide is opened using the open (uri) .read command 0. This single line opens the URL and reads in all of the HTML source code. Pretty cool, huh? Did you ever think scraping a web page would be so easy?
If there are any issues navigating to your URL, the script will show you the error and print the specific error message 0. Now on to the fun part, where rubyful_soup shows its power.
A new batch of rubyful_soup is made by initializing the BeautifulSoup and passing in our HTML source code. The soup allows us to easily parse the HTML source code. Sure, you could write a fancy regular expression or check each line for an HREF, but this feature is already supported by the soup! Just tell the soup to find all of the links in the source and save them to our array entitled links 0. One thing we want to remove is javascript and mailto links because these will make the parsers unhappy when they start testing link validity 0. Once the links are cleaned up, the script starts to iterate through each one.
Because we are checking for the validity of each link, what we are really checking for is any link that throws an error. If no errors are thrown, we know for certain that the link is valid. To interpret the output, we use a little more error-handling–fu and start checking each link 0. If the links are valid, the script will move on. If a link is bad, it will be logged. In this script, I have chosen to output the bad links to the command prompt, but you can hack the script to output to a text file or whatever you want. For more information, see Wicked Cool Ruby Scripts: Useful Scripts That Solve Difficult Problems. For
more information:
|