Robots_txt: Test if a URL may be crawled looking at robots.txt

Recommend this page to a friend!

Download .zip

Info

View files (2)

Download .zip

Reputation

Support forum (1)

Blog

Links

Last Updated		Ratings				Unique User Downloads		Download Rankings
2008-03-04 (8 years ago)		36%				Total: 1,336		All time: 2,802 This week: 1,016

Version		License		PHP version		Categories
`robots_txt` 1.1		GNU General Publi...		5.0		PHP 5, Searching

Description Author

This class can be used to check whether a page may be crawled by looking at the robots.txt file of its site.

It takes the URL of a page and retrieves the robots.txt file of the same site.

The class parses the robots.txt file and looks up for the rules defined in that file to see if the site allows crawling the intended page.

The class also stores the time when a page is crawled to check whether next time another page of the same site is being crawled it is honoring the intended crawl delay and request rate limits.

Innovation Award

January 2008
Number 8

robots.txt is a file that sites need to have in their domain Web root to tell search engine crawlers and Web robots in general which pages should not be crawled.

This class can parse a robots.txt file of a domain to determine whether a given page should be crawled or not.

It is useful to implement a friendly crawler which respects the wishes of site owners that do not want to have certain pages crawled by Web robot programs.

Manuel Lemos

Andy Pieters

Name:	Andy Pieters `<contact>`
Classes:	1 package by Andy Pieters
Country:	United Kingdom

Innovation award

Nominee: 1x

Details

		Robots exclusion standard is considered propper netiquette, so any kind of script that exhibits
		crawling-like behavior is expected to abide by it.

		The intended use of this class is to feed it a url before you intend to visit it. The class will
		automatically attempt to read the robots.txt file and will return a boolean value to indicate if
		you are allowed to visit this url.

		Maximum Crawl-delays and request-rates maxed-out at 60seconds.

		The class will block until the detected crawl-delay (or request-rate) allows visiting the url.

		For instance, if Crawl-delay is set to 3, the Robots_txt::urlAllowed() method will block for 3
		seconds when called a second time. An internal clock is kept with the last visited time, so if
		the delay is already expired, the method will not block.

		Example usage

		foreach($arrUrlsToVisit as $strUrlToVisit) {

			if(Robots_txt::urlAllowed($strUrlToVisit,$strUserAgent)) {

				#visit url, do processing. . . 
			}
		}

		The simple example above will ensure you abide by the wishes of the site owners.

		Note: an unofficial non-standard extension exists, that limits the times that crawlers
			  are allowed to visit a site. I choose to ignore this extension because I feel it
			  is unreasonable.

		Note: You are only *required* to specify your userAgent the first time you call the urlAllowed method, and
			  only the first value is ever used.
			  
Example Usage
	var_dump(Robots_txt::urlAllowed('http://slashdot.org/','Slurp'));
	var_dump(Robots_txt::urlAllowed('http://slashdot.org/test','Slurp'));

Files

File	Role	Description
`Robots.txt.class.php`	Class	Core file
`README.txt`	Doc.	Usage Examples

	robots_txt-2008-03-04.zip 4KB
	robots_txt-2008-03-04.tar.gz 4KB
	Install with Composer

Version Control

Unique User Downloads

Download Rankings

Total:	1,336
This week:	0

All time:	2,802
This week:	1,016

User Ratings

User Comments (1)

	All time
Utility:	50%
Consistency:	62%
Documentation:	50%
Examples:	-
Tests:	-
Videos:	-
Overall:	36%
Rank:	2807

Says not allowed also if it is: http://www.
4 years ago (Ivan Spadacenta)

10%

Applications that use this package

No pages of applications that use this class were specified.

If you know an application of this package, send a message to the author to add a link here.

Advertise on this site

For more information send a message to info at phpclasses dot org.