Feature #14798
Robots.txt and indexed search
| Status: | Accepted | Start date: | 2005-06-06 | |
|---|---|---|---|---|
| Priority: | Should have | Due date: | ||
| Assigned To: | - | % Done: | 0% | |
| Category: | Indexed Search | Spent time: | - | |
| Target version: | 7.4 (Backend) | |||
| PHP Version: | 5.5 | Sprint Focus: | ||
| Complexity: | 
Description
I've got a handful of those pdf files that I don't want
indexed. So, to satisfy google, and other search engines, I use a
robots.txt file so they aren't indexed.
My requested feature is, getting typo to honor robots.txt files, and
skip indexing files listed in there.
(issue imported from #M1170)
History
#1 Updated by Michael Stucki about 10 years ago
Are you talking of internal or external files?
Let's say your site is http://www.mysite.com/mysitedir/ and you have several links, which of them do you think should be checked against Robots.txt?
http://www.mysite.com/mysitedir/fileadmin/test.pdf
http://www.mysite.com/mysitedir/Intro.5.0.html
http://www.mysite.com/anothersitedir/fileadmin/test.pdf
http://www.mysite.com/anothersitedir/Introl.5.0.html
http://www.anothersite.com/mysitedir/fileadmin/test.pdf
http://www.anothersite.com/mysitedir/Intro.5.0.html
#2 Updated by Jody Cleveland about 10 years ago
I would think anything listed in robots, as long as it was within the site. More like this one:
#3 Updated by Michael Stucki about 10 years ago
I will implement this if you can do some research for me about Robots.txt:
- Is there an RFC?
- Where is the file expected to be found: Only in / or in any directory of the rootline?
- Does it accept regular expressions or only plain strings?
- Any other special formattings?
#4 Updated by Jody Cleveland almost 10 years ago
Here's the RFC:
3.3 Formal Syntax
This is a BNF-like description, using the conventions of RFC 822 [5],
  except that "|" is used to designate alternatives.  Briefly, literals
  are quoted with "", parentheses "(" and ")" are used to group
  elements, optional elements are enclosed in [brackets], and elements
  may be preceded with <n>* to designate n or more repetitions of the
  following element; n defaults to 0.robotstxt    = *blankcomment
              | blankcomment record *( 1*commentblank 1*record )
                   *blankcomment
    blankcomment = 1(blank | commentline)
    commentblank = *commentline blank *(blankcomment)
    blank        = *space CRLF
    CRLF         = CR LF
    record       = *commentline agentline *(commentline | agentline)
                   1*ruleline *(commentline | ruleline)agentline    = "User-agent:" *space agent  [comment] CRLF
    ruleline     = (disallowline | allowline | extension)
    disallowline = "Disallow" ":" *space path [comment] CRLF
    allowline    = "Allow" ":" *space rpath [comment] CRLF
    extension    = token : *space value [comment] CRLF
    value        = <any CHAR except CR or LF or "#">commentline  = comment CRLF
    comment      = blank "#" anychar
    space        = 1(SP | HT)
    rpath        = "/" path
    agent        = token
    anychar      = <any CHAR except CR or LF>
    CHAR         = <any US-ASCII character (octets 0 - 127)>
    CTL          = <any US-ASCII control character
                        (octets 0 - 31) and DEL (127)>
    CR           = <US-ASCII CR, carriage return (13)>
    LF           = <US-ASCII LF, linefeed (10)>
    SP           = <US-ASCII SP, space (32)>
    HT           = <US-ASCII HT, horizontal-tab (9)>The syntax for "token" is taken from RFC 1945 [2], reproduced here for
   convenience:token        = 1*<any CHAR except CTLs or tspecials>tspecials    = "(" | ")" | "<" | ">" | "@" 
             | "," | ";" | ":" | "\" | <">
             | "/" | "[" | "]" | "?" | "=" 
             | "{" | "}" | SP | HTThe syntax for "path" is defined in RFC 1808 [6], reproduced here for
  convenience:path        = fsegment *( "/" segment )
    fsegment    = 1*pchar
    segment     =  *pcharpchar       = uchar | ":" | "@" | "&" | "=" 
    uchar       = unreserved | escape
    unreserved  = alpha | digit | safe | extraescape      = "%" hex hex
    hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                         "a" | "b" | "c" | "d" | "e" | "f"alpha       = lowalpha | hialphalowalpha    = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
                  "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
                  "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" 
    hialpha     = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                  "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                  "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"digit       = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                  "8" | "9"safe        = "$" | "-" | "_" | "." | "+" 
    extra       = "!" | "*" | "'" | "(" | ")" | ","I belive the robots.txt file needs to be in the root of the site:
This section contains an example of how a /robots.txt may be used.
A fictional site may have the following URLs:http://www.fict.org/
     http://www.fict.org/index.html
     http://www.fict.org/robots.txt
     http://www.fict.org/server.html
     http://www.fict.org/services/fast.html
     http://www.fict.org/services/slow.html
     http://www.fict.org/orgo.gif
     http://www.fict.org/org/about.html
     http://www.fict.org/org/plans.html
     http://www.fict.org/%7Ejim/jim.html
     http://www.fict.org/%7Emak/mak.htmlThe site may in the /robots.txt have specific rules for robots that
   send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and
   "Excite/1.0", and a set of default rules:- /robots.txt for http://www.fict.org/
- comments to webmaster@fict.org
User-agent: unhipbot
      Disallow: /User-agent: webcrawler
      User-agent: excite
      Disallow:User-agent: *
      Disallow: /org/plans.html
      Allow: /org/
      Allow: /serv
      Allow: /~mak
     Disallow: /The following matrix shows which robots are allowed to access URLs:unhipbot       webcrawler           other
                                                                                                & excite
     http://www.fict.org/                                      No                  Yes                     No
     http://www.fict.org/index.html                   No                  Yes                     No
     http://www.fict.org/robots.txt                    Yes                  Yes                    Yes
     http://www.fict.org/server.html                 No                  Yes                    Yes
     http://www.fict.org/services/fast.html     No                  Yes                    Yes
     http://www.fict.org/services/slow.html   No                  Yes                    Yes
     http://www.fict.org/orgo.gif                       No                  Yes                     No
     http://www.fict.org/org/about.html           No                 Yes                     Yes
     http://www.fict.org/org/plans.html           No                 Yes                     No
     http://www.fict.org/%7Ejim/jim.html       No                 Yes                     No
     http://www.fict.org/%7Emak/mak.html  No                 Yes                    YesI took all this from this document:
http://www.robotstxt.org/wc/norobots-rfc.html
I hope that helps, and I really appreciate you looking into this. If there's anything else you need, let me know.
#5 Updated by Mathias Schreiber 8 months ago
- Description updated (diff)
- Status changed from New to Accepted
- Target version changed from 0 to 7.0
- PHP Version set to 5.5
#6 Updated by Mathias Schreiber 7 months ago
- Target version changed from 7.0 to 7.1 (Cleanup)
#7 Updated by Benjamin Mack about 1 month ago
- Target version changed from 7.1 (Cleanup) to 7.4 (Backend)