Robots.txt

Buletti · Post by **Buletti** » Thu 22. Dec 2005, 16:38

I have a stupid question.
If a website is running under php, which folders is allowed for the robots?
i mean the structure of the side is now different to normal html sites where the info is stored in the info folder.

But if i use the structure RECIPEs for example - there is no folder recipes in my root...

Post by **juergen** » Thu 22. Dec 2005, 19:05

Take a php site: f.e.

http://blablabla.de and look for it

This way I found one

Code: Select all

# robots.txt for http://www.philipp-XXXXXX.de/ # file created: 08.08.01 User-agent: 
 * # Disallow: /cgi-bin/ # exclude robots from specified tree # Disallow: /scripts/ 
 Kompetenz in Präzisionsgewindespindeln, Fein- und Trapezgewindespindeln, gewindeschleifen, 
 Praezisionsgewindespindeln, Gewindespindeln, Feingewindespindeln, Trapezgewindespindeln, 
 Trapezgewindestangen, Feingewindetriebe, Trapezgewindetriebe, Gewindekerne, Gewinderollen, 
 rundschleifen, Schnittwerkzeuge, Präzisionsdrehteile, Praezisionsdrehteile, Präsionsfrästeile, 
 Praezisionsfraesteile, Werkzeugfertigung, drehen, fräsen, fraesen, aussenrundschleifen, 
 innenrundschleifen, flachschleifen, spitzenloses, schleifen, schneckenschleifen, 
 honen, Maschinen, Werkzeugbau.

All my sites without robots.txt, one top, rest good ranking

greetz

Jürgen the robot

trip · Post by **trip** » Fri 23. Dec 2005, 07:54

the robots.txt file is a very good way of directing search engines...
eg google relies heavily on this to be told where to go... as many search engines also obey simple commands telling them what directories to ignore as well as how long to spider pages for...
robots.txt can become very indepth...

TriP

Buletti · Post by **Buletti** » Fri 23. Dec 2005, 12:27

Thank you for the answers. I see a light in the dark.

That the robots.txt is for directing search engines is easy for me to understand....

O.K. for google you may say disallow for the cgi-bin or other directories.

But DF6IH writes that he is running his sites without robots.txt.
That means that the spiders are spidering every folder. Thats right? Or even not because they have no order to crawl anything.

And if we use this for phpwcms for example, the spiders are crawling every folder (allow *), but there are just the php-files and no html-like content.
So what's the intention? Are the spiders "seeing" the php website like we do? Means the spiders are seeing just the content?
For example - there is no use for spiders to crawling the FCKEditor subfolders...

And what means the words in the robot.txt files ? DF6IH writes:
Kompetenz in Präzisionsgewindespindeln, Fein- und Trapezgewindespindeln, gewindeschleifen,

Are these some keywords of the content? So maybe i should bring them in my r*.txt file, too.

my robots.txt file looks like:

Code: Select all

User-agent:*
Disallow: /cgi-bin/
Disallow: /logs/
Disallow: /config/
Disallow: /include/
Disallow: /img/
Disallow: /phpwcms_ftp/
Disallow: /picture/
Disallow: /phpwcms_code_snippets/

If I understand you, i should change it to allow all folders but cgi-bin, config..

Thank you for teaching me. But you see, if i ask a question, there are many more coming up after i read your answers.

The site http://www.blahblahblah.de you told before is an internet site from a moderator.
Anyway, I wish yo all a happy christmas...

[/code][/quote]

Phadda · Post by **Phadda** » Fri 23. Dec 2005, 12:34

the robots.txt is a good informationfile to look for some directories who may be have some interresting content in it to for some script kiddiez to "hack" the dir :)

look at the ms robots hehe
http://www.microsoft.com/robots.txt

frold · Post by **frold** » Fri 23. Dec 2005, 13:02

Phadda wrote:the robots.txt is a good informationfile to look for some directories who may be have some interresting content in it to for some script kiddiez to "hack" the dir

look at the ms robots hehe
http://www.microsoft.com/robots.txt

or this one : http://www.whitehouse.gov/robots.txt

Looks like they dont want bad credit (look at all the iraq link :S) - there you can talk about censur!! Shame on you Bush!!

trip · Post by **trip** » Fri 23. Dec 2005, 13:13

here is another example of what a robots.txt file can do

Code: Select all

User-agent: msnbot
Disallow: /cgi-bin

basically its telling msn spiders to stay away from the cgi bin

Code: Select all

User-Agent: MSNbot
Crawl-Delay: 20

crawl delay for msn bot

Code: Select all

User-agent: Googlebot-Image
Disallow: /images

ask google images not to index images

basically anything you do not want to be spidered needs to be in the robots.txt file

so from the above you can see it can be used for a lot
TriP

Peekay · Post by **Peekay** » Thu 12. Apr 2007, 13:02

I too would be interested to know if anyone has anyone scored better with the search engines by creating a robots.txt file excluding /phpwcms_filestorage/ etc., or are the results the same using no robots.txt file at all?

In my tests using mod_rewrite, a PHPWCMS site does (eventually) get deep-indexed by Google without robots.txt.