Robots.txt

Post non-phpwcms related topics here - but I don't want to see "hey check this or that other cms". Post if you have a point or worthwhile comment, don't post just to increase you post count!
Post Reply
Buletti
Posts: 43
Joined: Tue 27. Sep 2005, 10:51
Location: Hamburg / Berlin / Germany

Robots.txt

Post by Buletti »

I have a stupid question.
If a website is running under php, which folders is allowed for the robots?
i mean the structure of the side is now different to normal html sites where the info is stored in the info folder.

But if i use the structure RECIPEs for example - there is no folder recipes in my root...
User avatar
juergen
Moderator
Posts: 4556
Joined: Mon 10. Jan 2005, 18:10
Location: Weinheim
Contact:

Post by juergen »

:idea: Take a php site: f.e.

http://blablabla.de and look for it ;-)

This way I found one

Code: Select all

# robots.txt for http://www.philipp-XXXXXX.de/ # file created: 08.08.01 User-agent: 
 * # Disallow: /cgi-bin/ # exclude robots from specified tree # Disallow: /scripts/ 
 Kompetenz in Präzisionsgewindespindeln, Fein- und Trapezgewindespindeln, gewindeschleifen, 
 Praezisionsgewindespindeln, Gewindespindeln, Feingewindespindeln, Trapezgewindespindeln, 
 Trapezgewindestangen, Feingewindetriebe, Trapezgewindetriebe, Gewindekerne, Gewinderollen, 
 rundschleifen, Schnittwerkzeuge, Präzisionsdrehteile, Praezisionsdrehteile, Präsionsfrästeile, 
 Praezisionsfraesteile, Werkzeugfertigung, drehen, fräsen, fraesen, aussenrundschleifen, 
 innenrundschleifen, flachschleifen, spitzenloses, schleifen, schneckenschleifen, 
 honen, Maschinen, Werkzeugbau.
:lol:

All my sites without robots.txt, one top, rest good ranking

greetz

Jürgen the robot
:evil: :lol:
trip
Posts: 657
Joined: Tue 17. Feb 2004, 09:56
Location: Cape Town, South Africa
Contact:

Post by trip »

the robots.txt file is a very good way of directing search engines...
eg google relies heavily on this to be told where to go... as many search engines also obey simple commands telling them what directories to ignore as well as how long to spider pages for...
robots.txt can become very indepth...

TriP
Buletti
Posts: 43
Joined: Tue 27. Sep 2005, 10:51
Location: Hamburg / Berlin / Germany

Post by Buletti »

Thank you for the answers. I see a light in the dark.
:lol:
That the robots.txt is for directing search engines is easy for me to understand....

O.K. for google you may say disallow for the cgi-bin or other directories.

But DF6IH writes that he is running his sites without robots.txt.
That means that the spiders are spidering every folder. Thats right? Or even not because they have no order to crawl anything.

And if we use this for phpwcms for example, the spiders are crawling every folder (allow *), but there are just the php-files and no html-like content.
So what's the intention? Are the spiders "seeing" the php website like we do? Means the spiders are seeing just the content?
For example - there is no use for spiders to crawling the FCKEditor subfolders...

And what means the words in the robot.txt files ? DF6IH writes:
Kompetenz in Präzisionsgewindespindeln, Fein- und Trapezgewindespindeln, gewindeschleifen,

Are these some keywords of the content? So maybe i should bring them in my r*.txt file, too.

my robots.txt file looks like:

Code: Select all

User-agent:*
Disallow: /cgi-bin/
Disallow: /logs/
Disallow: /config/
Disallow: /include/
Disallow: /img/
Disallow: /phpwcms_ftp/
Disallow: /picture/
Disallow: /phpwcms_code_snippets/
If I understand you, i should change it to allow all folders but cgi-bin, config..

Thank you for teaching me. But you see, if i ask a question, there are many more coming up after i read your answers.
:roll:
The site http://www.blahblahblah.de you told before is an internet site from a moderator.
Anyway, I wish yo all a happy christmas... :lol: [/code][/quote]
Last edited by Buletti on Fri 23. Dec 2005, 12:38, edited 1 time in total.
Phadda
Posts: 4
Joined: Sat 19. Nov 2005, 13:41

Post by Phadda »

the robots.txt is a good informationfile to look for some directories who may be have some interresting content in it to for some script kiddiez to "hack" the dir :)

look at the ms robots hehe
http://www.microsoft.com/robots.txt
frold
Posts: 2151
Joined: Tue 25. Nov 2003, 22:42

Post by frold »

Phadda wrote:the robots.txt is a good informationfile to look for some directories who may be have some interresting content in it to for some script kiddiez to "hack" the dir :)

look at the ms robots hehe
http://www.microsoft.com/robots.txt
or this one : http://www.whitehouse.gov/robots.txt

Looks like they dont want bad credit (look at all the iraq link :S) - there you can talk about censur!! Shame on you Bush!!
http://www.studmed.dk Portal for doctors and medical students in Denmark
trip
Posts: 657
Joined: Tue 17. Feb 2004, 09:56
Location: Cape Town, South Africa
Contact:

Post by trip »

here is another example of what a robots.txt file can do

Code: Select all

User-agent: msnbot
Disallow: /cgi-bin 
basically its telling msn spiders to stay away from the cgi bin

Code: Select all

User-Agent: MSNbot
Crawl-Delay: 20
crawl delay for msn bot

Code: Select all

User-agent: Googlebot-Image
Disallow: /images
ask google images not to index images

basically anything you do not want to be spidered needs to be in the robots.txt file

so from the above you can see it can be used for a lot
TriP
Peekay
Posts: 286
Joined: Sun 25. Jul 2004, 23:24
Location: UK

Post by Peekay »

I too would be interested to know if anyone has anyone scored better with the search engines by creating a robots.txt file excluding /phpwcms_filestorage/ etc., or are the results the same using no robots.txt file at all? :?:

In my tests using mod_rewrite, a PHPWCMS site does (eventually) get deep-indexed by Google without robots.txt.
Post Reply