 |
Easily
Create Online Help. And Online Anything Else.
Free 20-Day Trial - Click
Here |
| News: |
Actualité - Entreprise France Le chiffre d'affaires consolidé du 2ème trimestre 2005/2006 (du 1er janvier 2006 au 31 mars 2006), établi en application des normes IFRS, est de 296,1 millions d'euros (+6,6 % à données comparables)...
Valeo publie une hausse de marge opérationnelle au 1er trimestre Valeo fait état d'une marge opérationnelle en hausse de 0,2 point à 3,3% pour le premier trimestre 2006, malgré l'impact de la hausse des prix des matières premières qui a amputé la marge brute de 0,9 point...
Paris demande une publication hebdomadaire des stocks pétrolier Le ministre français délégué à l'industrie, François Loos, a demandé à la Commission européenne une publication hebdomadaire de l'état des stocks européens de pétrole. L'objectif est d'enrayer la spéculation sur les marchés du brut...
Dates clés Mittal Steel dévoile son intention de racheter Arcelor par le biais d'une offre publique mixte (à 25% en numéraire et à 75% en titres), qui valorise le groupe européen à 18,6 milliards d'euros...
Iliad flambe en Bourse après le lancement de la Freebox HD Iliad, maison mère du fournisseur d'accès à internet Free, s'envole à la Bourse de Paris après l'annonce du lancement de son nouveau modem "triple play" (internet, TV, téléphonie) permettant la téléphonie mobile par Wifi...
Malgré un mois de mars arrosé, risque de sécheresse en France La réunion du comité national de suivi des effets de la sécheresse s'est tenu mercredi 19 avril. Al'issue de cette réunion dirigée par Nelly Olin, ministre de l'écologie et du développement durable...
Recours à Clermont-Ferrand pour l'annulation d'essais d'OGM Le commissaire du gouvernement a demandé au tribunal administratif de Clermont-Ferrand l'annulation d'autorisations d'essais en plein champs de maïs transgéniques à des fins thérapeutiques, rapporte l'association France Nature...
Apple enregistre un CA et un bénéfice trimestriels forts Grâce aux livraisons significatives d’ordinateurs et d’iPod, Apple Computer a publié mercredi (18 avril) de bonnes performances pour le chiffre d’affaires et le bénéfice du deuxième trimestre...
|
|
|
| Recent
Articles: |
Google
To Charge For AdWords API
The current free quota system used in capping developer
usage will be replaced by a usage-based fee system
that will charge developers 25 cents per 1,000 quota
units used.
Nexaweb
Builds Ajax Client For Developers
Privately held Nexaweb Technologies recently announced
the addition of an Ajax client to its Rich Internet
Applications (RIAs) development platform.
DCamp
Socialtext
is opening it's doors to host DCamp,
the first ad-hoc event focused on design & user
experience, May 12th & 13th.
IBM
Polishes SOA Blitz With AJAX
Over the next few months, IBM is launching a massive
set of products and services to bolster implementation
of service-oriented architecture...
Google
Maps The Way To Version 2
The official launch of the Google Maps API Version
2 provides a number of tweaks to enhance its performance
and decrease the number of memory leaks that happened
in previous versions.
Why
Comply? The Movement to W3C Compliance
The Internet: a powerful tool with endless possibilities
to advance business, connect people and share information.
Importance
of W3 Standards
When the Internet first began its boom, the technologies
used in design were forgiving. W3-Compliance wasn't
as necessary because there were fewer browsers,
fewer users, and overall fewer technologies in use.
|
|
|
|
04.20.06
Truth About
Web Crawlers
By
Maksym Nesen
Wouldn't it be nice to be able to leave some code in your web site
to tell the search engine spider crawlers to make your site number
one?
Unfortunately a robots.txt file or robots meta tag won't do that,
but they can help the crawlers to index your site better and block
out the unwanted ones. First a little definition explaining:
Search Engine Spiders or Crawlers - A web crawler (also known as web
spider) is a program which browses the World Wide Web in a methodical,
automated manner. Web crawlers are mainly used to create a copy of
all the visited pages for later processing by a search engine, that
will index the downloaded pages to provide fast searches.
A web crawler is one type of bot, or software agent. In general, it
starts with a list of URLs to visit. As it visits these URLs, it identifies
all the hyperlinks in the page and adds them to the list of URLs to
visit, recursively browsing the Web according to a set of policies.
Robots.txt - The robots exclusion standard or robots.txt protocol
is a convention to prevent well-behaved web spiders and other web
robots from accessing all or part of a website. The information specifying
the parts that should not be accessed is specified in a file called
robots.txt in the top-level directory of the website.
The robots.txt protocol is purely advisory, and relies on the cooperation
of the web robot, so that marking an area of your site out of bounds
with robots.txt does not guarantee privacy. Many web site administrators
have been caught out trying to use the robots file to make private
parts of a website invisible to the rest of the world. However the
file is necessarily publicly available and is easily checked by anyone
with a web browser.
The robots.txt patterns are matched by simple substring comparisons,
so care should be taken to make sure that patterns matching directories
have the final '/' character appended: otherwise all files with names
starting with that substring will match, rather than just those in
the directory intended.
Meta Tag - Meta tags are used to provide structured data about data.
In the early 2000s, search engines veered away from reliance on Meta
tags, as many web sites used inappropriate keywords, or were keyword
stuffing to obtain any and all traffic possible.
Some search engines, however, still take Meta tags into some consideration
when delivering results. In recent years, search engines have become
smarter, penalizing websites that are cheating (by repeating the same
keyword several times to get a boost in the search ranking). Instead
of going up rankings, these websites will go down in rankings or,
on some search engines, will be kicked off of the search engine completely.
Index a site - The act of crawling your site and gathering information.
Easily
Create Online Help. And Online Anything Else.
Free 20-Day Trial - Click
Here |
|
How can the robots.txt file and meta tag help you?
In the robots.txt you can tell the harmful 'web crawlers' to leave
your web site alone, and give helpful hints to the ones you want to
crawl your site. Below is an example on how to disallow a web crawler
to search your site:
# this identifies the wayback machine User-agent:
ia_archiver Disallow: /
ia_archiver is the crawler name for the wayback machine that you may
have heard of, and the / after disallow tells ia_archiver not to index
any of your site. The # allows you to write comments to yourself so
you can keep track of what you typed.
Type the above three lines into notepad from your computer and save
it to the root directory of your web site as robots.txt. Web crawlers
look for this document first at a web site before doing anything else.
This helps the crawler to do its job, and helps the web site owner
tell the spider what to do. Say for instance you have some data that
you don't want the crawlers to see. (Like duplicate content for other
browser referrer pages)
You can deter crawlers from indexing the 'duplicate' directory by
typing this into your robots.txt file.
User-agent: * Disallow: /duplicate/
The * after user-agent says that this action applies to all crawlers
and /duplicate/ after disallow tells all crawlers to ignore this directory
and not search it. For each user-agent and disallow line there must
be a blank space between them in order for it to function correctly.
So this is how you would create the above two commands into a robots.txt
file:
# this identifies the wayback machine User-agent:
ia_archiver
Disallow: /
User-agent: * Disallow: /duplicate/
One thing to note that is very important: Anyone can access the robots.txt
file of a site. So if you have information that you don't want anyone
to see don't include it into the robots.txt file. If the directory
that you don't want anyone to see is not linked to from your web site
the crawlers won't index it anyway.
An alternative to blocking indexing of your site is to put a meta
tag into the page. It looks like this:
You put this into the tag of your web page. This line tells the robot
crawlers not to index (search) the page and not to follow any of the
hyperlinks on the page. So as an example tells the robot crawlers
to not index the page, but follow the hyperlinks on this page.
Did You Know That Google Has Its Own Meta Tag?
It looks like this: . This tells the Google robot crawler not to index
the page, not to follow any of the links, and not to keep from storing
cached versions of your web site. You will want this done if you update
the content on your site frequently. This prevents the web user from
seeing outdated content that isn't refreshed because of storage in
the cache.
You can use the meta tag to specifically talk to Google's robots to
avoid complications or if you are optimizing your site for Google's
search engine. Recommended software tools to automate submitting and
link creation : "http://blog-submitter.cafe150.com"
- Blogs AutoFiller.
|