Robots on the web are not some mechanical devices that go around doing repetitive and physical tasks. In fact this technology is quite complex to someone who isn’t well versed in the intricate and layered workings of such technologies. The internet on the surface, is generally tool that everyone takes advantage of without really knowing what it’s comprised of. If you look deeper in the Internet, you will find these scripts called “Web Crawlers”- programs (or scripts) that are created and travel throughout the internet (robots.txt.org), and a file named “robots.txt”, which regulates most of these robots. These robots can be very useful to companies, and very detrimental to them as well. Throughout this essay, I am aiming to inform you on what this txt file is all about, what web robots are, and what web crawlers are capable of.
What is robots.txt? At its core, it is The Robots Exclusion Protocol; a protocol that the owner of a website writes to give instructions to web robots about their site (robotstxt.org). In this file the user can write a set of instructions that will be given to the web spiders. If you don’t have any instructions to give these web robots, or you wish for all robots to scan and index your website’s web pages, then simply removing “robot.txt” from the websites hierarchy entirely will allow the web spiders to crawl without instruction. These spiders will first scan this file and read weather or not their robot is allowed to enter the website, or what pages they are allowed to scan, and proceed on with web crawling. Here is a couple of examples of how you would edit this file. If the owner of a website doesn’t want any robots to scan their website, they would put in their txt file “User-agent: *” and in the line below: “Disallow : /” (robotstxt.org). The “User-agent” is referring to a specific robot. For example if you wanted to refer to Googlebot then you would code “User-agent: Googlebot”, or using an asterisk would refer to all robots. The “Disallow:” part of code refers to what robots are prohibited from scanning. For instance, if you had a file called “topsecret.html”, then coding: “Disallow: /topsecret.html” will prevent the corresponding robot from indexing that file to the search engine; effectively hiding the file from all upstanding and above-board robots, or robots that were written to read the .txt file. (Sexton, Patrick). Opposite to “Disallow” there is the “Allow” tag, which as you can assume, describes what files the robot can access if the robot was prohibited access to a certain directory, or folder (Sexton, Patrick). For instance, if I had folder named “topsecret”, but I wanted certain bots to be able to index a file called “nottopsecret.html”, then the code would be written with the “Disallow” tag with “/topsecret”, and in the proceeding line the “Allow” tag with “/topsecret/nottopsecret.html”, resulting in the upstanding robots allowed only into that folder to view the nottopsecret.html file. Even though you are able to bar access to certain web pages in your website from web robots, not all robots comply or even check the robots.txt; ignoring the restrictions (robotstxt.org). This will be discussed further in the essay.
Web robots (also known as Web spiders and web crawlers) are programs that are sent to various websites and web index that site’s information, and are mostly used by search engines, like Google. Web Indexing refers to a style of indexing that are used in indexing content of websites in the Internet (wikipedia). Some websites use alphabetical indexes, which would be useful in a word terms setting, while search engines use primarily keywords and metadata, such as keywords that are placed within the META tags in HTML code, which has led to spam of unrelated terms to gain attention(“Indexingtheweb”). The web robots crawl through individual websites, recording data or words into a search engine’s index, which is a large database of words and where they come from (“Wp Themes Planet”). This catalog of words and web pages have multiple uses by different people. For instance, this catalog can be used to help marketers keep up with current terms being used, so that they can cater to the current times. These crawlers also have to be carefully coded. When crawling a page, it might stumble upon some code on the website which would only activate when a person actually looked at the Not all web robots are designed to do the same thing as other robots. Instead of scanning for the robots.txt file and following the restrictions and instructions given to them by the website owner, some are programmed and written to do more malicious things and ignore the file. One example of this is email harvesting. Email harvesting is the process of obtaining email addresses through multiple means, usually for the reason of spamming or sending bulk emails(technopedia.com). Another example of this malicious and abusive use of bots is data scraping. Data scraping (or web scraping) is the act of downloading data straight off the website, rather than cataloging it into an index (Quora). While the act of data scraping itself isn’t bad, doing this to get information that is otherwise not public information is. Not everything is compared black or white when it comes to Web crawling. There is a middle, grey area, which a bunch of companies operate at. The crawlers that operate in this grey zone are sometimes coded with Regex, a language theory that is used to find patterns. The programmer in this instance would look through the website that they plan to crawl through, and then using regex, develop an algorithm for the web crawler to follow in gathering data. For instance, I’m an ecommerce company and I want to see how much my competitors are selling certain items for, or plan to sell certain items for. I would then develop a web crawler to ignore the robots.txt file and crawl through my competitor’s website, pull their item prices off the site and create a data table and analyze this data. These are just a few ways that web crawlers run.
Although robots.txt may sound as if it is an unnecessary file and obsolete since anyone can program a web crawler to ignore this file, robots.txt is an important part of anyone’s website, as coding it wrong or simply not allowing any web robots to crawl and web index your website could result in your website not being indexed within search engines, effectively hiding your website from being seen on search results.
“The Web Robots Pages.” The Web Robots Pages. N.p., n.d. Web. 10 Feb. 2017.http://www.robotstxt.org/
“What are the biggest differences between web crawling and web scraping?” What are the biggest differences between web crawling and web scraping? – Quora. N.p., n.d. Web. 14 Feb. 2017.https://www.quora.com/What-are-the-biggest-differences-between-web-crawling-and-web-scraping
Sexton, Patrick. “The robots.txt file explained and illustrated.” Explained and illustrated. N.p., 29 Apr. 2016. Web. 12 Feb. 2017.https://varvy.com/robottxt.html
“Web indexing.” Wikipedia. Wikimedia Foundation, 11 Jan. 2017. Web. 12 Feb. 2017.https://en.wikipedia.org/wiki/Web_indexing
“Web scraping.” Wikipedia. Wikimedia Foundation, 04 Feb. 2017. Web. 15 Feb. 2017.https://en.wikipedia.org/wiki/Web_scraping
“Indexing the Web.” American Society for Indexing. N.p., n.d. Web. 11 Feb. 2017.http://www.asindexing.org/reference-shelf/indexing-the-web/
“Wp Themes Planet.” Wp Themes Planet RSS. N.p., n.d. Web. 12 Feb. 2017.http://www.wpthemesplanet.com/2009/09/how-does-web-crawler-spider-work/
“What is Email Harvesting? – Definition from Techopedia.” Techopedia.com. N.p., n.d. Web. 13 Feb. 2017.https://www.techopedia.com/definition/1657/email-harvesting
- Adopt best practices in Animation
- Educate our clients in basic web design 101 and the process.
- Securing websites, and effectively creating the correct robots
- We also configure email and spam filters and help reduce and even stop spam!
- Our expertise in SEO is not just kept to us, we also have resources and SEO tutorials for our clients.
Contact us for a quick quote, you'd be glad you do and understand why we believe we have mastered the science of web design.