Have you ever clicked on a link of a web page only to receive a message that the page doesn't exist? Well this is a 404 error or 'page not found'. The tricky thing is that sometimes search engines aren't aware of a 404 and think its a real page. We call this a soft 404. Warning we are about to get real techie on you here!
The '404' or 'Not Found' error message is an HTTP standard response code indicating that the server could not find the file that was requested. Both 404 and soft 404 are indicating the same thing - that the requested file is not present on the web server. But a soft 404 will return the server status code of 200 (OK) which tells a search engine that the page exists. (which it doesn't) So obviously it is now necessary for us to learn how to tell search engines that soft 404 pages are actually 404 pages (broken links) and should therefore not crawl them.
'Soft 404s' in Google Webmaster tools is one of Google's latest updates. They give you more control over the robots that are crawling your web pages. Why is this important? Search engines are very concerned about the web page’s server status code. If many of your web pages have a “404” set as server status code, your site will be considered low quality and will be pushed down in search engine rankings. Usually we try hard to find and fix the broken links that occur in a website. In some cases we may have a hard time in finding the broken links.
Soft 404s – What Does This Mean?
In short, broken links that are labeled as 200 (OK – server status code) by web servers are called soft 404s.
How can a web server report broken links as 200 (OK)?
Incorrect custom 404 page set up makes the web server report all the broken links (404) in your website as 200 (OK).
Correct Setup – Error Document 404 /unknown-page.php (Relative path is used to define the custom 404 page)
Incorrect Setup – Error Document 404 (https://www.techwyse.com/unknown-page.php) (Absolute path is used to define the custom 404 page)
If you have an absolute path for your custom 404 page in error document set up it will make the web server report broken links as 200 (OK) server status code and not a proper 404. Tools like Xenu or GPablo (or other broken link check tools) will not be able to find the broken link if you have an absolute path in the error document set up.
In this case, you cannot identify the broken links present in your website. Search engines also will consider this as an existing page and crawl it. This may end up with a duplication penalty if the URL matches with any other pages in your website.
Google has started showing these soft 404s in Google Webmaster tools.
Technically Google should report only the pages that do not exist and have 200 (OK) server status code. But now Google Webmaster Tools is reporting some 404 pages in the soft 404's list which Google will have to correct.
Following are the actions required against each of your soft 404s listed in Google webmaster tools for your website.
1) Page contains the correct content and properly returns a 200 response - Not actually a soft 404 and no action required
2) Page returning 404 status response - Not actually a soft 404 and no action required
3) Page doesn’t exist but returns a 200 response code - 301 redirect to a more accurate URL
If you are sure about these conditions, your site is completely free from broken links. Search engine robots are always interested in crawling strong content driven web pages. Make sure to let them visit only your valid pages and not the unwanted ones.
on
Thanks for explaining this in detail Elan.
on
Thank you so much for this article – at last I understand what has been happening on my sites and using your correction of relative path did the trick……
on
@Dan: Sites that are developed in .net or other development platforms don't have the chances of getting 200 (server status code) for broken links and thats why i didn't mention here.
on
Elan, informative read about soft 404. Thanks much for detailing it vividly. Now I clearly got the logic behind the concept. Thanks for that clarification and comments.
on
Ah thats such a great insight Elan. I thought that all pages that do not exist are considered as 404 by Google. But to know about the 'status code' is really a revelation. I think the research you have done is great.
Does Webmaster consider or differentiate the soft 404 and 404 pages seperately and show them too?
on
This one is worth noting! I just found one in my site. Good initiative from google to update it in webmaster tools
on
What about sites made in .Net? It seems everyone assumes PHP is the ONLY development platform out there 😀