Every page that loads in a web browser has a response code included in the HTTP headers, which may or may not be visible on the web page itself.
There are many different response codes a server gives to communicate the loading-status of the page; one of the most well-known codes is the 404-response code.
Generally, any code within 400 to 499 indicates that the page didn’t load. The 404-response code is the only one that carries a specific meaning – that the page is actually gone and probably isn’t coming back anytime soon.
What’s a Soft 404 Error?
A soft 404 error isn’t an official response code sent to a web browser. It’s just a label Google adds to a page within their index.
As Google crawls pages, it allocates resources carefully ensuring that no time is wasted by crawling missing pages which do not need to be indexed.
However, there are some servers that are poorly configured and their missing page loads a 200 code when it should display a 404-response code. If the invisible HTTP header displays a 200 code even if the web page clearly states that the page isn’t found, the page might be indexed, which is a waste of resources for Google.
To combat this issue, Google notes the characteristics of 404 pages and attempts to discern whether the 404 page really is a 404 page. In other words, Google learned that if it looks like a 404, smells like a 404, and acts like a 404, then it’s probably a genuine 404 page.
Potentially Misidentified as Soft 404
There are also cases wherein the page isn’t actually missing, but certain characteristics have triggered Google to categorize it as a missing page.
Some of these characteristics include a small amount or lack of content on the page and having too many similar pages on the site.
These characteristics are also similar to the factors that the Panda algorithm tackles. The Panda update considers thin and duplicate content as negative ranking factors.
Therefore, fixing these issues will help avoid both soft 404s and Panda issues.
404 errors have two main causes:
- An error in the link, directing users to a page that doesn’t exist.
- A link going to a page that used to exist and suddenly disappeared.
If the cause of the 404 is a linking error, you just have to fix the links.
The difficult part of this task is finding all the broken links on a site.
It can be more challenging for large, complex sites that have thousands or millions of pages. In instances like this, crawling tools come in handy. You can try using software such as Xenu, DeepCrawl, Screaming Frog, or Botify.
A Page That No Longer Exists
When a page no longer exists, you have two options:
- Restore the page if it was accidentally removed.
- 301 redirect it to the closest related page if it was removed on purpose.
First, you have to locate all the linking errors on the site. Similar to finding all errors in linking for a large scale website, you can use crawling tools. However, crawling tools may not find orphaned pages, which are pages that are not linked from anywhere within the navigational links or from any of the pages.
Orphaned pages can exist if they used to be part of the website, then after a website redesign, the link going to this old page disappeared, but external links from other websites might still be linking to them. To double check if these kinds of pages exist on your site, you can use a variety of tools.
Google Search Console
Search console will report 404 pages as Google’s crawler goes through all the pages it can find. This can include links from other sites going to a page that used to exist on your website.
You won’t find a missing page report in Google Analytics by default. However, you can track them in a number of ways.
For one, you can create a custom report and segment out pages that have a page title mentioning Error 404 – Page Not Found.
Another way to find orphaned pages within Google Analytics is to create custom content groupings and to assign all 404 pages to a content group.
Site: Operator Search Command
Searching Google for “site:example.com” will list all pages of example.com that are indexed by Google. You can then individually check if the pages are loading or if they’re giving 404s.
To do this at scale, I like using WebCEO, which has a feature to run the site: operator not only on Google, but also on Bing, Yahoo, Yandex, Naver, Baidu, and Seznam.
Since all the search engines will only give you a subset, running it on multiple search engines can help give a larger list of pages of your site. This list can be exported and run on tools for a mass 404 check. I simply do this by adding all URLs as links within an HTML file and loading it on Xenu to massively check for 404 errors.
Other Backlink Research Tools
Backlink research tools like Majestic, Ahrefs, Moz Open Site Explorer, Sistrix, LinkResearchTools, and CognitiveSEO can also help.
Most of these tools will export a list of backlinks linking to your domain. From there, you can check all the pages that are being linked to and look for 404 errors.
How to Fix Soft 404 Errors
Crawling tools won’t detect a soft 404 because it isn’t really a 404 error. But you can use crawling tools to detect something else. Here are a few things to find:
- Thin Content: Some crawling tools not only report pages that have thin content, but also show a total word count. From there, you can sort URLs based on your content’s number of words. Start with pages that have the least amount of words and evaluate whether the page has thin content.
- Duplicate Content: Some crawling tools are sophisticated enough to discern what percentage of the page is template content. If the main content is nearly the same as many other pages, you should look into these pages and determine why duplicate content exists on your site.
Aside from the crawling tools, you can also use Google Search Console and check under crawl errors to find pages that are listed under soft 404s.
Crawling an entire site to find issues that cause soft 404s allows you to locate and correct problems before Google even detects them.
After detecting these soft 404 issues, you will need to correct them.
Most of the time, the solutions appear to be common sense. This can include simple things like expanding pages with thin content or replacing duplicate content with new and unique ones.
Throughout this process, here are a few things to consider:
- Consolidate Pages: Sometimes thin content is caused by being too specific with the page topic, which can leave you with little to say. Merging several thin pages into one page can be more appropriate if the topics are related. Not only does this solve thin content issues, but it can fix duplicate content issues as well. For example, an e-commerce site selling shoes that come in different colors and sizes may have a different URL for each size and color combination. This leaves a large number of pages with content that is thin and relatively identical. The more effective approach is to put this all on one page instead and enumerate the options available.
- Find Technical Issues That Cause Duplicate Content: Using even the simplest web crawling tool like Xenu (which doesn’t look at content but only URLs, response codes, and title tags), you can still find duplicate content issues by looking at URLs. This includes things like www vs non-www URLs, http and https, with index.html and without, with tracking parameters and without, etc. A good summary of these common duplicate content issues found in URLs patterns can be found on slide 6 of this presentation.
Google Treats 404 Errors & Soft 404 Errors the Same Way
A soft 404 is not real 404 error, but Google will deindex those pages if they aren’t fixed quickly. It is best to crawl your site regularly to see if 404 or soft 404 errors occur. Crawling tools should be a major component of your SEO arsenal.
Featured Image: Paulo Bobita