Rebecca Berbel, Author at Search London

The Secrets of Log Monitoring for the Curious SEO

Posted on February 16, 2019February 19, 2019 by Rebecca Berbel

When log monitoring and SEO come up, we hear lot about the value of crawl budget optimization, monitoring Googlebot behavior and tracking organic visitors. But here are a few of the gritty secrets that don’t get shared as often.

What is log monitoring?

In SEO, server records of requests for URLs can be used to learn more about how bots and people use a website. Web servers record every request, its timestamp, the URL that was requested, who requested it, and the response that the server provided.

Requests are logged in various formats, but most look something like this:

Bot visit, identified by the Googlebot user-agent and IP address:

www.oncrawl.com:80 66.249.73.145 – – [07/Feb/2018:17:06:04 +0000] “GET /blog/ HTTP/1.1” 200 14486 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “-”

Organic visit, identified by the Google address as the referer:

www.oncrawl.com:80 37.14.184.94 – – [07/Feb/2018:17:06:04 +0000] “GET /blog/ HTTP/1.1” 200 37073 “https://www.google.es/” “Mozilla/5.0 (Linux; Android 7.0; SM-G920F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36” “-”

Why logs are the key to SEO success

Log data provides concrete and un-aggregated figures to answer questions that are at the heart of SEO:

How many of my pages have been explored by Google? By users coming from the SERPs?
How often does Google (or organic visitors) visit my pages?
Does Google (or organic visitors) visit some pages more than others?
What URLs produce errors on my site?

Following patterns in Googlebot visits can also provide information about your site’s ranking and indexing. Drops in crawl frequency often precede ranking drops by 2-5 days; a peak in crawl behavior followed by an increase in mobile bot activity and a decline in desktop bot activity is a good sign you’ve moved to Mobile-First Indexing–even if you haven’t received the official email.

Example of a website that have switched to the mobile-first index

And if you’re having a hard time ranking pages, but they’ve never had a “hit” by a googlebot, you know it’s not worth your on-page SEO: Google doesn’t know what’s on the page yet.

Log data is also essential when analyzing crawl budget on groups of pages, when examining the relationship between crawl and organic traffic, or when understanding the relationship between on-page factors and behavior by bots and visitors.

Because this data comes directly from the server, the gatekeeper of access to a website, it is complete, definitive, always up-to-date, and 100% reliable.

Logs lines come in all shapes and sizes

Not all server logs are presented in the same format. Their fields might be in a different order They might not contain all of the same information.

Some information that is often not included in log files by default includes:

The host. If you’re using your log files for SEO, this is extremely useful. It helps differentiate between requests for HTTP and HTTPS pages, and between subdomains of the same domain.
The port. The port used to transfer data can provide additional information on the protocol used.
The transfer time. If you don’t have other means of determining page speed, the time required to transfer all of the page content to the requester can be very useful.
The number of bytes transferred. The number of bytes helps you spot unusually large or small pages, as well as unwieldy media resources.

Identifying bots is not always easy: the good, the bad, and the missing

Bad bots

Sometimes it’s hard to tell what’s a bot and what’s not. To identify bots, it’s best to start by looking at the user-agent, which contains the name of the visitor, such as “google” or “googlebot”.

But because Google uses legitimate bots to crawl websites, scammers and scrapers (who steal content from your website) often name their bots after Google’s in hopes that they won’t be caught.

Google recommends using reverse DNS lookup to check the IP addresses of their bots. They provide the range of IP addresses their bots use. Bots whose entire user-agent and IP address do not check out should be discounted, and–most likely–blocked.

This isn’t a rare case. Bad bots can account for over a quarter of the total website traffic on a typical site, according to this study from 2016.

Good bots

On the other hand, bot monitoring also produces some surprises: at OnCrawl, SEO experts often uncover new googlebots before they’re announced. Based on the type of pages they request, we’re sometimes able to guess their role at Google. Among the bots we identified before Google announced them are:

[Ajax] : JavaScript crawler
Google-AMPHTML: AMP exploration
Google-speakr: crawls pages for Google’s page-reading service. It gained a lot of attention in early February 2019 as industry leaders tweeted about having noticed it.

Knowing what type of bot your site attracts gives you the keys to understanding how Google sees and treats your pages.

Missing bots

We’ve also discovered that, although Googlebot-News is still listed in the official list of bots, this bot is not used for crawling news articles. In fact, we’ve never spotted it in the wild.

Server errors disguised as valid pages

Sometimes server errors produce blank pages, but the server doesn’t realize this is an error. It reports the page as a status 200 (“everything’s ok!”) and no one’s the wiser–but neither bots nor visitors seeing blank pages get to see the URL’s actual content.

Monitoring the number of bytes transferred in server logs per URL will reveal this sort of error, if one occurs.

Hiding spots for orphan pages

Orphan pages, or pages that are not linked to from any other pages in your site’s structure, can be a major SEO issue. They underperform because link’s confer popularity and because they’re difficult for browsing users (and bots) to discover naturally on a website.

Any list of known pages, when examined with crawl data, can be useful for finding orphan pages. But few lists of pages are as complete as the URLs extracted from log data: logs contain every page that Google crawls or has tried to crawl, as well as every page visitors have visited or tried to visit.

Sharing with SEA

SEA (paid search engine advertising) also profits from log monitoring. GoogleAds verifies URLs associated with paid results using the following bots:

AdsBot-Google
AdsBot-Google-Mobile

The presence of these bots on your site can correspond with increases in spending and new campaigns.

What’s really behind Google’s crawl stats

When we talk about crawl budget, there are two sources for establishing your crawl budget:

Google Search Console: in the old Google Search Console, a graph of pages crawled per day and an average daily crawl rate are provided under Crawl > Crawl Stats.
Log data: the count of googlebot hits over a period of time, divided by the number of days in the period, gives another daily crawl rate.

Often, these two rates–which purportedly measure the same thing–are different values.

Here’s why:

SEOs often only count hits from SEO-related bots (“googlebot” in its desktop and mobile versions)
Google Search Console seems to provide a total for all of Google’s bots, whether or not their role is associated with SEO. This is the case of bots like AdSense (“Mediapartners-Google”), which crawls monetized pages on which Google places ads.
Google doesn’t list all of its the bots or all of the bots included in its crawl budget graph.

This poses two main problems:

The inclusion of non-SEO bots can disguise SEO crawl trends that are subsumed in activity by other bots. Drops in activity and unexpected peaks may look alarming, but have nothing to do with SEO; conversely, important SEO indicators may go unnoticed.
As features and reports are phased out of the old Google Search Console, it can be nerve-wracking to rely on Google Search Console for such essential information. It’s difficult to say whether this report will remain available in the long term.

Basing crawl analysis on log data is a good way around these uncertainties.

Log data and the GDPR

Under the European Union’s GDPR, the collecting, storing, and treatment of personal data is subject to extra safety care and protocols. Log data may fall in a gray zone under this legislation: in many cases, the European Commission considers IP address of people (not bots) to be personal information.

Some log analysis solutions, including OnCrawl, offer solutions for this issue. For example, OnCrawl has developed tools that strip IP addresses from log lines that do not belong to bots in order to avoid storing and processing this information unnecessarily.

TL;DR Log data isn’t just about crawl budget

There are plenty of secrets you don’t often hear mentioned in discussions about log files.

Here are the top ten takeaways:

Log data is the only 100% reliable source for all site traffic information.
Make sure your server logs the data you’re interested in–not all data is required in logs.
Verify that bots that look like Google really are Google.
Monitoring the different Google bots that visit your site allows you to discover how Google crawls.
Not all official Googlebots are active bots.
In addition to 4xx and 5xx HTTP status codes, keep an eye out for errors that serve empty pages in 200.
Use log data to uncover orphan pages.
Use log data to track SEA campaign effets.
Crawl budget and crawl rate is best monitored using log data.
Be aware of privacy concerns under the GDPR.

Rebecca works at OnCrawl who were the headline sponsors at Search London’s 8th birthday party. They are still offering an exclusive 30 day free trial or visit OnCrawl at www.oncrawl.com to find out more.

Investing in SEO Crawl Budget to Increase the Value of SEO Actions

Posted on January 23, 2019January 27, 2019 by Rebecca Berbel

Discussions about crawl budget often either spark debates or sound too technical. But making sure your crawl budget fits your website’s needs is one of the most important boosts you can give your SEO.

Why invest in a healthier crawl budget?

SEO functions on one basic principle: if you can provide a web page that best fulfills Google’s criteria for answers for a given query, your page will appear before others in the results and be visited more often by searchers. More visits mean more brand awareness and more marketing leads for sales and pre-sales to process.

This principle assumes that Google is able to find and examine your page in order to evaluate it as a potential match for search queries. This happens when Google crawls and indexes your page. A perfectly optimized page that is never crawled by Google will never be presented in the search results.

The search engine process for finding pages and displaying them in search results.

In short: Google’s page crawls are a requirement for SEO to work.

A healthy crawl budget ensures that the important pages on your site are crawled in a timely fashion. An investment in crawl budget, therefore, is an essential investment in an SEO strategy.

What is crawl budget?

“Crawl budget” refers to the number of pages on a website that a search engine discovers or explores in within a given time period.

Crawl budget is SEO’s best attempt to measure abstract and complex concepts:

How much attention does a search engine give your website?
What is your website’s ability to get pages indexed?

Graphical representation of daily googlebot hits on a website.

How much budget do I have?

The term “budget” is controversial, as it suggests that search engines like Google set a number for each site, and that you as an SEO ought to be able to petition for more budget for your site. This isn’t the case.

From Google’s point of view, crawling is expensive, and the number of pages that can be crawled in a day is limited. Google attempts to crawl as many pages as possible on the web, taking into account popularity, update frequency, information about new pages, and the web server’s ability to handle crawl traffic, among other criteria.

Since we have little direct influence on the amount of budget we get, the game becomes one of how to direct Google’s bots to the right pages at the right time.

No, really. How much crawl budget do I have?

The best way to determine how many times Google crawls your website’s URLs per day is to monitor googlebot hits in your server logs. Most SEOs take into account all hits by Google bots related to SEO and exclude bots like AdsBot-Google (which verifies the quality and pertinence of a page used in a paid campaign).

Visits by Google’s AdsBot that should be removed from an SEO crawl budget.

Because spammers often spoof Google bots to get access to a site, make sure you validate the IPs of bots that present as googlebots. If you use the log analyzer available in OnCrawl, they do this step for you.

Take the sum of the hits over a period of time and divide it by the number of days in that period. The result is your daily crawl budget.

If you can’t obtain access to your server logs, you can currently still use the old Google Search Console to get an estimate. The Google Search Console data on crawl rates provides a single “daily average” figure that includes all Google bots. This is your crawl budget (it will be inflated by the inclusion of additional bots).

Managing crawl budget by prioritizing quality URLs

Since you can’t control the amount of budget you get, making sure your budget is spent of valuable URLs is very important. And if you’re going to spend your crawl budget on optimal URLs, the first step is to know which URLs are worth the most on your site.

As obvious as it sounds, you will want to use your budget on the pages that can earn the most visits, conversions and revenue. Don’t forget that this list of pages may evolve over time or with seasonality. Adapt these pages to make them more accessible and attractive to bots.

Bots are most likely to visit pages with a number of qualities:

General site health: pages on a website that is functional, able to support crawl requests without going down, reasonably rapid, and reliable; it is not spam and has not been hacked
Crawlability: pages receive internal links, respond when requested, and aren’t forbidden to bots
Site architecture: pages are linked to from topic-level pages and thematic content pages link to one another, using pertinent anchor text
Web authority: pages are referenced (linked to) from qualitative outside sources
Freshness: pages are added or updated when necessary, and page are linked to from new pages or pages with fresh content
Sitemaps: pages are found in XML sitemaps submitted to the search engine with an appropriate <lastmod> date
Quality content: content is readable and responds to search query intent
Ranking performance: pages rank well but are not in the first position

If this list looks a lot like your general SEO strategy, there’s a reason for that: quality URLs for bots and quality URLs for users have the nearly identical requirements, with an extra focus on crawlability for bots.

Stretching your crawl budget to cover essentials

You can stretch your crawl budget to cover more pages, just like you can stretch a financial budget.

Cut unnecessary spending

A first level of unnecessary spending concerns any googlebot hits on pages you don’t want to show up in search results. Duplicate content, pages that should be redirected, and pages that have been removed all fall into this category. You may also want to include, for example, confirmation pages when a form is successfully sent, or pages in your sign-up tunnel, as well as test pages, archived pages and low-quality pages.

If you have prioritized your pages, you can also include pages with no or very low priority in this group.

Viewing the number of googlebot hits per day for different page categories.

To avoid spending crawl budget on these pages, keep bots away from them. You can use redirections as well as directives aimed at bots to herd bots in a different direction.

Limit budget drains

Sometimes unexpected configurations can become a drain on crawl budget.

Google will spend twice as much budget when two similar pages point to different canonical URLs. In particular, if your site uses facets or queryString parameters, going over your canonical strategy can help you save on crawl budget. Tools like the canonical evaluations in OnCrawl can help make this task easier.

Tracking hits by googlebots on pages with similar content that do not declare a single canonical URL.

Using 302 redirects, which tell search engines that the content has been temporarily moved to a new URL, can also spend more budget than expected. Google will often return frequently to re-crawl pages with a 302 status in order to find out whether the redirect is still in place, or whether the temporary period is over.

Reduce investments with few returns

User traffic data, either from analytics sources such as Google Analytics or from server log data, can help pinpoint areas where you’ve been investing crawl budget with little return for your efforts in user traffic.

Examples of pages you might be over-investing in include URLs that rank in the first few pages of the SERPs but have never had organic traffic, newly crawled pages that take much longer than average to receive their first SEO visit, and frequently crawled pages that do not rank.

Crawl rate for pages that receive no organic visits in strategic page groups: before and after implementing improvements.

Returns on crawl budget investments

When you improve how crawl budget is spent on your site, you can see valuable returns:

Reduced crawling of pages you don’t want to rank
Increased crawling of pages that are being crawled for the first time
Reduced time between publishing and ranking a page
Improved crawl frequency for certain groups of page
More effective impact of SEO optimizations
Improved rankings

Some of these are direct effects of your crawl budget management, such as reduced crawling of pages you’ve told Google not to crawl. Others are indirect: for example, as your SEO work has more of an effect, your site’s authority and popularity increase, increasing your rankings.

In both cases, though, a healthy crawl budget is at the core of an effective SEO strategy.

OnCrawl

OnCrawl is a technical SEO platform that uses real data to help you make better SEO decisions. Interested in monitoring or improving your crawl budget, as well as other technical SEO elements, and in using a powerful platform with friendly support provided by experienced SEO experts? Ask them about their free trial at the Search LDN event on Monday, February 4th where they are headline sponsors.If you cannot make it for whatever reason, visit OnCrawl at www.oncrawl.com.