The Secrets of Log Monitoring for the Curious SEO

When log monitoring and SEO come up, we hear lot about the value of crawl budget optimization, monitoring Googlebot behavior and tracking organic visitors. But here are a few of the gritty secrets that don’t get shared as often.

What is log monitoring?

In SEO, server records of requests for URLs can be used to learn more about how bots and people use a website. Web servers record every request, its timestamp, the URL that was requested, who requested it, and the response that the server provided.

Requests are logged in various formats, but most look something like this:

Bot visit, identified by the Googlebot user-agent and IP address:

www.oncrawl.com:80 66.249.73.145 – – [07/Feb/2018:17:06:04 +0000] “GET /blog/ HTTP/1.1” 200 14486 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “-”

 

Organic visit, identified by the Google address as the referer:

www.oncrawl.com:80 37.14.184.94 – – [07/Feb/2018:17:06:04 +0000] “GET /blog/ HTTP/1.1” 200 37073 “https://www.google.es/” “Mozilla/5.0 (Linux; Android 7.0; SM-G920F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36” “-”

 

Why logs are the key to SEO success

Log data provides concrete and un-aggregated figures to answer questions that are at the heart of SEO:

  • How many of my pages have been explored by Google? By users coming from the SERPs?
  • How often does Google (or organic visitors) visit my pages?
  • Does Google (or organic visitors) visit some pages more than others?
  • What URLs produce errors on my site?

 

Following patterns in Googlebot visits can also provide information about your site’s ranking and indexing. Drops in crawl frequency often precede ranking drops by 2-5 days; a peak in crawl behavior followed by an increase in mobile bot activity and a decline in desktop bot activity is a good sign you’ve moved to Mobile-First Indexing–even if you haven’t received the official email.

mobile-desktop

 

Example of a website that have switched to the mobile-first index

 

And if you’re having a hard time ranking pages, but they’ve never had a “hit” by a googlebot, you know it’s not worth your on-page SEO: Google doesn’t know what’s on the page yet.

 

Log data is also essential when analyzing crawl budget on groups of pages, when examining the relationship between crawl and organic traffic, or when understanding the relationship between on-page factors and behavior by bots and visitors.

crawl-budget

Because this data comes directly from the server, the gatekeeper of access to a website, it is complete, definitive, always up-to-date, and 100% reliable.

Logs lines come in all shapes and sizes

Not all server logs are presented in the same format. Their fields might be in a different order They might not contain all of the same information.

Some information that is often not included in log files by default includes:

  • The host. If you’re using your log files for SEO, this is extremely useful. It helps differentiate between requests for HTTP and HTTPS pages, and between subdomains of the same domain.
  • The port. The port used to transfer data can provide additional information on the protocol used.
  • The transfer time. If you don’t have other means of determining page speed, the time required to transfer all of the page content to the requester can be very useful.
  • The number of bytes transferred. The number of bytes helps you spot unusually large or small pages, as well as unwieldy media resources.

 

Identifying bots is not always easy: the good, the bad, and the missing

Bad bots

Sometimes it’s hard to tell what’s a bot and what’s not. To identify bots, it’s best to start by looking at the user-agent, which contains the name of the visitor, such as “google” or “googlebot”.

But because Google uses legitimate bots to crawl websites, scammers and scrapers (who steal content from your website) often name their bots after Google’s in hopes that they won’t be caught.

Google recommends using reverse DNS lookup to check the IP addresses of their bots. They provide the range of IP addresses their bots use. Bots whose entire user-agent and IP address do not check out should be discounted, and–most likely–blocked.

This isn’t a rare case. Bad bots can account for over a quarter of the total website traffic on a typical site, according to this study from 2016.

Good bots

On the other hand, bot monitoring also produces some surprises: at OnCrawl, SEO experts often uncover new googlebots before they’re announced. Based on the type of pages they request, we’re sometimes able to guess their role at Google. Among the bots we identified before Google announced them are:

  • [Ajax] : JavaScript crawler
  • Google-AMPHTML: AMP exploration
  • Google-speakr: crawls pages for Google’s page-reading service. It gained a lot of attention in early February 2019 as industry leaders tweeted about having noticed it.

 

Knowing what type of bot your site attracts gives you the keys to understanding how Google sees and treats your pages.

Missing bots

We’ve also discovered that, although Googlebot-News is still listed in the official list of bots, this bot is not used for crawling news articles. In fact, we’ve never spotted it in the wild.

Server errors disguised as valid pages

Sometimes server errors produce blank pages, but the server doesn’t realize this is an error. It reports the page as a status 200 (“everything’s ok!”) and no one’s the wiser–but neither bots nor visitors seeing blank pages get to see the URL’s actual content.

Monitoring the number of bytes transferred in server logs per URL will reveal this sort of error, if one occurs.

 

Hiding spots for orphan pages

Orphan pages, or pages that are not linked to from any other pages in your site’s structure, can be a major SEO issue. They underperform because link’s confer popularity and because they’re difficult for browsing users (and bots) to discover naturally on a website.

 

Any list of known pages, when examined with crawl data, can be useful for finding orphan pages. But few lists of pages are as complete as the URLs extracted from log data: logs contain every page that Google crawls or has tried to crawl, as well as every page visitors have visited or tried to visit.

Sharing with SEA

 

SEA (paid search engine advertising) also profits from log monitoring. GoogleAds verifies URLs associated with paid results using the following bots:

  • AdsBot-Google
  • AdsBot-Google-Mobile

The presence of these bots on your site can correspond with increases in spending and new campaigns.

What’s really behind Google’s crawl stats

When we talk about crawl budget, there are two sources for establishing your crawl budget:

  1. Google Search Console: in the old Google Search Console, a graph of pages crawled per day and an average daily crawl rate are provided under Crawl > Crawl Stats.
  2. Log data: the count of googlebot hits over a period of time, divided by the number of days in the period, gives another daily crawl rate.

Often, these two rates–which purportedly measure the same thing–are different values.

Here’s why:

  • SEOs often only count hits from SEO-related bots (“googlebot” in its desktop and mobile versions)
  • Google Search Console seems to provide a total for all of Google’s bots, whether or not their role is associated with SEO. This is the case of bots like AdSense (“Mediapartners-Google”), which crawls monetized pages on which Google places ads.
  • Google doesn’t list all of its the bots or all of the bots included in its crawl budget graph.

This poses two main problems:

  1. The inclusion of non-SEO bots can disguise SEO crawl trends that are subsumed in activity by other bots. Drops in activity and unexpected peaks may look alarming, but have nothing to do with SEO; conversely, important SEO indicators may go unnoticed.
  2. As features and reports are phased out of the old Google Search Console, it can be nerve-wracking to rely on Google Search Console for such essential information. It’s difficult to say whether this report will remain available in the long term.

Basing crawl analysis on log data is a good way around these uncertainties.

Log data and the GDPR

Under the European Union’s GDPR, the collecting, storing, and treatment of personal data is subject to extra safety care and protocols. Log data may fall in a gray zone under this legislation: in many cases, the European Commission considers IP address of people (not bots) to be personal information.

Some log analysis solutions, including OnCrawl, offer solutions for this issue. For example, OnCrawl has developed tools that strip IP addresses from log lines that do not belong to bots in order to avoid storing and processing this information unnecessarily.

 

TL;DR Log data isn’t just about crawl budget

There are plenty of secrets you don’t often hear mentioned in discussions about log files.

Here are the top ten takeaways:

  1. Log data is the only 100% reliable source for all site traffic information.
  2. Make sure your server logs the data you’re interested in–not all data is required in logs.
  3. Verify that bots that look like Google really are Google.
  4. Monitoring the different Google bots that visit your site allows you to discover how Google crawls.
  5. Not all official Googlebots are active bots.
  6. In addition to 4xx and 5xx HTTP status codes, keep an eye out for errors that serve empty pages in 200.
  7. Use log data to uncover orphan pages.
  8. Use log data to track SEA campaign effets.
  9. Crawl budget and crawl rate is best monitored using log data.
  10. Be aware of privacy concerns under the GDPR.

 

OnCrawl

OnCrawl is a technical SEO platform that uses real data to help you make better SEO decisions. Interested in improving your SEO by combining log, crawl and 3rd-party data on a powerful platform with friendly support provided by experienced SEO experts? If you missed us at the Search LDN event on Monday, February 4th where we were headline sponsors, you can still ask us about our free trial or visit OnCrawl at www.oncrawl.com.

Exciting Opportunities in Ipswich with Strategiq

As a past speaker at SearchLDN, we wanted to share some news about StrategiQ Marketing. They are looking to expand their team and are searching for a new SEO professional to come on board. and with exciting times 

Based in Ipswich, the role is best suited to those with at least two years of SEO experience and a keenly maintained knowledge of the industry and its ever changing search landscape. Working with their Head of Search, Chris Green (who also spoke at SearchLDN), and the rest of the marketing team, this person will be responsible for taking ownership across a full spectrum of SEO operations and delivery – from initial research and proposals to technical audits, keyword research and implementation of recommendations alongside our talented development team. To be successful in this position, a strong understanding of ranking factors and on-site optimisation is key and proficiency in G Suite and other web based analytics applications is a must.

StrategiQ team

More importantly, people who apply should have strong communication skills and be ambitious, driven and creative as well as willing to learn and take the training necessary to becoming a true specialist in this area. Our team is a close-knit group who aspire to be industry thought leaders and innovators – and we want this person to share a similar mindset and drive.

In the past year alone, StrategiQ Marketing has contributed to breaking new ground in the industry with our innovative hourly rank tracking experiment, DomainCanary tool and Javascript indexing research. Not to mention, in terms of industry contribution, a number of our team have contributed talks at high profile events such BrightonSEO, Sascon, Ungagged and Search London, as well as to webinars for SEMrush and Digital Olympus. They have also been recently shortlisted for the ‘Best Employer’ category in the 2018 Suffolk Business Awards, 

Outside of the daily role for this position, there are a number of other benefits to working at StrategiQ:

  • A competitive salary of £25 – £35k (dependent on experience)
  • £1k training budget & a personalised development plan
  • 23 days holiday per year, plus all British Bank Holidays
  • Monthly MVP opportunity – (‘most valued player’) award – the winner each month receives a ‘duvet day’ and a night out on the company card
  • Onsite shower facilities – the office is surrounded by fields & there’s a gym next door
  • Opportunity to participate in CSR fundraising events
  • Opportunity to be involved in company strategy days and team building activities

StrategiQ office

Don’t just take StrategiQ’s word for it though. Here’s what the team have to say about StrategiQ:

“I love working at StrategiQ because it feels like my destiny is in my own hands and we are actively encouraged to set goals for ourselves and the flexibility is there for us to achieve whatever we want to achieve – no glass ceilings!” – Yasmin, Designer

 

“I have now been in full time employment for 5 years and StrategiQ is my third local agency within this time. I can honestly and proudly say that StrategiQ is without a shadow of a doubt the best employer I have had in this time. Neither of my two previous agencies have come close to showing the passion and desire for bettering and rewarding employees like StrategiQ does. It is the first agency I have been at where I feel like I genuinely matter as a valued member of the team in the present, and in the future.” Charlie, Paid Search Manager

 

If the thought of being part of Strategiq’s team excites you, please send your CV and cover letter to chris@strategiq.co