Log File Analysis for SEO
Last updated
Server access logs record every HTTP request made to a server, including the timestamp, requested URL, HTTP status code, response size, and the user agent of whatever made the request. For SEO purposes, the most valuable information in logs is the Googlebot record: which URLs it fetched, when, how often, and what the server returned.
Log file analysis sits at the intersection of technical SEO and data analysis. It requires more setup than most SEO auditing tasks, but it provides evidence that no other tool can offer. This is direct observation of crawl behaviour, not an approximation of it.
What logs contain
A typical Apache or Nginx access log line looks like:
66.249.64.1 - - [01/May/2026:09:14:23 +0000] "GET /technical-seo/crawl-budget/ HTTP/1.1" 200 8432 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Each entry includes:
- IP address. From the requester.
- Timestamp. Of the request.
- HTTP method and URL. That was requested.
- Status code. That the server returned.
- Response size. In bytes.
- User agent. String identifying the crawler.
By filtering for Googlebot user agents and aggregating across log files, you can reconstruct exactly what Google crawled, when, and what the server told it.
Verifying Googlebot requests
Not every request with a Googlebot user agent string is genuine. User agents can be spoofed. To verify that a request is from real Googlebot, perform a reverse DNS lookup on the IP address. Genuine Googlebot requests resolve to googlebot.com or google.com hostnames. Requests that do not resolve to those domains are not from Google infrastructure.
AI crawlers in logs
Server logs are the most accurate way to measure AI crawler traffic. The major AI crawlers to identify are:
- GPTBot (OpenAI, model training): user agent
GPTBot/1.1 - OAI-SearchBot (OpenAI, ChatGPT search and citations): user agent
OAI-SearchBot/1.0 - ChatGPT-User (OpenAI, user-triggered fetches and Custom GPT requests): user agent
ChatGPT-User/1.0 - Google-Extended (Google AI training, distinct from Googlebot): user agent
Google-Extended - ClaudeBot (Anthropic): user agent
ClaudeBot/0.9 - PerplexityBot: user agent
PerplexityBot/1.0 - Meta-ExternalAgent (Meta AI): user agent
Meta-ExternalAgent
These crawlers can consume meaningful bandwidth on high-traffic sites. Logs allow you to quantify that usage and, if desired, configure robots.txt rules with actual data rather than guesswork. OpenAI updated its ChatGPT-User documentation in December 2025 to no longer commit to honouring robots.txt, so logs are the only reliable way to verify its actual behaviour on your site.
What log analysis reveals
Crawl frequency by URL — Which pages Googlebot visits most often, and which it rarely visits. Important pages that are crawled less frequently than low-value pages indicate a crawl budget problem.
Status codes served to Googlebot — If Googlebot is frequently receiving 5xx errors, 404s for URLs that should exist, or 301s that chain to another 301, logs surface these problems before they show up in Search Console.
Uncrawled important pages — URLs that should be crawled but do not appear in logs at all. These may be orphaned pages with no internal links, or pages inadvertently blocked.
Crawl anomalies — A sudden drop in Googlebot’s crawl volume can precede a ranking drop. The anomaly is often visible in logs before it appears in Search Console crawl stats, giving an earlier warning.
URLs Google wastes budget on — Parameter URLs, session IDs, faceted navigation variants, and admin pages that Googlebot is visiting despite having no ranking value.
Discrepancies with Search Console. GSC crawl stats show an aggregate daily crawl count, but do not reveal individual URL detail. Logs show the full picture: which URLs were crawled, at what frequency, and what response they received. It is common to find that URLs GSC reports as crawled were receiving non-200 responses at crawl time.
Tools for log analysis
Screaming Frog Log File Analyser is purpose-built for this task. It parses log files, filters by bot, and produces crawl frequency reports per URL. It integrates with Screaming Frog SEO Spider output to cross-reference crawl data with on-site data.
Logstash + Kibana (ELK stack) is a more powerful option for sites with very large log volumes. Setup requires technical resource but produces flexible, real-time dashboards.
Python or command-line tools (awk, grep, cut) work well for targeted one-off analysis if you are comfortable with the command line. A simple shell command can extract all Googlebot requests and count them by URL in minutes.
Cloudflare Analytics provides partial bot traffic data for sites behind Cloudflare, but does not replace server-level logs for detailed per-URL analysis.
Setting up log access
Log files are stored on the server or by the hosting provider. The path varies by server type:
- Apache: typically
/var/log/apache2/access.log - Nginx: typically
/var/log/nginx/access.log
Managed hosting providers (Kinsta, WP Engine, Cloudways) often provide log download through their dashboards. Cloudflare and CDN providers may buffer or modify logs; ensure you are analysing origin server logs rather than CDN-edge logs if you want accurate crawler data.
Logs are typically rotated and compressed daily. For meaningful SEO analysis, you need at least 30 days of logs, and 90 days is preferable for detecting crawl patterns and anomalies.
Log analysis versus Search Console
Search Console and server logs answer different questions. Search Console reports what Google has indexed and what queries pages appear for. Logs report what Google actually requested from the server. The gap between the two is where most log analysis value lives: pages Google crawled but did not index, pages that returned errors at crawl time, and patterns in crawl frequency that explain indexation delays.
Neither source is complete alone. Used together, they provide a more accurate picture of crawl behaviour than either provides independently.