SEO Assumptions

Recently, there was a huge SEO API leak that Google accidently published to github which leaked essentially a bunch of metrics that they use to track SEO performance. This document is my take on what it all means based on what I am reading from various sites that understand the keywords and metrics more than me, and will consist of key "Ethan takeaways" about what that actually means for making a site rank better with Google. For those not familiar with the term LLM, it means "some type of AI". In an industry context, it means an algorithm that reads text. In a "web designer" context, it means ChatGPT.

Raw doc found here

A lot of this was taken from Search Engine Land

Page Quality

There is a metric called pageQuality that is essentially exactly what it sounds like: how good the page is. How Google figures this, however, is what matters. Google seems to care about

Complexity of content
- unique information and depth of information stand out as ways to score high on “effort” calculations. Coincidentally, these things have also been proven to satisfy users.
"Replicability", or, how easy someone could remake the page structure and, presumably, come up with the content on their own or with a LLM.
Images, particularly unique images or images edited enough to seem unique
- This might be a reason to use "google lens" to check if an image can be reverse image searched. It may be that if the image comes from a good source and can be reverse searched, it boost the page, but if it comes from a bad source, it does not, and if it comes from "nowhere" (meaning it cannot be reverse searched) that it is considered unique and high quality. More on the pagespeed section
Videos
- Same as images, but harder to track because videos are unique. Perhaps the "QA slideshow videos" are considered unique despite being low quality, and perhaps google will favor pages with youtube videos because it supports a google service.
Tools and utilities

EEAT and YMYL

EEAT is "Experience, Expertise, Authoritativeness, and Trustworthiness.” YMYL is for “Your Money or Your Life.”

Basically, Google has metrics that determine how much of either of these things you are, the first being great and the latter being bad. Essentially, if you seem trustworthy and have good content that isnt spammy or predatory, you rank better. But doing something like "scare tactics" like "your roof could collapse at any minute, so buy flood insurance!!!" is likely understood by google to be threatening and bad. But, if your site is forbes (likely a good site), that doesnt directly help you if your content is bad. Whereas, a really really really great article on an otherwise shitty site will be impacted more positively by that breath of fresh air.

Topic Borders

Topical authority is a concept based on Google’s patent research. If you’ve read the patents, you’ll see that many of the insights SEOs have gleaned from patents are supported by this leak.

siteFocusScore denotes how much a site is focused on a specific topic. siteRadius measures how far page embeddings deviate from the site embedding. In plain speech, Google creates a topical identity for your website, and every page is measured against that identity. siteEmbeddings are compressed site/page embeddings.

Basically, anything embedded on a page is somehow analyzed and compared to "known good" others in the section (Hometown Roofing Company is analyzed and compared against Megaroofing International) as well as compared against your own site, or at least what google assumes what your site is. This implies, to me, that the "core pages" like home, services, and about play a HUGE role in determining what Google assumes your site is about, which means that the rest of the site in any way not matching up to those "promises or ideals" could be very very bad for SEO.

Host NSR

Host NSR is site rank computed for host-level (website) sitechunks. This value encodes nsr, site_pr and new_nsr. Important to note that nsr_data_proto seems to be the newest version of this but not much info can be found.

In essence, a sitechunk is taking chunks of your domain and you get site rank by measuring these chunks. This makes sense because we already know Google does this on a page-by-page, paragraph and topical basis.

It almost seems like a chunking system designed to poll random quality metric scores rooted in aggregates. It’s kinda like a pop quiz (rough analogy).

NavBoost

Basically, Google takes metrics from chrome users to determine how many people go to a website, and, then rank that site higher. It used to be thought that clicks on google search did this, but we are now learning that it is not limited to google search, but even going to www.site.com in the address bar directly, as long as you are on chrome. This means bookmarks rank sites better, so a site that has frequent updates that users might care about (a good blog) will likely rank higher in this area than a site that sells something, unless of course that is something that you might buy often.

HostAge

There is no evidence to suggest that newer sites rank better than older ones. Google tracks the dates on your pages to determine freshness, and does something with this data, but clearly not making old sites rank better. Probably only new pages?

Bolding/Emphasizing Content

While it’s conjecture, one of the most interesting things I found was the term weight (literal size).

This would indicate that bolding your words or the size of the words, in general, has some sort of impact on document scores.

Short/Thin Content

This one I was VERY wrong about. Short content does not equal thin content. Short content has a different scoring system applied to it

Links

Links from newer webpages are better than links inserted into older content. Basically, editing a page is less effective than creating a whole new page, and linking to "good" sites on a new page ranks better than those same links on an old page that has been editied.

Anchor text and referential page links

How is no one talking about this one? An entire page dedicated to anchor text observation, measurement, calculation and assessment.

Over how many days 80% of these phrases were discovered” is an interesting one.
Spam phrase fraction of all anchors of the document (likely link farm detection tactic – sell less links per page).
The average daily rate of spam anchor discovery.
How many spam phrases are found in the anchors among unique domains.
Total number of trusted sources for this URL.
The number of trusted anchors with anchor text matching spam terms.
Trusted examples are simply a list of trusted sources.
At the end of it all, you get spam probability and a spam penalty.

trustedTarget is a metric associated with spam anchors, and it says “True if this URL is on trusted source.” When you become “trusted” you can get away with more, and if you’ve investigated these “trusted sources,” you’ll see that they get away with quite a bit. On a positive note, Google has a Trawler policy that essentially appends “spam” to known spammers, and most crawls auto-reject spammers’ IPs.

Google Uses Click Data to Determine How to Weight Links in Rankings

This one’s fascinating, and comes directly from the anonymous source who first shared the leak. In their words: “Google has three buckets/tiers for classifying their link indexes (low, medium, high quality). Click data is used to determine which link graph index tier a document belongs to. See SourceType here, and TotalClicks here.” In summary:

If Forbes . com /Cats/ has no clicks it goes into the low-quality index and the link is ignored

If Forbes . com /Dogs/ has a high volume of clicks from verifiable devices (all the Chrome-related data discussed previously), it goes into the high-quality index and the link passes ranking signals

Once the link becomes “trusted” because it belongs to a higher tier index, it can flow PageRank and anchors, or be filtered/demoted by link spam systems. Links from the low-quality link index won’t hurt a site’s ranking; they are merely ignored.

I'm not sure if the author is interpreting the information right, or if I'm even correctly interpreting what the author is saying.. but is this in reference to backlinks? I.e. if a backlink gets no clicks it is useless and doesn't pass pagerank? Can anyone else chime in?

If this is true it is pretty huge. We've always known that a good backlink should drive traffic to your site but to have confirmation that click data is basically the first factor considered when judging a link and if the page/link doesn't meet a certain threshold for click data the process stops there and the link could be simply ignored. No clicks = USELESS backlink? So all of those paid links, niche edits, guest posts etc that don't generate traffic or that are on pages that don't receive any traffic could be even more useless than we thought.

titlematchScore

A sitewide title match score that is a signal that tells how well titles match user queries. So page titles (both in the tab bar and on the h1 tag) matching what a user might search is better than "home".

Example: * Home | Smart Recipes (bad) * Great Recipes for Dinners, Everyday | Smart Recipes (likely better)

Putting keywords in your title tag and matching search queries is important. Basically, having "best roofer AZ" is bad, but having something like "emergency roof repair with 24 hour service" would be great.

Scoring in the absence of measurement

Basically, you have chunks of your site that have values associated with them and these values are averaged and applied to the unknown document.

Google is making scores based on topics, internal links, referring domains, ratios, clicks and all sorts of other things. If normalized site rank hasn’t been computed for a chunk (Google used chunks of your website and pages for scoring purposes), the existing scores associated with other chunks will be averaged and applied to the unscored chunk.

I don’t think you can optimize for this, but one thing has been made abundantly clear:

You need to really focus on consistent quality, or you’ll end up hurting your SEO scores across the board by lowering your score average or topicality.

Demotions

Poor navigational experience hurts your score.
Location identity hurts your scores for pages trying to rank for a location not necessarily linked to your location identity.
Links that don’t match the target site will hurt your score.
User click dissatisfaction hurts your score.
It’s important to note that click satisfaction scores aren’t based on dwell time. If you continue searching for information NavBoost deems to be the same, you’ll get the scoring demotion.
Spam
- gibberishScores are mentioned. This refers to spun content, filler AI content and straight nonsense. Some people say Google can’t understand content. Heck, Google says they don’t understand the content. I’d say Google can pretend to understand at the very least, and it sure mentions a lot about content quality for an algorithm with no ability to “understand.”
- phraseAnchorSpamPenalty: Combined penalty for anchor demotion. This is not a link demotion or authority demotion. This is a demotion of the score specifically tied to the anchor. Anchors have quite a bit of importance.
- trendSpam: In my opinion, this is CTR manipulation-centered. “Count of matching trend spam queries.”
- keywordStuffingScore: Like it sounds, this is a score of keyword stuffing spam.
- spamBrainTotalDocSpamScore: Spam score identified by spam brain going from 0 to 1.
- spamRank: Measures the likelihood that a document links to known spammers. Value is 0 and 65535 (idk why it only has two values).
- spamWordScore: Apparently, certain words are spammy. I primarily found this score relating to anchors.

Advice

You should invest in a well-designed site with intuitive architecture so you can optimize for NavBoost.
If you have a site where SEO is important, you should remove / block pages that aren’t topically relevant. You can contextually bridge two topics to reinforce topical connections. Still, you must first establish your target topic and ensure each page scores well by optimizing for everything I’m sharing at the bottom of this document.
Because embeddings are used on a page-by-page and site-wide basis, we must optimize our headings around queries and make the paragraphs under the headings answer those queries clearly and succinctly.
Clicks and impressions are aggregated and applied on a topical basis, so you should write more content that can earn more impressions and clicks. Even if you’re only chipping away at the impression and click count, if you provide a good experience and are consistent with your topic expansion, you’ll start winning, according to the leaked docs.
Irregularly updated content has the lowest storage priority for Google and is definitely not showing up for freshness. It is very important to update your content. Seek ways to update the content by adding unique info, new images, and video content. Aim to kill two birds with one stone by scoring high on the “effort calculations” metric.
While it’s difficult to maintain high-quality content and publishing frequency, there is a reward. Google is applying site-level chard scores, which predict site/page quality based on your content. Google measures variances in any way you can imagine, so consistency is key.
Impressions for the entire website are part of the quality NSR data. This means you should really value the impression growth as it is a good sign.
Entities are very important. Salience scores for entities and top entity identification are mentioned.
Remove poorly performing pages. If user metrics are bad, no links point to the page and the page has had plenty of opportunity to thrive, then that page should be eliminated. Site-wide scores and scoring averages are mentioned throughout the leaked docs, and it is just as valuable to delete the weakest links as it is to optimize your new article (with some caveats).
Clicks matter more than anything else, by far so it might not be a bad idea to get external marketing going to start the clicks. While this doesnt directly confirm that google prefers people that use google ads, using google ads generates clicks, which means that google likes you more. Not anymore than any other ad services generating clicks necessarily, but it does work.
Exact match domains can be bad for search ranking: www.prescott-az-best-roofer.com is not as good as www.wymaroofing.com

Other takeaways

Google has a specific flag that indicates is a site is a “small personal site.”
Google denied having a "sandbox" that holds back new sites, but yep, the docs confirm it exists.
The number and diversity of your backlinks still matter a lot.
Having authors with expertise and authority helps. Maybe putting the name of the author, as well as an about page that says "Joe Blow has 25 years of experience" might help rankings, if every page is listed as written by him?
An anecdotal from reddit

I’ve seen a massive jump in our search rankings and traffic since March (4x). The website is a local therapy business. I couldn’t figure out why then I read there was an algorithm update in March of this year. We had created a “knowledge section” center of our site last year that almost never received traffic and now gets hundreds of visits per month. Prior to March we had low traffic, but high engagement and conversion from those visitors. I wonder if we are now outranking blogs for some of the condition related information searches due to this change.

A lot of long-held SEO theories have been validated, so trust your instincts.