How Google works: Paul Haahr at SMX
A transcription of Google engineer Paul Haahr’s session at SMX entitled “How Google Works”.
It’s very rare that we get to hear from a Google Search Engineer at a Search Industry conference. There are always lots of helpful Google Webmaster Analysts present but since Matt Cutts went on sabbatical we’ve not had much in the way of a replacement.
That’s why Paul Haahr’s session at the SMX West conference was really welcome.
I wanted to analyse this presentation after initially discovering it via Bill Slawski. As a training resource for my team, I thought it might be smart to transcribe, take notes, add useful links and screenshots and generally extract as much value from this rare opportunity as I could.
I also recommend you also take a look at Rae Hoffman’s notes here. One important takeaway for me was that Paul didn’t explicitly mention issues like algorithm updates and machine learning in his presentation – I don’t have a view on why that was, but nevertheless the value of his presentation is exceptional.
Conferences like SMX are the backbone of our industry for agencies and in-house teams alike. SMX will be visiting London on May 18th – 19th. It’s a great place to learn new ideas and meet interesting people so if you’re in London that week, you really should come along. If content like this isn’t a good reason to join us there, I don’t know what is. Massive thanks to Paul Haahr and Danny and his team for putting stuff like this out there.
Video Source SMX West 2016 – How Google Works: A Google Ranking Engineer’s Perspective
How Google Works: A Google Ranking Engineer’s Story
Danny: So this session is “How Google Works: A Google Ranking Engineer’s Story.” We are really lucky to have Paul Haahr, who’s a principal engineer at Google. He has one of those titles that sounds very subdued but actually doesn’t really reflect that he’s part of the senior leadership from the Google’s ranking team.
So it’s a real…you’re very lucky to have him out here talking today. He’s going to go through some slides and give you a sense of what it’s like from his perspective working with the team that builds out the ranking process, and then we should have some time for Q and A after that.
So if you please welcome Paul.
Paul: I’m Paul Haar. I’ve actually been working on ranking at Google for 14 years as of tomorrow. As I told Danny, my claim to fame in this room should really be that I was Matt Cutts’ office mate for about two years.
I’ve worked on retrieval. I’ve worked on a lot of different parts of ranking [1,2]. I’ve worked on indexing. These days, I manage a couple of small teams. I participate in our launch review process  and I even still do some coding.
I want to talk about Google Search today for just a couple of minutes. There are two themes, I would say, that are going through Google Search today, maybe an emerging one that I’m not going to talk as much about.
We need to get it to them very quickly despite the fact they probably have low bandwidth [11,12,13]. When you’re searching on a mobile device, you’re much less likely to type. You’re much more likely to use voice  or just tap  on a click target.
The other thing that’s going on with Google Search, I’m sure everybody here has noticed, there’s a lot of search features . We are doing spelling suggestions like we’ve always done but AutoComplete is playing a much bigger role.
And so, that’s sort of where I would characterize things. We’re going more and more into a world where search is being thought of as an assistant  to all parts of your life, and some of that is showing up as search is reaching out to help people directly.
But I am going to talk about ranking, which is a very specific sub-problem of all the search problem, which I would describe as the “10 blue links” problem. And so, I’m mostly going to be talking about classic search and how we’ve been doing things for ages and ages now.
So everybody’s used to “10 blue links.”
That’s all there used to be a little while ago. And I reduce the “10 Blue Links” problem to what do we show and what order do we show them in?
And I should also mention I’m not going to be talking about ads at all. And probably everybody in this room knows more about Google Ads than I do. Ads are great. They make us a lot of money. They work very well for advertisers.
But my job, we’re explicitly told, “Don’t think about the effect on ads. Don’t think about the effect on revenue. Just think about helping the user.”
So, I’m going to start with talking about what we call “Life of a Query.” This is actually modelled on a class we do for every new engineer at Google. Whether they’re working on Search or Android or Ads or self-driving cars, every engineer coming into Google gets a half-day class on how our systems are put together [24,25,26,27].
I’m not going to give you a half-day class. I’m going to give a five-minute version of it just to understand what our systems are like.
So there are two parts of a search engine. There’s what we do before we see a query and once we have the query. So before we have the query, we crawl the web.
Everybody here is used to seeing Googlebot crawl their sites. Not much to say there. We try to crawl as comprehensive a part of the web as we can. It’s measured these days in the billions of pages. I don’t even have an exact number. After we’ve gathered all those pages, we analyse the crawled pages.
In the old days, analysing crawled pages was we extracted the links. There was a little bit else, but it was basically just “Give me the links off the web.” These days, we do a lot of semantic analysis and annotation. Some of this is linguistic. Some of this is related to the Knowledge Graph. Some of it is things like address extraction and so on.
In most cases, you should not have to do anything special and we get to see the same version of the page with full rendering with CSS and all that that your users see.
That’s been a real benefit, I think, for both users and webmasters. And then after that, we build an index. And everybody knows what an index in a book is, but web index is very similar.
For every word, it is a list of the pages that the word appears on. As a practical matter to deal with the scale of the web, we break the web up into groups of millions of pages which are distributed randomly. And each of these millions of pages is called an index shard  and we have thousands of shards for the web index. And then, as well as the list of words, we have some per-document metadata.
So that is the index building process. These days, it runs continuously. It processes, again, some number of billions of pages a day.
Now, I’m going to turn to what happens at serving time when we actually get a query.
I’m going to break that into three parts. We do a query understanding  part, where we try to figure out what the query means, we do retrieval and scoring, which I’ll talk a little bit more about, and then, after we’ve done the retrieval part, we do some adjustments [36,37,38,39,40].
So query understanding. First question is do we know any named entities  in the query? The San Jose Convention Centre, we know what that is. Matt Cutts, we know what that is. And so we label those. And then, are there useful synonyms? Does General Motors in this context mean…? Does GM mean General Motors? Does GM mean genetically modified? And my point there is just that context matters. We look at the whole query for context.
Once we have this expanded query, we send this query to all the shards that I just talked about from the index. And for each shard then, we find all the pages that match. All is an exaggeration.
We find pages that match . We compute a score for the query and the page. Computing the score is the heart of ranking in a lot of ways. We come up with a number that represents how good a match the query is for the page.
Once we have that, each shard sends back the top end pages by score. And it’s a small number of pages for each shard. We’ve got a lot of shards. The central server then combines all the top pages, sorts by the score, and then we’re almost done.
And then, we do some post-retrieval adjustments. So this is looking at diversity by host, looking at how much duplication  there is. Spam demotions [44,45] kick in at this point and a whole bunch of other little-ish things come in at that point. Then, we generate snippets . We’ve got our top 10, we produce a search results page after we’ve merged with other search features and send it back to the user.
So what I’m trying to convey, I guess, in this talk to some degree is what do ranking engineers do. And the first version is just we write code for the servers that I just talked about. That’s a very operational definition. It doesn’t actually get at anything useful yet. So we’ll see if we can get more useful.
Talking a little bit more about the scoring process, that is computing this one number that represents the match between a query and a page.
We base this on what we call scoring signals. A signal is just some piece of information that’s used in scoring. We break these down into two categories. The ones that are just based around the page, so your page rank, your language, if the page is mobile-friendly. Or things that are query-dependent, so things that take into account both page and what the user is searching for. So keyword hits and synonyms [47,48], and proximity  all factor into this.
And so, version two of what ranking engineers do is we either look for new signals or we combine old signals in new ways. And both of those turn out to be really hard and interesting would be my summary.
All right. But that doesn’t get at how we determine what we want to do. That’s just how we do it. And metrics are really what we use as our guide.
Lord Kelvin supposedly said, “If you cannot measure it, you cannot improve it.” He actually said something that was much more Victorian and publishable in a science journal, but the popular version of the quote is much easier to understand.
So we measure ourselves in a whole lot of different directions. The key metrics that I want to talk about today. Relevance. Is the page useful at answering what the user was looking for? And this is our top-line metric. This is the one that we cite all the time internally. This is the one that we compare ourselves, usually, to other search engines with. And so, this is the big internal metric.
But we also have other ones such as quality. How good are the results that we show? How good are the individual pages? Or Time to Results, where faster is better. And so, we have a lot of metrics that we measure ourselves on.
And I should mention that all these metrics are based on looking at the whole search results page rather than just one result at a time. And to do this whole search results page, there’s just a convention that basically everybody who does search uses, something like this, which is position one is worth the most, position two is worth half of what position one is, position three is worth one-third. This is normally known as reciprocal rank weighting. And it goes on from there.
So what do ranking engineers do? We try to optimize the metrics. We try to improve the scores that we get on our metrics.
And to compute our metrics…I apologize if I hit anybody with the laser pointer. I keep hitting the wrong button.
Where do the metrics actually come from?
We have an evaluation process that’s based on two separate things, live experiments and human rater experiments.
Now, live experiments are probably familiar to just about everybody in the webmaster community.
And we do largely the same things that most websites do with live experiments, which is we do A-B experiments on real traffic. And then we look for changes in click patterns. And I should just mention that we run a lot of experiments . It is very rare if you do a search on Google and you’re not in at least one experiment.
Now, not all of these are ranking experiments. Famously, Google tested the colour blue that we used for links and other blue highlighting with 41 different blues and came to the conclusion what was the perfect blue. And this caused a lot of designers some angst because they just wanted to trust their instinct. You can argue both sides of that case. But anyway, we do a lot of experiments.
I want to mention that interpreting live experiments is actually a really challenging thing often. And I’m going to take you through an example. I apologize for the subscripts here. That’s just the conventions that we tend to use. This is the only subscripts on my slides.
But consider that you have two pages, page one and page two, that are possible pages to show for some query that a user gives. For page one, the answer is on the page. User clicks, let’s see the answer, they’re good. For page two, the answer’s on the page but our snippeting algorithm also pulls the answer into the snippet. And now, we have two algorithms. A puts P-1 before P-2. So the user going down, they see P-1. It looks like it could be a good result. They click on it. They go to the page. Our live experiment analysis says that’s good. We got a click and it was high up on the page. Algorithm B puts P-2 before P-1. The user’s going down the page.
They see P-2. They see the answer in the snippet. They don’t bother clicking. To us, at least to a naive interpretation, that looks bad. Do we really believe that? Because actually, we think that P-2 is at least as good as P-1 in terms of answering the question on the page and getting the user result in earlier should be good. It’s quite challenging to distinguish the case between the user left because the answer wasn’t in the snippet or because they didn’t think the answer was there at all and the user left because they got a good answer in the snippet. So live experiments are challenging but very useful nonetheless.
The other thing we do is human rater experiments .
There’s a long history of doing experiments like this in information retrieval . What we do is we show real people experimental search results. We ask those people how good are those results, and I’ll talk a little bit more about how we do that. We get some ratings. We average them across the raters. We do this for large query sets.
So we get lots of ratings and we get something that we think is statistically significant. Tools support this doing this in an automated way. It’s very similar to Mechanical Turk processes that people do outside Google.
And we actually published guidelines that explain the criteria to raters. We really want humans in the loop on this one because people have good intuitions. People search for themselves and they have experience. They can tell what’s a good search result and what’s bad. But we also need to explain what we’re looking for and what we think our users are looking for. As people may know, we published our human rater guidelines  last year.
They’re about 160 pages of great detail about how we think about what’s good for users.
If you’re wondering why Google is doing something, often the answer is to make it look more like what the rater guidelines say.
I’m going to give you a bunch of examples from the rater guidelines over the next bunch of slides.
So, for example, this is actually what our rater tool looks like for the raters minus the red arrows. This is, again, pulled from the rater guidelines. Basically, they get a set of search results. They get told what the query is on the top. There’s actually some information there about where the user is or where we think the user is. And there’s some sliders there that the raters can play with.
Here’s an example of an actual rating item. So they have a slider for what’s called the “Needs Met” rating and a slider for the “Page Quality” rating.
So they get asked this. In the context of a query, set the sliders to where you think they belong.
So there are these two scales, Needs Met, which is our version of relevance these days, which is does this page address the user’s need? And then, there’s Page Quality , which is how good is the page?
And you should be saying, “But you said you were mobile first. Why aren’t you asking, ‘Is this page mobile friendly?'”
And actually, all of our Needs Met instructions are about mobile user needs. So we give this general prefix, “Needs Met rating tasks ask raters to focus on mobile user needs and think about how helpful and satisfying the result is for the mobile users.”
So that’s implicit. But we also make it mobile-centric by using many more mobile queries than desktop queries in samples. We actually oversample mobile. Right now, the traffic is mobile has just passed desktop. But we have basically more than twice as many mobile queries as desktop in our samples.
We pay attention to the user’s location. You saw that in the tool sample I showed before. The tools also display a mobile user experience. And the raters actually visit the websites on their smartphones, not on their desktop computers. So there, the raters are really getting a very mobile-centric experience.
Okay, Needs Met rating. I’m going to start with the best category. Here are the categories.
It goes from Fully Meets through Highly, Moderately, and Slightly Meets all the way down to Fails to Meet. Oh, I left an extra S in there. Obviously, Fully Meets is great. Fails to Meet is awful and we’ve got things in the middle.
So two examples of Fully Meets. You search for CNN, we give you CNN.com. That’s a great result. The user who is searching on Google for CNN probably wants something like that. On the other hand, we know we’re in a mobile era. We know people like apps a lot. So you search for Yelp and you have the Yelp app installed on your phone, you probably actually want to open the Yelp app, or at least as likely. For Fully Meets, we really want the case of an unambiguous query and something that can wholly satisfy whatever a user wants to do about that query.
So in either case, I think, showing the Yelp website would probably be Fully Meets. Showing the CNN app would also be Fully Meets as well.
I hope there’s a way I can go back, and go to Highly Meets. Highly Meets is this is an informational query and this is a good source of information. Two of these happen to be from Wikipedia. I think ESPN would have been a probably better Highly Meets example here. But the idea is, “This is a great source of information. It’s authoritative. It’s got some expertise to it. It’s probably comprehensive for the query in question.” And this is what Highly Meets is meant to be.
We actually have a category between.. We actually give the raters slider bars where they can go anywhere they want on them. And we have a sort of Very Highly Meets in between Highly and Fully that’s meant to capture the idea that this would be Fully Meets if there weren’t another great interpretation. So this is two examples of the query, Trader Joe’s. The first one shows a map with three nearby stores. The second one shows the Trader Joe’s website. The user might want the website. So showing the map is not quite adequate. The user might want the map. Showing the website is not quite adequate. We wanted to have a distinction of, “This is better than just the Wikipedia page about Trader Joe’s, which seems like a pretty useless thing by and large.” But we didn’t want to be able to say, “Hey, you totally nailed this query by getting the map there and not getting the website.” Or vice versa.
Some more examples of Highly Meets would be showing pictures on a query where we think the user is looking for pictures, showing a map for an ambiguous query. The query here is turmeric, which is, yes, a spice but it’s also a restaurant in Sunnyvale. If the user is in Sunnyvale and they’re searching for turmeric, the map is probably a good guess. We want to give that Highly Meets.
Moderately Meets is it’s good information. For the query Shutterfly, the CrunchBase page about Shutterfly, it’s interesting. It’s certainly not the first thing, but it might be useful on the first page. Similarly for Tom Cruise, a fan site about Tom Cruise or a general star site about Tom Cruise, that seems like a good site. It’s not the most authoritative but it’ll have useful information.
Slightly Meets, less good information. In this case, one of the examples I think is really spot on. Search for a Honda Odyssey, you get the Kelley Blue Book page about the 2010 Honda Odyssey. User didn’t say 2010. They’re possibly interested in 2010. They’re more likely interested in something more recent. So that would be an example. So again, this is acceptable but not great information and we’d hope there is better.
Fails to Meet is where it starts getting laughable. Search for German cars and get Subaru. Probably didn’t mean that. Searching for a rodent removal company and getting one half the world away. Probably not too useful. We all have horror stories of our own. There were three bugs that acted in concert about 10 years ago before United and Continental had merged such that you searched for United Airlines and you get Continental at position one. I was responsible for two of the three bugs that were working in concert there, and that was very embarrassing.
I’m going to now turn from Relevance to Page Quality. After a lot of iteration, we’ve ended up at three important concepts that we think of for describing the quality of a page.
It’s expertise, authoritativeness, and trustworthiness . So is the author here an expert on what they’re talking about? Is the website or is the webpage authoritative about it? And can you trust it? And then, there’s clearly some queries where medical or financial information is involved, buying a product, where trustworthiness is probably the most important of those three.
The rating scale here is from high quality to low quality, and it’s sort of obvious what we’re looking for.
For high quality pages, it’s a satisfying amount of high quality main content. It’s got the expertise, authority, and trustworthiness. And the website has a good reputation as sort of a key principle there.
Low quality, it’s the opposite. A couple of other things to throw in, a website that has an explicit negative reputation. And we all know about those sorts of sites. Or the secondary content is distracting or unhelpful. Secondary content is largely ads and other things not necessarily tied to the user’s information need.
So optimizing our metrics, how do we do that?
We’ve got a team of a few hundred software engineers. They are focused on our metrics and our signals. They run lots and lots of experiments and they make a lot of changes.
Yeah. So that’s what we spend our time doing. The process is usually we start with an idea. Sometimes the idea is, “There’s a set of problems I really want to solve.” Or sometimes the idea is: “I’ve got this new source of data and I think it’s really useful.”
And then you repeat over and over. You write some code. You generate some data. You run a bunch of experiments and you analyse the results of those experiments. This is a typical software development process. Can take weeks, can take months in that stage. You know, it might not pan out. A lot of things never pan out. When it does pan out, we get a launch report written. We run some final experiments. A launch report is written by a quantitative analyst who is basically a statistician and someone who’s an expert at analysing our experiments.
These reports are really, really great because they summarize. First of all, they’re from a mostly objective perspective relative to the team that was working on the experiments. They keep us honest. And then, there’s the launch review process, which is, for the most part, a meeting every Thursday morning. I came here from our launch review meeting, where the leads in the area hear about what the project is trying to do, hear a summary from the analyst. In theory, we’ve read all of the dozen or so reports. In practice, hopefully one of us has read it. At least one of us has read one of each of the reports. And we try to debate basically what…is this good for the users? Is this good for the system architecture? Are we going to be able to keep improving the system after one of these changes is made?
I’d like to think that that’s a really kind and fair process. The teams that come before launch review might disagree. They’ve been known to be quite contentious.
We did a videotape  a few years ago of one of those meetings so you could see one or two items discussed there. And I should actually mention, I didn’t put it on the slide, but after launch review, assuming something is approved, getting into production can be an easy thing. Some teams will ship it the same week. Sometimes you have to rewrite your code to actually make it suitable for our production architecture, making things fast enough, making things clean enough.
That can take a while. I’ve known it to take months. In the worst case I know about, it took just shy of two years between approval and actually launching.
So what do ranking engineers do? By and large, we try to move results with good human ratings and live experiment ratings up and move results with bad ratings down.
What goes wrong with this? And how do we fix it? I’m going to talk about two kinds of problems that we have. There are others but these are at the core of most of the things we’ve seen.
One is if we get systematically bad ratings or if the metrics don’t capture the things we care about.
So bad ratings, here’s an example, Texas farm fertilizer.
That’s actually a brand of fertilizer. I only learned that by looking at the search. What we showed at position one was a three pack of local results on a map. It is very unlikely the user doing the search is going to want to go to the manufacturer’s headquarters. Now, I looked at the Street View. I looked through the pictures. You can actually drive your 18-wheeler up to the loading dock and get fertilizer that way. It is unlikely that the user doing the search actually wanted to do that. You can actually buy this in stores like Home Depot and Lowe’s, so that seems like a much more likely route. Maybe somebody wanted to order it online. I don’t know. But the raters, on average, called this pretty close to Highly Meets, and that seems crazy to us.
And then, we actually saw a pattern of losses that we were getting here, which is that in a series of experiments especially that were increasing the triggering of maps, raters were saying, “This looks great!” So the team was taking that as “Wow, we should show more maps.” And there was a cycle going on of experiments that were showing more maps and the raters saying, “Yeah, we like the maps there.” And that was a pattern that we saw, we realized that it was wrong, so what we started doing was creating some more examples for the rater guidelines.
And here’s exactly that query showing, in this case, I think we actually are showing the front end of the store. You can’t see the loading dock in that little picture. And told the raters, “Actually, this is a Fails to Meet. The user is not looking for this.” Now, you could argue maybe it should have been the next category up or somewhere there. But we really wanted to get the message across that if you don’t think the user is going to go there, don’t show them that. We were having this issue around things like radio stations. Very rarely do you want to go to the radio station or go to a newspaper or go to the state lottery office. When you search for the state lottery, you’re looking for the numbers.
Another problem case, missing metrics. We were having an issue with quality.
And this was particularly bad, we think of it as, around 2008, 2009 to 2011. We were getting lots of complaints about low quality content and they were right. We were seeing the same low quality thing. But our relevance metrics kept going up. And that’s because the low quality pages can be very relevant. This is basically the definition of a content farm in our vision of the world.
So we thought we were doing great. Our numbers were saying we were doing great and we were delivering a terrible user experience. And it turned out we weren’t measuring what we needed to.
So what we ended up doing was defining an explicit Quality metric, which got directly at the issue of quality. It’s not the same as Relevance. This is why we have that second slider there now. And it enabled us to develop quality-related signals separate from relevant signals and really improve them independently.
So when the metrics miss something, what ranking engineers need to do is fix the rating guidelines or develop new metrics. And so with that, I’m going to say thanks.
This transcription is based on Paul Haahr’s “How Google Works” presented at SMX West, in March 2016. I have added links and references which are designed to expand on a phrase used by the presenter and may not represent his views.