Cleverer Ranking Factors: Co-Occurrence, Co-Citation, and Anti-Manipulation

Google badly wants to be human.

Imagine Google as a giant, complex program used by so many of the computers in the world that it has huge influence over how all of us conduct business.

…What? No, not like Skynet from the Terminator movies.

Ok, how about this: a computer that has so many uses and applications that if we were on a spaceship, we could let it be the pilot, because that’s what it would want to do anyway.

…No, no, not like Hal 9000 from 2001: A Space Odyssey.

Let me rethink this: a machine that wasn’t designed to think or reason on its own, but is programmed in such a way that it is supposed to emulate how a human might see the world.

…Huh? No! Not like I, Robot, Blade Runner, or Battlestar Galactica!

co-citation co-occurrence

OR Cloud Atlas! Geez!

Clearly, almost every computer or robot we’re exposed to in books, movies, or TV—with the exception of R2-D2 and C-3PO, apparently—end up turning against their human masters and attempt to destroy them using various strategies and levels of effectiveness. (Let’s admit, the Cylons had that one on lock.) So it’s going to be difficult to believe that Google is probably not going to develop the ability to think for itself and slowly devise a plan to execute the end goal of its programming at the expense of humanity.

But we can try, right? Can we try to pretend that Google isn’t out to get us and actually wants to, *gulp*, help us?

Let’s start at the bottom:

What is Google trying to accomplish?

All search engines exist to help you find answers to your questions. Remember AskJeeves? I do. I remember thinking how easy it would be to type in a full question and get exactly the answer I was looking for without having to use the funky search operators they taught me in the computer lab at school.

Problem was, it didn’t work so well. More often than not, my keywords would get confused with my question phrasing and I’d have to end up typing in keywords and search operators…into Google.

Sigh. Oh well.

So why was Google such a success? People could (and do) speculate on this ad nauseam, but from a technical perspective, the founders seemed to find the best algorithm to rank search results (and later, to effectively print money by selling ads).

Because we’re talking about computers, and because the language of computers is numbers, computers have to use math in order to do the things that we humans do with our more abstract brains. To solve problems, computers use algorithms. An algorithm (or program) is a complex amalgam of formal logic and math that takes information in and spits a result (calculation) out. It just so happened that Google did this best.

For example, when we type “director of The Matrix” into Google, we don’t want to see pages about, say, the director of a company that specializes in matrix analysis. Ideally, we want a page about the movie The Matrix with the names of its directors, the Wachowskis, listed right where we can easily see them. Or better yet, we’d like the first result to be a page just about the Wachowskis which will inevitably give us information about their involvement with The Matrix.

Did you just think to yourself, “you mean like Wikipedia?”

co-occurrence co-citation google search

…you’re getting it. Check out #1 on those search results.

Then the folks at Google saw this and thought, “why not cut to the chase and give the answer right on the search page?” That’s what they have been working on the last several years with features like the Knowledge Graph, the carousel, and more. (In fact, search “cast of The Matrix”: Bam! carousel.)

These are just more ways to get us the answers we want quicker, and that’s exactly what Google ultimately wants to do. (You know, besides make unprecedented amounts of money on the ad economy they created.)

co-citation co-occurrence google search

Look at all those answers!

What’s changed this time then, you ask? Short answer: lots. Sure, computers still speak math and humans like me still do not, but by using different values and variables to plug into different equations, search engines like Google are fine-tuning how and why one web page will rank higher than another, even if they offer very similar information.

What sense would it make, then, to use such a strictly logical solution for such an abstract problem? That’s why the search engine game (not just Google) has shifted from ranking websites based on backlinks—essentially how many links are pointing to a website—to more verbal concepts like co-citation and co-occurrence (while some other core mathematic variables remain).

What’s co-occurrence?

Briefly, co-occurrence is the repeated appearance of two keywords or ideas within close proximity of each other. The key is context.

For example, the word “coke” has several different meanings. It can refer to the nickname for Coca-Cola, cocaine, or as your friendly neighborhood industrialist could tell you, as a fuel source derived from coal production. So why is it that when we search the word “coke” does Google assume we mean Coca-Cola? Co-occurrence may be the answer. Some factors to consider:

  1. 1. Coca-Cola is a giant worldwide company with a ubiquitous product, and “Coke” is its well-known nickname.
  2. 2. Many, many more people drink Coke and talk about Coke than do coke.
  3. 3. Since it’s not 1850, the fuel coke isn’t really used outside of industrial circles (and coal is more common anyway), so this coke probably isn’t talked about too much.

By looking at how the word “coke” is used, we can determine that it is usually associated with Coca-Cola these days, both offline and online. Since on the internet the word “Coke” appears along with other key phrases like “Coca-Cola,” “soda,” or “soft drink” millions and millions of more times than it does with “drugs” or “fuel,” then Google takes a bet that when a user types in “coke,” he or she is looking for Coca-Cola—just like how most people think of the word “Coke.”

Another good example is a well-known “Google bomb.” A decade ago, some people—whether as a joke or to make a point (or both)—used methods effective at the time to make a page about former President George W. Bush rank #1 for the search query “miserable failure.” Since Google got wise to the issue and the methods used no longer apply, the Google Bomb no longer works.

However, now we can see co-occurrence in action. Even though a page on George W. Bush alone does not rank for the query “miserable failure,” all of the front page results are about the infamous Google bomb which almost always includes—you guessed it—George W. Bush’s name. So in this case, if a user wanted a definition of the phrase “miserable failure” for some reason, they would instead be given a bunch of results on this event, all including two separate but now inextricably related key phrases: “miserable failure” and “George W. Bush.” In a weird way then, the Google bomb does still work.

co-occurrence google search


And co-citation?

This one is a bit trickier and goes back to search engines’ bread and butter: links.

But the odd thing about co-citation is that it has nothing to do with two pages linking to one another directly. Instead, Google sees two sites as being related by the fact that they are both linked to by other websites while talking about the same topic.

This principle wasn’t created by Google, however. In fact, it already existed in the ink-and-paper world. Wikipedia has a pretty good explanation of the concept along with a handy graphic (just imagine each document as a web page):

co-citation co-occurance google search

CPI = Co-citation Proximity Index, or, the measure of how strongly two documents on similar subject are related to each other based on the distance between their citations on the page

A great web example is the one Moz founder Rand Fishkin stumbled across last fall: why does a website like Consumer Reports rank so high for the large-volume search phrase “cellphone ratings” when that phrase—or barely even the individual words—is not mentioned on the page? The answer must be co-citation: that is, multiple websites independently linking to Consumer Reports when addressing the issue of cellphone ratings.

So…why these more abstract concepts?

There are many possible answers that we could argue, but my guess would be this: co-occurrence and co-citation are very difficult to manipulate.

When we think about the steps Google in particular has taken in the last few years toward combatting spam and even punishing domains that try to game the system, it only makes sense that they would try to find a quantitative method of evaluating content that would give more than one person the keys to the car, so to speak. That’s why you hear so many industry experts, analysts, and even Google themselves stress the quality of content. Since Google is now better able to tell the difference between good content and bad content and even punish the creators and propagators of really bad content, it’s essential that webmasters start being honest and make a legitimate effort to make their website a high-quality and authoritative resource.

This seems like an overall positive from a user’s/consumer’s perspective, but what about businesses that are trying so hard to be seen and convert leads? While on one hand businesses should always strive for high quality, on the other, quality is tough. It’s also time-consuming and expensive.

So what should you do?

Make the investment.

Whether that means taking time out of your week to keep your website updated, paying for a high-quality website in the first place, or even incur a monthly cost to an online marketer, it will all eventually pay off. Why? More people are gaining access to the internet, more people are using the internet, and the internet is becoming more accessible. All the time.

In fact, the Internet is often the first place people go for answers now. Not just answers, but communication, news, entertainment, and more. And the fact that the creators of the tools that facilitate these wants and needs want to make that experience better for its users is naturally positive. In Google’s case, that happens to mean that a computer needs to value information more like a human would in order to relate to its human users. How is that a bad thing?

…Or maybe I’m playing right into our future electronic masters’ hands…