Using Screaming Frog and Python to Find Internal Linking Opportunities

Whether you are looking for internal link opportunities for a new piece of content or adding more to an existing piece, this post will help you surface those opportunities in record time.

Problem 1: You just published a new piece of content on a large site. You now need to find mentions of that topic on existing pages so you can create internal links.

Problem 2: You have an existing page that needs more internal links. You need to look through all of the content for pages that mention that topic but do not already link to the page.

Finding internal linking opportunies to a brand new page is the easier problem to solve. You have a few choices:

Using the wordpress admin

If your site runs on wordpress, you can search in the wordpress admin. Click on pages (or posts) and in the upper-right corner you can enter the words you are hoping to find on the page. You can then click on each page or post that is surfaced, and search for the text on that page. Command+F will let find the words on the page, if they exist. This method grows old as you wait for pages to load in wordpress.

Think of it as chopping down a tree with an ax. It gets the job done, but you know there has to be a better way.

Better: Using Google's "site: operator"

You can head over to google and use the "site:" search operator. Using it in combination with the phrase you are looking for in quotations will surface a list of pages that mention these words.

Example: Searching for site:mysite.com "internal linking" will surface all the pages on your site that Google knows about that contain the words "internal linking".

Think of this method as using a chain saw to cut down a tree.

This method works great for finding link opportunities to brand new pages. But what if you are looking to find internal link opportunities to an existing page that is already well linked? You can try the methods mentioned above, but you quickly run into having to check if there is already a link on the page. This gets old quick.

Real-life scenario

Let's say you are the SEO working on garagegymreviews.com (I really like this site). Boss says, "Hey, we are on page 2 for the phrase "squat rack". We'd like our page https://www.garagegymreviews.com/best-squat-racks to do better. Can you find some more internal links to this page?"

You could use the 'site:' method described above, but it is slow and tedious. You will also waste a lot of time looking at urls that contain the words "squat rack" and already link to our target url.

Enter Screaming Frog and Python

I am going to show you a simple way to find internal linking opportunities using a Screaming Frog crawl file and the Python Pandas library.

How it works:

  • Configure screaming frog to capture the page contents via custom extraction.
  • Export this custom extraction to a csv file.
  • Use a Jupyter notebook to create a Pandas Data Frame and import the custom extraction csv.
  • Use a few lines of python code to narrow the results to those that contains the target phrase you are looking for, and do not contain the url of the page you are looking for links to.

Setting Up Custom Extraction in Screaming Frog

In Screaming Frog, you can set the crawler to store the content from certain elements on the page. Since modern websites are template driven, this means that with one css selector path, you can have the crawler store the main content of the page for every page you crawl.

In this example, we will use the site garagegymreviews.com. Looking at this url: https://www.garagegymreviews.com/best-budget-barbells we can right click on some text at the top of the article and select "inspect element". I am using the Safari browser, but it works the same in Chrome. From there, you will hover over the html until you see the section you want light up with highlighting. From there you can right click on that element, then select Copy -> Selector Path.

Once you have that value copied to your clipboard, you can go into Screaming Frog and in the menu select Configuration -> Custom -> Extraction. This should land you on this screen, where you set up your custom extraction for this task.

This tells SF to grab all the content within the html div that has the class="post-container". Be sure to set it to extract inner html, which will be important later when we search this column for urls.

Now that this is set, you can start the crawl.

Tip: Pause the crawl 30 seconds in and make sure the custom extraction is working as expected. In our case, we see that it is pulling in the html on individual posts, and not pulling in html on other pages like category, about, etc. This is what we want. Once the crawl is done, click on the custom extraction tab and then the export button to save this csv for later use.

Jupyter Notebook

If you have never used a Jupyter notebook before, just know that it is an interface for running Python code one step at a time. You'll need to be familiar with using the terminal on your machine. Here is the guide to installing and running Jupyter.

Once you have Jupyter running, you'll want to import the libraries we'll need.

Note: You don't have to use a Jupyter notebook for this, you could just as easily write this code in a python file and run it on your machine and get the same results.

import pandas as pd
import os

Next, we will grab the custom extraction csv from screaming frog, and use it to create a Pandas data frame. The code below creates a variable (crawl) and assigns the data frame to that variable.

crawl = pd.read_csv('/Users/jonathanholloway/python-seo/crawls/garage-gym/custom_extraction_all.csv')

Next, we will reduce this data frame down to just the columns we need. The address and the custom extraction. The code below creates a new variable (crawl_reduced) and assigns it the new version of the data frame that only contains the columns 'Address' and 'Main Content 1'. Note that we named the custom extraction 'Main Content' but in the csv, we see SF calls it 'Main Content 1'.

crawl_reduced = crawl[['Address','Main Content 1']]

Up next we want to create a new version of our data frame that contains only the records that contain a certain word/phrase AND do not contain a link to a certain url. The url will be the link we are looking to point links to.

Here is what that looks like:

link_opps = crawl_reduced[(crawl_reduced['Main Content 1'].str.contains('squat rack')) & (~crawl_reduced['Main Content 1'].str.contains('https://www.garagegymreviews.com/best-squat-racks', na=False))]

This code creates a new variable and assigns it the records from the crawl_reduced dataframe where the html contains the phrase 'squat rack' but does not contain our target url.

Finally, we get to export our link opps to a csv file and see the results.

link_opps[['Address']].to_csv('link_opps/squat-racks.csv')

Now you have a nice neat csv file with a list of urls that mention your target phrase, but do not link to your target url in the body of the post.

Good but contains false positives

It is not always perfect right away

Every website is different and has quirks, and the site I used in this example is no exception. This site has this nice "Further reading" section which changes every time you load the page. This can result in some false positives in our csv file. For example, maybe when we ran our crawl, the further reading section had text that mentioned 'squat rack', but when you go to the url and Command+F for squat rack, you don't see any mentions.

This is because of this sites unique functionality and how the template is coded. In this case, the unwanted section was inside the main content's div, which meant keeping it out of the custom extraction was not possible(I couldn't figure it out).

SPOILER ALERT: I solved this "little problem" and it took me an embarassingly-long time. Though, if you are an in-house seo working on a portfolio of large sites, then fixing a problem like this is worth it as it saves time in the long run. It took the csv from 100 urls down to 20. That is a lot of false positives. Chances are, the sites you work on will not have this unique problem.

Bonus: How I fixed these false positives

In the case of garagegymreviews.com, the further reading section is coded inside the main content div. This makes excluding it in our custom extraction impossible (at least for me). There is a solution though. To fix this, we can use a Python library called Beautiful Soup.

We will add a function that takes in a parameter (in this case, the html). We then parse this html with Beautiful Soup. We then use a for loop that finds all the divs with a certain css class. For each div with that class, it removes the div and its contents. It then returns the html as a string.

#function that will clean up the html by removing the section we don't want
def cleanup(self):
    soup = BeautifulSoup(self, 'lxml')
    for div in soup.find_all("div", {'class':'read-further-section'}): 
        div.decompose()
    return str(soup)
    
# run the cleanup function on the column containing all the html    
crawl_reduced['Main Content 1'] = crawl_reduced['Main Content 1'].apply(cleanup)

After adding in this step, the csv contains no false positives, but the opportunities list was much shorter at ~20 urls (hat tip to the garagegym seo, whoever you are).

That's better

Using your brain

You might be thinking, "what if this gives me a bunch of mentions of 'squat rack' where it doesn't make sense to add a link?". Fortunately for us, this still requires a human brain.

You might also think, "why would I go to all of these pages and link only the words 'squat rack' to my target url?". Hopefully you don't do that. This csv file of opportunities serves only as the jumping-off point. You can look at each page by itself and decide what the best anchor text is. You might want to vary the anchor text using phrases like:

  • best squat racks
  • cheaper squat racks
  • top squat racks
  • squat racks ranked
  • etc

You might also look at the page and see that the context doesn't fit, and skip that page altogether.

You have to look at each sentence containing the words 'squat rack' and imagine how it might be rewritten to best fit the context we are looking for.

Doing this avoids having to shoehorn anchor text into sentences where it doesn't belong. If you do that, you end up with spammy content, which isn't good for anyone these days.

Conclusion

Hopefully these methods save you time in finding internal linking opportunities. Do not be intimidated by the Python code. It's ok if you don't quite understand how it works. This is all basic Python that you can familiarize yourself with over a weekend or two.

Lastly, most sites do not have a related links sections that changes on page load. I could have found another site to use as the perfect example for this post, but I wanted to keep it as it shows the real problems you run into in the wild.

Cheers