Internal Link Reporting for Key Pages

This post will help you make some sense of what your internal links look like within the content of your site.

What we will be doing:

  • Configure Screaming Frog
  • Crawl entire site with screaming frog
  • Clean up crawl data
  • Defining a list of key pages you want to report on.
  • Export all inlinks from screaming frog to a csv file
  • Use python and the pandas library to get a list of what pages contains links to the key pages in the content, and what anchor text is being used.
  • Use python and the pandas library to get a list of pages that do not contain a link to any of the key pages, but meet criteria to potentially contain an internal link

Today's guinea pig is BestReviews.com

The reason I chose this site for this example is that it has a lot of different categories and pages. It's also an affiliate site, which is what I am most familiar with. It has also been on my radar as it's ranking surged to amazing heights, only to be struck back down by the product reviews algo update.

Why you need to do this type of reporting

When a site gets large, it is easy to lose sight of the internal linking strategy you started out with. You have to check in on things to make sure key pages are being linked to enough from within the content. It is also good for checking in on how the current links look, what anchor text is being used, and if the context is correct.

You can also use this type of reporting to surface pages that do not contain links to key pages, but otherwise fit the requirements for being able to link to your key pages.

I can imagine a scenario where hundreds of new pages are being published each month or quarter, and you want to check in on how the team is doing with internal linking of new and existing pages.

Boot up Screaming Frog and configure it so it isn't crawling and storing css, javascript, or images. Make sure link position is turned on and configured to be accurate for site you are crawling. For our case, I added custom link positions called "ToC" and "H1SelfLink" in order to filter out the jump links present on many pages. Some simple xpath can be used to label these elements of the page.

We will be using link position to narrow down our data to only contain links in the main content of the page. Do a sample crawl and see what internal links are pulled in on the page. I like to exclude sections where the page links to itself via anchor links.

Cleaning up the crawl file

Once the crawl is complete, remove all the rows for pages blocked by robots.txt and pages that are not a 200 status code. You can do this by sorting by status, highlighting all the pages you don't need, and right clicking and then clicking "Remove".

Defining a list of key pages to report on

For this example, we will imagine a scenario where we are looking at pages in a certain category that are within striking distance of page 1. For BestReviews.com that could be pages in their Toys and Games category that are currently ranking in positions 11-20. We can use ahrefs to get this data.

Organic keywords, filtered by positions 11-20, and url containing "toys-games"

From here we can quickly build a list of urls we want to target in our report. In ahrefs, we can export 25 rows to excel, then clean them up by removing the unnecessary columns and removing duplicates. This leaves us with:

So now that we have our key pages, we are ready to fire up a Jupyter notebook and start filtering our crawl file.

Start with the imports

import pandas as pd
import openpyxl as openpyxl

Next we will import our csv file as a pandas dataframe. This csv file was generated by highlighting all the valid urls in SF, clicking the inlinks tab, then exporting.

# load the inlinks export from screaming frog into a pandas data frame
df1 = pd.read_csv('inlinks.csv')

Next, we will define our list of key pages.

#list of key pages to pull internal link anchor text for

key_pages = ['https://bestreviews.com/toys-games/theme-toys/best-naruto-toys',
'https://bestreviews.com/toys-games/educational/best-marble-run-toys',
'https://bestreviews.com/toys-games/train-sets/best-train-sets-for-adults',
'https://bestreviews.com/toys-games/electronic/best-little-live-pets',
'https://bestreviews.com/toys-games/arts-crafts/best-model-car-kits',
'https://bestreviews.com/toys-games/table-games/best-mini-claw-machines',
'https://bestreviews.com/toys-games/stuffed-animals/best-furreal-friends',
'https://bestreviews.com/toys-games/educational/best-osmo-toys',
'https://bestreviews.com/toys-games/building-sets/best-plus-plus-blocks',
'https://bestreviews.com/toys-games/arts-crafts/best-kinetic-sand-sets',
'https://bestreviews.com/toys-games/theme-toys/best-alvin-and-the-chipmunks-toys',
'https://bestreviews.com/toys-games/building-sets/best-knex-sets',
'https://bestreviews.com/toys-games/arts-crafts/best-water-doodle-drawing-mats',
'https://bestreviews.com/toys-games/outdoors/best-nerf-bow-and-arrow-sets',
'https://bestreviews.com/toys-games/stuffed-animals/best-giant-teddy-bears',
'https://bestreviews.com/toys-games/table-games/best-card-shufflers',
'https://bestreviews.com/toys-games/theme-toys/best-dora-the-explorer-toys',
'https://bestreviews.com/toys-games/board-games/best-mahjong-sets',
'https://bestreviews.com/toys-games/electronic/best-tamagotchi-toys']

Next, we will add a column to the datafile to label which key page the internal link points to. This makes the reports easier to understand.

# add a column to label which key page the internal link points to. Probably a better way to do this.

df1.loc[df1.To == "https://bestreviews.com/toys-games/theme-toys/best-naruto-toys", "Key Page"] = "Best Naruto Toys"
df1.loc[df1.To == "https://bestreviews.com/toys-games/educational/best-marble-run-toys", "Key Page"] = "Best Marble Run Toys"
df1.loc[df1.To == "https://bestreviews.com/toys-games/train-sets/best-train-sets-for-adults", "Key Page"] = "Best Train Sets for Adults"
df1.loc[df1.To == "https://bestreviews.com/toys-games/electronic/best-little-live-pets", "Key Page"] = "Best Little Live Pets"
df1.loc[df1.To == "https://bestreviews.com/toys-games/arts-crafts/best-model-car-kits", "Key Page"] = "Best Model Car Kits"
df1.loc[df1.To == "https://bestreviews.com/toys-games/table-games/best-mini-claw-machines", "Key Page"] = "Best Mini Claw Machines"
df1.loc[df1.To == "https://bestreviews.com/toys-games/stuffed-animals/best-furreal-friends", "Key Page"] = "Buest Furreal Friends"
df1.loc[df1.To == "https://bestreviews.com/toys-games/educational/best-osmo-toys", "Key Page"] = "Best Osmo Toys"
df1.loc[df1.To == "https://bestreviews.com/toys-games/building-sets/best-plus-plus-blocks", "Key Page"] = "Best Plus Plus Blocks"
df1.loc[df1.To == "https://bestreviews.com/toys-games/arts-crafts/best-kinetic-sand-sets", "Key Page"] = "Best Kinetic Sand Sets"
df1.loc[df1.To == "https://bestreviews.com/toys-games/theme-toys/best-alvin-and-the-chipmunks-toys", "Key Page"] = "Best Chipmunks Toys"
df1.loc[df1.To == "https://bestreviews.com/toys-games/building-sets/best-knex-sets", "Key Page"] = "Best Knex Sets"
df1.loc[df1.To == "https://bestreviews.com/toys-games/arts-crafts/best-water-doodle-drawing-mats", "Key Page"] = "Best Water Doodle Mats"
df1.loc[df1.To == "https://bestreviews.com/toys-games/outdoors/best-nerf-bow-and-arrow-sets", "Key Page"] = "Best Nerf Bow & Arrows"
df1.loc[df1.To == "https://bestreviews.com/toys-games/stuffed-animals/best-giant-teddy-bears", "Key Page"] = "Best Giant Teddy Bears"
df1.loc[df1.To == "https://bestreviews.com/toys-games/table-games/best-card-shufflers", "Key Page"] = "Best Card Shuffler"
df1.loc[df1.To == "https://bestreviews.com/toys-games/theme-toys/best-dora-the-explorer-toys", "Key Page"] = "Best Dora Explorer Toys"
df1.loc[df1.To == "https://bestreviews.com/toys-games/board-games/best-mahjong-sets", "Key Page"] = "Best Mahjob Sets"
df1.loc[df1.To == "https://bestreviews.com/toys-games/electronic/best-tamagotchi-toys", "Key Page"] = "Best Tomagotchi Toys"

And finally, we will write some code to create a new data frame with just the url and anchor text for links pointing to key pages from within the content. We also rename one of the columns for easier comprehension in later reports.

# create new data table to show just the from URL and anchor text for links pointing to key pages from within the content.

anchor_text = df1.loc[(df1['To'].isin(key_pages)) & (df1['Link Position'] == 'Content'), ['Key Page', 'From', 'Anchor Text']]
# change column name From to Has Link On, for better understanding when viewing later.
anchor_text.rename(columns={"From": "Has Link On"}, inplace = True)

Once this is done, we write everything to an excel file.

with pd.ExcelWriter('output.xlsx') as writer:  
     anchor_text.to_excel(writer, sheet_name='Key Page Anchors')

From here, you can open up the excel file and view the results. If you notice anything weird, you can filter and delete certain urls. In our case, we had urls from a staging site pulled in somehow.

Example of output

Now that we have this data, we can add some pivot tables to better understand it.

Quick look at anchors to these key pages
A count of how many links each key page has from within content

From here you can make decisions on whether or not these key pages are lacking internal links.

You can alter these reports, what data you present, and more based on your reporting needs. Some other examples I have tried are creating extre sheets that contain urls that:

  • Have at least one link to a key page in the content (good for maybe adding others alongside it)
  • Have no links to any key pages (better for smaller sites where categories are closely related).

Cheers