This blog post explores the development of a powerful On-Page SEO Analyzer built with Streamlit. This tool helps you quickly gain insights into the SEO health of a website by extracting and displaying key on-page factors and performing broken link analysis from sitemaps.
What We'll Cover:
Understanding the code structure and core functionalities.
Exploring each function in detail.
The Streamlit app setup and its interactive features.
Example usage scenarios.
Code Breakdown
1. Import Statements
streamlit: The foundation of our application, providing the web framework that lets us create interactive dashboards.
cloudscraper: Helps handle websites that use Cloudflare protection to prevent bot access.
BeautifulSoup: Powerful for parsing HTML code, turning messy raw HTML into a structured format we can work with.
json: The library to handle JSON (JavaScript Object Notation) data, often used for structured data like schema markup.
requests: Enables us to make HTTP requests to fetch data from websites.
csv: For reading and writing CSV files, useful for managing URLs in this tool.
2. fetch_and_parse_html(url)
This function takes a URL as input and attempts to:
Fetch the HTML: It uses cloudscraper to get the HTML content from the given URL.
Parse the HTML: The fetched HTML is then passed to BeautifulSoup to parse it and create a structured representation.
Return the Data: If the process is successful, it returns the BeautifulSoup object. If there's an error, it displays an error message in Streamlit and returns None.
3. extract_meta_data(soup)
This function focuses on extracting crucial meta tags from the HTML.
metatitle: Retrieves the title of the webpage (found in the <title> tag).
metadescription: Fetches the content of the meta description tag (<meta name="description">).
robots_directives: Extracts the robots meta tag, which gives instructions to search engines (like indexing).
viewport: Extracts the meta viewport tag for responsiveness.
charset: Retrieves the charset used for the page.
html_language: Gets the language specified in the <html> tag.
If any of these elements are not found, it returns a default "Not Found" message.
4. extract_alternates_and_canonicals(soup)
This function targets essential elements for web crawlers and search engines.
canonical: Extracts the URL provided in the canonical meta tag (<link rel="canonical" ...>). Tells search engines which page should be considered the main version.
hreflangs: Fetches a list of hreflang tags. These specify language variations of a page and are important for international SEO.
mobile_alternate: Retrieves the URL specified in the mobile alternate tag, which directs users on mobile devices to a different version optimized for their device.
5. extract_schema_markup(soup)
Schema markup (like JSON-LD) helps search engines understand the content on a page, enriching results in search engines.
schema_types: Finds and extracts the types of schema used on the page, such as Product, Article, or Event.
breadcrumb_urls: Extracts breadcrumbs if they are present in the schema markup.
breadcumb_depth: Determines the number of steps in the breadcrumb.
6. extract_content_data(soup, url)
This function analyzes the content on the page.
text_length: Calculates the total number of characters within all paragraph tags (<p>).
h1: Extracts all h1 headings on the page.
headers: Gets all the headings from h1 to h6 along with their respective tag names (h1, h2, etc.).
images: Extracts image URLs and their alternative text (alt) attributes.
internal_links: Finds and extracts all internal links (links to pages within the same domain) along with their text, URL, and "nofollow" status.
external_links: Extracts external links (links to other domains).
number_internal_links: Counts the number of internal links found.
number_external_links: Counts the number of external links found.
contextual_links: Finds links that appear within the body of the text, potentially within paragraph (<p>) tags or headings.
div_links: Identifies links within <div> or <span> tags.
7. extract_open_graph(soup)
Open Graph metadata helps websites optimize how their content is displayed when shared on social media platforms. This function pulls out those Open Graph properties.
8. display_results(data)
This function displays the extracted data in a clear and structured way.
It takes a dictionary of data as input,
Converts lists into string formats so they can be displayed in a table,
Presents the SEO data in a Streamlit table for easy visualization.
9. process_sitemap(sitemap_url, max_urls=100)
This function is designed to work with sitemaps, which are XML files that list the pages on a website.
Download Sitemap: Downloads the sitemap XML file using requests.
Extract URLs: Parses the downloaded sitemap to find all the URLs (the <loc> tags) in the sitemap.
Save URLs: Saves the extracted URLs to a file named sitemap.txt.
Check Status Codes: Makes requests to each of the extracted URLs (up to a maximum number specified by max_urls), and checks their HTTP status codes. URLs returning a 200 (OK) status code indicate working pages.
Report Broken Links: Saves any URLs with a non-200 status code to a file named broken_urls.csv, indicating a broken link. The CSV file has two columns: "URL" and "Status Code."
Display Summary: Shows a message informing the user of how many URLs were scanned from the sitemap and how many broken links were discovered.
10. Streamlit App Setup:
Title: Sets the app title to "On-Page SEO Analyzer".
URL Input: A text input field to let users enter a URL to analyze.
Sitemap Input: An input for the URL of a sitemap if the user wants to check links in the sitemap (optional).
Start Button: When pressed, triggers the SEO audit to analyze the provided URL or process the sitemap.
This is the detailed code explanation for the SEO analyzer tool built with Streamlit. Now, let's illustrate how it would be used!
Example Usage:
Open Streamlit: Run the code (using streamlit run your_filename.py).
Input URL: In the "Enter the URL you want to analyze:" field, enter a URL, for example, https://www.example.com.
Sitemap Input (Optional): If the website has a sitemap, enter the sitemap URL in the "Enter the Sitemap URL (optional):" field (e.g., https://www.example.com/sitemap.xml).
Click "Start SEO Audit" The tool will:
Analyze the given URL and display the extracted SEO data.
Download, process, and report broken links from the sitemap if a sitemap URL was provided.
Print a message about the broken links found and the number of URLs scanned.
This powerful app combines a single page SEO analysis tool with broken link analysis to help improve the SEO performance of a website. The tool can be further expanded with additional features like keyword analysis, image optimization suggestions, or integration with SEO tools APIs.
Comments