Understanding CSS Basics for Web Scraping
Many resources I came across on web scraping speak to the need to understand the basics of HTML to web scrape; this doesn't seem to extend to CSS, is the impression I'm getting from my research. With the benefit of hindsight, I think it's helpful to learn the fundamentals of CSS beyond just what it is.
HTML and CSS are like inseparable besties. When web-scraping, we use query languages like XPath and CSS Selector (more about that in a future blog) to describe to the computer how it should find the content we seek. We tell it the characteristics or elements relationships of HTML/CSS to look for. If it helps, you can think of it as using Search in a word document.
It's likely that when you embark on your first web scraping project, you'll be following web scraping tutorials. I recall perplexing and struggling to follow the first couple of videos I watched. I had difficulty discerning between what's HTML and what's CSS. Explanation in those videos is often scant. Without some foundational understanding, it's also hard to comprehend the documentation on the libraries of functions required in web scraping.
I don't know about you, for things I want to learn, I like to spend time understanding the content quite deeply. I wouldn't say I like it when I'm told how it is without understanding the why or the purpose it serves. Knowing the why and purpose, I can recall more of what I've learned.
So, in this blog, let's look at the following:
- What is CSS?
- CSS Syntax
- CSS selectors
- How to add CSS to HTML?
What is CSS?
In my last blog, I explained that CSS stands for Cascading Style Sheet. It is a styling language that dresses up and beautifies HTML. With CSS, you can make text bold, change table size and colour background, add borders, and essentially edit pretty much any webpage aspects to make them attractive.
Now, some little styling can be achieved using HTML elements (tags) like font
, <small>
, <big>
, <center>
and attributes like color
, border
etc. But developers steer clear from them unless they want to inflict nightmares on themselves and other poor developers. Let me help you understand why it's a nightmare. Imagine you have typed a draft book of 365 pages, and you didn't use the Styles feature with Microsoft Word. Then your editor requests that you change the font of all the subheadings. You would have to comb through 365 pages and locate and update the subheading one by one. Needless to say, that is PAINFUL! 😫
Using CSS, we can make the change in one location, and the change will cascade
through to all the subheadings and that, my friend, is the beauty of CSS!
Oh, by the way, HTML was never intended to contain tags for formatting a web page. With the potential grief that can be inflicted using HTML to style webpages, many of the elements and attributes I mentioned above are no longer supported in HTML5 (the latest version).
CSS Syntax
CSS syntax is made up of a selector, property and its value.
The Selector points to the HTML element you want to style. Each property and value, known as the property-value pair, form a declaration. A colon separates property and value.
You can have multiple declarations. Each declaration is separated from the next with a semicolon. Curly braces flank declaration blocks.
Let me explain the example in the image:
- p is a selector in CSS. It points to the HTML element we want to style, which is
<p>
, paragraph, element. - backgroud-colour is a property, and yellow is the property value.
- opaque is a property, and 0.3 is the property value. Opacity can take a value from 0 to 1. 0 means transparent.
CSS Selectors
Before venturing into CSS selectors, I want to clarify that the CSS selectors referred to here differ from the CSS selector referred to in the introduction. That said, the two are inextricably related. The CSS selector I referred to in the introduction instructs the computer how to search for what we want to scrape. The CSS selectors we're about to delve into is illustrated in the CSS syntax above. CSS Selectors here defines the HTML elements you want to style with CSS. To avoid confusion, I will refer to CSS Selector in the introduction as CSS locator.
CSS selectors are divided into five categories. The conventions for these selectors are applied to CSS locators. Knowing this will help you expedite your learning of the CSS locator.
The five CSS selector categories are:
- Simple selectors- select elements based on name, id, class
- Combinator selectors- select elements based on a specific relationship between them
- Pseudo-class selectors- select elements based on a certain state
- Pseudo-elements selectors- select and style a part of an element
- Attribute selectors- select elements based on an attribute or attribute value
We won't spend time on all five. From what I could gather, understanding #1 Simple selectors, #2 Combinator Selectors and # 5 Attribute Selectors is more than sufficient for web scraping.
Simple Selectors
There are several simple CSS selectors:
- CSS element selector
- CSS id selector
- CSS class selector
- CSS universal selector
CSS Element Selector
The element selector selects HTML elements based on the element name.
Here, all <h1>
elements on the page will be centre-aligned with yellow text.
CSS id Selector
The id selector uses the HTML id attribute to identify a specific element. id of an element is unique.
To select an element with a specific id, you need a hash (#
) character before the element's id.
The CSS colour will be applied to the HTML element with the id = special_h1
.
CSS Class Selector
The class selector selects elements with a specific class attribute. Unlike id, class attributes aren't unique.
You need a period (.
) character before the class name to select elements with a specific class.
In the example here, HTML elements with class = right
will have the CSS of right-aligned, red-coloured text applied.
We can also specify that we only want a specific HTML element with a particular class name. In this example, only the text in <p>
elements with class = right
will be right aligned with red text.
Something to note, HTML elements can refer to more than one class.
Here, the <p>
element will be styled according to class = right
and class = small
:
CSS Universal Selector
The universal Selector represented by the * character selects all HTML elements on the page.
The CSS rule here will be applied to every HTML element on the page.
Combinator Selectors
There are four different types:
- Descendant Selector (
- Child selector (
>
) - Adjacent sibling selector (
+
) - General sibling selector (
~
)
Descendant Selector (
)
The descendant selector matches all elements that are descendants of the element specified.
All <h2>
elements inside of <div>
elements are selected in this example. I like to draw your attention to the space between the two elements in the syntax because it's easily overlooked.
Child Selector (>
)
The child selector selects all the children elements to the element specified. All the <h2>
children of <div>
element will be selected in this example.
Adjacent Sibling Selector (+
)
The adjacent sibling selector selects an element directly after another specific element. Sibling elements mean the element must have the same parent, and adjacent means immediately following.
In this example, the first <h2>
elements are placed immediately after <div>
elements are selected.
General Sibling Selector (~
)
The general sibling selector selects all elements that are the next siblings of the element specified. In this example, all <h2>
elements that are the next sibling of <div>
elements are selected.
Attribute Selectors
Attribute selectors allow you to select an HTML element given a specific attribute or attribute value.
CSS [attribute]
Selector
The [attribute]
selector selects elements with an attribute specified.
The following example selects all <a>
elements with a target attribute:
CSS [attribute= "value"]
Selector
The [attribute= "value"]
selector is used to select elements with an attribute with a specific value.
In this example, all the <a>
elements with a target value of "empty" will be selected.
CSS [attribute~= “value”]
and CSS [attribute|= “value”]
selectors
The [attribute~= "value"]
selector is used to select elements with an attribute with a specific value containing a specific word.
In this example, all the elements with a title attribute containing a space-separated list of words with "bird" will be selected.
Elements that will be returned are titles like title = "yellow bird", "dark bird", "bower bird", and "noisy bird". These won't be returned: "yellow-bird", "dark-bird", and "bird-poo"
If you want to retrieve results with words containing hyphen (-) then use [attribute|=" value"]
instead.
CSS [attribute^= "value"]
, CSS [attribute$= "value"]
and CSS [attribute*= "value"]
Selectors
The [attribute^= "value"]
selector is used to select elements with an attribute with a specific value, whose value begins with the specified value.
In this example, all the elements with a class attribute value that starts with 'cl' will be selected.
If you want elements with an attribute that ends with a specific value, then it can be achieved using [attribute$=" value]
.
If you want elements with an attribute that contains a specific value, then it can be achieved with [attribute*=" value"]
.
How to add CSS to HTML?
There are three ways to insert a style sheet:
- External CSS
- Internal CSS
- Inline CSS
External CSS
With an external style sheet, your styling is coded in a CSS document, so you can change the aesthetic of an entire website just by tweaking that one file!
To add external CSS to the HTML page, you need to include a reference to the external style sheet file inside the <link>
element inside the head section.
Internal CSS
An internal style sheet is used when one single HTML page has a unique style. The internal style is written inside the <style>
element in the head section.
Inline CSS
Inline CSS can be used when applying a unique style for a single element. To use inline styles, add the style
attribute to the relevant element. The style attribute can contain any CSS property.
Rightio, I've covered good grounds of what I think will help with web scraping. In your web scraping adventure, if you come across any CSS knowledge that I've not covered but is good to know for scraping information, please share it with me.