Today we will play around with web scraping in F#. Web scraping can be useful for gathering and processing data from the internet. In this blog post, we will learn how to navigate through a fictional bookstore website, extract the author from every book displayed, and then we can count how many times that author appeared and who wrote the most books.
Let's pretend our fictional book store site uses pagination, so a list of numbers dedicated to each page usually found at the bottom and when an index is clicked it will redirect to the page with the number index. Also, on each page we have a list of books that can be clicked and it will redirect to a page describing more information about the book, including the author's name.
Looking at the page we can identify two important elements:
- list of books
- pagination
Web scraping in dev tools
Before we go into the code, we can quickly make sure we get the right elements by using the dev console and querySelector.
If you hover over the console results, your queried element should be highlighted.
F# Implementation
Note: This will be done in a fsx file that can be run with dotnet fsi or or VSC if you have the ionide extension
Now let's write some code
We will start by adding the FSharp.Data nuget which will help us with web scraping
#r "nuget: FSharp.Data, 5.0.2"
open FSharp.Data
Let's get the first page:
let url = "https://www.fictional-bookstore.com/page/1/"
let page = HtmlDocument.Load(url)
Because we will iterate over the all of the pages lets make sure we get the last page number:
let lastPageNumber =
page.CssSelect(".foo")
|> Seq.filter (fun elem -> elem.HasClass(".next") |> not )
|> Seq.last
|> fun elem -> elem.DirectInnerText().Trim()
|> int
Using the cssSelect method we are getting pagination with class name 'foo' which includes the numbers for each page. Pagination also includes the button "Next" so we are going to filter that out and get the last number.
Let's create a function that will get us the name from the book page:
let nameFromBookpage (link:string) :string =
link
|> HtmlDocument.Load
|> fun page -> page.CssSelect(".book-author")
|> Seq.head
|> fun post -> post.Descendants "a" |> Seq.head
|> fun element -> element.DirectInnerText().Trim()
We have a link as a parameter, this link will directly go to the book's page because that's where the author's name is. The name is wrapped into a clickable element, so we are looking to find the a tag and extract its text.
Now we are going to make a function that will extract the link to the book:
let linkFromPost (post:HtmlNode) :string =
post.Descendants "a"
|> Seq.head
|> fun html -> html.AttributeValue "href"
Now that we have all these helpful functions, let's write another function! This time for getting the list of names from the page:
let listNamesFromPage (page:HtmlDocument) : string list =
page
|> fun page -> page.CssSelect(".list-books")
|> Seq.head
|> fun main -> main.CssSelect(".book-item")
|> List.map (fun post -> linkFromPost post |> nameFromBookpage)
The value given in the parameter page will contain the list of book summaries and a link to it. Using the CssSelect method we are able to extract the element containing the list of books summaries, we then iterate over that list to get the link for each book and finally we get the name using the link.
We now have the ability to get a list of names from a page but we still need to group them by author, count them and also make sure we get all the authors from the previous pages.
Let's do it:
let numbers = Seq.init lastPageNumber (fun i -> i + 1)
let collectionAuthors =
numbers
|> Seq.map (fun index ->
( HtmlDocument.Load($"https://www.fictional-bookstore.com/page/{index}/")
|> listNamesFromPage
))
|> Seq.collect id
|> Seq.countBy id
|> Seq.sortByDescending snd
We are iterating over the number of pages until we get to the last, and in each iteration we use the index to get a list of names from each page. We concatenate all of them together giving us a list of names which can be duplicates! We make use of countBy to group all of them together giving us a tuple with the author and how many instances of that author were in the list. Lastly, we use sortByDescending to return a descending list of who wrote the most books.
Once you run it, there will be a bit of delay as it searches all the pages but you should end up with this:
(John Smith, 73)
(Jane Smith, 31)
(James Bond, 28)
Conclusion
As you can see web scraping can be great tool for gathering information and learning from it 🙂