Jquery Web Scraping
Analog Forest đł
One case down, two to go. Let's move on to covering scraping HTML that's rendered by the web server in Case 2. Case 2 â Server-side Rendered HTML. Besides getting data asynchronously via an API, another common technique used by web servers is to render the data directly into the HTML before serving the page up. Oct 25, 2017 Setup is complete. Now, lets scrape! Step 2: Scraping. As youâve probably ascertained by now, Books to Scrape has a big library of real books and fake data on those books. What weâre going to do is select the first book on the page and return the title and price of that book. Hereâs the homepage of Books to Scrape. Mar 02, 2021 The simplest way to get started with web scraping without any dependencies is to use a bunch of regular expressions on the HTML string that you fetch using an HTTP client. But there is a big tradeoff. Regular expressions aren't as flexible and both professionals and amateurs struggle with writing them correctly. Web Scraping â a quick introduction Web Scraping is the automated method of extracting human-readable data output from a website. The specific data is gathered and copied into a central local database for later retrieval or analysis.
October 19, 2020
If youâll try to google âweb scraping tutorialâ youâll get a bunch of tech articles on the subject that tells you how to achieve the result using python. The toolkit is pretty standard for these posts: python 3 (hopefully not second) as an engine, requests library for fetching, and Beautiful Soup 4 (which is 6 years old) for web parsing.
Iâve also seen few articles where they teach you how to parse HTML content with regular expressions, spoiler: donât do this.
The problem is that Iâve seen articles like this 5 years ago and this stack hasnât mostly changed. And more importantly, the solution is not native to javascript developers. If you would like to use technologies you are more familiar with, like ES2020, node, and browser APIs you will miss the direct guidance.
Iâve tried to fill the spot and create âthe missing docâ.
Overview
Check if data is available in request
Before you will start to do any of the programming, always check for the easiest available way. In our case, it would be a direct network request for data.
Open developer tools - F12 in most browsers - then switch to the Network
tab and reload the page
If data is not baked in the HTML like it is in half of the modern web applications, there is a good chance that you donât need to scrape and parse at all.
If you are not so lucky and still need to do the scraping, here is the general overview of the process:
- fetch the page with the required data
- extract the data from the page markup to some in-language structure (Object, Array, Set)
- process the data: filter it, transform it to your needs, prepare it for the future usage
- save the data: write it to the database or dump it to the filesystem
That would be the easiest case for parsing, in sophisticated ones you can bump into some pagination, link navigation, dealing with bot protection (captcha), and even real-time site interaction. But all this wouldnât be covered in the current guide, sorry.
Fetching
As an example of this guide, we will scrape a goal data for Messi from Transfermarkt. You can check his stats on the site. To load the page from the node environment you will need to use your favorite request library. You can also use the raw HTTP/S module, but it doesnât even support async, so Iâve picked node-fetch
for this task. Your code will look something like:
Tools for parsing
There are two major alternatives for this task which are conveniently represented with two high-quality most-starred and alive libraries.
The first approach is just to build a syntax tree from a markup text and then navigate it with familiar browser-like syntax. This one is fully covered with cheerio that declared as jQuery for server
(IMO, they need to revise their marketing vibes for 2020).
The second way is to build the whole browser DOM but without a browser itself. We can do this with wonderful jsdom which is node.js implementation of many web standards
.
Letâs take a closer look at both of them.
cheerio
Despite these analogies cheerio doesnât have jQuery in dependencies, it just tries to reimplement most known methods from scratch:
Basic usage is really easy:
you load a HTML
done, youâre great, now you can use JQ selectors/methods
Probably you can pick this one if you need to save on size (cheerio is lightweight and fast) or you are really familiar with jQuery syntax and for some reason want to bring it to your new project. Cheerio is a nice way to do any kind of work with HTML you need in your application.
jsdom
This one is a bit more complicated: it tries to emulate part of the whole browser that is working with HTML and JS (apart from rendering the result). Itâs used heavily for testing and ⌠well scraping.
Letâs spin up jsdom:
you need to use a constructor with your HTML
then you can access standard browser API
jsdom is a lot heavier and it does a lot more job. You should understand, why to choose it over other options.
Parsing
In our example, I want to stick to the jsdom. It will help us to show one last approach at the end of the article. The parsing part is really vital but very short.
So weâll start with building a dom from the fetched HTML:
Then you can select table content with css selector and browser API. Donât forget to create a real array from NodeList
that querySelectorAll
returns.
Now you have a two-dimensional array to work with. This part is finished, now you need to process this data to get clean and ready-to-work-with stats.
Processing
First, letâs check the lengths of our rows. Each row is the stat about goal and we mostly donât care how many do we have. But each row can contain numbers in a different format so we have to deal with them.
We map over rows and get the length. Then we deduplicate results to see what options do we have here.
Not that bad, only 4 different shapes: with 1, 5, 14, and 15 cells.
Since we donât need rank data from extra cell in 15-cells case it is safe to delete it.
Row with only one cell is actually useless: it is just a name of the season, so we will skip it.
For the 5-cells case (when player scored few goals in one match) we need to find previous full row and use itâs data for empty stats.
Now we just have to manually map data to keys, nothing scientific here and no smart way to avoid it.
Saving
We would just dump our result to a file, converting it to a string first with the JSON.stringify
method.
Bonus: One-time parsing with a snippet
Since we used jsdom
with the browser-compatible API we actually donât need any node environment to parse the data. If we just need it once from a particular page we can just run some code at the Console
tab in the developers tools of your browser. Try to open any player stats on Transfermarkt and paste this giant non-readable snippet to the console:
And now just apply this magic copy
function that is integrated into the browser devtools. It would copy data to your clipboard.
Not that hard, right? And no need to deal with pip
anymore. I hope you found this article useful. Stay tuned, next time we will visualize this scraped data with modern JS libs.
You can find whole script for this article in the following codesandbox:
by Pavel Prokudin. I write about web-development and modern technologies. Follow me onTwitter
Follow me on twitch!Web scraping is practically parsing the HTML output of a website and taking the parts you want to use for something. In theory, thatâs a big part of how Google works as a search engine. It goes to every web page it can find and stores a copy locally.
For this tutorial, you should have go installed and ready to go, as in, your $GOPATH
set and the required compiler installed.
Parsing a page with goQuery
goQuery is pretty much like jQuery, just for go. It gives you easy access to the HTML structure of a page and enables you to pick which elements you want to access by attribute or content.
If you compare the functions, they are very close to jQuery with the .Text()
for text content of an element and .Attr()
or .AttrOr()
for attribute names values.
In order to get started with goQuery, just run the following in your terminal:
Scraping Links of a Page with golang and goQuery
Now letâs create our test project, I did that by the following:
Now we can create the example files for the programs listed below. Usually you shouldnât have multiple main()
functions inside one directory, but weâll make an exception, because weâre beginners, right?
List all Posts on Blog Page
The following program will list all articles on my blogs front page, composed of their title and a link to the post.
Since weâre using .Each()
we also get a numeric index, which starts at 0
and goes as far as we have elements of the selector #main article .entry-title
on the page.
If you come from a language where functions canât have multiple returns, look at this for a second: link, _ := linkTag.Attr('href')
, if we would define a name instead of _
and call it something like present
, we could test if an attribute is actually set.
The output of the above program should be something like the following:
Scrape all Links on the Page with Go
Scraping all links on a page doesnât look much different to be honest, we just use a more general selector, body a
and go through the logging for each of the links. Iâm getting the content of the respective <a>
tag by using linkText := linkTag.Text()
.
The output of the above code should be something like:
Now we know how to get all links from a page, including their link text! That would probably be pretty useful to a bunch of SEO or analytics people, because it displays the context of how another website is linked, so what kind of keywords it should be associated with.
Get Title and Meta Data with Golang scraping
Lastly we should cover what we typically donât select with jQuery a lot, the page title and the meta description:
This should yield:
Now whatâs a little bit different in the above example is that weâre using AtrrOr( value, fallback_value)
in order to be sure we have data at all. This is kind of a short hand instead of writing a check if an attribute is found or not.
Jquery Web Scraping Software
For the title we can just plainly select the Contents of a *Selection
, because itâs typically the only tag of its kind on a website: pageTitle = doc.Find('title').Contents().Text()
.
Summary
Go is still pretty new to me, but itâs getting more and more familiar. Some things the compiler worries about make me re-think how I think of code in general, which is a great thing. In terms of libraries, goQuery is very awesome and I want to thank the author a lot for providing such a powerful parsing library, that is so incredibly easy to use.
Do you do web scraping / crawling? What do you use it for? Did you like the post or do you have some suggestions? Let me know in the comments!
Jquery Web Scraping Examples
Thank you for reading! If you have any comments, additions or questions, please leave them in the form below! You can also tweet them at me
Web Scraping Free
If you want to read more like this, follow me on feedly or other rss readers