Puppeteer Sharp: Crawl the Web using C# and Headless ChromeWritten on April 30th, 2019 by Steven McLintock
Puppeteer Sharp is a port of the popular Headless Chrome NodeJS API built by Google. Puppeteer Sharp was written in C# and released in 2017 by Darío Kondratiuk to offer the same functionality to .NET developers.
Puppeteer Sharp enables a .NET developer to programmatically control, or ‘puppeteer’ the open-source Google Chromium web browser. The convenience of the Puppeteer API is the ability to use a headless instance of the browser, not actually displaying the UI for increased performance benefits.
Why use Puppeteer Sharp?
If you are a .NET developer, installing the Puppeteer Sharp Nuget package into your project can enable you to achieve:
- Crawling the web using a headless web browser
- Automated testing of a web application using a test framework
To use Puppeteer Sharp in a new or existing .NET project. install the latest version of the Nuget package ‘PuppeteerSharp’.
The first line of code that is necessary to ‘puppeteer’ a web browser is to download a revision of Chromium to the local machine. This is the browser that Puppeteer Sharp will use to interact with a website.
Fortunately, we can use C# to download either the default revision, or a revision the developer specifies. The revision will only download if it does not already exist on the local machine.
If the download is successful, you will see the version of the browser necessary to run on your operating system in your project directory:
Load a Webpage
First, we will programmatically initiate an instance of the headless web browser, load a new tab and go to ‘https://www.bing.com/maps’:
With the webpage successfully loaded in the headless browser, let’s interact with the webpage by searching for a local tourist attraction:
If you would like to store the HTML to parse elements such as the address or description, you can easily store the HTML in a variable:
Once you are finished, close the browser to free up resources:
Screenshots and PDF Documents
One of the benefits of Puppeteer Sharp is the ability to generate screenshots and PDF documents of the current page. This can be particularly useful for debugging purposes; automated testing or to capture a webpage at a specific resolution.
If you would like to a take a screenshot of the current page:
Alternatively, to generate a PDF document of the current page:
Change the View Port
If you require to test a webpage at a specific display size, such as to view how the page would appear on a mobile handset, you can use Puppeteer Sharp to change the size of the view port of the current page:
Whilst the functionality discussed thus far is useful to monitor and detect issues related to the user interface of a webpage, a .NET developer may also use Puppeteer Sharp to closely examine any network performance issues.
To accomplish this we can programmatically start and stop a trace log:
If a trace log is not capturing the amount of detail you require in your debugging session, you can programmatically enable Chrome DevTools to yield further insight:
Connect to a Remote Browser
One last feature of Puppeteer Sharp that I would like to mention is the ability to connect to a remote browser. This may be useful if you are using a serverless environment where installing a browser is not an option, such as the scalable ‘Azure Functions’.
One such service that compliments this feature is browserless.io:
Contribute to Puppeteer Sharp
If you would like to use this excellent API, be sure to visit puppeteersharp.com. However, If you would like to contribute to this project, you can find the GitHub profile at github.com/kblok/puppeteer-sharp.
Lastly, if you do use this useful port of the Google Chrome NodeJS API, please show your appreciation by donating to their Open Source Collective.