At Zyte (formerly Scrapinghub) we are known for our ability to help companies make mission-critical business decisions through the use of web scraped data.
But for anyone who enjoys a freshly poured pint of stout, there is one mission-critical question that creates a debate like no other…
“Who serves the best pint of Guinness?”
So with St Patrick's day quickly approaching, we decided to turn our expertise in large-scale data extraction to answering this mission-critical question.
Although this is a somewhat humorous question, the data extraction and analysis methods used are applicable to numerous high-value business use cases and are used by the world’s leading companies to gain a competitive edge in their respective markets.
In this article, we’re going to explore how to architect a web scraping and data science solution to find the best pint of Guinness in Dublin. But most importantly which Dublin pub serves the best pint of Guinness?
Step #1 - Identify rich data
For anyone who has ever enjoyed a pint of the black stuff, they know that the taste of a pint of Guinness is highly influenced by the skill of the person and the quality of the equipment they use.
With that in mind, our first task is to identify where we can find web data that contains rich insights into the quality of a pub’s Guinness and where the coverage levels are sufficient for all pubs in Dublin.
After a careful analysis of our options - pub websites, social media, reviews, articles, etc. we decided customer reviews would be our best option. They provide the best combination of relevant high granularity data and coverage to answer this question.
Step #2 - Extract review data
The next step would be to develop a web scraping infrastructure to extract this review data at scale using Scrapy. To do so we’d need to create two separate types of spiders:
We’d also need to run these spiders on a web scraping infrastructure that can reliably extract the review data with no data quality issues. To do so, we’d configure the web scraping infrastructure as follows:
Due to data protection regulations such as GDPR, it is important that the extraction spiders don’t extract any personal information of the customers who submitted the review. As a result, data extraction spiders need to anonymize customer reviews.
Step #3 - Text pre-processing
Once the unstructured review data was extracted from the site, the next step is to convert the text data into a collection of text documents, or “Corpus”, and pre-process the review data in advance of analysis.
Natural Language Processing (NLP) techniques have difficulty modeling unstructured and messy text, preferring instead well defined fixed-length inputs and outputs. As a result, typically this raw data needs to be converted into numbers. Specifically, vectors of numbers. The more similar the words are, the closer the number assigned to the words are.
The simplest way to convert a corpus to a vector format is the bag-of-words approach, where each unique word is represented by a unique number.
To use this approach the review data first needs to be cleaned up and structured. Here are some of the common pre-processing steps that can be implemented using a library such as Python’s NLTK, the most frequently used library in Python for text processing:
The goal of this pre-processing step is to ensure the text corpus is clean and contains only the core words required for text mining.
Once cleaned the review data then needs to be vectorized to enable an analysis of the data. Here is an example review prior to it being vectorized:
review_21 = X review_21 Output: "One of the greatest of Dublin's great bars, the Guinness here is always terrific, the atmosphere is friendly and it is perfect especially around Christmas -- snug warm and welcoming."
Here was how the review should be represented once it has been vectorized using the bag-of-words approach. Each unique word is assigned a unique number, and the frequency of the word's appearance recorded.
bow_21 = bow_transformer.transform([review_21]) Bow_21 Output: (0, 2079) 1 (0, 2006) 1 (0, 6295) 1 (0, 8609) 1 (0, 9152) 1 (0, 13620) 1 (0, 14781) 1 (0, 12165) 1 (0, 16179) 1 (0, 17816) 1 (0, 22077) 1 (0, 24797) 1 (0, 26102) 1
Once the text corpus was cleaned, structured, and vectorized, the next step is to analyze the review data to determine which pubs had the best Guinness reviews.
Although there is no definitive method of achieving this goal, for the purposes of this project we decided not to overcomplicate things and instead do a simple analysis of the data to see what insights we can yield.
One approach would be to filter the review data by looking for the word “guinness”. This would enable us to identify all the reviews that specifically mention “guinness”, an essential requirement when trying to determine who pours the best pint of the black stuff.
Next, we need to create a way to determine if the mentioning of Guinness was done in a positive or negative context.
One powerful method would be to build a classifier model using a labeled training dataset (30% of the overall dataset with reviews labeled as having positive or negative sentiment) developed with the Multinomial Naive Bayes library from Scikit-learn (a specialized version of Naive Bayes designed more for text documents) and apply our trained sentiment classifier model to the entire dataset. Categorizing all the reviews as either positive or negative.
To ensure the accuracy of these sentiment predictions, the results need to be analyzed and compared to the actual reviews. Our aim is to have an accuracy of 90% and above.
Finally, with a fully classified database of Guinness reviews, we should now be in a position to analyze this data and determine which pub serves the best Guinness in Dublin.
In this simple analysis project, we carried out analysis using the following assumptions and weighting criteria:
Using this methodology we were able to get an interesting insight into the quality of Guinness in every bar in Dublin and find the best place to get a pint of the black stuff.
So enough with the data science mumbo jumbo, what do our results say?
Winner: Kehoes Pub - 9 South Anne Street
Of the 74 reviews analyzed, 36 display positive sentiment for pints of Guinness. 48.6% of all reviews. The highest ratio of reviews mentioning Guinness in a positive light and the highest number of total reviews mentioning Guinness in their reviews. A great sign that they serve the best Guinness in Dublin.
To validate our results, the Zyte team did our due diligence and sampled Kehoes’ Guinness. We can safely say that those reviews weren’t lying, a great pint of stout!
Worthy runners up…
Runners Up #1: John Kavanagh The Gravediggers - 1 Prospect Square
Of the 54 reviews analyzed, 25 display positive sentiment for pints of Guinness. 46.3% of all reviews.
Runners Up #2: Mulligan’s Pub - 8 Poolbeg St
Of the 49 reviews analyzed, 21 display positive sentiment for pints of Guinness. 42.9% of all reviews.
So if you’re looking for the best place to find a great pint of Guinness this Saint Patrick’s Day, be sure to check out these great options.
At Zyte (formerly Scrapinghub) we specialize in turning unstructured web data into structured data. If you would like to learn more about how you can use web scraped data in your business then feel free to contact our Solution architecture team, who will talk you through the services we offer startups right through to Fortune 100 companies.
We always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on right now.
Until next time…
Happy St Patrick's Day! ☘️