Data mining and web scraping – sounds like two buzzwords meaning the same thing. Quite often data mining is misunderstood as the process of obtaining information from a website which is not quite right. The article will help you understand what data mining is and how it differs from web scraping.
Just like mining for gold means to dig through rocks to find the treasure, data mining means sorting through extensive data sets to get the valuable information you or your business need. It’s a component of the overall data science and analytics process.
When hearing data mining you might think it’s an interchangeable term to web scraping. However data mining doesn’t involve the actual data gathering, extraction or scraping of data. It’s the process of analyzing large amounts of data which then can deliver helpful insights and trends for businesses who rely on data. Web scraping or web data extraction, on the other hand, is the process of collecting data from a website in an automated manner.
When you have collected the data you need, you can start data mining, meaning, analyse the data sets. Now that is extremely simplified. There are many things you need to do before you can actually start the process of data mining – read more about it in the next paragraph.
One good way of explaining how data mining works is to use the Cross-Industry Standard Process for Data Mining (CRISP-DM). It was published in 1999 to standardize the data mining process across industries and is nowadays the most common methodology for data mining, analytics and data science projects.
CRISP-DM defines a total of six individual phases that you run through in a data mining project. However, it is not a single, linear run, because the individual phases can be repeated several times or sometimes multiple changes between different phases are necessary. Depending on the results provided by each phase, it may be necessary to jump back to an earlier phase or to go through the same phase again.
The following is a brief description of the individual phases of the CRISP-DM standard model:
Business understanding: A data mining project starts with setting the specific goals and requirements of the project. The result of this phase is the formulation of the task and the description of the planned rough approach.
Data understanding: Once the business problem is understood, it is time to get an overview of the available data and it’s quality. Often this data comes from various sources, in a structured and unstructured manner which needs cleaning.
Data preparation: The goal of the data preparation phase is to select the final data set that includes all relevant data needed for the analysis and model creation.
Modeling: In the context of modeling, the data mining methods suitable for the task are applied to the data set created in the data preparation phase. These methods can include clustering, predictive models, classification, estimation or a combination of them. The optimization of the parameters and the creation of several models are typical for this phase. It might even require you to go back to the data preparation phase if you need to select other variables or prepare different sources.
Evaluation: The evaluation and the testing of the models ensure an exact comparison of the created data models with the task to select the most suitable model. This phase is designed to allow you to look at the progress so far and ensure it’s on the right track for meeting the business goals. If it’s not, there might be a need to go back to previous steps before a project is ready for the deployment phase.
Deployment: Now it’s time to deploy the accurate and reliable model in the real world. The deployment can take place within the organization, be shared with customers and stakeholders. The work doesn’t end when the last line of code is complete; deployment requires careful thought, a roll-out plan, and a way to make sure the right people are informed.
Data mining helps to make accurate predictions, recognizing patterns and outliers, and often informs forecasting. It is used to identify gaps and errors in business operations and it also sets a business apart from the competition in combination with predictive analytics, machine learning etc. No wonder that data mining techniques are widely used in business areas like marketing, risk management and fraud detection.
A real life example on how data mining is used, can be found during online shopping on Amazon’s feature “frequently bought together”, or the recommendation section on Spotify and Netflix. All of them use data mining algorithms to analyse consumer behavior and to identify patterns. The goal is to improve the user experience which falls under the market basket analysis, a common use case for data mining. Using extracted product data helps to identify customer and shopping trends.
As we mentioned briefly above, web scraping or web data extraction goes hand in hand with data mining.
To find relevant information in data sets which can be used for analytics and predictive modeling, the amount of data available is a critical factor. Since the goal is to discover patterns in sequential or non-sequential data, correlations, to determine if the amount of obtained data is of good quality, the more data available the better.
So what you need is data. But how to get this data?
This is when we talk about web scraping.
Everyone of us copied and pasted information from a website at some point in our lives, so basically we did the same as any web scraper, just on a rather tiny scale. To get enough data to drive any insights out of it, web scraping uses intelligent automation to retrieve thousands, even millions of data points from the internet’s seemingly endless frontier.
Although data mining and web scraping are different things, in the end they work towards the same goal: Help businesses thrive.