Data Mining VS Data Extraction: What's the Difference?

As two typical buzzwords related to data science, data mining and data extraction confuse a lot of people. Data mining is often misunderstood as extracting and obtaining data, but it is actually way more complicated than that.

Data mining is a technique often used to analyze large data sets with statistical and mathematical methods to find hidden patterns or trends, and derive value from them, while data extraction is the act of retrieving data from (usually unstructured or poorly structured) data sources into centralized locations for storage or further processing.

Specifically, unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files, classifieds, etc. The centralized locations may be on-site, cloud-based, or a hybrid of the two. It is important to keep in mind that data extraction doesn’t include the processing or analysis that may take place later.

Key Differences Between Data Mining and Data Extraction

  1. Data mining is also named knowledge discovery in databases, knowledge extraction, data/pattern analysis, information harvesting. Data extraction is used interchangeably with web data extraction, web scraping, web crawling, data retrieval, data harvesting, etc.
  2. Data mining studies are mostly on structured data, while data extraction usually retrieves data out of unstructured or poorly structured data sources.
  3. The goal of data mining is to make available data more useful for generating insights. Data extraction is to collect data and gather them into a place where they can be stored or further processed.
  4. Data mining is based on mathematical methods to reveal patterns or trends. Data extraction is based on programming languages or data extraction tools to crawl the data sources.
  5. The purpose of data mining is to find facts that are previously unknown or ignored, while data extraction deals with existing information.
  6. Data mining is much more complicated and requires large investments in staff training. Data extraction, when conducted with the right tool, can be extremely easy and cost-effective.

This is awesome @octoparsejerry , thank you for taking you time to explain this in great details to us