Data science and data mining are hot topics in the industry. Companies can’t seem to hire enough people to crunch their numbers and do their analytics.
Harvard Business School even calls data science the sexiest job of the 21st century!
While there is a ton of good in data mining, some major issues are still present in 2022.
In this 6-minute read, we go over this complete, and comprehensive list of major issues industry leaders in data mining keep screaming about.
- Most Data Is Messy Data
- Missing Data And Its Effects On Solutions
- Dealing With Distributed Data
- Different levels of data security
- Expensive And Timely Data Upkeep
- Data Science is A Newer Field
- Fast Paced industry
- Understanding The Business Context
- Dealing With “People” Problems
- Always a new algorithm
- Navigating Initial Assumptions
- Amount of Knowledge Needed
- Scalability of Good Solutions
- Production vs. Training
- Model Drift And Its Effects On Business
- Result evaluation
- Knowing How To Correctly Deliver Your Results
- There Is Always Changing requirements
- Budget Seems Smaller in Data Mining
- Proving a positive Return on investment (ROI)
1. Most Data Is Messy Data
Most of your data mining projects are going to start with messy data.
While most analytical professionals love the modeling part of their job, they spend about 80% of their time cleaning messy data.
This trend doesn’t seem to be subduing either.
The data mining process continues to get more in-depth, with new and innovative approaches coming out every day.
While we wish this increase in depths came with a decrease in messy data – it doesn’t.
Data continues to be messy – as a data scientist, you’ll need to get used to jumbled columns, weird date formats, non-standard units, and a thousand other data issues that make will make your datasets messy.
2. Missing Data And Its Effects On Solutions
Another major problem in data mining is missing data.
While we know above that data is usually messy, what happens when it’s missing?
Missing data impacts your analysis, biases models, and creates situations where normal data points look like outliers.
Even worse, when missing data isn’t just individual points, but completely missing variables, our models could be meaningless.
3. Dealing With Distributed Data
As we’ve increased the volume of data, we’ve also increased our technological capabilities for dealing with it.
Things like Cassandra, HDFS, Amazon Web Services, and many others make it very easy for the data that we need for our data mining project to exist all over the world.
This creates major issues for analysis, as you’ll have another thing to worry about when retrieving your data: Where it’s at and how you’ll get it.
4. Different levels of data security and access
Sometimes in data mining, you’ll be unable to obtain the data you need.
This could be for many reasons, but a lot of the time, it has to do with data security.
Many companies and business units inside companies do not want someone poking around and crunching numbers.
This could lead to discovering patterns that could make some decision-makers look bad.
To combat this, many lockdown the “goods” (the data you really need) and only allow access to certain parts of their data.
These parts are usually uninformative (it’s why it’s available) and make your data mining projects impossible.
5. Expensive And Timely Data Upkeep
Clive Humby Said it best “Data is the new oil. Like oil, data is valuable, but if unrefined, it cannot really be used.”
Maintaining high-quality data to be continuously used in data mining projects is expensive.
This is a major issue in data mining, as companies have to commit to hiring analytical professionals that can handle both sides of this complex data.
6. Data Science is A Newer Field
Data science being a newer field creates some problems in data mining projects.
Schools have just started to teach it, businesses have just started to address it, and managers have just started to manage it.
With data science being a newer field, there aren’t as many hashed-out processes compared to something like software engineering.
7. Fast Paced industry
Not only is Data science new, but it moves at the speed of light.
Just in June of 2021, 3057 articles were posted on Machine Learning. (Source)
Many of the top minds in academia all around the world are working day and night to push the fields of data mining and machine learning forward.
The issue is simple: It’s hard to keep up.
8. Understanding The Business Context
Another major issue in data mining is solving the wrong problem.
While analytical professionals are usually pretty advanced in their technological skills, their business savvy is sometimes lacking.
This creates a point of contention between leaders and analytical professionals trying to solve the problem, as understanding the problem and business context is sometimes difficult.
9. Dealing With People Problems
One (if not the biggest) issue in data mining is the “People” Problems.
Sometimes the information that these algorithms find … isn’t the nicest.
For example, what if you ran an analysis of different business units, and your algorithm returned and told you to eliminate half of the unit…
Would you do it?
What would you do?
Algorithms and people can work really well together, but while your algorithms do not have emotions (yet), you do.
Sometimes, you and those around you are the biggest issues in your data mining projects.
10. Always a new algorithm
In data mining, new and improved algorithms are being published every day.
Should you implement them?
Would they work better than on your current dataset?
Should we redo everything we’ve done in production?
Ignoring the shiny new object is hard in data mining and creates issues when publications promise immediate results and high performance over (now old) algorithms you’ve implemented.
Working with data is hard.
What’s even harder is working with data you have some emotional connection to.
This creates problems during your data mining projects, as your subconscious biases can influence how and when you do your analysis.
Sometimes, we have initial assumptions about the outcome of a data mining project, which leads us to do our analysis in a way that derives that outcome instead of looking at the project objectively.
12. Amount of Background Knowledge Needed
What is the runtime of Kmeans clustering?
What are the benefits of normalizing your data over using something like np.log?
The amount of mathematical knowledge needed to perform data mining projects correctly is insanely high, and added to is the need for a deep understanding of computer science principles.
As time goes on and auto machine learning improves, the skill level needed for data science professionals may also drop – but for now – it remains high.
13. Scalability of Good Solutions
Scale is a massive problem in most forms of computing, and data mining is no different.
Once a solution is found, will it be able to handle 1000 users at a time?
How are you going to set up that architecture?
Should that architecture be built by you or others?
Do the others understand your problem and the model’s needs?
Scaling good solutions is not any easier than building correct solutions, and many data mining professionals forget that to get past the finish line, scale has to be considered.
14. Production vs. Training
From experience, how a data mining system works during training and testing is much different than how that same model will work in production.
This is independent of your mining methodology or data mining algorithm, production environments just bring a whole new array of data mining challenges that are impossible to see coming.
The time spent correctly handling these issues is a major data mining problem.
15. Model Drift And Its Effects On Business
Let’s say you’ve fought through the incomplete data, you’ve discovered patterns and new insights and handled the entire data flow correctly in production.
You ran your tests, and your model was working perfectly.
However, it’s now three months later, and your model seems to worsen by the day.
Model Drift is the slow decay of your model’s accuracy as the world slowly changes.
The models built for yesterday may not be the best models for today.
A common problem data mining professionals have is keeping all their models up to date and ensuring their data mining results apply today.
16. Result evaluation
Your data mining algorithms are spitting out numbers, and your loss function says they’re good.
What do these results mean?
Understanding and correctly assessing your results is hard.
It’s even harder when you’re on a time crunch, and someone needed these results an hour ago.
Even if you do everything right and make it to the final boss, incorrectly assessing your results would make all the prior steps useless.
17. Knowing How To Correctly Deliver Your Results
You’ve used the right data mining tools, picked the correct data mining algorithms, and accurately assessed your results – now it’s time to present them.
No matter what data mining methods you choose, poorly presenting your results can make everything you’ve previously done worthless.
Forgetting a critical year, Focusing on the wrong aspect, and Not driving toward the result are just some of the things that can make your accurate results look meaningless.
18. There Is Always Changing requirements
Data Mining has a revolving door when it comes to requirements.
Throughout the full scope of a data mining project, it’s not shocking to see requirements change 4-5 different times.
Changing requirements is a major issue in data mining projects and something you should consistently work with your leadership to stop from happening.
19. Budget Seems Smaller in Data Mining
Around this time of year, budgets are always tight.
While software engineering projects seem to have unlimited budgets, data mining projects do not.
Working sophisticated problems on tight budgets is not easy and is a constant issue for data mining professionals.
20. Proving a positive Return on investment (ROI)
Finally, our last issue for data mining professionals is proving ROI.
This isn’t because the RIO on your projects is low. It’s because sometimes the changes these analyses present are hard to implement.
If not implemented, they will constantly show up, presenting themselves repeatedly.
Additional Articles in our Data Mining Series
Here at EML, we have complete series breaking down those tough-to-learn topics in data mining.
You can find the rest of that series here:
- Bayes Classification In Data Mining: A complete introduction to Bayes Classification, with full python code (like all articles in this series).
- Outlier Detection In Data Mining: A introduction to outlier detection, plus a couple of proven ways to find outliers in your data – all coded for you in python.
- Correlation Analysis In Data Mining: Teaching you the ins and outs of a correlation analysis during data mining projects.
- Summarization in data mining: Summarization is tough and even tougher in data mining. Here we have a full guide implemented with python.
- .NET CI/CD In GitLab [WITH CODE EXAMPLES] - September 16, 2023
- Debug CI/CD GitLab: Fixes for Your Jobs And Pipelines in Gitlab - September 13, 2023
- Understanding Pipeline Problems (Timeout CICD GitLab) - September 8, 2023