What challenges do most data scientists ignore?



Pradeep Menon is an expert with extensive experience and influence in big data, data science and data architecture. This is the first in a series of brief data science articles that he wrote this year. It mainly introduces the basic laws, commonly used algorithms, and problem types in data science. Readers can get a glimpse of the data science panorama.

The British mathematician and Tesco Club architect Clive Humbly made the statement in 2016 that "data is a new energy". He said:

"Data is a new energy. It's extremely valuable, but it needs to be refined before it can be used. Just like oil, it needs to be converted to gas, plastic, or chemicals, etc., to play its real role. Therefore, data can can only be used. The value is only available after decomposition and analysis. "

The iPhone revolution and the growth of the mobile economy have created a perfect opportunity for big data technology to develop. In 2012, HBR (Harvard Business Review) published an article that brought data scientists to the fore. This article is called "Data Scientist: The Sexiest Job of the 21st Century" (https://hbr.org/2012/10/data-scientist-the-sexiest-job- of-the-21st-Century) and describes this group of "new people": a mix of data hackers, analysts, communicators and reputation consultants.

Every company is currently trying to become more "data-driven". Machine learning technology has done a great deal of help. Many of these things are very professional and difficult to understand. Hence, this series of articles will simplify data science. The author seeks to refer to Stanford University courses and the Introduction to Statistical Learning textbook (http://www-bcf.usc.edu / ~ gareth / ISL /) to make data science for the reader simple and easy present in an understandable form.

Data science is a multidisciplinary field that mainly includes:

  • Business knowledge
  • Statistical learning is also known as machine learning (statistical learning, also known as machine learning).
  • Computer programming

The focus of this series is on simplifying the knowledge of machine learning in data science. This article first introduces the basic laws, commonly used algorithms, and problem types in data science.

Core law

(Image reproduced from (Menon, 2017))

Data is a strategic resource: this concept is organizational thinking. The question is, "Are we using all of the data we collect and store? Can we extract meaningful resources from it?" I'm pretty sure the answer to these questions is no. Cloud-based companies rely on data-driven ones. You are required to treat data as a strategic resource. However, this concept does not apply to most institutions.

The systematic process of knowledge ingestion: The breakdown of data requires an organized process that includes clear steps and clear and achievable goals for each step. This is similar to the cross-industry standard data mining process (CRISP-DM) (https://en.wikipedia.org/wiki/ Cross_Industry_Standard_Process_for_Data_Mining).

Sleep with data: Relevant institutions should invest in professionals who care about data. It is not alchemy that turns data into resources. There is no all-powerful alchemist in this world. What they need are believers who understand the value of data and can identify and create data resources. And professionals who can connect data, technology and finance.

Accept uncertainty: data science is not a silver bullet (special effects weapon). It is not a crystal ball, it can be used to predict the future. Like reports and key performance indicators, it's a decision factor. Data science is a tool, not a means to an end. It is not absolute, but belongs to the category of probability. Management and decision makers have to accept this fact. You need to incorporate quantified uncertainty into the decision-making process. Only when relevant institutions can introduce a culture of experimentation and learn quickly from mistakes can they grow beyond uncertainty.

BAB law (Business-Analytics-Business): I think this is the most important law. Most data science literature focuses on models and algorithms. The equation itself lacks a business background. BAB highlights the commercial part. Putting algorithms into a business context is critical. Define business problems, solve them with analytics, and finally integrate the answers into business processes. This is the so-called BAB: Business-Analysis-Business, one such process.



Regarding the second law, this paragraph focuses on introducing the data science process part. Here are the stages of a typical data science project:

1. Define Business Problem (Define Business Problem)

Einstein once said: "Everything retains its essence and strives to be the simplest." This reference can be called the key to defining a business problem. The description of the problem requires precise processing and the goal to be achieved must be clearly defined. In my experience, the business team is too busy with the task at hand but ignores the challenges that need to be addressed. Brainstorming sessions, seminars and interviews can help identify these challenges and formulate hypotheses. For example, let's say that a telecommunications company's revenue declined year over year due to a decline in its customer base. In this case, the business problem can be defined as follows:

  • The company needs to grow its customer base by developing new customer bases while reducing customer churn.

2. Break it down into machine learning tasks

Well-defined business problems need to be mapped to different machine learning tasks. In the example above, if the company needs to expand its customer base by entering new markets and reducing customer churn, then how can we break it down into a machine learning problem? The following is a disassembly plan:

  • Reduce customer churn by x%.
  • Identify new customer groups for target marketing.

3. Data preparation

After we have formulated the business problem and broken it down into machine learning tasks, we need to have a deep understanding of the associated data in order to formulate suitable analysis strategies. You have to pay attention to the most important aspects like the data source, the quality of the data and the variance of the data.

4. Exploratory data analysis

Just like an astronaut exploring the universe, a data scientist must explore the unknown in the data pattern, gain insight into its hidden features, and record new discoveries. Exploratory data analysis (EDA) is an exciting task. We can better understand the data, examine its subtle differences, discover hidden patterns, develop new functions, and develop modeling strategies.

5. Modeling

After the exploratory data analysis, we will do some modeling. In this phase, we choose the most appropriate algorithm for specific machine learning problems, such as: B. General Regression, Decision Tree, Random Forest, and other algorithms.

6. Deployment and Evaluation

Finally, we use the established models and continuously monitor them, observe their performance in reality and carry out a targeted calibration.

Typically, the modeling and provisioning part is only 20% of the total work, and the remaining 80% of the work is data research and thorough understanding.

Types of machine learning problems


Simply put, machine learning falls into two categories: supervised learning and unsupervised learning.

1. Supervised learning

The supervised learning task has a predefined goal. Modelers observe and influence the process of creating machine learning models in a targeted manner in order to achieve their specific goals. Supervised learning can be further broken down into two categories:


Regression models are very common in machine learning tasks and are used to estimate and predict a numeric variable. Give two examples:

  • What is the estimated sales potential for the next quarter?
  • How many transactions can be done in the next year?


As the name suggests, the classification model separates the goals and categorizes them into several specific types. It is suitable for all types of uses. Give some typical examples:

  • Use the classification model to filter spam and classify received emails into spam and claims based on certain characteristics.
  • Churn prediction is another important application of classification models. Telephone companies generally use the churn model to predict whether users will churn (that is, stop using the service).

2. Unsupervised learning

Unsupervised learning has no set goals, so it can sometimes be difficult to interpret the results obtained. There are many types of unsupervised learning tasks. The most common are:

  • Clustering: Grouping goals according to similarity. For example, customer segmentation uses clustering algorithms.
  • Association (Association): Association algorithms are used to find products that match each other. In shopping basket analysis, correlation algorithms are used to bundle products that are for sale.
  • Link Prediction: The Link Prediction is used to determine the connection between data elements. Sites like Facebook, Amazon, and Netflix use link prediction algorithms extensively to recommend relevant friends, avid products, and movies.
  • Data reduction: Data reduction methods are used to reduce the number of features in a data set. It presents a large data set with a large number of attributes with fewer attributes.

Machine learning task for algorithm modeling

Once the business problem is broken down into machine learning tasks, one or more algorithms can solve a particular machine learning task. Usually a model is trained with several algorithms. Select the algorithm or collection of algorithms that will produce the best results for your deployment.

Microsoft Azure Machine Learning has more than 30 pre-built algorithms that can be used to train machine learning models.

You can use the Azure Machine Learning cheat sheet to examine these algorithms.


Data science is a very broad field. It's exciting, it's a science and an art. In this article, we've only examined the tip of the iceberg. There is no point in exploring his "how" method without understanding his "why" principle. In the following articles, we will continue to discuss the "how" of machine learning methods.