avatarDr. Daniel Koh

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3018

Abstract

etation of results, although they are not the main contributors to statistical analysis.</p><p id="3245">The third point relates to the spread of the data. Given its symmetrical property, we assume that the left half from the average is the same as the right half from the average. If the average income of a nation is US2,000, and 1 million people are earning that amount, we should expect, say, 600,000 people earning US1,500 as compared to 590,000 people earning US$2,500. From this perspective, we need to be sure that this assumption holds <b>in reality</b> before we even consider using the normal distribution to describe the data. While I initially use income as an example, due to this assumption <b>in reality</b>, I know that the left half and right half from the average are unlikely symmetrical. So I wouldn’t contemplate the thought of using normal distribution to describe the income level of the people in a country — especially for affluent nations like Singapore.</p><p id="08e1">Last, as most of us have learned in schools, the normal distribution forms the basis for complex models such as linear regression. However, as I have mentioned previously, we do not always observe normality in reality, although normality reflects reality. In reality, data can come in all forms and any type of distribution may potentially be able to describe the data. Given that normal distribution provides limited capability to analyze all types of data, we see the rise of more complex algorithms such as Decision Trees, Support Vector Machines, and Particle Swarm Optimization. These are advanced methods that are used in data science. I’m pretty sure my data friends in DataFrens.sg would know some of those advanced methods. I have personally used them for several years and they definitely can answer a wider range of problems in reality.</p><p id="05c0">How about we get GPT4.0 to summarize this discussion?</p><blockquote id="42eb"><p>To summarize, the passage presents a nuanced view of the normal distribution’s role in data science. It highlights the distribution’s symmetry and suitability for modeling natural phenomena, while noting that real-world data often deviate from this ideal, especially in the variability of the mean, median, and mode. The discussion acknowledges the limitations of the normal distribution in fully capturing complex data characteristics, such as the presence of outliers and the asymmetry in distributions like income levels. It also points to the importance of considering alternative distributions and advanced methods like Decision Trees, Support Vector Machines, and Particle Swarm Optimization, which can address a broader range of real-world data scenarios. The passage underscores the necessity of a context-driven approach in data analysis, rather than relying solely on traditional statistical models.</p></blockquote><figure id="0e43"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hUzXXbWnQht5xBpaeHWtZg.jpeg"><figcaption><b>Dr. Daniel Koh</b><

Options

/figcaption></figure><p id="77a1">Daniel started off his career as a senior list researcher with a British publishing firm. Back then, his role involved contact sourcing through the internet and performed data entry into the Microsoft Dynamic CRM system. (Microsoft Dynamic CRM 3.0) Progressively, he explored the option of using Visual Basic scripting within excel to automate the contact sourcing process.</p><p id="6ece">He successfully developed and implemented the scripts, leading to 95% increase in data entry efficiency. He then moved on to take on the role of a CRM executive with Fuji Xerox Singapore.</p><p id="4040">As a CRM executive, he liaised with third party vendor for technical enhancement of the CRM system (Microsoft Dynamic CRM 4.0 and 365). He also performs functional enhancement of the CRM system for hundreds of end users.</p><p id="5a95">His notable achievement was the development of the CRM boy that led to 98% improvement in data quality and data integrity in the CRM system. Following his Masters studies in Consumer Insight with Nanyang Business School, he took on the role of an Analytics instructor with Singapore Management University. He prepared class notes and technical walkthrough, and taught Analytics to the undergraduate students from various disciplines. Subsequently, he took on various roles as consultants in the consultancy, manufacturing and information technology industries in Singapore.</p><figure id="7ea7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*VWhPhUkYYQx2VBMU.jpeg"><figcaption></figcaption></figure><p id="87c8">He travelled to Paris, London, Sri Lanka, Japan and Malaysia to fulfill his role as a consultant. The cultural and professional exchanges between local and overseas data analytics had given him a very good overview of the expectations and motivations from people around the world. He also had a chance to relocate to the United States for one year, particularly focusing on Operations Management.</p><p id="62f9">Prior to his current freelance status, he took on the role of the Data Science Lead in a Singaporean software company. His primary role was to develop Artificial Intelligence using logic, data science and machine learning techniques through in-depth, full-stacked scripting. He also developed customized Reporting for his customers. In his point of view, 95% of today’s reporting can be automated, which can free up staff from daily manual work.</p><figure id="e41d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*FvkpvIpBBAPadr0u.jpeg"><figcaption></figcaption></figure><p id="326e">He holds a Bachelor of Science in Marketing (BSc. Marketing Pass with Merit) from Singapore University of Social Sciences (in which he graduated as a Valedictorian), a Master of Science in Marketing and Consumer Insights (MSc. Marketing and Consumer Insights) from Nanyang Technological University, a Doctor of Business Administration (DBA) from Swiss School of Business and Management.</p></article></body>

Probability Theory #13 — Normal Distribution

Normal Distribution is the most widely used distribution in many areas of study. It is widely used because of two main reasons. First, the distribution is symmetrical, meaning the left half and right half from the peak of the distribution which is the average value form a symmetrical shape. In terms of use, it is completely suitable to describe natural phenomena. Second, given its natural property reflecting our reality at large, it can describe life events easily due to its ability to provide intuitive and explainable interpretations. However, it is very important to note that data science is not confined to just the normal distribution: many data in reality fit better under other distributions such as exponential distribution and beta distribution. As data scientists, we need to explore the use of other distributions and not just limit ourselves to one.

One of the key features of the normal distribution is the match of mean, median and mode. We assume that they are equal. But in reality, the mean is unlikely equal to the median and the mode due to fluctuation in data. In my personal opinion, the deviation of mean, median and mode is itself a form of chance encounter. If we draw multiple samples from a population, we should observe the deviation of mean, median and mode as a pattern. If this pattern is not observed, then we know chance plays a greater role in the data. It’s as if we are saying “We draw 100 samples, and each sample contains 1000 data points, and 50% of these samples do not have a similar deviation of mean, median and mode.” However, we do not want the deviation to differ significantly to the point that the chance plays a greater role in the deviation itself. It’s as if we are saying “We draw 100 samples, and each sample contains 1000 data points, and each one [sample of 1000 data points] has an approximate deviation of mean and median at 20% or mean and mode at 30%. When such a deviation occurs, we know that extreme values likely exist in the data. This leads us to our second point.

When extreme values exist in the data, we consider them outliers. They are like outcasts in society. The other key feature of the normal distribution is the exclusion of outliers. We assume that the outliers disrupt the normality of data, and we need to remove them before analyzing the data. That’s true for statistical discussions. However, from a philosophical perspective, it can be quite contentious when we devalue the outliers for the sake of gaining normality which reflects reality. Hence, before we perform any removal of the outliers, we need to make sure that these outliers do not play a major role in the character of the data. Sometimes the outliers can be the main contributor to the interpretation of results, although they are not the main contributors to statistical analysis.

The third point relates to the spread of the data. Given its symmetrical property, we assume that the left half from the average is the same as the right half from the average. If the average income of a nation is US$2,000, and 1 million people are earning that amount, we should expect, say, 600,000 people earning US$1,500 as compared to 590,000 people earning US$2,500. From this perspective, we need to be sure that this assumption holds in reality before we even consider using the normal distribution to describe the data. While I initially use income as an example, due to this assumption in reality, I know that the left half and right half from the average are unlikely symmetrical. So I wouldn’t contemplate the thought of using normal distribution to describe the income level of the people in a country — especially for affluent nations like Singapore.

Last, as most of us have learned in schools, the normal distribution forms the basis for complex models such as linear regression. However, as I have mentioned previously, we do not always observe normality in reality, although normality reflects reality. In reality, data can come in all forms and any type of distribution may potentially be able to describe the data. Given that normal distribution provides limited capability to analyze all types of data, we see the rise of more complex algorithms such as Decision Trees, Support Vector Machines, and Particle Swarm Optimization. These are advanced methods that are used in data science. I’m pretty sure my data friends in DataFrens.sg would know some of those advanced methods. I have personally used them for several years and they definitely can answer a wider range of problems in reality.

How about we get GPT4.0 to summarize this discussion?

To summarize, the passage presents a nuanced view of the normal distribution’s role in data science. It highlights the distribution’s symmetry and suitability for modeling natural phenomena, while noting that real-world data often deviate from this ideal, especially in the variability of the mean, median, and mode. The discussion acknowledges the limitations of the normal distribution in fully capturing complex data characteristics, such as the presence of outliers and the asymmetry in distributions like income levels. It also points to the importance of considering alternative distributions and advanced methods like Decision Trees, Support Vector Machines, and Particle Swarm Optimization, which can address a broader range of real-world data scenarios. The passage underscores the necessity of a context-driven approach in data analysis, rather than relying solely on traditional statistical models.

Dr. Daniel Koh

Daniel started off his career as a senior list researcher with a British publishing firm. Back then, his role involved contact sourcing through the internet and performed data entry into the Microsoft Dynamic CRM system. (Microsoft Dynamic CRM 3.0) Progressively, he explored the option of using Visual Basic scripting within excel to automate the contact sourcing process.

He successfully developed and implemented the scripts, leading to 95% increase in data entry efficiency. He then moved on to take on the role of a CRM executive with Fuji Xerox Singapore.

As a CRM executive, he liaised with third party vendor for technical enhancement of the CRM system (Microsoft Dynamic CRM 4.0 and 365). He also performs functional enhancement of the CRM system for hundreds of end users.

His notable achievement was the development of the CRM boy that led to 98% improvement in data quality and data integrity in the CRM system. Following his Masters studies in Consumer Insight with Nanyang Business School, he took on the role of an Analytics instructor with Singapore Management University. He prepared class notes and technical walkthrough, and taught Analytics to the undergraduate students from various disciplines. Subsequently, he took on various roles as consultants in the consultancy, manufacturing and information technology industries in Singapore.

He travelled to Paris, London, Sri Lanka, Japan and Malaysia to fulfill his role as a consultant. The cultural and professional exchanges between local and overseas data analytics had given him a very good overview of the expectations and motivations from people around the world. He also had a chance to relocate to the United States for one year, particularly focusing on Operations Management.

Prior to his current freelance status, he took on the role of the Data Science Lead in a Singaporean software company. His primary role was to develop Artificial Intelligence using logic, data science and machine learning techniques through in-depth, full-stacked scripting. He also developed customized Reporting for his customers. In his point of view, 95% of today’s reporting can be automated, which can free up staff from daily manual work.

He holds a Bachelor of Science in Marketing (BSc. Marketing Pass with Merit) from Singapore University of Social Sciences (in which he graduated as a Valedictorian), a Master of Science in Marketing and Consumer Insights (MSc. Marketing and Consumer Insights) from Nanyang Technological University, a Doctor of Business Administration (DBA) from Swiss School of Business and Management.

Data Science
Statistics
Probability
Recommended from ReadMedium