A Median-Based Machine-Learning Approach for Predicting Random Sampling Bernoulli Distribution Parameter

12 trang Gia Huy 16/05/2022 2770

Download

Bạn đang xem tài liệu "A Median-Based Machine-Learning Approach for Predicting Random Sampling Bernoulli Distribution Parameter", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

a_median_based_machine_learning_approach_for_predicting_rand.pdf

Nội dung text: A Median-Based Machine-Learning Approach for Predicting Random Sampling Bernoulli Distribution Parameter

Vietnam Journal of Computer Science Vol. 6, No. 1 (2019) 17–28 #.c The Author(s) DOI: 10.1142/S2196888819500015 A Median-Based Machine-Learning Approach for Predicting Random Sampling Bernoulli Distribution Parameter Hoang Pham*,‡ and David H. Pham† *Department of Industrial and Systems Engineering School of Engineering, Rutgers University Piscataway, NJ 08854, USA †Department of Computer Science School of Arts and Sciences, Rutgers University Piscataway, NJ 08854, USA ‡hopham@soe.rutgers.edu Accepted 8 November 2018 Published 11 December 2018 In real-life applications, we often do not have population data but we can collect several samples from a large sample size of data. In this paper, we propose a median-based machine- learning approach and algorithm to predict the parameter of the Bernoulli distribution. We illustrate the proposed median approach by generating various sample datasets from Bernoulli population distribution to validate the accuracy of the proposed approach. We also analyze the eđectiveness of the median methods using machine-learning techniques including correction method and logistic regression. Our results show that the median-based measure outperforms the mean measure in the applications of machine learning using sampling distribution approaches. Keywords: Bernoulli distribution; median-based method; machine learning; sampling distribu- tion; prediction. Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com 1. Introduction We are living in a world full of data. Data is everywhere. Millions of datasets are by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. generated each and every day. When it comes to analyzing the data, analysts are often faced with the task of selection of methods to analyze the data. In the Big Data era, given a large dataset ¯lled with information involving lots of uncertainties, using diđerent methods and tools especially in the machine-learning (ML) techniques ‡Corresponding author. This is an Open Access article published by World Scienti¯c Publishing Company. It is distributed under the terms of the Creative Commons Attribution 4.0 (CC-BY) License. Further distribution of this work is permitted, provided the original work is properly cited. 17
18 H. Pham & D. H. Pham today we can get diđerent results.1 Some common measures in the ML resampling approaches in Data Analytics are the mean and median. Hence ¯nding ways to better use the median or mean to analyze data samples with better accuracy and to make better decisions is important. A median is the middle number in the dataset when the values in your dataset are put in order. The mean is mainly the average of all the values in the dataset that are included in the calculation. Many agents commonly report the household incomes using the median ¯gures rather than the sample means to re°ect the household incomes of an area. Similarly, school o±cials often report the salary of their graduates using the sample median instead of the sample mean. Does the median value represent a good measure rather than the mean? It will depend on the applications. Consider a group of house prices in area A consisting of 11 houses, for example. If the current estimated house prices in this area are reported as follows (in thousands of dollars): 550, 520, 480, 600, 580, 720, 650, 1500, 690, 710, 475, then the sample mean of the house price in area A is $747,000, while the house price using the sample median is $600,000. A reasonable accurate assessment of house prices in this given area seems to be around $600,000. If we consider the house with the high expense of $1.5 million in this area as an outlier, then in this case both the median and mean values are close to each other which are 590 and 597.5, respectively. Since the mean considers all the values in the calculation, it is easily eđected by the single highly expensive house. The median is however resistant even if we add another outlier to the sample. For example, if you have a sample size of ¯ve such as: 8, 3, 6, 20, and 10. The median is 8 while the mean (or average) is 9.4. If we then add another observation of 50, the median likely will not be notably ađected by such an outlier (if we believe this is an outlier) and is resistant and thus will not stray far from the current median value. The mean can easily be swayed by any outlier. In this case, the median is 9 whereas the mean becomes 16.17. When the sample size is large and it does not include any outlier or observation in the sample that stands far from the rest in the population, the mean value usually provides a better measure. If the data happens to include some outliers, the Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com median however is less ađected by outliers. As for the distribution function, if there is a skewed distribution, the median can provide more accurate measure than the mean. by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. In real-life applications we would not have complete population data but we can collect many samples of a large sample size. In this study, we propose a median-based machine-learning approach and algorithm to accurately predict the best parameter estimate of the Bernoulli distribution using the medians of sampling distribution. We analyze several sample datasets generated from Bernoulli population distribution to validate the accuracy of proposed approach and identify the eđectiveness of the median-based methods using machine-learning techniques. Our results show that the median-based measure always outperforms the mean measure in the applications of machine learning using sampling distribution approaches.
Predicting Random Sampling Bernoulli Distribution Parameter 19 2. The Median-Based ML Approach in Random Sampling Bernoulli Distribution Let X1; X2; ; Xn constitute a random sample of size n from a Bernoulli distribu- tion with parameter p, 0 < p < 1, which is an experiment with two outcomes, gen- erally referred to as success (x ẳ 1) and failure (x ẳ 0). A Bernoulli random variable X with success probability p has the probability density function fðxịẳpxð1 pị1x; x ẳ 0; 1: ð1ị The cumulative distribution function of X is 1 pxẳ 0; 1; FðxịẳPðX xịẳ 1 otherwise: The expected value of a Bernoulli random variable X is: EðXịẳp. If Z is the number of heads (or successes) in n Bernoulli trials where each trail has a success probability equal to p, then we know that Z has a binomial distribution n PðZ ẳ zịẳ pzð1 pịnz; z ẳ 1; 2; ; n: ð2ị z and that the mean of Z is EðZịẳnp. Let Z W ẳ ; n then W is an unbiased estimator of p. Estimating the proportion or success probability p is commonly challenging in real-life practice due to the fact that we do not have the population data. The use of Z/n to estimate p is very common and interesting. Di±cult problems in today's e-Commerce and Web applications can often allow for errors caused due to the uncertainty of the sample since it will depend on the applications and the sample size in data collection. For example, a drug-reaction surveillance program was carried out in a university hospital. Out of the total of 1000 patients monitored, it was found that 45 experienced some form of adverse reactions to their medications. If z is taken Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com to be the number of patients (out of 1000) experiencing adverse reactions, it would be reasonable to assume that Z is binomial: 1000 z 1 1000z: by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. PðZ ẳ zịẳ 45 p ð pị where p is the probability that a random patient has an adverse reaction. The unbiased estimator for p in this case would be 45 0:045: W ẳ 1000 ẳ Another example is on °ipping of coins to estimate the probability of landing head when °ipping a coin. An experiment was recently conducted with a group of
20 H. Pham & D. H. Pham Table 1. Results for a group of 30 students, each °ipping a coin 45 times. Student # Heads Tails Prob (H) Student # Heads Tails Prob (H) 1 25 20 0.555556 16 16 29 0.355556 2 22 23 0.488889 17 24 21 0.533333 3 21 24 0.466667 18 19 26 0.422222 4 23 22 0.511111 19 14 31 0.311111 5 23 22 0.511111 20 20 25 0.444444 6 17 28 0.377778 21 28 17 0.622222 7 12 33 0.266667 22 21 24 0.466667 8 25 20 0.555556 23 21 24 0.466667 9 27 18 0.6 24 27 18 0.6 10 30 15 0.666667 25 23 22 0.511111 11 26 19 0.577778 26 19 26 0.422222 12 23 22 0.511111 27 26 19 0.577778 13 24 21 0.533333 28 26 19 0.577778 14 25 20 0.555556 29 17 28 0.377778 15 20 25 0.444444 30 22 23 0.488889 343 332 Mean 323 352 Mean 0.508148 0.491852 0.508148 0.478519 0.521481 0.478519 The probability of getting head: 0.493333 30 students where each student °ipped the coin 45 times and we recorded the numbers of heads from their trials. The results are shown in Table 1. The estimated probability of a head landing up is given by 666 0:493333: W ẳ 1350 ẳ A week later, the same group with two additional students has repeated the same procedure, with each one °ipping the coin 45 times. The results are shown in Table 2. This time the estimated probability of a head up is 703 0:488194: W ẳ 1440 ẳ Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com It is reasonable to observe that the results of these two experiments are not the same due to random sampling. An interesting question is that: How likely the probability of landing head would be by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. 0.5 when you independently °ip the coin? As discussed earlier, we can only conduct sample experiments and obtain as many samples from the experiments, but we would not have the entire population data in general and even in this °ipping coins case. This leads to an interesting research problem: Is there a methodology that can be used to estimate the probability of success (or obtaining head) based on random sam- pling? In this study, we propose a median-based machine-learning sampling approach to estimate the success probability parameter p from Bernoulli population distribution. Let a sample of size n be taken from an in¯nite (or large) population with a density function fðxị. Suppose that the random variables X1; X2; ; Xn constitute
Predicting Random Sampling Bernoulli Distribution Parameter 21 Table 2. Results for a group of 32 students, each °ipping the coin 45 times. Student # Heads Tails Prob (H) Student # Heads Tails Prob (H) 1 19 26 0.422222 17 19 26 0.422222 2 23 22 0.511111 18 24 21 0.533333 3 25 20 0.555556 19 22 23 0.488889 4 25 20 0.555556 20 26 19 0.577778 5 21 24 0.466667 21 17 28 0.377778 6 21 24 0.466667 22 22 23 0.488889 7 18 27 0.4 23 22 23 0.488889 8 22 23 0.488889 24 18 27 0.4 9 26 19 0.577778 25 23 22 0.511111 10 22 23 0.488889 26 24 21 0.533333 11 26 19 0.577778 27 26 19 0.577778 12 19 26 0.422222 28 18 27 0.4 13 24 21 0.533333 29 20 25 0.444444 14 19 26 0.422222 30 19 26 0.422222 15 22 23 0.488889 31 26 19 0.577778 16 20 25 0.444444 32 25 20 0.555556 352 368 Mean 351 369 Mean 0.488889 0.511111 0.488889 0.4875 0.5125 0.4875 Probability of getting head: 0.488194 a sample of size n from an in¯nite population with a density function fðxị. Let Y1 ẳ minfX1; X2; ; Xng; Yn ẳ maxfX1; X2; ; Xng; Ymd ẳ medianfX1; X2; ; Xng: In general, Yr is called the rth-order statistic of the sample. The median of the random sample Ymd can be written as 8 >Ynỵ1 if n is odd; Yn ỵ Yn 1 if n is even: 2 2 2 ỵ 3 4 8 10 18 15 2 by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. Given X1 ẳ , X2 ẳ , X3 ẳ , X4 ẳ , X5 ẳ , X6 ẳ , X7 ẳ , for example. Then Y1 ẳ 2, Y7 ẳ 18, and Ymd ẳ 8. A known result for the probability density of the rth-order statistic is given below. Theorem (Order statistics2). For a random sample of size n from an in¯nite population having value x and a continuous density fðxị, the probability density of the rth-order statistic Yr is given by n! g ðyịẳ ẵFðyịr1ẵ1 Fðyịnrfðyị: ð3ị r ðr 1ị!ðn rị!
22 H. Pham & D. H. Pham The distributions of order statistics from continuous populations have been found to be very useful in many real-life applications including survival data analysis, radar target detection, and image processing. However, the probability density function of order statistics from a discrete population often cannot be obtained in a closed form.3 A Bernoulli trial is an experiment where the outcome is one of the two possible outcomes, namely \success" or \failure"; head or tail. For simplicity, we denote these two outcomes as one and zero, respectively. The probability of getting one (or landing \head") is p and the probability of getting zero (or landing \tail") is (1 p). We now present our proposed algorithm sketch in detail and the algorithm using median-based machine-learning approach to generate the datasets for computing the estimated proportion parameter p drawn from Bernoulli population distribution. Proposed algorithm sketch We will need to generate m sample medians (means and other features) from k sequences of samples of size N drawn from Bernoulli (pị population distribution. The following sketch describes in detail the proposed algorithm: Step 1. A sequence of N independent Bernoulli trials (i.e., coin °ips) are drawn from Bernoulli distribution with a success probability parameter p (or probability of landing head). Calculate the estimated probability of success, p~, by dividing the total number of successes over the number of trials. That is, Total number of successes p~ ẳ : Number of trials ðNị In the °ipping a coin experiment, for example, if p ẳ 0:5 then the chance of landing heads is same as that of landing tails. It is called a \fair coin". If, on the other hand, there is a possibility that the chance of landing on one side is greater than the other, then we called it an \unfair coin". . Summary: A sequence of N independent sampling trials are drawn from Bernoulli (p) distribution where p is assigned. Obtain the estimated probability of p, p~. Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com Step 2. Perform k sequences of N trials (from step 1) in each sequence. We now obtain k values of estimated probability p~. . Summary: Perform k sequences, each with N independent trials (from step 1). by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. Construct (or form) a vector of k estimated proportion values, called the estimated probability vector p (i.e., vector with k 1). ~ Step 3. Create d features (dimensions) by the following. Create the ¯rst set of three features (dimensions) by calculating the median, mean, and the average of the ¯rst observation and the last observation from the original estimated probability vector p, respectively. Then continue to create another set of three features (dimensions) by~ calculating the median, mean and the average of the ¯rst observation and the last observation, respectively, from excluding % of the ¯rst observations as well as the
Predicting Random Sampling Bernoulli Distribution Parameter 23 last observations from the original estimated probability vector p. Similarly, continue the same procedure to create another set of three features from~ the remaining esti- mated probability vector. So we now obtain a set of d feature variables each with one observation. . Summary: Construct the ¯rst set of three features (dimensions) from the estimated probability vector p by calculating the median, mean, and the average of the ¯rst observation and the~ last observation from the original estimated probability vector p, respectively. Subsequently, continue to construct the next set of three features~ by calculating the median, mean, and the average of the ¯rst observation and the last observation, respectively, from excluding % of the ¯rst observations as well as the last observations from the original estimated probability vector p. Repeat the same procedure until we obtain d features. ~ Step 4. Repeat the same procedure (steps 1–3) for m experiments. . Summary: Repeat steps 1–3 until we obtain a matrix P of size (m d). Step 5. Obtain the median of the medians with respect to each column vector from matrix P. . Summary: Obtain the median from the medians of each column vector from matrix P. Step 6. Run the entire procedure for s times. Compare the results between the medians of medians of results based on the algorithm above and the true population distribution p. . Repeat steps 1–5 for s times and compare the results out of s trials. Proposed algorithm (1) Generate a sequence of N trials: x1; ; xN , from a Bernoulli (p) population distribution. (2) Obtain the following: Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com P N 1 x p~ðjị ẳ iẳ i ; j ẳ 1; 2; ; k: N by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. (3) Repeat the previous steps k times for a random sample of size N. (4) Calculate the median, mean, and average ¯rst and last observations (called half) as follows: medianðf1ị ẳ medianðp~ị; P k 1 p~ meanðf2ị ẳ iẳ i ; k p~1 ỵ p~ ðf3ị k : half ẳ 2
24 H. Pham & D. H. Pham (5) Generate f1, f2; ; fd features by repeating step (4) by excluding % each time from the set of k values [in step (3)]. (6) Repeat previous steps (1)–(5) m times. ; ; ; ; (7) Obtain: P ẳðp1 p2 pd1 pdị, a matrix of size (m d) with a set of d features with m values~ each.~ ~ ~ (8) The estimated proportion p is given by p^ ẳ median fmedianfpij for i ẳ 1; 2; ; mgg: ðjẳ1;2; ;dị ~ In summary, the sample medians of the unbiased parameter estimator of the Bernoulli distribution function have empirically shown to be the best estimate for the true population parameter p. We now present the main result as follows: Theorem (Median-Based ML). Consider k sequences of N independent Bernoulli trials, each with success probability p. Let pij represent the ith sample median of proportion (success probability) value of~ jth feature (or dimension) for i ẳ 1; 2; ; m and j ẳ 1; 2; ; d. The estimated success probability p, p^, which is an unbiased estimator of the population proportion p and will convert to the true p almost surely, is given p^ ẳ median fmedianfpij for i ẳ 1; 2; ; mgg; ðjẳ1;2; ;dị ~ where lim p^ ! p a:s: as N does not require to be large: N This shows that the estimated success probability p using the proposed median-based ML approach will approximately attain the true population parameter p of Bernoulli population distribution where N does not require to be large. 3. The Median-Based ML Approach Analytical Results Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com In this section, we apply the proposed algorithm on °ipping of coin application. We ¯rst generate several sets of sample data from the Bernoulli population distribution for various values of p, probability of landing head. We wrote a program in R to by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. obtain the results as discussed in this section. We generate m ẳ 64 sample medians containing d ẳ 15 features (or predictors) from k ẳ 256 samples of size N ẳ 200 drawn from Bernoulli (p) population distri- bution where p is given. Table 3 shows the comparison results of predicting the correct population p from the Bernoulli distribution using the median and mean sampling methods for p ẳ 0:4, 0.5, and 0.6. It shows that the proposed median method is very accurate in prediction of the proportion parameter of Bernoulli dis- tribution. In other words, we can obtain the true success probability parameter almost all the times using the proposed median-based machine-learning approach.
Predicting Random Sampling Bernoulli Distribution Parameter 25 Table 3. Results based on 20 predictions for k ẳ 256, m ẳ 64, and s ẳ 20. p ẳ 0:4 p ẳ 0:5 p ẳ 0:6 N Mean Median Mean Median Mean Median 10 7 16 12 20 6 17 20 6 16 6 20 4 15 30 4 14 3 20 5 15 40 0 16 0 20 1 16 50 0 17 0 18 0 16 100 0 17 3 18 0 15 200 2 18 1 18 0 15 1000 0 9 0909 For example, 18 out of 20 times the median method predicted exactly the same true population parameter where the true population is p ẳ 0:5, while the mean method provided the correct answer just once. We used the true population pa- rameter with the set of sampling values: m ẳ 64, d ẳ 15, k ẳ 256,andN ẳ 200 drawn from Bernoulli population distribution. In summary, given that N ẳ 200, k ẳ 256, m ẳ 64, s ẳ 20, and p ẳ 0:5. Pre- diction results are as follows: Median method: Accuracy ẳ 18 (out of 20 samples). Mean method: Accuracy ẳ 1 (out of 20 samples). Given the same set of sampling values: m ẳ 64, d ẳ 15, k ẳ 256, but N ẳ 30 drawn from Bernoulli population distribution, 20 out of 20 times the median-based method predicted the exact same population parameter value p ẳ 0:5, while the mean method provided just three correct answers of the true population parameter. It is worth noting that even for a small size N ẳ 10, the proposed median-based ML also predicts very well with 100% correct result for the proportion population parameter from Bernoulli population distribution. We observe that the percentage of predicting the correct parameter population Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com result of Bernoulli distribution decreases when N is getting very large (N > 1000). Table 4 presents the ¯rst ¯ve out of 64 sample (m ẳ 64) data generated from the by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. Table 4. First ¯ve out of the 64 sample data. p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 1 1 0.501 0.503 0.500 0.501 0.503 0.503 0.501 0.503 0.505 0.502 0.513 0.505 0.503 0.513 2 1 0.500 0.508 0.495 0.500 0.508 0.495 0.499 0.508 0.495 0.499 0.508 0.495 0.499 0.508 3 0 0.499 0.495 0.495 0.498 0.495 0.495 0.499 0.495 0.495 0.499 0.495 0.495 0.496 0.495 4 1 0.501 0.510 0.500 0.499 0.510 0.500 0.499 0.498 0.500 0.498 0.498 0.500 0.499 0.495 5 1 0.501 0.503 0.500 0.501 0.503 0.500 0.501 0.503 0.500 0.499 0.503 0.500 0.499 0.510 Note: Here, p1 ẳ 1 if the estimate of p1 is the same as the population (i.e., predicted result is correct); otherwise p1 ẳ 0.
Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. 26 .Pa .H Pham H. D. & Pham H. Fig. 1. Scatter data plots of all 15 predictors based on all the data.
Predicting Random Sampling Bernoulli Distribution Parameter 27 Fig. 2. Correlation diagram plots of all 15 variables (predictors). given values: m ẳ 64, d ẳ 15, k ẳ 256, s ẳ 20, p ẳ 0:5,andN ẳ 200 drawn from Bernoulli population distribution. Figures 1 and 2 show the scatter plots and the correlation plots of all 15 predictors based on the same given values above. Table 5 presents diđerent approaches using ML-based techniques such as re- Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com gression, stepwise regression, correlation analysis, and important variables. The results indicate clearly that those predictors using the median approach have strong relationship with the dependent variable where it contains the entire 256 samples. by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles. Table 5. Variable selections based on various methods for N ẳ 200, k ẳ 256, m ẳ 64, s ẳ 20 and p ẳ 0:5. Correlation Regression Regression Relative importance Stepwise Predictors analysis (> 60%) (ANOVA) (based on 0.001 level) variables regression P2 P4 P5 P7 P10
28 H. Pham & D. H. Pham Table 6. Prediction results using logistic regression. Prediction Reference 001 010 139 Accuracy: 0.7692 95% CI: (0.4619, 0.9496) Sensitivity: 0.250000 Speci¯city: 1.000000 We also use logistic regression to ¯t the training dataset and then predict the results using the same dataset above where we split the dataset into 80% of training data and 20% of test data. The accuracy and speci¯city are signi¯cantly high with 77% and 100%, respectively (see Table 6). It shows that the model predicts very well for the future test data. In conclusion, using the median-based ML approach provides the best estimate of the Bernoulli distribution function rather than the mean approach. References 1. H. Pham (ed.), Springer Handbook of Engineering Statistics, Springer Handbooks (Springer, 2006). 2. H. A. David and H. N. Nagaraja, Order Statistics, 3rd edn. (John Wiley & Sons, 2003). 3. D. L. Evans, L. M. Leemis and J. H. Drew, The distribution of order statistics for discrete random variables with applications to bootstrapping, INFORMS J. Comput. 18 (2006) 19. Vietnam J. Comp. Sci. 2019.06:17-28. Downloaded from www.worldscientific.com by 1.53.49.125 on 12/20/21. Re-use and distribution is strictly not permitted, except for Open Access articles.