Image by Editor | Ideogram
Random data consists of values generated by various tools without predictable patterns. The occurrence of these values depends on the probability distribution from which they are drawn, making them unpredictable.
Using random data offers numerous benefits in our experiments, such as simulating real-world data, creating synthetic data for machine learning training, or for statistical sampling purposes.
NumPy is a powerful package that supports many mathematical and statistical computations, including the generation of random data. From simple data to complex multidimensional arrays and matrices, NumPy can facilitate the generation of random data.
This article will delve into how we can generate random data using NumPy. Let’s get started.
Generating Random Data with NumPy
First, ensure that the NumPy package is installed in your environment. If not, you can install it using pip.
Once the package is successfully installed, we can proceed to the main part of the article.
First, we would set the seed number for reproducibility. When generating random occurrences with a computer, remember that what we are doing is pseudo-random. Pseudo-randomness occurs when data appears random but is deterministic if we know the starting points, called seeds.
To set the seed in NumPy, use the following code:
import numpy as np
np.random.seed(101)
You can use any positive integer as the seed number, which will become our starting point. The .random
method in NumPy will be our main function for this article.
Once we set the seed, we can generate random floating-point numbers with NumPy. Let’s generate five random floating-point numbers.
Output>>
array([0.51639863, 0.57066759, 0.02847423, 0.17152166, 0.68527698])
We can also obtain multidimensional arrays using NumPy. For example, the following code generates a 3×3 array filled with random floating-point numbers.
Output>>
array([[0.26618856, 0.77888791, 0.89206388],
[0.0756819 , 0.82565261, 0.02549692],
[0.5902313 , 0.5342532 , 0.58125755]])
Next, we can generate random integers within a specific range using the following code:
np.random.randint(1, 1000, size=5)
Output>>
array([974, 553, 645, 576, 937])
Previously, all randomly sampled data followed a uniform distribution, meaning each value had an equal chance of occurring. If we repeat the data generation process infinitely, the frequency of all numbers would be nearly equal.
We can generate random data from various distributions. Here, we generate ten random data points from the standard normal distribution.
np.random.normal(0, 1, 10)
Output>>
array([-1.31984116, 1.73778011, 0.25983863, -0.317497 , 0.0185246 ,
-0.42062671, 1.02851771, -0.7226102 , -1.17349046, 1.05557983])
The above code generates Z-scores from the normal distribution with a mean of zero and a standard deviation of one.
We can generate random data following other distributions as well. Here’s how to use the Poisson distribution to generate random data.
Output>>
array([10, 6, 3, 3, 8, 3, 6, 8, 3, 3])
The random sample data from the Poisson distribution in the above code simulates random events at a specific average rate (5), but the generated number can vary.
We can also generate random data following the binomial distribution.
np.random.binomial(10, 0.5, 10)
Output>>
array([5, 7, 5, 4, 5, 6, 5, 7, 4, 7])
The above code simulates experiments following the binomial distribution. Imagine performing ten coin flips (first parameter ten and second parameter probability 0.5); how many times does it show heads? As shown in the result above, we performed the experiment ten times (the third parameter).
Let’s try the exponential distribution. With this code, we can generate data following the exponential distribution.
np.random.exponential(1, 10)
Output>>
array([0.7916478 , 0.59574388, 0.1622387 , 0.99915554, 0.10660882,
0.3713874 , 0.3766358 , 1.53743068, 1.82033544, 1.20722031])
The exponential distribution explains the time between events. For example, the above code can be interpreted as the time it takes for a bus to arrive at a station, which takes a random amount of time but averages 1 minute.
For advanced generation, you can combine results from different distributions to create data samples following a custom distribution. For example, 70% of the random data generated below follows a normal distribution, while the rest follows an exponential distribution.
def combined_distribution(size=10):
# normal distribution
normal_samples = np.random.normal(loc=0, scale=1, size=int(0.7 * size))
# exponential distribution
exponential_samples = np.random.exponential(scale=1, size=int(0.3 * size))
# Combine the samples
combined_samples = np.concatenate([normal_samples, exponential_samples])
# Shuffle the samples
np.random.shuffle(combined_samples)
return combined_samples
samples = combined_distribution()
samples
Output>>
array([-1.42085224, -0.04597935, -1.22524869, 0.22023681, 1.13025524,
0.74561453, 1.35293768, 1.20491792, -0.7179921 , -0.16645063])
These custom distributions are much more powerful, especially if we want to simulate our data to follow real-world cases (which are usually more complex).
Conclusion
NumPy is a powerful Python package for mathematical and statistical computations. It generates random data that can be used for various purposes, such as data simulations, synthetic data for machine learning, and more.
In this article, we explained how to generate random data with NumPy, including methods that could enhance our data generation experience.
Cornellius Yudha Wijaya is the Deputy Director of Data Science and a Data Writer. While working full-time at Allianz Indonesia, he enjoys sharing Python and data tips through social media and writing. Cornellius writes on various topics related to AI and machine learning.