Will we run out of machine learning data? Evidence based on dataset size trends

Preface#

Recently, artificial intelligence has become very popular. "Around CN" includes ChatGPT, Claude, Bing AI, Google Bard; CN has Wenxin Yiyan, iFlytek Spark... and they are all evolving rapidly. At the bottom of the ChatGPT webpage, there is a line of small text indicating the current version, which has now evolved to ChatGPT May 24 Version. Can these artificial intelligences continue to evolve so quickly?

The essence of artificial intelligence can be described as simulating human intelligent behavior and decision-making processes through the use of mathematical models and algorithms.

Parameters refer to the adjustable variables in an artificial intelligence model that are used to control the model's behavior and performance. The more parameters there are, the more possibilities the model considers, and the more comprehensive the model's output results are; the values of the parameters are usually learned from training data, and the more training material there is, the better the model's output results. Increasing parameters and optimizing parameter values both require a large amount of data. At any given moment, the learning data available to us is limited.

The paper "Will we run out of ML data? Evidence from projecting dataset size trends" analyzes the question of "Will we run out of machine learning data?"

2026: Exhaustion of "high-quality data"
2030 to 2050: Exhaustion of all language data
2030 to 2060: Exhaustion of all visual data

Below is the translation of the main parts of the paper:

(If you wish to read the original text directly, please click the link at the end.)

Based on our previous analysis of dataset size trends, we have predicted the growth of dataset sizes in the language and visual domains. We explore the limits of this trend by estimating the total amount of unlabeled data available in the coming decades.

Abstract#

We analyzed the growth of dataset sizes used in natural language processing and computer vision and extrapolated using two methods: using historical growth rates and estimating future predictions to calculate the optimal dataset size. We studied the growth of data usage by estimating the total stock of unlabeled data available on the internet in the coming decades. Our analysis indicates that the stock of high-quality language data will soon be exhausted; possibly before 2026. In contrast, the stock of low-quality language data and image data will be exhausted later; low-quality language data between 2030 and 2050, and image data between 2030 and 2060. Our work suggests that without significantly improving data efficiency or having new sources of data available, the current trend of machine learning models relying on the continuous growth of large datasets may slow down.

Key Points#

We used historical growth rates and calculated the optimal dataset size based on current scaling laws and existing computational availability to predict the growth of training datasets for visual and language models (Section III-A).
We also predicted the growth of the total stock of unlabeled data, including high-quality language data (Section III-B).
As of October 2022, language datasets are growing exponentially at a rate exceeding 50%, containing 2e12 words (Section IV-A).
Currently, the stock of language data is growing at 7% per year, but our model predicts it will slow to 1% by 2100. This stock currently lies between 7e13 and 7e16 words, which is 1.5 to 4.5 orders of magnitude larger than the largest dataset currently in use (Section IV-B1).
Based on these trends, we are likely to exhaust language data between 2030 and 2050 (Section IV-D).
However, language models are typically trained on high-quality data. The stock of high-quality language data lies between 4.6e12 and 1.7e13 words, which is less than an order of magnitude larger than the largest dataset (Section IV-B2).
We are only an order of magnitude away from exhausting high-quality data, which is likely to occur between 2023 and 2027 (Section IV-D).
Compared to language data, predictions for the future growth of image datasets are less clear, as historical trends have stopped in the past four years (with new models using more data than ever before, see [1]). However, the growth rate seems to be between 18% and 31% per year. The current largest dataset contains 3e9 images (Section IV-A).
Currently, the stock of visual data is growing at 8% per year, but will eventually slow to 1% by 2100. Currently, it lies between 8.11e12 and 2.3e13 images, which is three to four orders of magnitude larger than the largest dataset currently in use (Section IV-C).
Based on these trend predictions, we are likely to exhaust visual data between 2030 and 2070 (Section IV-D).

I. Introduction#

Training data is one of the three main factors determining the performance of machine learning (ML) models, acting in conjunction with algorithms and computational power. According to the current understanding of scaling laws, future machine learning capabilities will heavily rely on the availability of large amounts of data for training large models [2, 3].

Previous research compiled a database containing over 200 training datasets for machine learning models [1] and estimated the historical growth rates of dataset sizes for visual and language models.

To understand the limits of this trend, we developed probabilistic models to estimate the total amount of image and language data available from 2022 to 2100. Based on our predictions of dataset size trends, we further estimated the limits of these trends due to the exhaustion of available data.

II. Previous Research#

Database Stock: There have been various estimates regarding the size of the internet and the amount of available information [4, 5, 6]. However, in recent years, such reports have not provided detailed analyses of different data types (e.g., images, videos, or blog posts), but rather aggregated all data types into a single value in bytes [7].

Data Bottlenecks in Machine Learning: In [8], the authors estimated the stock of high-quality data and used scaling laws [3] to predict that even with computationally optimal scaling methods, database stock would not allow the scale of language models to exceed 1.6 times that of DeepMind's Chinchilla language model [3]. We improved this analysis by creating a clear model of dataset size growth and more detailed estimates of database stock over time, allowing us to predict when datasets will become as large as the total database stock.

III. Research Methods#

A. Predicting the Growth of Training Dataset Sizes#

Previous research compiled historical trends of dataset sizes across different application domains (the domains included in Figure 2 are visual, language, recommendation, speech, painting, and gaming. However, only the data in the visual and language domains is significant.) [1].

We define dataset size as the number of unique data points used for model training. Each domain has a different definition for "data point." Specifically, for language data, we define a data point as a word; for image data, we define a data point as an image. More details on this choice of dataset size metric can be found in [1].

Using historical trends and the size of the largest dataset used to date, we can estimate the future evolution of dataset sizes. However, this prediction assumes that past trends will continue indefinitely. In reality, there are limits to the amount of data that models can be trained on. One of the most important limitations is computational availability. This is because increasing the amount of training data for a given model requires additional computational resources, and the amount of computational resources available is limited by hardware supply and the cost of purchasing or renting hardware.

To account for this limitation, we made another prediction based on computational availability and computationally optimal dataset sizes. Scaling laws can be used to predict the optimal balance between model size and dataset size given a computational budget (in FLOPs) [2, 3]. Specifically, the optimal dataset size is proportional to the square root of the computational budget:

$D \propto \sqrt{C}$

Previous research [9] predicted the future availability of computational resources for the largest training tasks (Figure 3 Note that this prediction carries significant uncertainty and includes scenarios where spending on computational resources may grow by several orders of magnitude, reaching levels of 1% of GDP at current levels.). We use these predictions to estimate the optimal training dataset size achievable in each future year.

B. Estimating the Rate of Data Accumulation#

In recent years, unsupervised learning has successfully created foundational models that can be fine-tuned with a small amount of labeled data and a large amount of unlabeled data for multiple tasks. Additionally, unsupervised models can generate valuable pseudo-labels for unlabeled data [10]. For these reasons, we will focus on the stock and accumulation rate of unlabeled data, even though the amount of labeled data is relatively small (Figure 4 Note that while transfer learning greatly reduces the need for labeled data, it does not eliminate it entirely. Moreover, labeled data is often harder to obtain than unlabeled data. Therefore, although the required amount is smaller, labeled data may become a bottleneck.).

Before delving into the details, let us consider a theoretical framework for our expectations regarding the rate of data accumulation. The vast majority of data is user-generated and stored on social media platforms, blogs, forums, etc. Three factors determine how much content is produced over a given period: population size, internet penetration rate, and the average amount of data generated per internet user. Population size has been extensively studied, so we use standard United Nations forecast data [11]. The internet penetration rate (the proportion of the population using the internet) has grown from 0% in 1990 to 50% in 2018, and now exceeds 60% [12]. We model it as a sigmoid function of time and fit it to the data in [12].

The average amount of data generated by users varies according to geographical and temporal internet usage trends and is not easy to analyze (this would require considering the cultural, demographic, and socio-economic impacts of different countries and periods, which is beyond the scope of this paper). For simplicity, let us assume that the average amount of data generated by users remains constant over time.

This model of the number of internet users aligns closely with historical numbers of internet users (Figure 2). To test its ability to predict internet data generation, we empirically tested the model against Reddit submission data, comparing it with exponential and sigmoid models. The results show that this model fits the data better (see Appendix C for details).

C. Rate of Accumulation of High-Quality Data#

We have developed a model for the rate of accumulation of user-generated content. However, for language data, such content often has lower quality compared to more specialized language data such as books or scientific papers. Models trained on the latter data perform better [13], which is why such data is often used when training language models [14, 15, 3]. We know very little about the quality of data for image models and how to identify high-quality image data (besides very rough metrics like image resolution, there are other metrics to assess the quality of image data. For example, comparing the robustness of image-text models trained on different commonly used datasets under distribution shifts shows that no single dataset produces better robustness across all shifts [16].), so in this section, we will focus on language.

Due to our limited understanding of the trade-offs involved in using high-quality versus low-quality data, we provide separate estimates and growth predictions for high-quality and low-quality data. To determine high-quality data, we rely on practitioners' expertise and examine the composition of datasets used to train large language models. The most common sources in these datasets include books, news articles, scientific papers, Wikipedia, and filtered web content (filtered web content is selected from regular web content using quality proxy metrics, such as the number of upvotes for links shared on Reddit; datasets like MassiveWeb and WebText were constructed in this way; other common data sources include GitHub (for code), subtitles and transcriptions of educational videos, records of podcasts or parliamentary meetings, and emails).

A common feature of these data sources is that they contain data that has been filtered for usefulness or quality. For instance, in the case of news, scientific articles, or open-source code projects, usefulness filtering is enforced by professional standards (such as peer review). In the case of Wikipedia, filtering is achieved through time-tested community editing. In the case of filtered web content, filtering is achieved through active engagement from many users. Despite imperfections, this feature helps us identify other sources of high-quality data, which we define as our work on high-quality data.

Some high-quality data, such as filtered web content and Wikipedia, is generated by contributors focused on the internet. This means we can use the same model to handle general user-generated content.

However, other sources of high-quality data are generated by domain experts (such as scientists, authors, and open-source developers). In this case, the rate of generation is not determined by population or internet penetration but by the economic scale and the share of the economy allocated to creative fields (such as science and art).

Over the past 20 years, R&D spending in OECD countries has roughly accounted for 2% of their GDP [17]. Although this figure is slowly increasing, we will assume it remains stable. Therefore, the rate of data accumulation should be roughly proportional to the size of the world economy, which grows at about 4% per year. This prediction aligns with observed growth in scientific publications [18].

We estimate the proportion of these two categories of data (focused contributors and professionals) in high-quality data by examining existing datasets and categorizing their subcomponents.

D. Limiting Factors#

There may be several reasons for errors in our estimates of the growth rate of dataset sizes:

In the future, less data may be needed to achieve the same level of performance. This possibility is particularly high, as there have been significant improvements in data efficiency in other fields [19, 8].
The availability of computational resources may grow more slowly than expected, due to potential barriers to technological efficiency, supply chain disruptions, or reduced willingness to invest.
The current scaling laws may be incorrect, as has happened in the past (in [2], the authors suggested that for every tenfold increase in computational resources, the size of the training dataset should increase fivefold. In a more recent study [3], they revisited this issue and suggested that for every tenfold increase in computational resources, the size of the training dataset should increase threefold.). Even without additional improvements in data efficiency, there may be better scaling methods that use less data.
Multimodal models may perform better through transfer learning, effectively expanding the database to include combinations of all data modalities.

Additionally, there are some limitations to our estimates of database stock:

Using synthetic data can make the database nearly infinite. We have uncertainties regarding the usefulness and training costs of synthetic data.
The widespread adoption of autonomous vehicles may lead to an unprecedented number of road video recordings, which could significantly impact data generation.
Similarly, actors with large budgets (such as governments or large corporations) may be able to increase data production through sufficient investment, especially in high-quality data in niche areas. Some possibilities include extensive screen recording or large-scale surveillance.
We may find better ways to extract high-quality data from low-quality sources, such as by designing robust automated quality metrics.

IV. Analysis#

A. Trends in Dataset Sizes#

Previous research [1] identified historical growth rates of training datasets across different domains. Since the language and visual domains are the only ones with large amounts of data, we will limit our analysis to these two domains. These trends are summarized in Table I.

B. Language Data#

1) Low-Quality Data#

We used five different models to estimate the volume and accumulation rate of data. Table II summarizes these different models, which are further illustrated in Figure 3a and explained in more detail in Appendix A. The integrated model estimates the current total stock to be between 6.9e13 and 7.1e16 words, with a current growth rate between 6.41% and 17.49% per year.

It is important to note that the high end of this estimate comes from our two least trusted, highly theoretical models. Our interpretation of this range is as follows: 1e14 words is very likely to be held by a single, well-funded participant like Google; 1e15 words is collectively held by all major participants (i.e., all tech companies); and 1e16 words is what humanity might be able to collectively produce through global, sustained efforts over many years, adopting some practices currently outside the Overton window, such as recording all text messages, phone calls, and video conferences.

Using the integrated database stock model as the upper limit for dataset expansion, we predicted the size of the training dataset, finding that it grows rapidly before exhausting the database stock. After this point, the growth rate significantly slows down (Figure 3c).

Table II

2) High-Quality Data#

We studied the composition of several high-quality datasets and determined the scalability of each component to investigate high-quality data. We considered three datasets: The Pile [13], MassiveText [3], and the PaLM pre-training dataset [15].

From these datasets, we can see that high-quality datasets are typically composed of the following components: 50% user-generated content (such as Pile-CC, OpenWebText2, social media conversations, filtered web pages, MassiveWeb, C4), 15-20% books, 10-20% scientific papers, <10% code, and <10% news. Additionally, they all include known small high-quality datasets such as Wikipedia (Figure 4a).

We estimated the amount of available text in digitized books, public GitHub repositories, and scientific papers. Assuming that this text occupies 30% to 50% of the hypothesized high-quality dataset, we can arrive at 9e12 [4.6e12; 1.7e13] words. We assume that the amount of high-quality data grows at a rate of 4-5% per year, consistent with global economic trends, as explained in the introduction (see Figure 4b). Details of the model can be found in Appendix A.

Using the high-quality database stock as the upper limit to predict the growth of language datasets, rather than using the low-quality database stock, we found the same deceleration pattern, but the deceleration occurs earlier, starting before 2026 (Figure 4c).

C. Visual Data#

For the visual domain, we used two different estimation methods: one provided by Rise Above Research [20], and the other used a combination of images and videos published on the most popular social media platforms. The integrated model shows that the number of images on the internet today is between 8.11e12 and 2.3e13, with a current annual growth rate of about 8%. These models are summarized in Table III and Figure 5a.

Using the integrated database stock model as the upper limit for dataset expansion, we predicted the size of the training dataset based on historical trends and computational optimization extrapolation. Since we are unclear whether recent high outliers indicate a new higher growth trend, historical projections are highly uncertain. Compared to language, computational projections are also more uncertain, as we do not have a good understanding of the scaling laws in the visual domain. (This is because images can have different resolutions, making the tokenization of images more variable than that of text.)

Similar to the language case, the size of the dataset grows exponentially until it reaches the size of the database stock, after which the growth rate significantly slows down (Figure 5c).

We are unclear about the quality of unlabeled visual data and how to distinguish high-quality data, so we did not attempt to estimate it.

TABLE III

D. Will Data Become a Bottleneck?#

So far, we have found that the growth rate of database stock is far lower than the size of the training dataset (see Figures 3c, 4c, and 5c). This means that if current trends continue, exhausting our database stock is inevitable. Furthermore, the scale of high-quality database stock is much smaller than that of low-quality database stock. Predictions of dataset sizes based on historical trends and computational availability are very similar in the initial years, but begin to diverge afterward.

We calculated the probability of exhausting database stock and dataset sizes each year (Figure 6). Although there is considerable uncertainty regarding the exhaustion dates of low-quality language and visual stocks, it seems unlikely that they will be exhausted before 2030 or after 2060. However, if current trends continue, the high-quality language stock will almost certainly be exhausted before 2027. The quantiles of these distributions are shown in Table IV.

V. Discussion#

The scaling laws of language models indicate that scalability depends on the amount of available data [3, 8]. From this perspective, about half of the improvements in language models over the past four years have come from training on more data. If there is no further room for data expansion, this will lead to a slowdown in artificial intelligence progress.

From both historical and computational limitation perspectives, the rate of accumulation of language and visual models' data is far slower than the growth of dataset sizes we have observed to date. Therefore, we may face a bottleneck in training data. This will impact language models between 2030 and 2040 and image models between 2030 and 2060 (Figure 6).

This is particularly evident for high-quality language data, which seems likely to be exhausted before 2027. It is currently unclear whether large datasets can substitute for lower-quality data, but even so, this would not be sufficient to completely avoid a slowdown, as our ability to expand training datasets is also limited by computational availability.

Based on these predictions, one might think that a slowdown is inevitable. However, we have ample reason to believe that our models have not fully captured the evolution of machine learning progress (see the Limiting Factors section).

In particular, the future evolution of data efficiency and the impact of data quality on performance are crucial for predicting future data demands. Unfortunately, our understanding of these variables is not sufficient to provide detailed predictions. Future work could attempt to incorporate these considerations into the analysis.

VI. Conclusion#

We have predicted the growth of training dataset sizes and database stock. The growth rate of database stock is far slower than the growth rate of dataset sizes, so if current trends continue, the dataset will eventually stop growing due to data exhaustion. According to our model, this could happen between 2030 and 2040 for language data, and between 2030 and 2060 for visual data. Additionally, high-quality language data will be exhausted before 2026.

If our assumptions are correct, data will become the primary bottleneck for scaling machine learning models, and we may see a slowdown in artificial intelligence progress as a result. However, as mentioned earlier, there are multiple reasons to doubt that these trends will continue as predicted, such as the potential for algorithmic innovations in data efficiency.

Others#

For references and other sections, please refer to the original text: Will we run out of ML data? Evidence from projecting dataset size trends

So what happens after the depletion of data resources?

There are several potential solutions and possible directions for development:

Data augmentation techniques: Data augmentation is a technique that generates more training samples using existing data. By applying various data transformations, perturbations, and synthesis methods, the scale and diversity of training data can be expanded. Data augmentation can help models learn and generalize better to some extent, even achieving good results with limited original datasets.
Transfer learning: Transfer learning is a technique that utilizes existing knowledge and models to help solve new problems. By applying pre-trained models or parts of models to new tasks, existing knowledge and experience can be leveraged, thereby reducing reliance on large amounts of new data. Transfer learning can optimize and accelerate models in data-limited situations.
Reinforcement learning and self-learning: Reinforcement learning is a technique that learns optimal behavior through interaction with the environment. Compared to traditional supervised learning, reinforcement learning can better adapt to situations with limited data. Additionally, self-learning techniques can enable machines to actively collect information and experiences from the environment and enhance their capabilities through continuous self-training and exploration.
Data sharing and collaboration: In situations where data resources are limited, collaboration and data sharing can become a solution. By collaborating between different institutions, researchers, or companies, sharing data resources can accelerate model advancement and development. At the same time, adhering to privacy protection and data security principles, reasonable data sharing can provide more possibilities for the development of machine learning.

It should be noted that the above solutions are not exhaustive, and the field of machine learning is still evolving and innovating. More technologies and methods to address data scarcity may emerge in the future. Additionally, with technological advancements and new data collection methods, we can also expect more data resources to become available, further promoting the development of machine learning.