New patent for DeepSeek: Reduces network resource consumption during data collection
Updated on: 04-0-0 0:0:0

IT之家 4 月 2 日消息,IT之家從國家知識產權局中國專利公佈公告網獲悉,DeepSeek 關聯公司杭州深度求索人工智慧基礎技術研究有限公司申請的“一種廣度數據採集的方法及其系統”專利於 4 月 1 日公佈。

The patent abstract shows:

The beneficial effects of the invention are: to discover as many web links as possible and to reduce the traffic impact on the website; Analyze the downloaded content, infer the quality of the links that have not been downloaded, reduce low-quality web page downloads and duplicate downloads, improve data quality and download efficiency, and reduce the consumption of network resources in the process of data collection. Separate information refill columns are used to ensure the atomicity and stability of the web page information base modification.

Background technology says: In recent years, with the advancement of artificial intelligence technology, the field of NLP natural language has made great progress. Many large language models (LLMs) have been trained in the field of natural language processing to study various theories and methods for effective communication between humans and computers in natural language.

The training of a large language model requires the construction of oneHigh-quality, diverse large language model datasetsThis requires a large amount of high-quality text information to be collected and processed by web page data as the input of the model for the training of large language models.

However, there are many problems with existing data acquisition technologies, such as:When collecting complex websites, it is not possible to obtain the full link; Easy to over-download, causing the other party's website to crash; to the download pageNo content quality analysis and inference are performed, resulting in repeated downloads or low-quality downloads, affecting the efficiency of data collection.

Therefore, in the process of obtaining a large amount of web page data, how to collect Internet data quickly, accurately, safely and efficiently becomes very important.