A n-grams collection extracted from the Portuguese Web
Description
The n-grams collection was extracted from the collected documents whose identified language was Portuguese. We extracted word n-grams up to the fifht order (5-grams). A set of regular expressions to tokenize the text were applied. After the extraction, all n-grams with tokens having more than 32 characters were discarded. N-grams with frequencies below 5 were discarded as well. The n-grams collection is available as a set of UTF-8 encoded files, containing the n-grams and their frequencies (2010-11-10).
This collection was build by David Batista, winner of the 2nd place of the Arquivo.pt award 2021 with the work Politiquices.pt.
Related publication: https://www.davidsbatista.net/assets/documents/publications/WPT05_fala2010.pdf
Also published at Harvard Dataverse
Producer
Latest update
29 de agosto de 2023
License
Metadata quality:
Data description filled
Files documented
License filled
Update frequency followed
File formats are open
Temporal coverage filled
Spatial coverage filled
All files are available
Metadata quality
There are no reuses for this dataset yet.
There are no discussions for this dataset yet.
There are no community resources for this dataset yet.
Information
Tags
License
ID
64ee072ff1b5a534ce7a4ed3
Temporality
Creation
29 de agosto de 2023
Frequency
Punctual
Temporal coverage
1996/01/01 to 2022/12/02
Latest update
29 de agosto de 2023
Spatial coverage
Territorial coverage
Portugal
Territorial coverage granularity
Country
Actions
Embed
Statistics for the year
Reuses of this dataset
0
Followers
0