A n-grams collection extracted from the Portuguese Web
Metadata quality:
Arquivo.pt - pesquise páginas do passado
O Arquivo.pt permite pesquisar e aceder a páginas da web preservadas desde 1996. O Arquivo.pt é um serviço público gerido pela Fundação para a Ciência e a Tecnologia (FCT) que arquiva continuamente os conteúdos de websites de interesse para a comunidade portuguesa. O Arquivo.pt disponibiliza…
Informations
- License
- Creative Commons CCZero
- ID
- 64ee072ff1b5a534ce7a4ed3
Temporality
- Temporal coverage
- 1996/01/01 to 2022/12/02
- Frequency
- Punctual
- Creation date
- August 29, 2023
- Latest resource update
- August 29, 2023
Geographic dimensions
- Territorial coverage granularity
- Country
- Territorial coverage
- Portugal
Embed
Permalink
Description
The n-grams collection was extracted from the collected documents whose identified language was Portuguese. We extracted word n-grams up to the fifht order (5-grams). A set of regular expressions to tokenize the text were applied. After the extraction, all n-grams with tokens having more than 32 characters were discarded. N-grams with frequencies below 5 were discarded as well. The n-grams collection is available as a set of UTF-8 encoded files, containing the n-grams and their frequencies (2010-11-10).
This collection was build by David Batista, winner of the 2nd place of the Arquivo.pt award 2021 with the work Politiquices.pt.
Related publication: https://www.davidsbatista.net/assets/documents/publications/WPT05_fala2010.pdf
Also published at Harvard Dataverse
Files 1
Pré-Visualização 0
- Os ficheiros JSON e XML descarregados a partir deste painel de pré-visualização são gerados a partir do ficheiro selecionado e podem não corresponder aos recursos originais alojados na plataforma com o mesmo formato.
Community resources 0
You have built a more comprehensive database than those presented here? This is the time to share it!
Reuses 0
Explore the reuses of this dataset.
Did you use this data ? Reference your work and increase your visibility.
Discussion between the organization and the community about this dataset.