A n-grams collection extracted from the Portuguese Web

Metadata quality: 1.0/1
Metadata quality:
Data description filled
Resources documented
License filled
Update frequency followed
File formats are open
Temporal coverage filled
Spatial coverage filled
Updated on 29 de agosto de 2023 — Creative Commons CCZero

Arquivo.pt - pesquise páginas do passado

O Arquivo.pt permite pesquisar e aceder a páginas da web preservadas desde 1996. O Arquivo.pt é um serviço público gerido pela Fundação para a Ciência e a Tecnologia (FCT) que arquiva continuamente os conteúdos de websites de interesse para a comunidade portuguesa. O Arquivo.pt disponibiliza…

51 datasets

Informations

Licencia
Creative Commons CCZero
ID
64ee072ff1b5a534ce7a4ed3

Temporality

Cobertura temporal
1996/01/01 to 2022/12/02
Frequency
Puntual
Fecha de creación
29 de agosto de 2023
Latest resource update
29 de agosto de 2023

Geographic dimensions

Territorial coverage granularity
País
Territorial coverage
Portugal

Embed

Permalink

Descripción

The n-grams collection was extracted from the collected documents whose identified language was Portuguese. We extracted word n-grams up to the fifht order (5-grams). A set of regular expressions to tokenize the text were applied. After the extraction, all n-grams with tokens having more than 32 characters were discarded. N-grams with frequencies below 5 were discarded as well. The n-grams collection is available as a set of UTF-8 encoded files, containing the n-grams and their frequencies (2010-11-10).

This collection was build by David Batista, winner of the 2nd place of the Arquivo.pt award 2021 with the work Politiquices.pt.

Related publication: https://www.davidsbatista.net/assets/documents/publications/WPT05_fala2010.pdf
Also published at Harvard Dataverse

Files 1

Pré-Visualização 0

     


  • Os ficheiros JSON e XML descarregados a partir deste painel de pré-visualização são gerados a partir do ficheiro selecionado e podem não corresponder aos recursos originais alojados na plataforma com o mesmo formato.

Community resources 0

You have built a more comprehensive database than those presented here? This is the time to share it!

Reutilizaciones 0

Explore the reuses of this dataset.

Did you use this data ? Reference your work and increase your visibility.

Discussion between the organization and the community about this dataset.