A n-grams collection extracted from the Portuguese Web

Metadata quality: 1.0/1
Metadata quality:
Data description filled
Resources documented
License filled
Update frequency followed
File formats are open
Temporal coverage filled
Spatial coverage filled
Updated on August 29, 2023 — Creative Commons CCZero

Arquivo.pt - pesquise páginas do passado

O Arquivo.pt permite pesquisar e aceder a páginas da web preservadas desde 1996. O Arquivo.pt é um serviço público gerido pela Fundação para a Ciência e a Tecnologia (FCT) que arquiva continuamente os conteúdos de websites de interesse para a comunidade portuguesa. O Arquivo.pt disponibiliza…

48 datasets

Informations

License
Creative Commons CCZero
ID
64ee072ff1b5a534ce7a4ed3

Temporality

Temporal coverage
1996/01/01 to 2022/12/02
Frequency
Punctual
Creation date
August 29, 2023
Latest resource update
August 29, 2023

Geographic dimensions

Territorial coverage granularity
Country
Territorial coverage
Portugal

Embed

Permalink

Description

The n-grams collection was extracted from the collected documents whose identified language was Portuguese. We extracted word n-grams up to the fifht order (5-grams). A set of regular expressions to tokenize the text were applied. After the extraction, all n-grams with tokens having more than 32 characters were discarded. N-grams with frequencies below 5 were discarded as well. The n-grams collection is available as a set of UTF-8 encoded files, containing the n-grams and their frequencies (2010-11-10).

This collection was build by David Batista, winner of the 2nd place of the Arquivo.pt award 2021 with the work Politiquices.pt.

Related publication: https://www.davidsbatista.net/assets/documents/publications/WPT05_fala2010.pdf
Also published at Harvard Dataverse

Files 1

Pré-Visualização 0

     


  • Os ficheiros JSON e XML descarregados a partir deste painel de pré-visualização são gerados a partir do ficheiro selecionado e podem não corresponder aos recursos originais alojados na plataforma com o mesmo formato.

Community resources 0

You have built a more comprehensive database than those presented here? This is the time to share it!

Reuses 0

Explore the reuses of this dataset.

Did you use this data ? Reference your work and increase your visibility.

Discussion between the organization and the community about this dataset.