CNPJ Source

Source

class publicbr.cnpj._source.CNPJSource(spark_session, file_dir)

Class used to extract CNPJ data.

Parameters
  • spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data

  • file_dir (str) – Root directory where the data will be saved

spark

Spark session used in data manipulation

Type

pyspark.sql.SparkSession

raw_dir

Path to the diectory used to store raw data

Type

str

trusted_dir

Path to the diectory used to store cleaned data

Type

str

crawler

Object used to extract data from the public source

Type

CNPJCrawler

cleaners

Dict with the cleaners used to consolidate tables

Type

Dict[Cleaner]

create(download=True, overwrite=True, **kwargs)

Wrapper for method execution.

Parameters
  • download (bool) – Indicator that the raw files must be downloaded

  • overwrite (bool) – Indicator of if the already existing files should be overwritten.

  • **kwargs

    modestr

    Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error

    n_partitionsint

    Number of DataFrame partitions

    partition_colstr

    Column to partition DataFrame on writing

    key :

    Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

extract(overwrite)

Extract data from public CNPJ data source, using the CNPJCrawler.

Parameters

overwrite (bool) – Indicator of if the already existing files should be overwritten.

Returns

returns an instance of the object

Return type

self

transform(**kwargs)

Transform raw data extracted from public CNPJ data source.

Parameters

**kwargs

modestr

Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error

n_partitionsint

Number of DataFrame partitions

partition_colstr

Column to partition DataFrame on writing

key :

Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

Crawler

class publicbr.cnpj._crawler.CNPJCrawler(save_dir)

Class used to extract CNPJ data from the public source.

Parameters

save_dir (str) – Path to where the downloaded data should be stored. It creates a directory if it does not exists already.

base_url

Url containing all the files to be downloaded

Type

str

save_dir

Path to where the consolidated data should be stored

Type

str

files

Name of the files to be downloaded

Type

str

download_url(url, save_path) None

Function that downlads data from the URL.

Parameters
  • url (str) – Full url of the file, created by joining the base_url and a file name.

  • save_path (str) – Path of the destination file

Returns

returns an instance of the object

Return type

self

get_data(overwrite) None

Wrapper to download each file in files.

Parameters

overwrite (bool) – Indicator of if the already existing files should be overwritten.

Returns

returns an instance of the object

Return type

self

run(overwrite=True) None

Wrapper for method execution.

Parameters

overwrite (bool) – Indicator of if the already existing files should be overwritten.

Returns

returns an instance of the object

Return type

self

unzip() None

Extract data from the downloaded zipped files.

Returns

returns an instance of the object

Return type

self

Consolidation

class publicbr.cnpj._consolidation.AuxCleaner(spark_session, file_dir, save_dir)

Class used to clean the auxiliary tables that compose the CNPJ data. Currently, they are the following:

  • CNAE

  • Municípios

  • Natureza Jurídica

  • País

  • Qualificação de Sócios

  • Motivo da Situação Cadastral

Parameters
  • spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data

  • file_dir (str) – Path to where the raw data is stored.

  • save_dir (str) – Path to where the consolidated data should be stored

spark

Spark session used in data manipulation

Type

pyspark.sql.SparkSession

file_dir

Path to where the raw data is stored.

Type

str

save_dir

Path to where the consolidated data should be stored

Type

str

file_ids

Names used to identify the auxiliary tables

Type

List[str]

files

Name of the raw files

Type

str

clean(mode='error', n_partitions=32, **kwargs) None

Wrapper for method execution.

Parameters
  • mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error

  • n_partitions (int) – Number of data partitions in execution

  • **kwargs

    partition_colstr

    Column to partition DataFrame on writing

    key :

    Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema(file_id) str

Creates schema used in the file reading.

Parameters

file_id (str) – Name used to identify the auxiliary table

Returns

String specifying the schema of the DataFrame

Return type

str

get_files() List[str]

Gets the correct files of the auxiliary tables from the raw directory

Returns

Name os the auxiliary table files

Return type

List[str]

transform_data(df) pyspark.sql.dataframe.DataFrame

Performs the necessary transformations to clean the raw data.

Parameters

df (pyspark.sql.dataframe.DataFrame) – Spark DataFrame of the read raw data

Returns

Spark DataFrame of the consolidated data

Return type

pyspark.sql.dataframe.DataFrame

class publicbr.cnpj._consolidation.EmpresasCleaner(spark_session, file_dir, save_dir)

Class used to clean the table containing general information about the company, such as share capital.

Parameters
  • spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data

  • file_dir (str) – Path to where the raw data is stored.

  • save_dir (str) – Path to where the consolidated data should be stored

spark

Spark session used in data manipulation

Type

pyspark.sql.SparkSession

file_dir

Path to where the raw data is stored.

Type

str

save_dir

Path to where the consolidated data should be stored

Type

str

aux_paths

Dict with the path to the auxiliary tables used in cleaning

Type

Dict[str]

int_dir

Path to directory of intermediary tables

schema

Schema used to read raw data

Type

str

df

Spark DataFrame of raw data

Type

pyspark.sql.dataframe.DataFrame

df_cleaned

Spark DataFrame of cleaned data

Type

pyspark.sql.dataframe.DataFrame

clean(mode='error', n_partitions=32, **kwargs) None

Wrapper for method execution.

Parameters
  • mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error

  • n_partitions (int) – Number of data partitions in execution

  • **kwargs

    partition_colstr

    Column to partition DataFrame on writing

    key :

    Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema() None

Creates schema used in the file reading.

Returns

returns an instance of the object

Return type

self

transform_data() None

Performs the necessary transformations to clean the raw data.

Returns

returns an instance of the object

Return type

self

class publicbr.cnpj._consolidation.EstabCleaner(spark_session, file_dir, save_dir)

Class used to clean the biggest dataset, that contains all the information of the company at the moment of registration, such as main economic activity, location, contacts etc.

Parameters
  • spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data

  • file_dir (str) – Path to where the raw data is stored.

  • save_dir (str) – Path to where the consolidated data should be stored

spark

Spark session used in data manipulation

Type

pyspark.sql.SparkSession

file_dir

Path to where the raw data is stored.

Type

str

save_dir

Path to where the consolidated data should be stored

Type

str

aux_paths

Dict with the path to the auxiliary tables used in cleaning

Type

Dict[str]

int_dir

Path to directory of intermediary tables

schema

Schema used to read raw data

Type

str

df

Spark DataFrame of raw data

Type

pyspark.sql.dataframe.DataFrame

df_cleaned

Spark DataFrame of cleaned data

Type

pyspark.sql.dataframe.DataFrame

clean(mode='error', n_partitions=32, **kwargs) None

Wrapper for method execution.

Parameters
  • mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error

  • n_partitions (int) – Number of data partitions in execution

  • **kwargs

    partition_colstr

    Column to partition DataFrame on writing

    key :

    Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema() None

Creates schema used in the file reading.

Returns

returns an instance of the object

Return type

self

transform_data() None

Performs the necessary transformations to clean the raw data.

Returns

returns an instance of the object

Return type

self

class publicbr.cnpj._consolidation.SimplesCleaner(spark_session, file_dir, save_dir)

Class used to clean the simples table, that contains data of mostly micro and small companies that opted to be part of the Simples or MEI category.

Parameters
  • spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data

  • file_dir (str) – Path to where the raw data is stored.

  • save_dir (str) – Path to where the consolidated data should be stored

spark

Spark session used in data manipulation

Type

pyspark.sql.SparkSession

file_dir

Path to where the raw data is stored.

Type

str

save_dir

Path to where the consolidated data should be stored

Type

str

file_path

Path to raw data

Type

str

save_path

Path to write cleaned data

Type

str

schema

Schema used to read raw data

Type

str

df

Spark DataFrame of raw data

Type

pyspark.sql.dataframe.DataFrame

df_cleaned

Spark DataFrame of cleaned data

Type

pyspark.sql.dataframe.DataFrame

clean(mode='error', n_partitions=32, **kwargs) None

Wrapper for method execution.

Parameters
  • mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error

  • n_partitions (int) – Number of data partitions in execution

  • **kwargs

    partition_colstr

    Column to partition DataFrame on writing

    key :

    Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema() None

Creates schema used in the file reading.

Returns

returns an instance of the object

Return type

self

transform_data() None

Performs the necessary transformations to clean the raw data.

Returns

returns an instance of the object

Return type

self

class publicbr.cnpj._consolidation.SociosCleaner(spark_session, file_dir, save_dir)

Class used to clean the table containing information about partners.

Parameters
  • spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data

  • file_dir (str) – Path to where the raw data is stored.

  • save_dir (str) – Path to where the consolidated data should be stored

spark

Spark session used in data manipulation

Type

pyspark.sql.SparkSession

file_dir

Path to where the raw data is stored.

Type

str

save_dir

Path to where the consolidated data should be stored

Type

str

aux_paths

Dict with the path to the auxiliary tables used in cleaning

Type

Dict[str]

int_dir

Path to directory of intermediary tables

schema

Schema used to read raw data

Type

str

int_path

Path to intermediary table written

df

Spark DataFrame of raw data

Type

pyspark.sql.dataframe.DataFrame

df_cleaned

Spark DataFrame of cleaned data

Type

pyspark.sql.dataframe.DataFrame

clean(mode='error', n_partitions=32, **kwargs) None

Wrapper for method execution.

Parameters
  • mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error

  • n_partitions (int) – Number of data partitions in execution

  • **kwargs

    partition_colstr

    Column to partition DataFrame on writing

    key :

    Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema() None

Creates schema used in the file reading.

Returns

returns an instance of the object

Return type

self

transform_data() None

Performs the necessary transformations to clean the raw data.

Returns

returns an instance of the object

Return type

self