CNPJ Source¶
Source¶
- class publicbr.cnpj._source.CNPJSource(spark_session, file_dir)¶
Class used to extract CNPJ data.
- Parameters
spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Root directory where the data will be saved
- spark¶
Spark session used in data manipulation
- Type
pyspark.sql.SparkSession
- raw_dir¶
Path to the diectory used to store raw data
- Type
str
- trusted_dir¶
Path to the diectory used to store cleaned data
- Type
str
- crawler¶
Object used to extract data from the public source
- Type
- cleaners¶
Dict with the cleaners used to consolidate tables
- Type
Dict[Cleaner]
- create(download=True, overwrite=True, **kwargs)¶
Wrapper for method execution.
- Parameters
download (bool) – Indicator that the raw files must be downloaded
overwrite (bool) – Indicator of if the already existing files should be overwritten.
**kwargs –
- modestr
Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
- n_partitionsint
Number of DataFrame partitions
- partition_colstr
Column to partition DataFrame on writing
- key :
Other options passed to DataFrameWriter.options
- Returns
returns an instance of the object
- Return type
self
- extract(overwrite)¶
Extract data from public CNPJ data source, using the CNPJCrawler.
- Parameters
overwrite (bool) – Indicator of if the already existing files should be overwritten.
- Returns
returns an instance of the object
- Return type
self
- transform(**kwargs)¶
Transform raw data extracted from public CNPJ data source.
- Parameters
**kwargs –
- modestr
Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
- n_partitionsint
Number of DataFrame partitions
- partition_colstr
Column to partition DataFrame on writing
- key :
Other options passed to DataFrameWriter.options
- Returns
returns an instance of the object
- Return type
self
Crawler¶
- class publicbr.cnpj._crawler.CNPJCrawler(save_dir)¶
Class used to extract CNPJ data from the public source.
- Parameters
save_dir (str) – Path to where the downloaded data should be stored. It creates a directory if it does not exists already.
- base_url¶
Url containing all the files to be downloaded
- Type
str
- save_dir¶
Path to where the consolidated data should be stored
- Type
str
- files¶
Name of the files to be downloaded
- Type
str
- download_url(url, save_path) None¶
Function that downlads data from the URL.
- Parameters
url (str) – Full url of the file, created by joining the base_url and a file name.
save_path (str) – Path of the destination file
- Returns
returns an instance of the object
- Return type
self
- get_data(overwrite) None¶
Wrapper to download each file in files.
- Parameters
overwrite (bool) – Indicator of if the already existing files should be overwritten.
- Returns
returns an instance of the object
- Return type
self
- run(overwrite=True) None¶
Wrapper for method execution.
- Parameters
overwrite (bool) – Indicator of if the already existing files should be overwritten.
- Returns
returns an instance of the object
- Return type
self
- unzip() None¶
Extract data from the downloaded zipped files.
- Returns
returns an instance of the object
- Return type
self
Consolidation¶
- class publicbr.cnpj._consolidation.AuxCleaner(spark_session, file_dir, save_dir)¶
Class used to clean the auxiliary tables that compose the CNPJ data. Currently, they are the following:
CNAE
Municípios
Natureza Jurídica
País
Qualificação de Sócios
Motivo da Situação Cadastral
- Parameters
spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored
- spark¶
Spark session used in data manipulation
- Type
pyspark.sql.SparkSession
- file_dir¶
Path to where the raw data is stored.
- Type
str
- save_dir¶
Path to where the consolidated data should be stored
- Type
str
- file_ids¶
Names used to identify the auxiliary tables
- Type
List[str]
- files¶
Name of the raw files
- Type
str
- clean(mode='error', n_partitions=32, **kwargs) None¶
Wrapper for method execution.
- Parameters
mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –
- partition_colstr
Column to partition DataFrame on writing
- key :
Other options passed to DataFrameWriter.options
- Returns
returns an instance of the object
- Return type
self
- define_schema(file_id) str¶
Creates schema used in the file reading.
- Parameters
file_id (str) – Name used to identify the auxiliary table
- Returns
String specifying the schema of the DataFrame
- Return type
str
- get_files() List[str]¶
Gets the correct files of the auxiliary tables from the raw directory
- Returns
Name os the auxiliary table files
- Return type
List[str]
- transform_data(df) pyspark.sql.dataframe.DataFrame¶
Performs the necessary transformations to clean the raw data.
- Parameters
df (pyspark.sql.dataframe.DataFrame) – Spark DataFrame of the read raw data
- Returns
Spark DataFrame of the consolidated data
- Return type
pyspark.sql.dataframe.DataFrame
- class publicbr.cnpj._consolidation.EmpresasCleaner(spark_session, file_dir, save_dir)¶
Class used to clean the table containing general information about the company, such as share capital.
- Parameters
spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored
- spark¶
Spark session used in data manipulation
- Type
pyspark.sql.SparkSession
- file_dir¶
Path to where the raw data is stored.
- Type
str
- save_dir¶
Path to where the consolidated data should be stored
- Type
str
- aux_paths¶
Dict with the path to the auxiliary tables used in cleaning
- Type
Dict[str]
- int_dir¶
Path to directory of intermediary tables
- schema¶
Schema used to read raw data
- Type
str
- df¶
Spark DataFrame of raw data
- Type
pyspark.sql.dataframe.DataFrame
- df_cleaned¶
Spark DataFrame of cleaned data
- Type
pyspark.sql.dataframe.DataFrame
- clean(mode='error', n_partitions=32, **kwargs) None¶
Wrapper for method execution.
- Parameters
mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –
- partition_colstr
Column to partition DataFrame on writing
- key :
Other options passed to DataFrameWriter.options
- Returns
returns an instance of the object
- Return type
self
- define_schema() None¶
Creates schema used in the file reading.
- Returns
returns an instance of the object
- Return type
self
- transform_data() None¶
Performs the necessary transformations to clean the raw data.
- Returns
returns an instance of the object
- Return type
self
- class publicbr.cnpj._consolidation.EstabCleaner(spark_session, file_dir, save_dir)¶
Class used to clean the biggest dataset, that contains all the information of the company at the moment of registration, such as main economic activity, location, contacts etc.
- Parameters
spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored
- spark¶
Spark session used in data manipulation
- Type
pyspark.sql.SparkSession
- file_dir¶
Path to where the raw data is stored.
- Type
str
- save_dir¶
Path to where the consolidated data should be stored
- Type
str
- aux_paths¶
Dict with the path to the auxiliary tables used in cleaning
- Type
Dict[str]
- int_dir¶
Path to directory of intermediary tables
- schema¶
Schema used to read raw data
- Type
str
- df¶
Spark DataFrame of raw data
- Type
pyspark.sql.dataframe.DataFrame
- df_cleaned¶
Spark DataFrame of cleaned data
- Type
pyspark.sql.dataframe.DataFrame
- clean(mode='error', n_partitions=32, **kwargs) None¶
Wrapper for method execution.
- Parameters
mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –
- partition_colstr
Column to partition DataFrame on writing
- key :
Other options passed to DataFrameWriter.options
- Returns
returns an instance of the object
- Return type
self
- define_schema() None¶
Creates schema used in the file reading.
- Returns
returns an instance of the object
- Return type
self
- transform_data() None¶
Performs the necessary transformations to clean the raw data.
- Returns
returns an instance of the object
- Return type
self
- class publicbr.cnpj._consolidation.SimplesCleaner(spark_session, file_dir, save_dir)¶
Class used to clean the simples table, that contains data of mostly micro and small companies that opted to be part of the Simples or MEI category.
- Parameters
spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored
- spark¶
Spark session used in data manipulation
- Type
pyspark.sql.SparkSession
- file_dir¶
Path to where the raw data is stored.
- Type
str
- save_dir¶
Path to where the consolidated data should be stored
- Type
str
- file_path¶
Path to raw data
- Type
str
- save_path¶
Path to write cleaned data
- Type
str
- schema¶
Schema used to read raw data
- Type
str
- df¶
Spark DataFrame of raw data
- Type
pyspark.sql.dataframe.DataFrame
- df_cleaned¶
Spark DataFrame of cleaned data
- Type
pyspark.sql.dataframe.DataFrame
- clean(mode='error', n_partitions=32, **kwargs) None¶
Wrapper for method execution.
- Parameters
mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –
- partition_colstr
Column to partition DataFrame on writing
- key :
Other options passed to DataFrameWriter.options
- Returns
returns an instance of the object
- Return type
self
- define_schema() None¶
Creates schema used in the file reading.
- Returns
returns an instance of the object
- Return type
self
- transform_data() None¶
Performs the necessary transformations to clean the raw data.
- Returns
returns an instance of the object
- Return type
self
- class publicbr.cnpj._consolidation.SociosCleaner(spark_session, file_dir, save_dir)¶
Class used to clean the table containing information about partners.
- Parameters
spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored
- spark¶
Spark session used in data manipulation
- Type
pyspark.sql.SparkSession
- file_dir¶
Path to where the raw data is stored.
- Type
str
- save_dir¶
Path to where the consolidated data should be stored
- Type
str
- aux_paths¶
Dict with the path to the auxiliary tables used in cleaning
- Type
Dict[str]
- int_dir¶
Path to directory of intermediary tables
- schema¶
Schema used to read raw data
- Type
str
- int_path¶
Path to intermediary table written
- df¶
Spark DataFrame of raw data
- Type
pyspark.sql.dataframe.DataFrame
- df_cleaned¶
Spark DataFrame of cleaned data
- Type
pyspark.sql.dataframe.DataFrame
- clean(mode='error', n_partitions=32, **kwargs) None¶
Wrapper for method execution.
- Parameters
mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –
- partition_colstr
Column to partition DataFrame on writing
- key :
Other options passed to DataFrameWriter.options
- Returns
returns an instance of the object
- Return type
self
- define_schema() None¶
Creates schema used in the file reading.
- Returns
returns an instance of the object
- Return type
self
- transform_data() None¶
Performs the necessary transformations to clean the raw data.
- Returns
returns an instance of the object
- Return type
self