Data Exploration
Data exploration is the process of examining and understanding your dataset's structure and content. This guide will walk you through a typical dataset exploration workflow using qsv commands in your terminal.
For the following examples, we'll use a sample dataset from the Dubai Land Department (DLD) containing real estate transaction information. You can access the dataset here:
- Dataset overview: Dubai Pulse - DLD Transactions
- Direct download link: transactions.csv
Best Practices
- Be consistent: Use the same command structure and naming conventions throughout your exploration.
- Document your process: Keep track of the commands you run and their purposes.
- Iterative exploration: Revisit earlier steps as you gain new insights into your dataset.
- Version control: Save different versions of your cleaned and transformed data.
- Automate when possible: Create shell scripts for repetitive tasks.
Data Exploration Workflow with qsv
1. Initial Data Loading and Inspection
a. Preview the data as a table:
qsv sample 10 'DLD_Transactions_English_500.csv' \
| qsv table
This command randomly selects 10 rows from the CSV file, giving you a quick overview of the data structure and content.
b. Preview the data in a flattened format:
qsv sample 5 'DLD_Transactions_English_500.csv' \
| qsv flatten
This command shows the first 10 rows of the dataset in a flattened format, this is useful to see the data in a more readable format, especially when there are a lot of columns.
c. Check dimensions:
qsv count 'DLD_Transactions_English_500.csv'
qsv headers 'DLD_Transactions_English_500.csv' \
| qsv table
The first command counts the number of rows in the file. The second command displays the header row, showing all column names.
d. View the first few rows:
qsv head -n 5 'DLD_Transactions_English_500.csv' \
| qsv table
This displays the first 5 rows of the dataset.
e. View the last few rows:
qsv tail -n 5 'DLD_Transactions_English_500.csv' \
| qsv table
This displays the last 5 rows of the dataset.
2. Understand Data Structure
a. Examine data types and summarize the data:
qsv stats 'DLD_Transactions_English_500.csv' \
| qsv table
This command provides summary statistics for each column and displays them in a table.
b. Identify missing values:
qsv stats 'DLD_Transactions_English_500.csv' \
| qsv select field,nullcount \
| qsv table
This command chain shows the number of null values in each column in a tabular format.
c. Check for duplicate rows:
qsv dedup --count 'DLD_Transactions_English_500.csv' \
| qsv table
This command identifies and counts duplicate rows in the dataset.
d. Examine a specific column's data:
qsv select area_name_en 'DLD_Transactions_English_500.csv' \
| qsv table
This command selects and displays only the 'area_name_en' column.
e. Get unique values in a column:
qsv unique area_name_en 'DLD_Transactions_English_500.csv' \
| qsv table
This command displays the unique values in the 'area_name_en' column.
f. Get the top N values in a column:
qsv top -n 5 area_name_en 'DLD_Transactions_English_500.csv' \
| qsv table
This command displays the top 5 most frequent values in the 'area_name_en' column.
3. Data Cleaning
a. Handle missing values (example: remove rows with missing values):
qsv search -s transaction_id [^[:space:]] 'DLD_Transactions_English_500.csv' > 'cleaned_transactions.csv'
qsv sample 10 'cleaned_transactions.csv' \
| qsv table
This removes rows with missing transaction_ids and displays a sample of the result.
b. Standardize data (example: convert dates to a consistent format):
qsv select transaction_id,instance_date 'cleaned_transactions.csv' \
| qsv datefmt instance_date %Y-%m-%d \
| qsv table
This formats the instance_date column to "YYYY-MM-DD" and displays the result.
c. Remove duplicate rows:
qsv dedup 'cleaned_transactions.csv' > 'unique_transactions.csv'
qsv count 'unique_transactions.csv'
This removes duplicate rows and shows the count of unique transactions.
d. Fill missing values with a default:
qsv fill -c property_type_en Unknown 'cleaned_transactions.csv' \
| qsv table
This fills any missing values in the 'property_type_en' column with 'Unknown'.
e. Replace specific values:
qsv replace -c area_name_en Dubai Dubai City 'cleaned_transactions.csv' \
| qsv table
This replaces the value "Dubai" with "Dubai City" in the 'area_name_en' column.
f. Remove rows with specific values:
qsv search -v -c area_name_en Dubai City 'cleaned_transactions.csv' > 'filtered_transactions.csv'
qsv count 'filtered_transactions.csv'
This removes rows where the 'area_name_en' column contains "Dubai City" and shows the count of the resulting dataset.
4. Exploratory Data Analysis (EDA)
a. Examine unique values in a column:
qsv frequency 'cleaned_transactions.csv' area_name_en \
| qsv table
This counts the frequency of each unique value in the area_name_en column and displays it in a table.
b. Basic statistical analysis:
qsv stats 'cleaned_transactions.csv' meter_sale_price \
| qsv table
This provides summary statistics for the meter_sale_price column in a tabular format.
5. Data Transformation
a. Feature engineering (extract year from date):
qsv select transaction_id,instance_date 'cleaned_transactions.csv' \
| qsv datefmt instance_date %Y \
| qsv table
This extracts just the year from the instance_date column and displays the result.
b. Data aggregation (example: count transactions by year):
qsv select transaction_id,instance_date 'cleaned_transactions.csv' \
| qsv datefmt instance_date %Y \
| qsv frequency instance_date \
| qsv table
This counts the number of transactions per year and displays the result in a table.
6. Documentation and Reporting
Generate a summary report:
qsv stats 'cleaned_transactions.csv' \
| qsv table > 'summary_report.txt'
This generates summary statistics for all columns and saves them to a text file in a tabular format.
7. Prepare Data for Modeling
Save processed data:
qsv select transaction_id,area_name_en,meter_sale_price,instance_date cleaned_transactions.csv > prepared_data.csv \
\
| qsv sample 10 prepared_data.csv \
\
| qsv table
This selects specific columns for modeling and displays a sample of the result.
This workflow demonstrates how to use qsv commands in your terminal to explore and prepare your data. Remember that data exploration is often an iterative process, and you may need to revisit earlier steps as you gain new insights into your dataset.
Note: Some advanced operations like visualization and hypothesis testing are not directly available in qsv. For these tasks, you might need to use the output from qsv with other tools or programming languages.