By admin
In today’s AI-driven landscape, the spotlight often shines on sophisticated machine learning models like neural networks, decision trees, and ensemble methods, but the reality is, most of the work happens before a model is ever trained. Data science isn’t just part of the machine learning process—it is the foundation.
Before we can trust predictions, optimize designs, or automate workflows, we must ensure the data behind these models is well-understood, clean, and structured. In fact, research and practical experience both show that about 80% of machine learning effort goes into data collection, cleaning, and exploration—not model tuning.
This isn’t just theory. At Sidian, when we analyzed a concrete strength dataset of 1,456 entries to determine optimal concrete mixes, nearly all our time was spent wrangling messy data, examining patterns, and preparing it for trustworthy predictions. Let’s break this process down and understand why these steps matter so much.
Why Data Science is the Core of AI Success
AI projects often fail because of a misunderstanding: people expect the algorithm to solve everything. But when project files, specifications, lab reports, and logs are poorly structured or scattered across systems, even the best algorithm can’t compensate.
Messy data = unreliable results.
That’s why a disciplined data science process is crucial. Without it, AI models may give outputs that seem impressive but are ultimately useless or misleading. Additionally, large quantities of data are required, and the structure of this data must meet the software capabilities to be used effectively.
The 5 Key Steps in Machine Learning
The 5 essential stages in Machine Learning:
- Business Understanding/Planning:Why are we analyzing this data? What’s the decision we want to support? This keeps projects goal-focused.
- Data Collection & Preparation:This is where we gather data from all available sources: spreadsheets, databases, sensors, reports, even handwritten records. In our concrete strength study, this meant consolidating different lab test results and supplier batch records.Once collected, data needs cleaning:
- Removing outliers
- Filling or removing missing values
- Standardizing formats
- Normalizing numerical ranges
- Exploratory Data Analysis (EDA):This is the most powerful and often overlooked step: examining distributions, checking correlations, plotting histograms and scatter plots. EDA is where we first “listen” to what the data is telling us.
- Model Building & Insights:Only after the data has been understood should we start building predictive models — not before. Once the model meets the onset plan, deployment can be considered.
- Deployment and Monitoring:After building a model, its performance must be tracked over time to detect drifts and changes in data patterns, ensuring sustained reliability.
Steps 1–3 are the most crucial aspects of the process. This is where the backbone of the ML model is laid, and this is why most time is spent here.
Deep Dive into Data Preprocessing and EDA
Data arrives in many forms: structured tables, scanned PDFs, sensor logs, field notes. Before models can help us, we need to unify these into a single, clean dataset. In the concrete dataset we worked on, for instance, water-cement ratios, coarse aggregate volumes, and curing times were spread across different reports in different units and formats. However, this was all pre-consolidated for us with an open-source set—but this is atypical.
Once consolidated, EDA becomes a decision-support tool:
- Histograms reveal distributions (e.g., are most samples clustered around a certain cement content?)
- Scatter plots uncover relationships (e.g., how does strength vary with curing time?)
- Correlation matrices identify dependencies (e.g., does aggregate size correlate with compressive strength?)
This diagnostic phase is essential because it informs not just what our data says—but what models we can responsibly use.
For example:
- Linear Regression requires assumptions: linear relationships, normally distributed errors, homoscedasticity, low multicollinearity.
- Decision Trees don’t require these assumptions and provide transparent, explainable results, but are prone to overfitting if the dataset is noisy or too small.
How Understanding Data Helps Predict Model Choice
By carefully analyzing data characteristics during EDA, we can match our modeling approach appropriately:
- Linearity & normality → Linear Regression
- Complex, nonlinear patterns → Decision Trees or Random Forests
- High-dimensional data → LASSO or Ridge Regression
- Categorical-dominated datasets → Tree-based models excel
- Time-series data → ARIMA or LSTM networks
Good data analysis is a guidepost for model strategy.
Practical Implications for Engineering and Business Leaders
For industries like civil engineering—where lives, safety, and millions of dollars are at stake—”black box” models are not acceptable without transparency and validation.
Data security, cleanliness, and structure must be priorities before embarking on any AI-driven initiative. Organizations increasingly demand:
- The ability to search and learn from past project data
- Reduced time spent on back-and-forth reviews and coordination
- Automation of low-margin processes like bidding
- Rapid generation of feasible designs
But all of these depend first on having clean, well-understood, well-structured data.
Conclusion
At Sidian, we’ve learned first-hand that AI isn’t magic—it’s math on data.
Just as CAD revolutionized design by merging engineering with computers, AI is poised to transform civil engineering again—but only for those who embrace disciplined data science practices.
Before we can rely on predictions or automation, we must respect the fundamentals: thoughtful data collection, careful cleaning, insightful exploration, and methodical model selection. Data science is the engine of machine learning success—and it starts with good data.