A Coding Guide to Build a Functional Data Analysis Workflow Using Lilac for Transforming, Filtering, and Exporting Structured Insights

0 6 minutes read

1751243030 A Coding Guide to Build a Functional Data Analysis Workflow.png

In this tutorial, we show the database analysis pipeline functional at full capacity using purple The library, without relying on signaling. It combines the possibilities of managing Lilac data collections and Python functional programming forms to create a clean and extended workflow. From preparing the project and creating realistic sample data to extracting the visions and exporting the outputs that have been filtered, the tutorial program confirms the tests of the code that can be re -used. Basic functional facilities, such as tubes, map_over, and filter_by, are used to build an introductory flow, while Pandas facilitates detailed data transfers and quality analysis.

!pip install lilac[all] pandas numpy

To start, we install the required libraries using the matter! PIP purple installation[all] Pandas Numby. This ensures that we have a complete purple suite along with Pandas and Numby in order to deal with smooth data and analyze it. We must manage this in the notebook before follow -up.

import json
import uuid
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional
from functools import reduce, partial
import lilac as ll

We import all basic libraries. These include JSON and UUID to deal with data and create unique projects names, Pandas to work with data on scheduling, and a path from Pathlib for management evidence. We also offer type tips to improve functionality and functools for functional composition patterns. Finally, we import a basic purple library as LL to manage our data collections.

def pipe(*functions):
   """Compose functions left to right (pipe operator)"""
   return lambda x: reduce(lambda acc, f: f(acc), functions, x)


def map_over(func, iterable):
   """Functional map wrapper"""
   return list(map(func, iterable))


def filter_by(predicate, iterable):
   """Functional filter wrapper"""
   return list(filter(predicate, iterable))


def create_sample_data() -> List[Dict[str, Any]]:
   """Generate realistic sample data for analysis"""
   return [
       {"id": 1, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5},
       {"id": 2, "text": "Machine learning is AI subset", "category": "tech", "score": 0.8, "tokens": 6},
       {"id": 3, "text": "Contact support for help", "category": "support", "score": 0.7, "tokens": 4},
       {"id": 4, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5}, 
       {"id": 5, "text": "Deep learning neural networks", "category": "tech", "score": 0.85, "tokens": 4},
       {"id": 6, "text": "How to optimize models?", "category": "tech", "score": 0.75, "tokens": 5},
       {"id": 7, "text": "Performance tuning guide", "category": "guide", "score": 0.6, "tokens": 3},
       {"id": 8, "text": "Advanced optimization techniques", "category": "tech", "score": 0.95, "tokens": 3},
       {"id": 9, "text": "Gradient descent algorithm", "category": "tech", "score": 0.88, "tokens": 3},
       {"id": 10, "text": "Model evaluation metrics", "category": "tech", "score": 0.82, "tokens": 3},
   ]

In this section, we define reusable functional facilities. The pipe function helps us clearly, while Map_Over and filter_by allow us to convert or filter absolute data functionally. After that, we create a sample data collection that simulates the real world records, which feature fields such as text, category, result and symbols, which we will use later to show the possibilities of regulating Lilac data.

def setup_lilac_project(project_name: str) -> str:
   """Initialize Lilac project directory"""
   project_dir = f"./{project_name}-{uuid.uuid4().hex[:6]}"
   Path(project_dir).mkdir(exist_ok=True)
   ll.set_project_dir(project_dir)
   return project_dir


def create_dataset_from_data(name: str, data: List[Dict]) -> ll.Dataset:
   """Create Lilac dataset from data"""
   data_file = f"{name}.jsonl"
   with open(data_file, 'w') as f:
       for item in data:
           f.write(json.dumps(item) + '\n')
  
   config = ll.DatasetConfig(
       namespace="tutorial",
       name=name,
       source=ll.sources.JSONSource(filepaths=[data_file])
   )
  
   return ll.create_dataset(config)

Using the SETUP_Lilac_Project function, we create a unique work guide for the Lilac project and register it using the Lilac Application interface. Using Create_Dataset_from_Data, we convert our raw dictionaries to the .jsonl file and create a Lilac data collection by determining their composition. This is the clean and structural analysis data.

def extract_dataframe(dataset: ll.Dataset, fields: List[str]) -> pd.DataFrame:
   """Extract data as pandas DataFrame"""
   return dataset.to_pandas(fields)


def apply_functional_filters(df: pd.DataFrame) -> Dict[str, pd.DataFrame]:
   """Apply various filters and return multiple filtered versions"""
  
   filters = {
       'high_score': lambda df: df[df['score'] >= 0.8],
       'tech_category': lambda df: df[df['category'] == 'tech'],
       'min_tokens': lambda df: df[df['tokens'] >= 4],
       'no_duplicates': lambda df: df.drop_duplicates(subset=['text'], keep='first'),
       'combined_quality': lambda df: df[(df['score'] >= 0.8) & (df['tokens'] >= 3) & (df['category'] == 'tech')]
   }
  
   return {name: filter_func(df.copy()) for name, filter_func in filters.items()}

We extract the data set in the Pandas data system using Extract_Dataframe, which allows us to work with specific fields in familiar format. Next, using Apply_functional_filters, we specify and apply a set of logical filters, such as high -grade selection, category -based filter, distinctive code restrictions, duplicate removal, compound quality conditions, to create multiple display methods of data.

def analyze_data_quality(df: pd.DataFrame) -> Dict[str, Any]:
   """Analyze data quality metrics"""
   return {
       'total_records': len(df),
       'unique_texts': df['text'].nunique(),
       'duplicate_rate': 1 - (df['text'].nunique() / len(df)),
       'avg_score': df['score'].mean(),
       'category_distribution': df['category'].value_counts().to_dict(),
       'score_distribution': {
           'high': len(df[df['score'] >= 0.8]),
           'medium': len(df[(df['score'] >= 0.6) & (df['score'] < 0.8)]),
           'low': len(df[df['score'] < 0.6])
       },
       'token_stats': {
           'mean': df['tokens'].mean(),
           'min': df['tokens'].min(),
           'max': df['tokens'].max()
       }
   }


def create_data_transformations() -> Dict[str, callable]:
   """Create various data transformation functions"""
   return {
       'normalize_scores': lambda df: df.assign(norm_score=df['score'] / df['score'].max()),
       'add_length_category': lambda df: df.assign(
           length_cat=pd.cut(df['tokens'], bins=[0, 3, 5, float('inf')], labels=['short', 'medium', 'long'])
       ),
       'add_quality_tier': lambda df: df.assign(
           quality_tier=pd.cut(df['score'], bins=[0, 0.6, 0.8, 1.0], labels=['low', 'medium', 'high'])
       ),
       'add_category_rank': lambda df: df.assign(
           category_rank=df.groupby('category')['score'].rank(ascending=False)
       )
   }

To assess the quality of the data collection, we use Analyze_data_quality, which helps us to measure the main standards such as total and unique records, repeated rates, category details, and outcome/distinctive symbol distributions. This gives us a clear picture of the data collection and reliability. We also define the functions of transformation using Create_Data_termorms, which allows improvements such as normalization of the result, classification of the distinctive symbol length, the setting of the quality layer, and the classification of the category within the category.

def apply_transformations(df: pd.DataFrame, transform_names: List[str]) -> pd.DataFrame:
   """Apply selected transformations"""
   transformations = create_data_transformations()
   selected_transforms = [transformations[name] for name in transform_names if name in transformations]
  
   return pipe(*selected_transforms)(df.copy()) if selected_transforms else df


def export_filtered_data(filtered_datasets: Dict[str, pd.DataFrame], output_dir: str) -> None:
   """Export filtered datasets to files"""
   Path(output_dir).mkdir(exist_ok=True)
  
   for name, df in filtered_datasets.items():
       output_file = Path(output_dir) / f"{name}_filtered.jsonl"
       with open(output_file, 'w') as f:
           for _, row in df.iterrows():
               f.write(json.dumps(row.to_dict()) + '\n')
       print(f"Exported {len(df)} records to {output_file}")

After that, through the Apply_transForms transfers, we selectively apply the required transformations in a functional chain, while ensuring that our data enriches and organizes it. Once you nominate, we use Export_filled_data to write each variable data collection in a separate .jsonl file. This enables us to store sub -groups, such as high -quality entries or unproductive records, in an organized coordination for use in the direction of the estuary.

def main_analysis_pipeline():
   """Main analysis pipeline demonstrating functional approach"""
  
   print("🚀 Setting up Lilac project...")
   project_dir = setup_lilac_project("advanced_tutorial")
  
   print("📊 Creating sample dataset...")
   sample_data = create_sample_data()
   dataset = create_dataset_from_data("sample_data", sample_data)
  
   print("📋 Extracting data...")
   df = extract_dataframe(dataset, ['id', 'text', 'category', 'score', 'tokens'])
  
   print("🔍 Analyzing data quality...")
   quality_report = analyze_data_quality(df)
   print(f"Original data: {quality_report['total_records']} records")
   print(f"Duplicates: {quality_report['duplicate_rate']:.1%}")
   print(f"Average score: {quality_report['avg_score']:.2f}")
  
   print("🔄 Applying transformations...")
   transformed_df = apply_transformations(df, ['normalize_scores', 'add_length_category', 'add_quality_tier'])
  
   print("🎯 Applying filters...")
   filtered_datasets = apply_functional_filters(transformed_df)
  
   print("\n📈 Filter Results:")
   for name, filtered_df in filtered_datasets.items():
       print(f"  {name}: {len(filtered_df)} records")
  
   print("💾 Exporting filtered datasets...")
   export_filtered_data(filtered_datasets, f"{project_dir}/exports")
  
   print("\n🏆 Top Quality Records:")
   best_quality = filtered_datasets['combined_quality'].head(3)
   for _, row in best_quality.iterrows():
       print(f"  • {row['text']} (score: {row['score']}, category: {row['category']})")
  
   return {
       'original_data': df,
       'transformed_data': transformed_df,
       'filtered_data': filtered_datasets,
       'quality_report': quality_report
   }


if __name__ == "__main__":
   results = main_analysis_pipeline()
   print("\n✅ Analysis complete! Check the exports folder for filtered datasets.")

Finally, in Main_Anyllysis In order to print high -quality entries as a quick piece. This function represents our full data activation episode, supported by Lilac.

In conclusion, users will have a practical understanding of the creation of a cloning data pipeline that enhances the Lilac data set and functional programming patterns of developing clean -and -term analysis. The pipeline covers all critical stages, including the creation of the data set, its conversion, filtering, quality analysis and export, and the provision of flexibility for both experimentation and publishing. It also shows how to include meaningful definition data such as natural grades, quality levels, and length categories, which can be useful in clinic tasks such as modeling or human review.

verify Symbols. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 100K+ ML Subreddit And subscribe to Our newsletter.

Niegel, a trainee consultant at Marktechpost. It follows an integrated double degree in materials at the Indian Institute of Technology, Khargpur. Nichil is a fan of artificial intelligence/ml that always looks for applications in areas such as biomedics and biomedical sciences. With a strong background in material science, it explores new progress and creates opportunities to contribute.