Examples
In this section, we provide comprehensive examples to demonstrate the workflow for processing different types of file sources using our schema-driven approach. These examples will guide you through the steps required to configure, execute, and evaluate the data based on specific file formats.
- For each file source, we will cover the following key steps:
Running the Schema Generation Command: This step involves executing a command to generate a schema based on the specified file format. The schema will define how the data should be processed and evaluated.
Creating the Schema: We provide the schema configuration file, detailing the necessary settings and parameters for processing the data.
Description of the Data We Want to Search: This section includes a sample dataset in the specified file format that we aim to search and analyze.
File to Run the Schema Against: Here, we present another sample dataset against which the schema will be applied to find matches.
Expected Results: Finally, we show the expected output after running the schema against the sample data, demonstrating the effectiveness of the schema-driven approach.
By following these examples, you will gain a clear understanding of how to handle different file formats, configure the schema appropriately, and interpret the results of the evaluation. This will enable you to apply similar workflows to your own datasets, ensuring accurate and efficient data processing across various file sources.
Working with JSON data
To analyze and process JSON data effectively, we use a schema-driven approach. This process involves running a command to generate a schema based on JSON-like data and evaluating the similarity of data entries against the schema.
Running the Schema Generation Command
First, we execute a command to generate a schema from the JSON-like data. This schema will be used to evaluate and process the data further.
findanywhere_schema jsonlike string_based_evaluation \
--threshold constant \
--out schema.yml
This command generates a schema named schema.yml using the findanywhere_schema tool. The schema is based on the JSON-like data format and uses a string-based evaluation method. The –threshold constant option sets a constant threshold value for the evaluation process.
Creating the Schema
The schema configuration file defines how the data will be processed and evaluated. Below is an example of such a schema:
deduction:
config: {}
name: average
evaluation:
config:
aggregate: max
similarity: jaro_winkler
similarity_parameter: {}
name: string_based_evaluation
source:
config:
encoding: utf-8
errors: surrogateescape
file_format: !!python/object/apply:findanywhere.adapters.source.jsonlike.JSONLikeFormat
- json
tokenizer: delimiter
tokenizer_config: {}
name: jsonlike
threshold:
config:
constant: 0.8
name: constant
- This YAML configuration file includes several sections:
deduction: Defines the deduction method to use, which in this case is average.
evaluation: Specifies the evaluation method (string_based_evaluation) and its configuration, including the similarity measure (jaro_winkler) and aggregation method (max).
source: Describes the source data format and how it should be read, including encoding, error handling, and tokenizer configuration.
threshold: Sets the constant threshold value for evaluation.
Description of the Data We Want to Search
Here is the JSON data we aim to search. This dataset contains user information such as IDs, emails, first names, and last names.
[
{
"id": "alice.ashcroft", "email": "alice.ashcroft@here.local",
"first_name": "Alice", "last_name": "Ashcroft"
},
{
"id": "bob.bones", "email": "charlie.st.claire@here.local",
"first_name": "Bob", "last_name": "Bones"
}
]
- This dataset includes two user records:
Alice Ashcroft with ID alice.ashcroft and email alice.ashcroft@here.local
Bob Bones with ID bob.bones and email charlie.st.claire@here.local
File to Run the Schema Against
We also have a JSON file containing data entries we want to match against our schema.
[
{"firstname": "Alice", "lastname": "Ashcroft", "order_no": 5},
{"name": "Bones", "other": "Bob", "order_no": 6}
]
- This dataset includes two records:
An entry with firstname “Alice”, lastname “Ashcroft”, and order_no 5.
An entry with name “Bones”, other “Bob”, and order_no 6.
Expected Results
After running the schema against the data file, we expect to obtain a result indicating the best matches based on the similarity evaluation. Below is an example of the expected output:
[
{
"of": "Alice",
"best_matches": {
"first_name": {
"position": {"path": [0, "firstname"], "token": 0},
"value": "Alice",
"similarity": 1.0
},
"id": {
"position": {"path": [0, "firstname"], "token": 0},
"value": "Alice",
"similarity": 1.0
},
"last_name": {
"position": {"path": [0, "lastname"], "token": 0
},
"value": "Ashcroft",
"similarity": 1.0
}
},
"score": 1.0
},
{
"of": "Bob",
"best_matches": {
"first_name": {
"position": {"path": [1, "other"], "token": 0},
"value": "Bob",
"similarity": 1.0
},
"id": {
"position": {"path": [1, "other"], "token": 0},
"value": "Bob",
"similarity": 1.0
},
"last_name": {
"position": {"path": [1, "name"], "token": 0},
"value": "Bones",
"similarity": 1.0
}
},
"score": 1.0
}
]
- The result includes:
A match for “Alice” with perfect similarity scores for first_name, id, and last_name.
A match for “Bob” with perfect similarity scores for first_name, id, and last_name.
This output demonstrates that the schema and evaluation process correctly identified and matched the relevant data fields from the input dataset.
Working with malformed CSV data
A “malformed CSV” refers to a CSV (Comma-Separated Values) file that deviates from the standard format where each line contains the same number of fields, separated consistently by a single specified delimiter, typically a comma. In the context of this document, a malformed CSV may exhibit one or more of the following anomalies:
Inconsistent Delimiters: Instead of using a consistent delimiter like a comma, the file uses varied delimiters (e.g., ‘##’, commas, spaces), or mixes multiple types of delimiters unpredictably across the data.
Irregular Number of Fields: Some rows contain different numbers of fields compared to others. This could include extra fields or missing fields, which may lead to misalignment when attempting to parse the data into a structured form.
Non-standard Encoding or Escaping: Issues in text encoding or improper escaping of special characters (like newline characters within a field) can cause further complications in parsing the file.
Such inconsistencies can lead to challenges in data processing and analysis, as standard CSV parsing tools or libraries may fail to correctly interpret or extract the data fields. Special handling and customized parsing strategies are often required to work effectively with malformed CSV files.
Assuming the following malformed CSV file
This section provides an example of a CSV file that does not adhere to a standard single-delimiter format, which can pose challenges during data processing:
name##age
alice##32
bob##24##blue
charlie##30##cyan##cheeseburger
Create initial schema
To begin processing the malformed CSV file, we first generate an initial schema using the findanywhere_schema tool. This tool is configured to evaluate strings in the file and create a schema based on the consistency of the data pattern it finds:
findanywhere_schema textfile string_based_evaluation \
--threshold constant \
--out schema.yml
Modify schema to reflect source file, treated as text with delimiter
Once the initial schema is created, it is necessary to modify it to accurately reflect the structure of our source file, particularly focusing on handling multiple delimiters as seen in our data. This adjusted schema uses the ‘##’ character as a delimiter and sets the threshold for matching:
deduction:
config: {}
name: average
evaluation:
config:
aggregate: max
similarity: token_best_fit_similarity
similarity_parameter: {}
name: string_based_evaluation
source:
config:
encoding: utf-8
errors: surrogateescape
tokenizer_config:
delimiters:
- "##"
name: textfile
threshold:
config:
constant: 0.8
name: constant
Data we want to search
The schema will be utilized to search for specific data entries from another JSON formatted file, as shown below. This file contains structured data such as email addresses and town names associated with identifiers:
[
{"id": "alice", "email": "alice.ashcroft@here.local", "town": "Ashville"},
{"id": "charlie", "email": "charlie.st.claire@here.local"}
]
findanywhere schema.yml search_for.json search_in.csv
Yields results
The results of the search are provided below, indicating how closely each piece of data matches the searched criteria. This output highlights anomalies and explains discrepancies found during the search:
{
"of": "alice",
"best_matches": {
"email": {
"position": {"line": 1, "token": 2},
"value": "alice.ashcroft@here.local",
"similarity": 1.0
},
"id": {
"position": {"line": 1, "token": 0},
"value": "alice.ashcroft",
"similarity": 0.8714285714285714
},
"town": {
"position": {"line": 1, "token": 1},
"value": "5th Avenue Ashville",
"similarity": 1.0
}
},
"score": 0.9571428571428572
}
{
"of": "alice",
"best_matches": {
"email": {
"position": {"line": 2, "token": 3},
"value": " California,bob.bones@here.local\n",
"similarity": 0.7280286738351255
},
"id": {
"position": {"line": 2, "token": 1},
"value": "Alice Ashcroft Memorial Lane",
"similarity": 0.8666666666666667
},
"town": {
"position": {"line": 2, "token": 2},
"value": "Ashville Cyan County",
"similarity": 1.0}
},
"score": 0.8648984468339308
}
{
"of": "charlie",
"best_matches": {
"email": {
"position": {"line": 3, "token": 1},
"value": "charlie.st.claire@here.local",
"similarity": 1.0
},
"id": {
"position": {"line": 3, "token": 0},
"value": "charlie.st.claire",
"similarity": 0.8823529411764706}
},
"score": 0.9411764705882353
}
Working with xml data
In this section, we will cover the process of searching an XML file using a schema and evaluation method. The goal is to identify specific data within the XML file by defining a schema and applying a string-based evaluation method.
To start, we need an XML file that contains the data we want to search. We will also define the schema and the evaluation method to process the XML file.
We will be using the findanywhere_schema command to create a schema from the XML file and apply a string-based evaluation method. The –threshold parameter sets the matching threshold, and the –deduce_score parameter specifies how to calculate the overall score.
findanywhere_schema xmlfile string_based_evaluation --threshold constant --out schema.yml --deduce_score average
<<Result of schema generation>>
After running the command, the schema generation results will look like this. The schema defines how the XML data should be processed and evaluated.
deduction:
config: {}
name: average
evaluation:
config:
aggregate: max
similarity: jaro_winkler
similarity_parameter: {}
name: string_based_evaluation
source:
config:
include_attributes: true
tokenizer: delimiter
tokenizer_config: {}
name: xmlfile
threshold:
config:
constant: 0.8
name: constant
Here is the XML file that contains the library data. This file includes metadata about the library and details about the books available in the library.
<?xml version="1.0" encoding="UTF-8"?>
<Library>
<Metadata>
<LibraryName>Central City Library</LibraryName>
<Location>123 Library St, Central City</Location>
<Contact>Email: info@centrallibrary.com, Phone: 123-456-7890</Contact>
<EstablishedYear>1902</EstablishedYear>
</Metadata>
<Books>
<Book lent_to="Alice Ashcroft">
<Title>The Great Gatsby</Title>
<Author>F. Scott Fitzgerald</Author>
<PublicationYear>1925</PublicationYear>
<ISBN>9780743273565</ISBN>
<Genre>Fiction</Genre>
</Book>
<Book>
<Title>To Kill a Mockingbird</Title>
<Author>Harper Lee</Author>
<PublicationYear>1960</PublicationYear>
<ISBN>9780061120084</ISBN>
<Genre>Fiction</Genre>
</Book>
<Book>
<Title>1984</Title>
<Author>George Orwell</Author>
<PublicationYear>1949</PublicationYear>
<ISBN>9780451524935</ISBN>
<Genre>Dystopian</Genre>
</Book>
</Books>
</Library>
We will use the following JSON data to search within the XML file. This data contains the names we want to find in the XML file.
[
{
"first_name": "Francis",
"middle_name": "Scott",
"last_name": "Fritzgerald"
},
{
"first_name": "Alice",
"last_name": "Ashcroft"
}
]
To perform the search, we use the findanywhere command, specifying the schema file, the JSON data file, and the XML file.
findanywhere schema.yml search_data.json books.xml
The command returns the results of the search. The output indicates the best matches found in the XML file for each entry in the JSON data, along with their positions and similarity scores.
{
"of": "0",
"best_matches": {
"first_name": {
"position": {
"tag": "title",
"path": ["Library", "Books", "Book"],
"index": 1,
"token": 2
},
"value": "a",
"similarity": 0.7142857142857143
},
"last_name": {
"position": {
"tag": "author",
"path": ["Library", "Books", "Book"],
"index": 0,
"token": 2
},
"value": "Fitzgerald",
"similarity": 0.9727272727272728
},
"middle_name": {
"position": {
"tag": "author",
"path": ["Library", "Books", "Book"],
"index": 0,
"token": 1
},
"value": "Scott",
"similarity": 1.0
}
},
"score": 0.8956709956709957
}
{
"of": "1",
"best_matches": {
"first_name": {
"position": {
"tag": "lent_to@book",
"path": ["Library", "Books", "Book"],
"index": 0,
"token": 0
},
"value": "Alice",
"similarity": 1.0
},
"last_name": {
"position": {
"tag": "lent_to@book",
"path": ["Library", "Books", "Book"],
"index": 0,
"token": 1
},
"value": "Ashcroft",
"similarity": 1.0
}
},
"score": 1.0
}
Working with HTML data
findanywhere_schema website string_based_evaluation --threshold constant --out schema.yml --deduce_score average
The command above generates a schema file (schema.yml) based on the website website using the string_based_evaluation method with a constant threshold. The resulting schema is configured to deduce scores averaged across evaluations.
deduction:
config: {}
name: average
evaluation:
config:
aggregate: max
similarity: jaro_winkler
similarity_parameter: {}
name: string_based_evaluation
source:
config:
html_parser_name: html.parser
tokenizer: delimiter
tokenizer_config: {}
name: website
threshold:
config:
constant: 0.8
name: constant
- The schema file (schema.yml) includes:
deduction: Configuration for averaging scores (average).
evaluation: Utilizes string_based_evaluation with parameters such as maximum aggregation and Jaro-Winkler similarity.
source: Specifies website as the source, using HTML parser html.parser with delimiter tokenization.
threshold: Sets a constant threshold of 0.8 for scoring.
[
{
"keyword": "Example"
},
{
"reference": "Information"
}
]
In the JSON above, we specify our search criteria for https://example.com or a local file. We are interested in extracting information related to the keyword “Example” and the reference “Information”.
findanywhere schema.yml search_data.json http://example.com --sequential 1
Executing the command above (findanywhere), we utilize the schema (schema.yml) and search data (search_data.json). Given that website multiprocessing isn’t required, we opt for sequential mode (–sequential 1).
{
"of": "0",
"best_matches": {
"keyword": {
"position": {
"tag": "h1",
"path": ["html", "body", "div"],
"index": 0,
"token": 0
},
"value": "Example",
"similarity": 1.0
}
},
"score": 1.0
}
{
"of": "1",
"best_matches": {
"reference": {
"position": {
"tag": "p",
"path": ["html", "body", "div"],
"index": 25,
"token": 1
},
"value": "information...",
"similarity": 0.8744588744588745
}
},
"score": 0.8744588744588745
}
{
"of": "1",
"best_matches": {
"reference": {
"position": {
"tag": "a",
"path": ["html", "body", "div", "p"],
"index": 1,
"token": 1
},
"value": "information...",
"similarity": 0.8744588744588745
}
},
"score": 0.8744588744588745
}
{
"of": "0",
"best_matches": {
"keyword": {
"position": {
"tag": "title",
"path": ["html", "head"],
"index": 0,
"token": 0
},
"value": "Example",
"similarity": 1.0
}
},
"score": 1.0
}
After execution, the results in Python format provide details on matched elements (tag, path, index, token), matched values (value), similarity scores (similarity), and overall scores (score). These results are obtained for both keywords and references specified in the search data.